`
Kevin12
  • 浏览: 235355 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

eclipse开发spark程序配置在集群上运行

阅读更多
这篇bolg讲一下,IDE开发的spark程序如何提交到集群上运行。
首先保证你的集群是运行成功的,集群搭建可以参考http://kevin12.iteye.com/blog/2273556

开发集群测试的spark wordcount程序;
1.hdfs数据准备.
先将README.md文件上传到hdfs上的/library/wordcount/input2目录
root@master1:/usr/local/hadoop/hadoop-2.6.0/sbin# hdfs dfs -mkdir /library/wordcount/input2
root@master1:/usr/local/hadoop/hadoop-2.6.0/sbin# hdfs dfs -put /usr/local/tools/README.md /library/wordcount/input2
查看一下确保文件已经上传到hdfs上。


程序代码:
package com.imf.spark

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
/**
 * 用户scala开发集群测试的spark wordcount程序
 */
object WordCount_Cluster {
   def main(args: Array[String]): Unit = {
     /**
      * 1.创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息,
      * 例如:通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置为local,
      * 则代表Spark程序在本地运行,特别适合于机器配置条件非常差的情况。
      */
     //创建SparkConf对象
     val conf = new SparkConf()
     //设置应用程序名称,在程序运行的监控界面可以看到名称
     conf.setAppName("My First Spark App!")
     /**
      * 2.创建SparkContext对象
      * SparkContext是spark程序所有功能的唯一入口,无论是采用Scala,java,python,R等都必须有一个SprakContext
      * SparkContext核心作用:初始化spark应用程序运行所需要的核心组件,包括DAGScheduler,TaskScheduler,SchedulerBackend
      * 同时还会负责Spark程序往Master注册程序等;
      * SparkContext是整个应用程序中最为至关重要的一个对象;
      */
     //通过创建SparkContext对象,通过传入SparkConf实例定制Spark运行的具体参数和配置信息
     val sc = new SparkContext(conf)

     /**
      * 3.根据具体数据的来源(HDFS,HBase,Local,FS,DB,S3等)通过SparkContext来创建RDD;
      * RDD的创建基本有三种方式:根据外部的数据来源(例如HDFS)、根据Scala集合、由其他的RDD操作;
      * 数据会被RDD划分成为一系列的Partitions,分配到每个Partition的数据属于一个Task的处理范畴;
      */
     //读取hdfs文件内容并切分成Pratitions
     //val lines = sc.textFile("hdfs://master1:9000/library/wordcount/input2")
     val lines = sc.textFile("/library/wordcount/input2")//这里不设置并行度是spark有自己的算法,暂时不去考虑

     /**
      * 4.对初始的RDD进行Transformation级别的处理,例如map,filter等高阶函数的变成,来进行具体的数据计算
      * 4.1.将每一行的字符串拆分成单个单词
      */
     //对每一行的字符串进行拆分并把所有行的拆分结果通过flat合并成一个大的集合
      val words = lines.flatMap { line => line.split(" ") }
     /**
      * 4.2.在单词拆分的基础上对每个单词实例计数为1,也就是word => (word,1)
      */
     val pairs = words.map{word =>(word,1)}

     /**
      * 4.3.在每个单词实例计数为1基础上统计每个单词在文件中出现的总次数
      */
     //对相同的key进行value的累积(包括Local和Reducer级别同时Reduce)
     val wordCounts = pairs.reduceByKey(_+_)
     //打印输出
     wordCounts.collect.foreach(pair => println(pair._1+":"+pair._2))
     sc.stop()
   }
}


2.打包程序
将程序打包成jar文件,并将jar文件复制到虚拟机的master1上的/usr/local/sparkApps目录下。



3.提交应用程序到集群中。
可以参考官方文档http://spark.apache.org/docs/latest/submitting-applications.html
第一次用的是下面的命令,结果运行不了,提示信息和命令如下:原因没有提交到集群中。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-submit \
> --class com.imf.spark.WordCount_Cluster \
> --master master spark://master1:7077 \
> /usr/local/sparkApps/WordCount.jar
Error: Master must start with yarn, spark, mesos, or local
Run with --help for usage help or --verbose for debug output

第二次修改命令如下,运行正常:
./spark-submit \
--class com.imf.spark.WordCount_Cluster \
--master yarn \
/usr/local/sparkApps/WordCount.jar


运行结果如下:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-submit \
> --class com.imf.spark.WordCount_Cluster \
> --master yarn \
> /usr/local/sparkApps/WordCount.jar
16/01/27 07:21:55 INFO spark.SparkContext: Running Spark version 1.6.0
16/01/27 07:21:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/27 07:21:56 INFO spark.SecurityManager: Changing view acls to: root
16/01/27 07:21:56 INFO spark.SecurityManager: Changing modify acls to: root
16/01/27 07:21:56 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/01/27 07:21:57 INFO util.Utils: Successfully started service 'sparkDriver' on port 33500.
16/01/27 07:21:57 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/01/27 07:21:57 INFO Remoting: Starting remoting
16/01/27 07:21:58 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.112.130:40488]
16/01/27 07:21:58 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 40488.
16/01/27 07:21:58 INFO spark.SparkEnv: Registering MapOutputTracker
16/01/27 07:21:58 INFO spark.SparkEnv: Registering BlockManagerMaster
16/01/27 07:21:58 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-298b85db-a14e-448b-bf12-a76e7f9d3e61
16/01/27 07:21:58 INFO storage.MemoryStore: MemoryStore started with capacity 517.4 MB
16/01/27 07:21:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/01/27 07:21:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/01/27 07:21:59 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/01/27 07:21:59 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/01/27 07:21:59 INFO ui.SparkUI: Started SparkUI at http://192.168.112.130:4040
16/01/27 07:21:59 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-1273b284-86a3-45a5-ab5f-7996b7f78a5f/httpd-c0ad1a3a-48fc-4aef-8ac1-ddc28159fe2c
16/01/27 07:21:59 INFO spark.HttpServer: Starting HTTP Server
16/01/27 07:21:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/01/27 07:21:59 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41722
16/01/27 07:21:59 INFO util.Utils: Successfully started service 'HTTP file server' on port 41722.
16/01/27 07:21:59 INFO spark.SparkContext: Added JAR file:/usr/local/sparkApps/WordCount.jar at http://192.168.112.130:41722/jars/WordCount.jar with timestamp 1453850519749
16/01/27 07:22:00 INFO client.RMProxy: Connecting to ResourceManager at master1/192.168.112.130:8032
16/01/27 07:22:01 INFO yarn.Client: Requesting a new application from cluster with 3 NodeManagers
16/01/27 07:22:01 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/01/27 07:22:01 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/01/27 07:22:01 INFO yarn.Client: Setting up container launch context for our AM
16/01/27 07:22:01 INFO yarn.Client: Setting up the launch environment for our AM container
16/01/27 07:22:01 INFO yarn.Client: Preparing resources for our AM container
16/01/27 07:22:02 INFO yarn.Client: Uploading resource file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://master1:9000/user/root/.sparkStaging/application_1453847555417_0001/spark-assembly-1.6.0-hadoop2.6.0.jar
16/01/27 07:22:11 INFO yarn.Client: Uploading resource file:/tmp/spark-1273b284-86a3-45a5-ab5f-7996b7f78a5f/__spark_conf__2067657392446030944.zip -> hdfs://master1:9000/user/root/.sparkStaging/application_1453847555417_0001/__spark_conf__2067657392446030944.zip
16/01/27 07:22:11 INFO spark.SecurityManager: Changing view acls to: root
16/01/27 07:22:11 INFO spark.SecurityManager: Changing modify acls to: root
16/01/27 07:22:11 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/01/27 07:22:12 INFO yarn.Client: Submitting application 1 to ResourceManager
16/01/27 07:22:12 INFO impl.YarnClientImpl: Submitted application application_1453847555417_0001
16/01/27 07:22:14 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:14 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1453850532502
     final status: UNDEFINED
     tracking URL: http://master1:8088/proxy/application_1453847555417_0001/
     user: root
16/01/27 07:22:15 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:16 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:17 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:18 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:19 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:20 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:21 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:22 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:23 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:24 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:25 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:26 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:27 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:28 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:29 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:30 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:31 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:32 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:32 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/01/27 07:22:32 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master1, PROXY_URI_BASES -> http://master1:8088/proxy/application_1453847555417_0001), /proxy/application_1453847555417_0001
16/01/27 07:22:32 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/01/27 07:22:33 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED)
16/01/27 07:22:34 INFO yarn.Client: Application report for application_1453847555417_0001 (state: RUNNING)
16/01/27 07:22:34 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 192.168.112.132
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1453850532502
     final status: UNDEFINED
     tracking URL: http://master1:8088/proxy/application_1453847555417_0001/
     user: root
16/01/27 07:22:34 INFO cluster.YarnClientSchedulerBackend: Application application_1453847555417_0001 has started running.
16/01/27 07:22:34 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33873.
16/01/27 07:22:34 INFO netty.NettyBlockTransferService: Server created on 33873
16/01/27 07:22:34 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/01/27 07:22:34 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.112.130:33873 with 517.4 MB RAM, BlockManagerId(driver, 192.168.112.130, 33873)
16/01/27 07:22:34 INFO storage.BlockManagerMaster: Registered BlockManager
16/01/27 07:22:37 INFO scheduler.EventLoggingListener: Logging events to hdfs://master1:9000/historyserverforSpark/application_1453847555417_0001
16/01/27 07:22:37 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
16/01/27 07:22:46 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 212.9 KB, free 212.9 KB)
16/01/27 07:22:46 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.6 KB, free 232.4 KB)
16/01/27 07:22:46 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.112.130:33873 (size: 19.6 KB, free: 517.4 MB)
16/01/27 07:22:46 INFO spark.SparkContext: Created broadcast 0 from textFile at WordCount_Cluster.scala:36
16/01/27 07:22:49 INFO mapred.FileInputFormat: Total input paths to process : 1
16/01/27 07:22:50 INFO spark.SparkContext: Starting job: collect at WordCount_Cluster.scala:55
16/01/27 07:22:51 INFO scheduler.DAGScheduler: Registering RDD 3 (map at WordCount_Cluster.scala:47)
16/01/27 07:22:51 INFO scheduler.DAGScheduler: Got job 0 (collect at WordCount_Cluster.scala:55) with 2 output partitions
16/01/27 07:22:51 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at WordCount_Cluster.scala:55)
16/01/27 07:22:51 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/01/27 07:22:51 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/01/27 07:22:51 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount_Cluster.scala:47), which has no missing parents
16/01/27 07:22:52 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 236.5 KB)
16/01/27 07:22:52 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 238.8 KB)
16/01/27 07:22:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.112.130:33873 (size: 2.3 KB, free: 517.4 MB)
16/01/27 07:22:52 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/01/27 07:22:52 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount_Cluster.scala:47)
16/01/27 07:22:52 INFO cluster.YarnScheduler: Adding task set 0.0 with 2 tasks
16/01/27 07:23:08 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/01/27 07:23:22 INFO cluster.YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (worker3:56136) with ID 2
16/01/27 07:23:23 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, worker3, partition 0,NODE_LOCAL, 2202 bytes)
16/01/27 07:23:23 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/01/27 07:23:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager worker3:41970 with 517.4 MB RAM, BlockManagerId(2, worker3, 41970)
16/01/27 07:23:33 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on worker3:41970 (size: 2.3 KB, free: 517.4 MB)
16/01/27 07:23:37 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on worker3:41970 (size: 19.6 KB, free: 517.4 MB)
16/01/27 07:23:39 INFO cluster.YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (worker2:42174) with ID 1
16/01/27 07:23:39 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, worker2, partition 1,RACK_LOCAL, 2202 bytes)
16/01/27 07:23:40 INFO storage.BlockManagerMasterEndpoint: Registering block manager worker2:44472 with 517.4 MB RAM, BlockManagerId(1, worker2, 44472)
16/01/27 07:23:47 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on worker2:44472 (size: 2.3 KB, free: 517.4 MB)
16/01/27 07:23:51 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on worker2:44472 (size: 19.6 KB, free: 517.4 MB)
16/01/27 07:24:00 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 37460 ms on worker3 (1/2)
16/01/27 07:24:09 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (map at WordCount_Cluster.scala:47) finished in 75.810 s
16/01/27 07:24:09 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 29446 ms on worker2 (2/2)
16/01/27 07:24:09 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/01/27 07:24:09 INFO scheduler.DAGScheduler: running: Set()
16/01/27 07:24:09 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
16/01/27 07:24:09 INFO scheduler.DAGScheduler: failed: Set()
16/01/27 07:24:09 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/01/27 07:24:09 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount_Cluster.scala:53), which has no missing parents
16/01/27 07:24:09 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 241.4 KB)
16/01/27 07:24:09 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1595.0 B, free 242.9 KB)
16/01/27 07:24:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.112.130:33873 (size: 1595.0 B, free: 517.4 MB)
16/01/27 07:24:09 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/01/27 07:24:09 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount_Cluster.scala:53)
16/01/27 07:24:09 INFO cluster.YarnScheduler: Adding task set 1.0 with 2 tasks
16/01/27 07:24:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, worker2, partition 0,NODE_LOCAL, 1951 bytes)
16/01/27 07:24:09 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, worker3, partition 1,NODE_LOCAL, 1951 bytes)
16/01/27 07:24:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on worker2:44472 (size: 1595.0 B, free: 517.4 MB)
16/01/27 07:24:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on worker3:41970 (size: 1595.0 B, free: 517.4 MB)
16/01/27 07:24:10 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to worker2:42174
16/01/27 07:24:10 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 158 bytes
16/01/27 07:24:10 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to worker3:56136
16/01/27 07:24:13 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 3634 ms on worker3 (1/2)
16/01/27 07:24:13 INFO scheduler.DAGScheduler: ResultStage 1 (collect at WordCount_Cluster.scala:55) finished in 3.691 s
16/01/27 07:24:13 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 3696 ms on worker2 (2/2)
16/01/27 07:24:13 INFO cluster.YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 
16/01/27 07:24:13 INFO scheduler.DAGScheduler: Job 0 finished: collect at WordCount_Cluster.scala:55, took 82.513256 s
package:1
this:1
Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version):1
Because:1
Python:2
cluster.:1
[run:1
its:1
YARN,:1
have:1
general:2
pre-built:1
locally:2
locally.:1
changed:1
sc.parallelize(1:1
only:1
Configuration:1
This:2
first:1
basic:1
documentation:3
learning,:1
graph:1
Hive:2
several:1
["Specifying:1
"yarn":1
page](http://spark.apache.org/documentation.html):1
[params]`.:1
[project:2
prefer:1
SparkPi:2
<http://spark.apache.org/>:1
engine:1
version:1
file:1
documentation,:1
MASTER:1
example:3
are:1
systems.:1
params:1
scala>:1
DataFrames,:1
provides:1
refer:2
configure:1
Interactive:2
R,:1
can:6
build:3
when:1
how:2
easiest:1
Apache:1
package.:1
1000).count():1
Note:1
Data.:1
>>>:1
Scala:2
Alternatively,:1
variable:1
submit:1
Testing:1
Streaming:1
module,:1
thread,:1
rich:1
them,:1
detailed:2
stream:1
GraphX:1
distribution:1
Please:3
is:6
return:2
Thriftserver:1
same:1
start:1
Spark](#building-spark).:1
one:2
with:3
built:1
Spark"](http://spark.apache.org/docs/latest/building-spark.html).:1
data:1
wiki](https://cwiki.apache.org/confluence/display/SPARK).:1
using:2
talk:1
class:2
Shell:2
README:1
computing:1
Python,:2
example::1
##:8
building:2
N:1
set:2
Hadoop-supported:1
other:1
Example:1
analysis.:1
from:1
runs.:1
Building:1
higher-level:1
need:1
guidance:2
Big:1
guide,:1
Java,:1
fast:1
uses:1
SQL:2
will:1
<class>:1
requires:1
:67
Documentation:1
web:1
cluster:2
using::1
MLlib:1
shell::2
Scala,:1
supports:2
built,:1
./dev/run-tests:1
build/mvn:1
sample:1
For:2
Programs:1
Spark:13
particular:2
The:1
processing.:1
APIs:1
computation:1
Try:1
[Configuration:1
./bin/pyspark:1
A:1
through:1
#:1
library:1
following:2
More:1
which:2
also:4
storage:1
should:2
To:2
for:11
Once:1
setup:1
mesos://:1
Maven](http://maven.apache.org/).:1
latest:1
processing,:1
the:21
your:1
not:1
different:1
distributions.:1
given.:1
About:1
if:4
instructions.:1
be:2
do:2
Tests:1
no:1
./bin/run-example:2
programs,:1
including:3
`./bin/run-example:1
Spark.:1
Versions:1
HDFS:1
individual:1
spark://:1
It:2
an:3
programming:1
machine:1
run::1
environment:1
clean:1
1000::2
And:1
run:7
./bin/spark-shell:1
URL,:1
"local":1
MASTER=spark://host:7077:1
on:5
You:3
threads.:1
against:1
[Apache:1
help:1
print:1
tests:2
examples:2
at:2
in:5
-DskipTests:1
optimized:1
downloaded:1
versions:1
graphs:1
Guide](http://spark.apache.org/docs/latest/configuration.html):1
online:1
usage:1
abbreviated:1
comes:1
directory.:1
overview:1
[building:1
`examples`:2
Many:1
Running:1
way:1
use:3
Online:1
site,:1
tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).:1
running:1
find:1
sc.parallelize(range(1000)).count():1
contains:1
project:1
you:4
Pi:1
that:2
protocols:1
a:8
or:3
high-level:1
name:1
Hadoop,:2
to:14
available:1
(You:1
core:1
instance::1
see:1
of:5
tools:1
"local[N]":1
programs:2
package.):1
["Building:1
must:1
and:10
command,:2
system:1
Hadoop:3
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/01/27 07:24:13 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.112.130:4040
16/01/27 07:24:14 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread
16/01/27 07:24:14 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
16/01/27 07:24:14 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down
16/01/27 07:24:14 INFO cluster.YarnClientSchedulerBackend: Stopped
16/01/27 07:24:14 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/01/27 07:24:14 INFO storage.MemoryStore: MemoryStore cleared
16/01/27 07:24:14 INFO storage.BlockManager: BlockManager stopped
16/01/27 07:24:14 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/01/27 07:24:14 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/01/27 07:24:15 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/27 07:24:15 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/01/27 07:24:15 INFO spark.SparkContext: Successfully stopped SparkContext
16/01/27 07:24:15 INFO util.ShutdownHookManager: Shutdown hook called
16/01/27 07:24:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-1273b284-86a3-45a5-ab5f-7996b7f78a5f
16/01/27 07:24:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-1273b284-86a3-45a5-ab5f-7996b7f78a5f/httpd-c0ad1a3a-48fc-4aef-8ac1-ddc28159fe2c
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# 


运行结果和本地的一样。

4.通过写一个wordcount.sh,实现自动化。
创建一个wordcount.sh文件,内容如下,保存并退出。然后chmod +x wordcount.sh给shell文件执行权限。运行下面的命令,执行即可,执行结果和上面的一样。
root@master1:/usr/local/sparkApps# ./wordcount.sh


6.通过浏览器查看程序运行结果:







  • 大小: 87.8 KB
  • 大小: 64 KB
  • 大小: 161.7 KB
  • 大小: 208.7 KB
分享到:
评论

相关推荐

    Spark集群及开发环境搭建(完整版)

    ### Spark集群及开发环境搭建(完整版) #### 一、软件及下载 本文档提供了详细的步骤来指导初学者搭建Spark集群及其开发环境。首先需要准备的软件包括: - **VirtualBox-5.1**:虚拟机软件,用于安装CentOS操作...

    Hadoop集群搭建部署与MapReduce程序关键点个性化开发.doc

    在Eclipse中直接运行MapReduce程序,可以进行快速的本地测试和调试,减少了实际在集群上运行的时间。 任务3是对开发过程的总结和反思,通常包括遇到的问题、解决策略以及优化建议。在实践中,可能需要根据硬件资源...

    eclipse集成hadoop+spark+hive开发源码实例

    9. **建立连接**:在Eclipse中配置Hadoop和Spark连接,使得Eclipse能够与本地或远程Hadoop和Spark集群通信。 10. **开发源码**:现在你可以在Eclipse中编写Hadoop MapReduce、Spark应用和Hive查询。使用Eclipse的...

    基于Eclipse的Hadoop应用开发环境的配置

    同时,Eclipse的日志视图可以显示程序运行的详细信息,帮助调试。 以上就是基于Eclipse的Hadoop应用开发环境配置的全过程。通过这个环境,开发者可以更专注于Hadoop应用的编写和优化,而无需关心底层的集群管理。...

    eclipse 运行hadoop工具包

    标题 "eclipse 运行hadoop工具包" 涉及到的是在Eclipse集成开发...通过以上步骤,开发者能够在Eclipse这个强大的平台上高效地开发和运行Hadoop项目,充分发挥两者结合的优势,实现快速迭代和调试,从而提高工作效率。

    Spark集群及开发环境搭建

    初学者手册 一、 软件及下载 2 ...3. 测试spark集群 20 八、 Scala开发环境搭建 21 1、系统安装 21 2、安装jdk8 21 3、安装scala2.11 21 4、安装scala for eclipse 21 5、创建scala工程 21

    sparkscala开发依赖包

    5. 编写和运行代码:使用Scala编写Spark程序,然后通过ECLIPSE的运行配置来启动Spark集群并提交作业。 总之,Spark Scala开发依赖包对于在ECLIPSE中进行大数据处理项目至关重要。正确配置这些依赖将使开发者能够在...

    Hadoop平台完整兼容组件(包括VM,Redhat系统镜像,jdk,hadoop,HBase,eclipse,spark等)

    1. **虚拟机(VM)**:VM提供了在本地计算机上运行Redhat系统的环境,使得用户可以在自己的机器上模拟一个完整的Linux服务器,便于学习和测试Hadoop集群配置。 2. **Redhat系统镜像**:Redhat是流行的Linux发行版之...

    老汤spark开发.zip

    完成后,使用`spark-submit`工具提交程序到Spark集群。 在"老汤Spark2.x实战应用系列"中,很可能会涵盖以上步骤的详细说明,并给出在Windows环境下解决常见问题的技巧。这将帮助开发者避免很多初学者常见的错误,...

    win7下Eclipse开发Hadoop应用程序环境搭建

    在Windows 7操作系统中,使用Eclipse开发Hadoop应用程序的过程涉及多个步骤,涵盖了从环境配置到实际编程的各个层面。以下是对这个主题的详细讲解: 首先,我们需要了解Hadoop和Eclipse的基础。Hadoop是一个开源的...

    hadoop-2.6.5 + eclipse附配置hdfs相关文件.zip

    4. **编写和运行MapReduce程序**:使用Eclipse创建Java项目,编写MapReduce代码,并使用Eclipse的Run As &gt; Hadoop Job选项直接在Hadoop集群上运行。 5. **调试和日志查看**:Eclipse还可以帮助调试MapReduce程序,...

    eclipse hadoop2.7.1 plugin 配置

    6. **配置运行配置**:右键点击项目,选择`Run As` &gt; `Run Configurations`,在`Map/Reduce Job`中设置运行参数,如主类、输入和输出路径。 7. **Hadoop.dll和winutils.exe**:在Windows环境下,由于Hadoop主要为...

    Linux环境下Hadoop搭建与Eclipse配置

    - 通过Eclipse的构建和运行选项,可以直接将程序提交到运行中的Hadoop集群上进行测试和运行。 7. **监控和调试**: - Hadoop提供Web UI供用户监控集群状态,如NameNode的50070端口和ResourceManager的8088端口。 ...

    Spark简单测试案例

    综上所述,本文介绍了在特定的 Hadoop 和 Spark 集群环境下进行 WordCount 示例的实现过程。从环境搭建、IDE 配置到代码编写,每个步骤都进行了详细的说明。通过学习这个案例,可以帮助读者更好地理解 Spark 的基本...

    spark local下 WordCount运行示例

    在本地模式下,Spark会在单个JVM上运行,这方便开发者进行快速调试和测试,而不需要分布式集群环境。 在实际应用中,Spark的强大在于它的弹性分布式数据集(Resilient Distributed Datasets, RDDs),以及其支持的...

    Hadoop_2.X,eclipse开发插件

    在开发过程中,可能会遇到的问题包括网络连接问题、权限问题、依赖库冲突等,解决这些问题需要对Hadoop和Eclipse的配置有深入理解。同时,为了更好地调试和优化Hadoop应用程序,开发者还需要掌握Hadoop的生态系统,...

    spark overview

    用户编写的程序在集群上运行时,通常会包含一个Driver和多个executors,共同完成整个应用程序的计算任务。在Spark的独立部署模式中,用户需要关注如何有效地配置和管理集群资源,以便能够高效地运行Spark作业。

    windows 下搭建eclipse的hadoop开发环境.rar_SPARKK_blew1bh_manufacturingkf

    最后,我们可以在Eclipse中编写和运行Spark程序。利用Eclipse的调试工具,可以方便地设置断点,查看变量值,理解程序执行流程。此外,Eclipse的构建工具如Maven或Gradle可以帮助管理依赖,自动化构建过程。 关于...

    还在为搭建集群的期末作业发愁吗?最简单的搭建hadoop+spark+hive大数据集群搭建文档.docx

    看这一篇就够啦,给出一个完全分布式hadoop+spark集群搭建完整文档,从环境准备(包括机器名,ip映射步骤,ssh免密,Java等)开始,包括zookeeper,hadoop,hive,spark,eclipse/idea安装全过程,3-4节点,集群部署...

    java+hadopp+scala+spark配置win10版

    6. **运行和测试**:配置完成后,你可以通过编写简单的Java或Scala程序,利用Hadoop和Spark处理本地文件或模拟集群进行测试。例如,使用WordCount示例来验证配置是否正确。 这个压缩包中的Word文档可能包含了配置...

Global site tag (gtag.js) - Google Analytics