- 浏览: 236083 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
lwb314:
你的这个是创建的临时的hive表,数据也是通过文件录入进去的, ...
Spark SQL操作Hive数据库 -
yixiaoqi2010:
你好 我的提交上去 总是报错,找不到hive表,可能是哪里 ...
Spark SQL操作Hive数据库 -
bo_hai:
target jvm版本也要选择正确。不能选择太高。2.10对 ...
eclipse开发spark程序配置本地运行
这篇bolg讲一下,IDE开发的spark程序如何提交到集群上运行。
首先保证你的集群是运行成功的,集群搭建可以参考http://kevin12.iteye.com/blog/2273556
开发集群测试的spark wordcount程序;
1.hdfs数据准备.
先将README.md文件上传到hdfs上的/library/wordcount/input2目录
root@master1:/usr/local/hadoop/hadoop-2.6.0/sbin# hdfs dfs -mkdir /library/wordcount/input2
root@master1:/usr/local/hadoop/hadoop-2.6.0/sbin# hdfs dfs -put /usr/local/tools/README.md /library/wordcount/input2
查看一下确保文件已经上传到hdfs上。
程序代码:
2.打包程序
将程序打包成jar文件,并将jar文件复制到虚拟机的master1上的/usr/local/sparkApps目录下。
3.提交应用程序到集群中。
可以参考官方文档http://spark.apache.org/docs/latest/submitting-applications.html
第一次用的是下面的命令,结果运行不了,提示信息和命令如下:原因没有提交到集群中。
第二次修改命令如下,运行正常:
运行结果如下:
运行结果和本地的一样。
4.通过写一个wordcount.sh,实现自动化。
创建一个wordcount.sh文件,内容如下,保存并退出。然后chmod +x wordcount.sh给shell文件执行权限。运行下面的命令,执行即可,执行结果和上面的一样。
6.通过浏览器查看程序运行结果:
首先保证你的集群是运行成功的,集群搭建可以参考http://kevin12.iteye.com/blog/2273556
开发集群测试的spark wordcount程序;
1.hdfs数据准备.
先将README.md文件上传到hdfs上的/library/wordcount/input2目录
root@master1:/usr/local/hadoop/hadoop-2.6.0/sbin# hdfs dfs -mkdir /library/wordcount/input2
root@master1:/usr/local/hadoop/hadoop-2.6.0/sbin# hdfs dfs -put /usr/local/tools/README.md /library/wordcount/input2
查看一下确保文件已经上传到hdfs上。
程序代码:
package com.imf.spark import org.apache.spark.SparkConf import org.apache.spark.SparkContext /** * 用户scala开发集群测试的spark wordcount程序 */ object WordCount_Cluster { def main(args: Array[String]): Unit = { /** * 1.创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息, * 例如:通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置为local, * 则代表Spark程序在本地运行,特别适合于机器配置条件非常差的情况。 */ //创建SparkConf对象 val conf = new SparkConf() //设置应用程序名称,在程序运行的监控界面可以看到名称 conf.setAppName("My First Spark App!") /** * 2.创建SparkContext对象 * SparkContext是spark程序所有功能的唯一入口,无论是采用Scala,java,python,R等都必须有一个SprakContext * SparkContext核心作用:初始化spark应用程序运行所需要的核心组件,包括DAGScheduler,TaskScheduler,SchedulerBackend * 同时还会负责Spark程序往Master注册程序等; * SparkContext是整个应用程序中最为至关重要的一个对象; */ //通过创建SparkContext对象,通过传入SparkConf实例定制Spark运行的具体参数和配置信息 val sc = new SparkContext(conf) /** * 3.根据具体数据的来源(HDFS,HBase,Local,FS,DB,S3等)通过SparkContext来创建RDD; * RDD的创建基本有三种方式:根据外部的数据来源(例如HDFS)、根据Scala集合、由其他的RDD操作; * 数据会被RDD划分成为一系列的Partitions,分配到每个Partition的数据属于一个Task的处理范畴; */ //读取hdfs文件内容并切分成Pratitions //val lines = sc.textFile("hdfs://master1:9000/library/wordcount/input2") val lines = sc.textFile("/library/wordcount/input2")//这里不设置并行度是spark有自己的算法,暂时不去考虑 /** * 4.对初始的RDD进行Transformation级别的处理,例如map,filter等高阶函数的变成,来进行具体的数据计算 * 4.1.将每一行的字符串拆分成单个单词 */ //对每一行的字符串进行拆分并把所有行的拆分结果通过flat合并成一个大的集合 val words = lines.flatMap { line => line.split(" ") } /** * 4.2.在单词拆分的基础上对每个单词实例计数为1,也就是word => (word,1) */ val pairs = words.map{word =>(word,1)} /** * 4.3.在每个单词实例计数为1基础上统计每个单词在文件中出现的总次数 */ //对相同的key进行value的累积(包括Local和Reducer级别同时Reduce) val wordCounts = pairs.reduceByKey(_+_) //打印输出 wordCounts.collect.foreach(pair => println(pair._1+":"+pair._2)) sc.stop() } }
2.打包程序
将程序打包成jar文件,并将jar文件复制到虚拟机的master1上的/usr/local/sparkApps目录下。
3.提交应用程序到集群中。
可以参考官方文档http://spark.apache.org/docs/latest/submitting-applications.html
第一次用的是下面的命令,结果运行不了,提示信息和命令如下:原因没有提交到集群中。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-submit \ > --class com.imf.spark.WordCount_Cluster \ > --master master spark://master1:7077 \ > /usr/local/sparkApps/WordCount.jar Error: Master must start with yarn, spark, mesos, or local Run with --help for usage help or --verbose for debug output
第二次修改命令如下,运行正常:
./spark-submit \ --class com.imf.spark.WordCount_Cluster \ --master yarn \ /usr/local/sparkApps/WordCount.jar
运行结果如下:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-submit \ > --class com.imf.spark.WordCount_Cluster \ > --master yarn \ > /usr/local/sparkApps/WordCount.jar 16/01/27 07:21:55 INFO spark.SparkContext: Running Spark version 1.6.0 16/01/27 07:21:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/27 07:21:56 INFO spark.SecurityManager: Changing view acls to: root 16/01/27 07:21:56 INFO spark.SecurityManager: Changing modify acls to: root 16/01/27 07:21:56 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 16/01/27 07:21:57 INFO util.Utils: Successfully started service 'sparkDriver' on port 33500. 16/01/27 07:21:57 INFO slf4j.Slf4jLogger: Slf4jLogger started 16/01/27 07:21:57 INFO Remoting: Starting remoting 16/01/27 07:21:58 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.112.130:40488] 16/01/27 07:21:58 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 40488. 16/01/27 07:21:58 INFO spark.SparkEnv: Registering MapOutputTracker 16/01/27 07:21:58 INFO spark.SparkEnv: Registering BlockManagerMaster 16/01/27 07:21:58 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-298b85db-a14e-448b-bf12-a76e7f9d3e61 16/01/27 07:21:58 INFO storage.MemoryStore: MemoryStore started with capacity 517.4 MB 16/01/27 07:21:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator 16/01/27 07:21:59 INFO server.Server: jetty-8.y.z-SNAPSHOT 16/01/27 07:21:59 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 16/01/27 07:21:59 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 16/01/27 07:21:59 INFO ui.SparkUI: Started SparkUI at http://192.168.112.130:4040 16/01/27 07:21:59 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-1273b284-86a3-45a5-ab5f-7996b7f78a5f/httpd-c0ad1a3a-48fc-4aef-8ac1-ddc28159fe2c 16/01/27 07:21:59 INFO spark.HttpServer: Starting HTTP Server 16/01/27 07:21:59 INFO server.Server: jetty-8.y.z-SNAPSHOT 16/01/27 07:21:59 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41722 16/01/27 07:21:59 INFO util.Utils: Successfully started service 'HTTP file server' on port 41722. 16/01/27 07:21:59 INFO spark.SparkContext: Added JAR file:/usr/local/sparkApps/WordCount.jar at http://192.168.112.130:41722/jars/WordCount.jar with timestamp 1453850519749 16/01/27 07:22:00 INFO client.RMProxy: Connecting to ResourceManager at master1/192.168.112.130:8032 16/01/27 07:22:01 INFO yarn.Client: Requesting a new application from cluster with 3 NodeManagers 16/01/27 07:22:01 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/01/27 07:22:01 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/01/27 07:22:01 INFO yarn.Client: Setting up container launch context for our AM 16/01/27 07:22:01 INFO yarn.Client: Setting up the launch environment for our AM container 16/01/27 07:22:01 INFO yarn.Client: Preparing resources for our AM container 16/01/27 07:22:02 INFO yarn.Client: Uploading resource file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://master1:9000/user/root/.sparkStaging/application_1453847555417_0001/spark-assembly-1.6.0-hadoop2.6.0.jar 16/01/27 07:22:11 INFO yarn.Client: Uploading resource file:/tmp/spark-1273b284-86a3-45a5-ab5f-7996b7f78a5f/__spark_conf__2067657392446030944.zip -> hdfs://master1:9000/user/root/.sparkStaging/application_1453847555417_0001/__spark_conf__2067657392446030944.zip 16/01/27 07:22:11 INFO spark.SecurityManager: Changing view acls to: root 16/01/27 07:22:11 INFO spark.SecurityManager: Changing modify acls to: root 16/01/27 07:22:11 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 16/01/27 07:22:12 INFO yarn.Client: Submitting application 1 to ResourceManager 16/01/27 07:22:12 INFO impl.YarnClientImpl: Submitted application application_1453847555417_0001 16/01/27 07:22:14 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:14 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1453850532502 final status: UNDEFINED tracking URL: http://master1:8088/proxy/application_1453847555417_0001/ user: root 16/01/27 07:22:15 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:16 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:17 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:18 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:19 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:20 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:21 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:22 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:23 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:24 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:25 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:26 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:27 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:28 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:29 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:30 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:31 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:32 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:32 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null) 16/01/27 07:22:32 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master1, PROXY_URI_BASES -> http://master1:8088/proxy/application_1453847555417_0001), /proxy/application_1453847555417_0001 16/01/27 07:22:32 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter 16/01/27 07:22:33 INFO yarn.Client: Application report for application_1453847555417_0001 (state: ACCEPTED) 16/01/27 07:22:34 INFO yarn.Client: Application report for application_1453847555417_0001 (state: RUNNING) 16/01/27 07:22:34 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: 192.168.112.132 ApplicationMaster RPC port: 0 queue: default start time: 1453850532502 final status: UNDEFINED tracking URL: http://master1:8088/proxy/application_1453847555417_0001/ user: root 16/01/27 07:22:34 INFO cluster.YarnClientSchedulerBackend: Application application_1453847555417_0001 has started running. 16/01/27 07:22:34 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33873. 16/01/27 07:22:34 INFO netty.NettyBlockTransferService: Server created on 33873 16/01/27 07:22:34 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/01/27 07:22:34 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.112.130:33873 with 517.4 MB RAM, BlockManagerId(driver, 192.168.112.130, 33873) 16/01/27 07:22:34 INFO storage.BlockManagerMaster: Registered BlockManager 16/01/27 07:22:37 INFO scheduler.EventLoggingListener: Logging events to hdfs://master1:9000/historyserverforSpark/application_1453847555417_0001 16/01/27 07:22:37 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms) 16/01/27 07:22:46 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 212.9 KB, free 212.9 KB) 16/01/27 07:22:46 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.6 KB, free 232.4 KB) 16/01/27 07:22:46 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.112.130:33873 (size: 19.6 KB, free: 517.4 MB) 16/01/27 07:22:46 INFO spark.SparkContext: Created broadcast 0 from textFile at WordCount_Cluster.scala:36 16/01/27 07:22:49 INFO mapred.FileInputFormat: Total input paths to process : 1 16/01/27 07:22:50 INFO spark.SparkContext: Starting job: collect at WordCount_Cluster.scala:55 16/01/27 07:22:51 INFO scheduler.DAGScheduler: Registering RDD 3 (map at WordCount_Cluster.scala:47) 16/01/27 07:22:51 INFO scheduler.DAGScheduler: Got job 0 (collect at WordCount_Cluster.scala:55) with 2 output partitions 16/01/27 07:22:51 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at WordCount_Cluster.scala:55) 16/01/27 07:22:51 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 16/01/27 07:22:51 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0) 16/01/27 07:22:51 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount_Cluster.scala:47), which has no missing parents 16/01/27 07:22:52 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 236.5 KB) 16/01/27 07:22:52 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 238.8 KB) 16/01/27 07:22:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.112.130:33873 (size: 2.3 KB, free: 517.4 MB) 16/01/27 07:22:52 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 16/01/27 07:22:52 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount_Cluster.scala:47) 16/01/27 07:22:52 INFO cluster.YarnScheduler: Adding task set 0.0 with 2 tasks 16/01/27 07:23:08 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 16/01/27 07:23:22 INFO cluster.YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (worker3:56136) with ID 2 16/01/27 07:23:23 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, worker3, partition 0,NODE_LOCAL, 2202 bytes) 16/01/27 07:23:23 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 16/01/27 07:23:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager worker3:41970 with 517.4 MB RAM, BlockManagerId(2, worker3, 41970) 16/01/27 07:23:33 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on worker3:41970 (size: 2.3 KB, free: 517.4 MB) 16/01/27 07:23:37 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on worker3:41970 (size: 19.6 KB, free: 517.4 MB) 16/01/27 07:23:39 INFO cluster.YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (worker2:42174) with ID 1 16/01/27 07:23:39 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, worker2, partition 1,RACK_LOCAL, 2202 bytes) 16/01/27 07:23:40 INFO storage.BlockManagerMasterEndpoint: Registering block manager worker2:44472 with 517.4 MB RAM, BlockManagerId(1, worker2, 44472) 16/01/27 07:23:47 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on worker2:44472 (size: 2.3 KB, free: 517.4 MB) 16/01/27 07:23:51 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on worker2:44472 (size: 19.6 KB, free: 517.4 MB) 16/01/27 07:24:00 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 37460 ms on worker3 (1/2) 16/01/27 07:24:09 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (map at WordCount_Cluster.scala:47) finished in 75.810 s 16/01/27 07:24:09 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 29446 ms on worker2 (2/2) 16/01/27 07:24:09 INFO scheduler.DAGScheduler: looking for newly runnable stages 16/01/27 07:24:09 INFO scheduler.DAGScheduler: running: Set() 16/01/27 07:24:09 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1) 16/01/27 07:24:09 INFO scheduler.DAGScheduler: failed: Set() 16/01/27 07:24:09 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/01/27 07:24:09 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount_Cluster.scala:53), which has no missing parents 16/01/27 07:24:09 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 241.4 KB) 16/01/27 07:24:09 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1595.0 B, free 242.9 KB) 16/01/27 07:24:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.112.130:33873 (size: 1595.0 B, free: 517.4 MB) 16/01/27 07:24:09 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 16/01/27 07:24:09 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount_Cluster.scala:53) 16/01/27 07:24:09 INFO cluster.YarnScheduler: Adding task set 1.0 with 2 tasks 16/01/27 07:24:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, worker2, partition 0,NODE_LOCAL, 1951 bytes) 16/01/27 07:24:09 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, worker3, partition 1,NODE_LOCAL, 1951 bytes) 16/01/27 07:24:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on worker2:44472 (size: 1595.0 B, free: 517.4 MB) 16/01/27 07:24:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on worker3:41970 (size: 1595.0 B, free: 517.4 MB) 16/01/27 07:24:10 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to worker2:42174 16/01/27 07:24:10 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 158 bytes 16/01/27 07:24:10 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to worker3:56136 16/01/27 07:24:13 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 3634 ms on worker3 (1/2) 16/01/27 07:24:13 INFO scheduler.DAGScheduler: ResultStage 1 (collect at WordCount_Cluster.scala:55) finished in 3.691 s 16/01/27 07:24:13 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 3696 ms on worker2 (2/2) 16/01/27 07:24:13 INFO cluster.YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/01/27 07:24:13 INFO scheduler.DAGScheduler: Job 0 finished: collect at WordCount_Cluster.scala:55, took 82.513256 s package:1 this:1 Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version):1 Because:1 Python:2 cluster.:1 [run:1 its:1 YARN,:1 have:1 general:2 pre-built:1 locally:2 locally.:1 changed:1 sc.parallelize(1:1 only:1 Configuration:1 This:2 first:1 basic:1 documentation:3 learning,:1 graph:1 Hive:2 several:1 ["Specifying:1 "yarn":1 page](http://spark.apache.org/documentation.html):1 [params]`.:1 [project:2 prefer:1 SparkPi:2 <http://spark.apache.org/>:1 engine:1 version:1 file:1 documentation,:1 MASTER:1 example:3 are:1 systems.:1 params:1 scala>:1 DataFrames,:1 provides:1 refer:2 configure:1 Interactive:2 R,:1 can:6 build:3 when:1 how:2 easiest:1 Apache:1 package.:1 1000).count():1 Note:1 Data.:1 >>>:1 Scala:2 Alternatively,:1 variable:1 submit:1 Testing:1 Streaming:1 module,:1 thread,:1 rich:1 them,:1 detailed:2 stream:1 GraphX:1 distribution:1 Please:3 is:6 return:2 Thriftserver:1 same:1 start:1 Spark](#building-spark).:1 one:2 with:3 built:1 Spark"](http://spark.apache.org/docs/latest/building-spark.html).:1 data:1 wiki](https://cwiki.apache.org/confluence/display/SPARK).:1 using:2 talk:1 class:2 Shell:2 README:1 computing:1 Python,:2 example::1 ##:8 building:2 N:1 set:2 Hadoop-supported:1 other:1 Example:1 analysis.:1 from:1 runs.:1 Building:1 higher-level:1 need:1 guidance:2 Big:1 guide,:1 Java,:1 fast:1 uses:1 SQL:2 will:1 <class>:1 requires:1 :67 Documentation:1 web:1 cluster:2 using::1 MLlib:1 shell::2 Scala,:1 supports:2 built,:1 ./dev/run-tests:1 build/mvn:1 sample:1 For:2 Programs:1 Spark:13 particular:2 The:1 processing.:1 APIs:1 computation:1 Try:1 [Configuration:1 ./bin/pyspark:1 A:1 through:1 #:1 library:1 following:2 More:1 which:2 also:4 storage:1 should:2 To:2 for:11 Once:1 setup:1 mesos://:1 Maven](http://maven.apache.org/).:1 latest:1 processing,:1 the:21 your:1 not:1 different:1 distributions.:1 given.:1 About:1 if:4 instructions.:1 be:2 do:2 Tests:1 no:1 ./bin/run-example:2 programs,:1 including:3 `./bin/run-example:1 Spark.:1 Versions:1 HDFS:1 individual:1 spark://:1 It:2 an:3 programming:1 machine:1 run::1 environment:1 clean:1 1000::2 And:1 run:7 ./bin/spark-shell:1 URL,:1 "local":1 MASTER=spark://host:7077:1 on:5 You:3 threads.:1 against:1 [Apache:1 help:1 print:1 tests:2 examples:2 at:2 in:5 -DskipTests:1 optimized:1 downloaded:1 versions:1 graphs:1 Guide](http://spark.apache.org/docs/latest/configuration.html):1 online:1 usage:1 abbreviated:1 comes:1 directory.:1 overview:1 [building:1 `examples`:2 Many:1 Running:1 way:1 use:3 Online:1 site,:1 tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).:1 running:1 find:1 sc.parallelize(range(1000)).count():1 contains:1 project:1 you:4 Pi:1 that:2 protocols:1 a:8 or:3 high-level:1 name:1 Hadoop,:2 to:14 available:1 (You:1 core:1 instance::1 see:1 of:5 tools:1 "local[N]":1 programs:2 package.):1 ["Building:1 must:1 and:10 command,:2 system:1 Hadoop:3 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 16/01/27 07:24:13 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 16/01/27 07:24:13 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.112.130:4040 16/01/27 07:24:14 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread 16/01/27 07:24:14 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors 16/01/27 07:24:14 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down 16/01/27 07:24:14 INFO cluster.YarnClientSchedulerBackend: Stopped 16/01/27 07:24:14 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/01/27 07:24:14 INFO storage.MemoryStore: MemoryStore cleared 16/01/27 07:24:14 INFO storage.BlockManager: BlockManager stopped 16/01/27 07:24:14 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 16/01/27 07:24:14 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/01/27 07:24:15 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/01/27 07:24:15 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/01/27 07:24:15 INFO spark.SparkContext: Successfully stopped SparkContext 16/01/27 07:24:15 INFO util.ShutdownHookManager: Shutdown hook called 16/01/27 07:24:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-1273b284-86a3-45a5-ab5f-7996b7f78a5f 16/01/27 07:24:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-1273b284-86a3-45a5-ab5f-7996b7f78a5f/httpd-c0ad1a3a-48fc-4aef-8ac1-ddc28159fe2c root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin#
运行结果和本地的一样。
4.通过写一个wordcount.sh,实现自动化。
创建一个wordcount.sh文件,内容如下,保存并退出。然后chmod +x wordcount.sh给shell文件执行权限。运行下面的命令,执行即可,执行结果和上面的一样。
root@master1:/usr/local/sparkApps# ./wordcount.sh
6.通过浏览器查看程序运行结果:
发表评论
-
SparkStreaming pull data from Flume
2016-06-19 17:29 1232Spark Streaming + Flume Integra ... -
Flume push数据到SparkStreaming
2016-06-19 15:16 1948上节http://kevin12.iteye.com/blog ... -
Spark Streaming 统计单词的例
2016-06-19 14:55 3测试Spark Streaming 统计单词的例子 1.准 ... -
Spark Streaming 统计单词的例子
2016-06-19 12:29 3690测试Spark Streaming 统计单词的例子 1.准备 ... -
Spark SQL窗口函数
2016-04-22 07:18 2564窗口函数又叫着窗口分析函数,Spark 1.4版本SparkS ... -
Spark SQL内置函数应用
2016-04-22 07:00 8672简单说明 使用Spark SQL中的内置函数对数据进行 ... -
Spark SQL操作Hive数据库
2016-04-13 22:37 17608本次例子通过scala编程实现Spark SQL操作Hive数 ... -
Spark SQL on hive配置和实战
2016-03-26 18:40 5581spark sql 官网:http://spark ... -
Spark RDD弹性表现和来源
2016-02-09 20:12 3861hadoop 的MapReduce是基于数 ... -
Spark内核架构
2016-02-07 12:24 10181.在将spark内核架构前,先了解一下Hadoop的MR,H ... -
spark集群HA搭建
2016-01-31 08:50 4536spark集群的HA图: 搭建spark的HA需要安装z ... -
Spark集群中WordCount运行原理
2016-01-31 07:05 2514以数据流动的视角解释一下wordcount运行的原理 pa ... -
eclipse开发spark程序配置本地运行
2016-01-27 07:58 12414今天简单讲一下在local模式下用eclipse开发一个简单的 ... -
spark1.6.0搭建(基于hadoop2.6.0分布式)
2016-01-24 10:11 5981本文是基于hadoop2.6.0的分布式环境搭建spark1. ...
相关推荐
### Spark集群及开发环境搭建(完整版) #### 一、软件及下载 本文档提供了详细的步骤来指导初学者搭建Spark集群及其开发环境。首先需要准备的软件包括: - **VirtualBox-5.1**:虚拟机软件,用于安装CentOS操作...
在Eclipse中直接运行MapReduce程序,可以进行快速的本地测试和调试,减少了实际在集群上运行的时间。 任务3是对开发过程的总结和反思,通常包括遇到的问题、解决策略以及优化建议。在实践中,可能需要根据硬件资源...
9. **建立连接**:在Eclipse中配置Hadoop和Spark连接,使得Eclipse能够与本地或远程Hadoop和Spark集群通信。 10. **开发源码**:现在你可以在Eclipse中编写Hadoop MapReduce、Spark应用和Hive查询。使用Eclipse的...
同时,Eclipse的日志视图可以显示程序运行的详细信息,帮助调试。 以上就是基于Eclipse的Hadoop应用开发环境配置的全过程。通过这个环境,开发者可以更专注于Hadoop应用的编写和优化,而无需关心底层的集群管理。...
标题 "eclipse 运行hadoop工具包" 涉及到的是在Eclipse集成开发...通过以上步骤,开发者能够在Eclipse这个强大的平台上高效地开发和运行Hadoop项目,充分发挥两者结合的优势,实现快速迭代和调试,从而提高工作效率。
初学者手册 一、 软件及下载 2 ...3. 测试spark集群 20 八、 Scala开发环境搭建 21 1、系统安装 21 2、安装jdk8 21 3、安装scala2.11 21 4、安装scala for eclipse 21 5、创建scala工程 21
5. 编写和运行代码:使用Scala编写Spark程序,然后通过ECLIPSE的运行配置来启动Spark集群并提交作业。 总之,Spark Scala开发依赖包对于在ECLIPSE中进行大数据处理项目至关重要。正确配置这些依赖将使开发者能够在...
1. **虚拟机(VM)**:VM提供了在本地计算机上运行Redhat系统的环境,使得用户可以在自己的机器上模拟一个完整的Linux服务器,便于学习和测试Hadoop集群配置。 2. **Redhat系统镜像**:Redhat是流行的Linux发行版之...
完成后,使用`spark-submit`工具提交程序到Spark集群。 在"老汤Spark2.x实战应用系列"中,很可能会涵盖以上步骤的详细说明,并给出在Windows环境下解决常见问题的技巧。这将帮助开发者避免很多初学者常见的错误,...
在Windows 7操作系统中,使用Eclipse开发Hadoop应用程序的过程涉及多个步骤,涵盖了从环境配置到实际编程的各个层面。以下是对这个主题的详细讲解: 首先,我们需要了解Hadoop和Eclipse的基础。Hadoop是一个开源的...
4. **编写和运行MapReduce程序**:使用Eclipse创建Java项目,编写MapReduce代码,并使用Eclipse的Run As > Hadoop Job选项直接在Hadoop集群上运行。 5. **调试和日志查看**:Eclipse还可以帮助调试MapReduce程序,...
6. **配置运行配置**:右键点击项目,选择`Run As` > `Run Configurations`,在`Map/Reduce Job`中设置运行参数,如主类、输入和输出路径。 7. **Hadoop.dll和winutils.exe**:在Windows环境下,由于Hadoop主要为...
- 通过Eclipse的构建和运行选项,可以直接将程序提交到运行中的Hadoop集群上进行测试和运行。 7. **监控和调试**: - Hadoop提供Web UI供用户监控集群状态,如NameNode的50070端口和ResourceManager的8088端口。 ...
综上所述,本文介绍了在特定的 Hadoop 和 Spark 集群环境下进行 WordCount 示例的实现过程。从环境搭建、IDE 配置到代码编写,每个步骤都进行了详细的说明。通过学习这个案例,可以帮助读者更好地理解 Spark 的基本...
在本地模式下,Spark会在单个JVM上运行,这方便开发者进行快速调试和测试,而不需要分布式集群环境。 在实际应用中,Spark的强大在于它的弹性分布式数据集(Resilient Distributed Datasets, RDDs),以及其支持的...
在开发过程中,可能会遇到的问题包括网络连接问题、权限问题、依赖库冲突等,解决这些问题需要对Hadoop和Eclipse的配置有深入理解。同时,为了更好地调试和优化Hadoop应用程序,开发者还需要掌握Hadoop的生态系统,...
用户编写的程序在集群上运行时,通常会包含一个Driver和多个executors,共同完成整个应用程序的计算任务。在Spark的独立部署模式中,用户需要关注如何有效地配置和管理集群资源,以便能够高效地运行Spark作业。
最后,我们可以在Eclipse中编写和运行Spark程序。利用Eclipse的调试工具,可以方便地设置断点,查看变量值,理解程序执行流程。此外,Eclipse的构建工具如Maven或Gradle可以帮助管理依赖,自动化构建过程。 关于...
看这一篇就够啦,给出一个完全分布式hadoop+spark集群搭建完整文档,从环境准备(包括机器名,ip映射步骤,ssh免密,Java等)开始,包括zookeeper,hadoop,hive,spark,eclipse/idea安装全过程,3-4节点,集群部署...
6. **运行和测试**:配置完成后,你可以通过编写简单的Java或Scala程序,利用Hadoop和Spark处理本地文件或模拟集群进行测试。例如,使用WordCount示例来验证配置是否正确。 这个压缩包中的Word文档可能包含了配置...