Scala版本
scala-2.10.4
说明:
之前搭建环境一直不成功,原因可能是使用了Scala-2.11.4版本导致的。Spark的官方网站明确的说Spark-1.2.0不支持Scala2.11.4版本:
Note: Scala 2.11 users should download the Spark source package and build with Scala 2.11 support.
Spark版本:
spark-1.2.0-bin-hadoop2.4.tgz
配置环境变量
export SCALA_HOME=/home/hadoop/spark1.2.0/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH export SPARK_HOME=/home/hadoop/spark1.2.0/spark-1.2.0-bin-hadoop2.4 export PATH=$SPARK_HOME/bin:$PATH
搭建Intellij Idea开发Spark程序的环境
1. 下载安装Scala插件
2. 创建 Scala的Non-SBT项目
3. 导入Spark的jar包
spark-1.2.0-bin-hadoop2.4
4.编写wordcount例子代码
package spark.examples import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SparkWordCount { def main(args: Array[String]) { ///注意setMaster("local")这行代码,表明Spark以local运行(注意local与standalone模式的区别) val conf = new SparkConf().setAppName("SparkWordCount").setMaster("local") val sc = new SparkContext(conf) val rdd = sc.textFile("file:///home/hadoop/spark1.2.0/word.txt") rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).saveAsTextFile("file:///home/hadoop/spark1.2.0/WordCountResult") sc.stop } }
控制台日志:
15/01/14 22:06:34 WARN Utils: Your hostname, hadoop-Inspiron-3521 resolves to a loopback address: 127.0.1.1; using 192.168.0.111 instead (on interface eth1) 15/01/14 22:06:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/01/14 22:06:35 INFO SecurityManager: Changing view acls to: hadoop 15/01/14 22:06:35 INFO SecurityManager: Changing modify acls to: hadoop 15/01/14 22:06:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 15/01/14 22:06:36 INFO Slf4jLogger: Slf4jLogger started 15/01/14 22:06:36 INFO Remoting: Starting remoting 15/01/14 22:06:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@hadoop-Inspiron-3521.local:53624] 15/01/14 22:06:36 INFO Utils: Successfully started service 'sparkDriver' on port 53624. 15/01/14 22:06:36 INFO SparkEnv: Registering MapOutputTracker 15/01/14 22:06:36 INFO SparkEnv: Registering BlockManagerMaster 15/01/14 22:06:36 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150114220636-4826 15/01/14 22:06:36 INFO MemoryStore: MemoryStore started with capacity 461.7 MB 15/01/14 22:06:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/01/14 22:06:37 INFO HttpFileServer: HTTP File server directory is /tmp/spark-19683393-0315-498c-9b72-9c6a13684f44 15/01/14 22:06:37 INFO HttpServer: Starting HTTP Server 15/01/14 22:06:38 INFO Utils: Successfully started service 'HTTP file server' on port 53231. 15/01/14 22:06:43 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/01/14 22:06:43 INFO SparkUI: Started SparkUI at http://hadoop-Inspiron-3521.local:4040 15/01/14 22:06:43 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@hadoop-Inspiron-3521.local:53624/user/HeartbeatReceiver 15/01/14 22:06:44 INFO NettyBlockTransferService: Server created on 46971 15/01/14 22:06:44 INFO BlockManagerMaster: Trying to register BlockManager 15/01/14 22:06:44 INFO BlockManagerMasterActor: Registering block manager localhost:46971 with 461.7 MB RAM, BlockManagerId(<driver>, localhost, 46971) 15/01/14 22:06:44 INFO BlockManagerMaster: Registered BlockManager 15/01/14 22:06:44 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=484127539 15/01/14 22:06:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 461.5 MB) 15/01/14 22:06:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=484127539 15/01/14 22:06:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 461.5 MB) 15/01/14 22:06:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46971 (size: 22.2 KB, free: 461.7 MB) 15/01/14 22:06:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/01/14 22:06:45 INFO SparkContext: Created broadcast 0 from textFile at SparkWordCount.scala:40 15/01/14 22:06:45 INFO FileInputFormat: Total input paths to process : 1 15/01/14 22:06:45 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/01/14 22:06:45 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/01/14 22:06:45 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/01/14 22:06:45 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/01/14 22:06:45 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/01/14 22:06:46 INFO SparkContext: Starting job: saveAsTextFile at SparkWordCount.scala:43 15/01/14 22:06:46 INFO DAGScheduler: Registering RDD 3 (map at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO DAGScheduler: Registering RDD 5 (map at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO DAGScheduler: Got job 0 (saveAsTextFile at SparkWordCount.scala:43) with 1 output partitions (allowLocal=false) 15/01/14 22:06:46 INFO DAGScheduler: Final stage: Stage 2(saveAsTextFile at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO DAGScheduler: Parents of final stage: List(Stage 1) 15/01/14 22:06:46 INFO DAGScheduler: Missing parents: List(Stage 1) 15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[3] at map at SparkWordCount.scala:43), which has no missing parents 15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(3560) called with curMem=186397, maxMem=484127539 15/01/14 22:06:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.5 KB, free 461.5 MB) 15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2528) called with curMem=189957, maxMem=484127539 15/01/14 22:06:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 461.5 MB) 15/01/14 22:06:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:46971 (size: 2.5 KB, free: 461.7 MB) 15/01/14 22:06:46 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/01/14 22:06:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838 15/01/14 22:06:46 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[3] at map at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/01/14 22:06:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1292 bytes) 15/01/14 22:06:46 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/01/14 22:06:46 INFO HadoopRDD: Input split: file:/home/hadoop/spark1.2.0/word.txt:0+29 15/01/14 22:06:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1895 bytes result sent to driver 15/01/14 22:06:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 323 ms on localhost (1/1) 15/01/14 22:06:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/01/14 22:06:46 INFO DAGScheduler: Stage 0 (map at SparkWordCount.scala:43) finished in 0.350 s 15/01/14 22:06:46 INFO DAGScheduler: looking for newly runnable stages 15/01/14 22:06:46 INFO DAGScheduler: running: Set() 15/01/14 22:06:46 INFO DAGScheduler: waiting: Set(Stage 1, Stage 2) 15/01/14 22:06:46 INFO DAGScheduler: failed: Set() 15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 1: List() 15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 2: List(Stage 1) 15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[5] at map at SparkWordCount.scala:43), which is now runnable 15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2992) called with curMem=192485, maxMem=484127539 15/01/14 22:06:46 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 461.5 MB) 15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2158) called with curMem=195477, maxMem=484127539 15/01/14 22:06:46 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 461.5 MB) 15/01/14 22:06:46 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:46971 (size: 2.1 KB, free: 461.7 MB) 15/01/14 22:06:46 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 15/01/14 22:06:46 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838 15/01/14 22:06:46 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[5] at map at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/01/14 22:06:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1045 bytes) 15/01/14 22:06:46 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/01/14 22:06:46 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 15/01/14 22:06:46 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 12 ms 15/01/14 22:06:46 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1000 bytes result sent to driver 15/01/14 22:06:46 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 110 ms on localhost (1/1) 15/01/14 22:06:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/01/14 22:06:46 INFO DAGScheduler: Stage 1 (map at SparkWordCount.scala:43) finished in 0.106 s 15/01/14 22:06:46 INFO DAGScheduler: looking for newly runnable stages 15/01/14 22:06:46 INFO DAGScheduler: running: Set() 15/01/14 22:06:46 INFO DAGScheduler: waiting: Set(Stage 2) 15/01/14 22:06:46 INFO DAGScheduler: failed: Set() 15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 2: List() 15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[8] at saveAsTextFile at SparkWordCount.scala:43), which is now runnable 15/01/14 22:06:47 INFO MemoryStore: ensureFreeSpace(112880) called with curMem=197635, maxMem=484127539 15/01/14 22:06:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 110.2 KB, free 461.4 MB) 15/01/14 22:06:47 INFO MemoryStore: ensureFreeSpace(67500) called with curMem=310515, maxMem=484127539 15/01/14 22:06:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 65.9 KB, free 461.3 MB) 15/01/14 22:06:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:46971 (size: 65.9 KB, free: 461.6 MB) 15/01/14 22:06:47 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0 15/01/14 22:06:47 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838 15/01/14 22:06:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (MappedRDD[8] at saveAsTextFile at SparkWordCount.scala:43) 15/01/14 22:06:47 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks 15/01/14 22:06:47 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1056 bytes) 15/01/14 22:06:47 INFO Executor: Running task 0.0 in stage 2.0 (TID 2) 15/01/14 22:06:47 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 15/01/14 22:06:47 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 15/01/14 22:06:47 INFO FileOutputCommitter: Saved output of task 'attempt_201501142206_0002_m_000000_2' to file:/home/hadoop/spark1.2.0/WordCountResult/_temporary/0/task_201501142206_0002_m_000000 15/01/14 22:06:47 INFO SparkHadoopWriter: attempt_201501142206_0002_m_000000_2: Committed 15/01/14 22:06:47 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 824 bytes result sent to driver 15/01/14 22:06:47 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 397 ms on localhost (1/1) 15/01/14 22:06:47 INFO DAGScheduler: Stage 2 (saveAsTextFile at SparkWordCount.scala:43) finished in 0.399 s 15/01/14 22:06:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/01/14 22:06:47 INFO DAGScheduler: Job 0 finished: saveAsTextFile at SparkWordCount.scala:43, took 1.241181 s 15/01/14 22:06:47 INFO SparkUI: Stopped Spark web UI at http://hadoop-Inspiron-3521.local:4040 15/01/14 22:06:47 INFO DAGScheduler: Stopping DAGScheduler 15/01/14 22:06:48 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 15/01/14 22:06:48 INFO MemoryStore: MemoryStore cleared 15/01/14 22:06:48 INFO BlockManager: BlockManager stopped 15/01/14 22:06:48 INFO BlockManagerMaster: BlockManagerMaster stopped 15/01/14 22:06:48 INFO SparkContext: Successfully stopped SparkContext 15/01/14 22:06:48 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/01/14 22:06:48 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. Process finished with exit code 0
调整日志级别
从上面的输出中,可以看到,Spark默认是输出INFO级别的日志,为了查看全部的日志,可以设置Spark的日志输出,办法是在wordcount项目的源代码根目录创建一个log4j.properties文件,其中的内容
log4j.rootCategory=DEBUG, file log4j.appender.file=org.apache.log4j.ConsoleAppender #如果要把日志输出到某个文件中,则使用FileAppender #log4j.appender.file=org.apache.log4j.FileAppender #log4j.appender.file.file=spark.log log4j.appender.file.append=false log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n # Ignore messages below warning level from Jetty, because it's a bit verbose log4j.logger.org.eclipse.jetty=WARN org.eclipse.jetty.LEVEL=WARN
相关推荐
5. **在IntelliJ IDEA中开发Spark应用** - 创建Scala类,继承`org.apache.spark.SparkContext`或`org.apache.spark.sql.SparkSession`,编写Spark程序。 - 使用IntelliJ IDEA的代码提示、调试和重构功能提升开发...
【Spark本地开发环境搭建】 ...如果需要运行Spark程序,可以通过配置好的Spark运行配置在本地启动Spark Shell或提交任务,以进行测试和调试。在本地环境下,这通常比在集群上运行更快,更方便进行迭代开发。
2. 运行Spark程序:在IDEA中,选择“Run”菜单,点击“Edit Configurations”,创建一个新的"Application"配置。在Main class中选择你刚创建的Spark程序,设置相关参数,如Master为`local[*]`,表示在本地多线程模式...
IntelliJ IDEA和Visual Studio Code是两款广受欢迎的集成开发环境(IDE),它们在保险业务与分析系统的开发过程中发挥着关键作用。 IntelliJ IDEA,作为JetBrains公司开发的一款专业级Java IDE,以其强大的代码智能...
执行`spark-submit`命令提交你的应用,或者在IntelliJ IDEA中直接运行。 总结,Windows环境下部署Spark运行环境涉及多个步骤,包括安装Java、Scala、Spark以及Hadoop的配置。使用IntelliJ IDEA作为IDE,能有效提高...
在使用 IntelliJ IDEA 创建 Spark 项目时,需要了解两种不同的方式,这两种方式可以帮助开发者快速创建 Spark 项目。 方式一:通过选择File->new Project->Java->Scala创建Spark项目 通过选择File->new Project->...
**SparkTest:IntelliJ IDEA中的Spark ...此外,你还可以了解如何处理实时数据流,以及如何编写和运行Spark作业。总之,这个项目提供了一个实践平台,帮助开发者深入理解和掌握Spark在Scala和IntelliJ IDEA中的应用。
为了在IntelliJ IDEA(简称IDEA)中顺利运行Spark程序,我们需要确保正确地配置了环境变量。首先,需要将解压后的Hadoop目录添加到系统的PATH环境变量中,以便Java可以找到`hadoop.dll`和`winutils.exe`。然后,需要...
通过此插件,用户可以在IntelliJ IDEA中直接创建和运行Spark应用程序,进行数据处理和分析。插件提供了对Spark相关库的智能感知,使得编写Spark作业时可以享受到IDE的全面支持,包括代码提示、重构以及单元测试。 ...
总结来说,"intellij的scala工具bin文件"是IntelliJ IDEA中用于支持Scala开发的关键组件,它使得开发者能够高效地编写Spark应用程序,享受Scala语言的强大特性和Spark框架的高性能计算能力。通过这些工具,你可以...
此外,也可以利用Spark UI监控应用程序的状态和性能。 #### 五、总结 本文详细介绍了如何在Hadoop分布式集群和基于HDFS的Spark集群上部署Scala程序WordCount测试的过程。通过这种方式,不仅能够加深对Hadoop和Spark...
9. **验证IDEA配置**:在IDEA中创建一个Scala项目,检查是否能正常导入Spark库,以及是否可以编写和运行Spark程序。 完成以上步骤后,你将拥有一个完整的Spark开发环境,可以在IDEA中进行Spark应用的编写、测试和...
3. **运行与调试**:允许用户直接在IDE内提交Spark或Hadoop作业,进行本地或远程运行,以及调试。这极大地简化了开发流程,减少了在命令行和IDE之间切换的麻烦。 4. **数据浏览与可视化**:通过集成Hue、HBase等...
通过这个插件,开发者可以在IDE中直接编写Spark程序,享受代码提示和错误检测等功能,同时可以直接在IDE中运行和调试Spark作业,无需频繁地与命令行交互。 在使用Scala插件进行Spark开发时,要注意以下几点: 1. *...
总之,`scala-intellij-bin-2020.3.18.zip`为IntelliJ IDEA带来了全面的Scala开发支持,让开发者能够在一个高效的环境中进行大数据Spark项目或者其他Scala应用的开发。通过集成的工具和特性,它可以提高开发者的生产...
综上所述,“spark-3.2.4-bin-hadoop3.2-scala2.13”安装包是构建和运行Spark应用程序的基础,涵盖了大数据处理、流处理、机器学习等多个领域,为开发者提供了高效、灵活的数据处理平台。通过深入理解和熟练运用,...
开发Spark应用程序,可以使用IntelliJ IDEA、Eclipse或SBT等工具,配合Spark Shell进行快速开发和测试。远程调试功能有助于优化代码和解决问题。此外,书中还提供了多种实际应用场景的编程示例,如WordCount、Top K...