- 浏览: 235350 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
lwb314:
你的这个是创建的临时的hive表,数据也是通过文件录入进去的, ...
Spark SQL操作Hive数据库 -
yixiaoqi2010:
你好 我的提交上去 总是报错,找不到hive表,可能是哪里 ...
Spark SQL操作Hive数据库 -
bo_hai:
target jvm版本也要选择正确。不能选择太高。2.10对 ...
eclipse开发spark程序配置本地运行
今天简单讲一下在local模式下用eclipse开发一个简单的spark应用程序,并在本地运行测试。
1.下载最新版的scala for eclipse版本,选择windows 64位,下载网址:http://scala-ide.org/download/sdk.html
下载好后解压到D盘,打开并选择工作空间。
然后创建一个测试项目ScalaDev,右击项目选择Properties,在对话框中选择Scala Compiler,在右面页签中勾选Use Project Settings和Scala Installation点击ok,保存配置。
2.添加spark1.6.0的jar文件依赖spark-assembly-1.6.0-hadoop2.6.0.jar,并添加到项目中。
spark-assembly-1.6.0-hadoop2.6.0.jar在spark-1.6.0-bin-hadoop2.6.tgz包中的lib下面。
右击ScalaDev项目选择Build Path->Configure Build Path
注:如果你选择了Scala Installation为Latest2.11 bundle(dynamic)项目会报如下的错误:ScalaDev工程上出现一个红叉,查看Problems下面的原因是scala编译版本和spark的不一致导致。
More than one scala library found in the build path (D:/eclipse/plugins/org.scala-lang.scala-library_2.11.7.v20150622-112736-1fbce4612c.jar, F:/IMF/Big_Data_Software/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar).At least one has an incompatible version. Please update the project build path so it contains only one compatible scala library.
解决方法:右击Scala Library Container->Properties,在弹出框中选择Latest 2.10 bundle(dynamic),保存即可。
3.在src下创建spark工程包,并创建入口类。
选择项目New -> Package创建com.imf.spark包;
选择com.imf.spark包名,创建Scala Object;
测试程序前,要将spark-1.6.0-bin-hadoop2.6目录中的README.md文件拷贝到D://testspark//目录中,代码如下:
运行结果:
说明:上面程序运行错误,是加载hadoop的配置,因为运行在本地,是找不到的,但不影响测试。
1.下载最新版的scala for eclipse版本,选择windows 64位,下载网址:http://scala-ide.org/download/sdk.html
下载好后解压到D盘,打开并选择工作空间。
然后创建一个测试项目ScalaDev,右击项目选择Properties,在对话框中选择Scala Compiler,在右面页签中勾选Use Project Settings和Scala Installation点击ok,保存配置。
2.添加spark1.6.0的jar文件依赖spark-assembly-1.6.0-hadoop2.6.0.jar,并添加到项目中。
spark-assembly-1.6.0-hadoop2.6.0.jar在spark-1.6.0-bin-hadoop2.6.tgz包中的lib下面。
右击ScalaDev项目选择Build Path->Configure Build Path
注:如果你选择了Scala Installation为Latest2.11 bundle(dynamic)项目会报如下的错误:ScalaDev工程上出现一个红叉,查看Problems下面的原因是scala编译版本和spark的不一致导致。
More than one scala library found in the build path (D:/eclipse/plugins/org.scala-lang.scala-library_2.11.7.v20150622-112736-1fbce4612c.jar, F:/IMF/Big_Data_Software/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar).At least one has an incompatible version. Please update the project build path so it contains only one compatible scala library.
解决方法:右击Scala Library Container->Properties,在弹出框中选择Latest 2.10 bundle(dynamic),保存即可。
3.在src下创建spark工程包,并创建入口类。
选择项目New -> Package创建com.imf.spark包;
选择com.imf.spark包名,创建Scala Object;
测试程序前,要将spark-1.6.0-bin-hadoop2.6目录中的README.md文件拷贝到D://testspark//目录中,代码如下:
package com.imf.spark import org.apache.spark.SparkConf import org.apache.spark.SparkContext /** * 用户scala开发本地测试的spark wordcount程序 */ object WordCount { def main(args: Array[String]): Unit = { /** * 1.创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息, * 例如:通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置为local, * 则代表Spark程序在本地运行,特别适合于机器配置条件非常差的情况。 */ //创建SparkConf对象 val conf = new SparkConf() //设置应用程序名称,在程序运行的监控界面可以看到名称 conf.setAppName("My First Spark App!") //设置local使程序在本地运行,不需要安装Spark集群 conf.setMaster("local") /** * 2.创建SparkContext对象 * SparkContext是spark程序所有功能的唯一入口,无论是采用Scala,java,python,R等都必须有一个SprakContext * SparkContext核心作用:初始化spark应用程序运行所需要的核心组件,包括DAGScheduler,TaskScheduler,SchedulerBackend * 同时还会负责Spark程序往Master注册程序等; * SparkContext是整个应用程序中最为至关重要的一个对象; */ //通过创建SparkContext对象,通过传入SparkConf实例定制Spark运行的具体参数和配置信息 val sc = new SparkContext(conf) /** * 3.根据具体数据的来源(HDFS,HBase,Local,FS,DB,S3等)通过SparkContext来创建RDD; * RDD的创建基本有三种方式:根据外部的数据来源(例如HDFS)、根据Scala集合、由其他的RDD操作; * 数据会被RDD划分成为一系列的Partitions,分配到每个Partition的数据属于一个Task的处理范畴; */ //读取本地文件,并设置一个partition val lines = sc.textFile("D://testspark//README.md",1) /** * 4.对初始的RDD进行Transformation级别的处理,例如map,filter等高阶函数的变成,来进行具体的数据计算 * 4.1.将每一行的字符串拆分成单个单词 */ //对每一行的字符串进行拆分并把所有行的拆分结果通过flat合并成一个大的集合 val words = lines.flatMap { line => line.split(" ") } /** * 4.2.在单词拆分的基础上对每个单词实例计数为1,也就是word => (word,1) */ val pairs = words.map{word =>(word,1)} /** * 4.3.在每个单词实例计数为1基础上统计每个单词在文件中出现的总次数 */ //对相同的key进行value的累积(包括Local和Reducer级别同时Reduce) val wordCounts = pairs.reduceByKey(_+_) //打印输出 wordCounts.foreach(pair => println(pair._1+":"+pair._2)) sc.stop() } }
运行结果:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/01/26 08:23:37 INFO SparkContext: Running Spark version 1.6.0 16/01/26 08:23:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/26 08:23:42 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104) at org.apache.hadoop.security.Groups.<init>(Groups.java:86) at org.apache.hadoop.security.Groups.<init>(Groups.java:66) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:248) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:763) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:748) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:621) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2136) at org.apache.spark.SparkContext.<init>(SparkContext.scala:322) at com.dt.spark.WordCount$.main(WordCount.scala:29) at com.dt.spark.WordCount.main(WordCount.scala) 16/01/26 08:23:42 INFO SecurityManager: Changing view acls to: vivi 16/01/26 08:23:42 INFO SecurityManager: Changing modify acls to: vivi 16/01/26 08:23:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vivi); users with modify permissions: Set(vivi) 16/01/26 08:23:43 INFO Utils: Successfully started service 'sparkDriver' on port 54663. 16/01/26 08:23:43 INFO Slf4jLogger: Slf4jLogger started 16/01/26 08:23:43 INFO Remoting: Starting remoting 16/01/26 08:23:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.100.102:54676] 16/01/26 08:23:43 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54676. 16/01/26 08:23:43 INFO SparkEnv: Registering MapOutputTracker 16/01/26 08:23:43 INFO SparkEnv: Registering BlockManagerMaster 16/01/26 08:23:43 INFO DiskBlockManager: Created local directory at C:\Users\vivi\AppData\Local\Temp\blockmgr-5f59f3c2-3b87-49c5-a1ae-e21847aac44b 16/01/26 08:23:43 INFO MemoryStore: MemoryStore started with capacity 1813.7 MB 16/01/26 08:23:43 INFO SparkEnv: Registering OutputCommitCoordinator 16/01/26 08:23:43 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/01/26 08:23:43 INFO SparkUI: Started SparkUI at http://192.168.100.102:4040 16/01/26 08:23:43 INFO Executor: Starting executor ID driver on host localhost 16/01/26 08:23:43 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54683. 16/01/26 08:23:43 INFO NettyBlockTransferService: Server created on 54683 16/01/26 08:23:43 INFO BlockManagerMaster: Trying to register BlockManager 16/01/26 08:23:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54683 with 1813.7 MB RAM, BlockManagerId(driver, localhost, 54683) 16/01/26 08:23:43 INFO BlockManagerMaster: Registered BlockManager 16/01/26 08:23:46 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB) 16/01/26 08:23:46 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.6 KB) 16/01/26 08:23:46 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54683 (size: 13.9 KB, free: 1813.7 MB) 16/01/26 08:23:46 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:37 16/01/26 08:23:47 WARN : Your hostname, vivi-PC resolves to a loopback/non-reachable address: fe80:0:0:0:5937:95c4:86da:2f43%30, but we couldn't find any external IP address! 16/01/26 08:23:48 INFO FileInputFormat: Total input paths to process : 1 16/01/26 08:23:48 INFO SparkContext: Starting job: foreach at WordCount.scala:56 16/01/26 08:23:48 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:48) 16/01/26 08:23:48 INFO DAGScheduler: Got job 0 (foreach at WordCount.scala:56) with 1 output partitions 16/01/26 08:23:48 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCount.scala:56) 16/01/26 08:23:48 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 16/01/26 08:23:48 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 16/01/26 08:23:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:48), which has no missing parents 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.0 KB, free 171.6 KB) 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 173.9 KB) 16/01/26 08:23:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54683 (size: 2.3 KB, free: 1813.7 MB) 16/01/26 08:23:48 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 16/01/26 08:23:48 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:48) 16/01/26 08:23:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 16/01/26 08:23:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2119 bytes) 16/01/26 08:23:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 16/01/26 08:23:48 INFO HadoopRDD: Input split: file:/D:/testspark/README.md:0+3359 16/01/26 08:23:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 16/01/26 08:23:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 16/01/26 08:23:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 16/01/26 08:23:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 16/01/26 08:23:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/01/26 08:23:48 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver 16/01/26 08:23:48 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 177 ms on localhost (1/1) 16/01/26 08:23:48 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/01/26 08:23:48 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:48) finished in 0.186 s 16/01/26 08:23:48 INFO DAGScheduler: looking for newly runnable stages 16/01/26 08:23:48 INFO DAGScheduler: running: Set() 16/01/26 08:23:48 INFO DAGScheduler: waiting: Set(ResultStage 1) 16/01/26 08:23:48 INFO DAGScheduler: failed: Set() 16/01/26 08:23:48 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:54), which has no missing parents 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.5 KB, free 176.4 KB) 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1581.0 B, free 177.9 KB) 16/01/26 08:23:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54683 (size: 1581.0 B, free: 1813.7 MB) 16/01/26 08:23:48 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 16/01/26 08:23:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:54) 16/01/26 08:23:48 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 16/01/26 08:23:48 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes) 16/01/26 08:23:48 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 16/01/26 08:23:48 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 16/01/26 08:23:48 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms package:1 For:2 Programs:1 processing.:1 Because:1 The:1 cluster.:1 its:1 [run:1 APIs:1 have:1 Try:1 computation:1 through:1 several:1 This:2 graph:1 Hive:2 storage:1 ["Specifying:1 To:2 page](http://spark.apache.org/documentation.html):1 Once:1 "yarn":1 prefer:1 SparkPi:2 engine:1 version:1 file:1 documentation,:1 processing,:1 the:21 are:1 systems.:1 params:1 not:1 different:1 refer:2 Interactive:2 R,:1 given.:1 if:4 build:3 when:1 be:2 Tests:1 Apache:1 ./bin/run-example:2 programs,:1 including:3 Spark.:1 package.:1 1000).count():1 Versions:1 HDFS:1 Data.:1 >>>:1 programming:1 Testing:1 module,:1 Streaming:1 environment:1 run::1 clean:1 1000::2 rich:1 GraphX:1 Please:3 is:6 run:7 URL,:1 threads.:1 same:1 MASTER=spark://host:7077:1 on:5 built:1 against:1 [Apache:1 tests:2 examples:2 at:2 optimized:1 usage:1 using:2 graphs:1 talk:1 Shell:2 class:2 abbreviated:1 directory.:1 README:1 computing:1 overview:1 `examples`:2 example::1 ##:8 N:1 set:2 use:3 Hadoop-supported:1 tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).:1 running:1 find:1 contains:1 project:1 Pi:1 need:1 or:3 Big:1 Java,:1 high-level:1 uses:1 <class>:1 Hadoop,:2 available:1 requires:1 (You:1 see:1 Documentation:1 of:5 tools:1 using::1 cluster:2 must:1 supports:2 built,:1 system:1 build/mvn:1 Hadoop:3 this:1 Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version):1 particular:2 Python:2 Spark:13 general:2 YARN,:1 pre-built:1 [Configuration:1 locally:2 library:1 A:1 locally.:1 sc.parallelize(1:1 only:1 Configuration:1 following:2 basic:1 #:1 changed:1 More:1 which:2 learning,:1 first:1 ./bin/pyspark:1 also:4 should:2 for:11 [params]`.:1 documentation:3 [project:2 mesos://:1 Maven](http://maven.apache.org/).:1 setup:1 <http://spark.apache.org/>:1 latest:1 your:1 MASTER:1 example:3 scala>:1 DataFrames,:1 provides:1 configure:1 distributions.:1 can:6 About:1 instructions.:1 do:2 easiest:1 no:1 how:2 `./bin/run-example:1 Note:1 individual:1 spark://:1 It:2 Scala:2 Alternatively,:1 an:3 variable:1 submit:1 machine:1 thread,:1 them,:1 detailed:2 stream:1 And:1 distribution:1 return:2 Thriftserver:1 ./bin/spark-shell:1 "local":1 start:1 You:3 Spark](#building-spark).:1 one:2 help:1 with:3 print:1 Spark"](http://spark.apache.org/docs/latest/building-spark.html).:1 data:1 wiki](https://cwiki.apache.org/confluence/display/SPARK).:1 in:5 -DskipTests:1 downloaded:1 versions:1 online:1 Guide](http://spark.apache.org/docs/latest/configuration.html):1 comes:1 [building:1 Python,:2 Many:1 building:2 Running:1 from:1 way:1 Online:1 site,:1 other:1 Example:1 analysis.:1 sc.parallelize(range(1000)).count():1 you:4 runs.:1 Building:1 higher-level:1 protocols:1 guidance:2 a:8 guide,:1 name:1 fast:1 SQL:2 will:1 instance::1 to:14 core:1 :67 web:1 "local[N]":1 programs:2 package.):1 that:2 MLlib:1 ["Building:1 shell::2 Scala,:1 and:10 command,:2 ./dev/run-tests:1 sample:1 16/01/26 08:23:48 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver 16/01/26 08:23:48 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 61 ms on localhost (1/1) 16/01/26 08:23:48 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/01/26 08:23:48 INFO DAGScheduler: ResultStage 1 (foreach at WordCount.scala:56) finished in 0.061 s 16/01/26 08:23:48 INFO DAGScheduler: Job 0 finished: foreach at WordCount.scala:56, took 0.328012 s 16/01/26 08:23:48 INFO SparkUI: Stopped Spark web UI at http://192.168.100.102:4040 16/01/26 08:23:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/01/26 08:23:48 INFO MemoryStore: MemoryStore cleared 16/01/26 08:23:48 INFO BlockManager: BlockManager stopped 16/01/26 08:23:48 INFO BlockManagerMaster: BlockManagerMaster stopped 16/01/26 08:23:48 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/01/26 08:23:48 INFO SparkContext: Successfully stopped SparkContext 16/01/26 08:23:48 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/01/26 08:23:48 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/01/26 08:23:48 INFO ShutdownHookManager: Shutdown hook called 16/01/26 08:23:48 INFO ShutdownHookManager: Deleting directory C:\Users\vivi\AppData\Local\Temp\spark-56f9ed0a-5671-449a-955a-041c63569ff2
说明:上面程序运行错误,是加载hadoop的配置,因为运行在本地,是找不到的,但不影响测试。
发表评论
-
SparkStreaming pull data from Flume
2016-06-19 17:29 1230Spark Streaming + Flume Integra ... -
Flume push数据到SparkStreaming
2016-06-19 15:16 1943上节http://kevin12.iteye.com/blog ... -
Spark Streaming 统计单词的例
2016-06-19 14:55 3测试Spark Streaming 统计单词的例子 1.准 ... -
Spark Streaming 统计单词的例子
2016-06-19 12:29 3684测试Spark Streaming 统计单词的例子 1.准备 ... -
Spark SQL窗口函数
2016-04-22 07:18 2562窗口函数又叫着窗口分析函数,Spark 1.4版本SparkS ... -
Spark SQL内置函数应用
2016-04-22 07:00 8658简单说明 使用Spark SQL中的内置函数对数据进行 ... -
Spark SQL操作Hive数据库
2016-04-13 22:37 17604本次例子通过scala编程实现Spark SQL操作Hive数 ... -
Spark SQL on hive配置和实战
2016-03-26 18:40 5569spark sql 官网:http://spark ... -
Spark RDD弹性表现和来源
2016-02-09 20:12 3859hadoop 的MapReduce是基于数 ... -
Spark内核架构
2016-02-07 12:24 10141.在将spark内核架构前,先了解一下Hadoop的MR,H ... -
spark集群HA搭建
2016-01-31 08:50 4527spark集群的HA图: 搭建spark的HA需要安装z ... -
Spark集群中WordCount运行原理
2016-01-31 07:05 2511以数据流动的视角解释一下wordcount运行的原理 pa ... -
eclipse开发spark程序配置在集群上运行
2016-01-27 08:08 9367这篇bolg讲一下,IDE开发的spark程序如何提交到集群上 ... -
spark1.6.0搭建(基于hadoop2.6.0分布式)
2016-01-24 10:11 5975本文是基于hadoop2.6.0的分布式环境搭建spark1. ...
相关推荐
9. **运行Hadoop程序**:在Eclipse中,右键项目选择Run As -> Run Configurations,创建一个新的Hadoop Job配置。配置包括输入数据路径、输出数据路径、主类等信息。点击Run,Eclipse会自动将程序打包并提交到Hadoop...
在本文中,我们将详细探讨如何在Eclipse 3.5.2中部署、编译和运行Spark源代码。首先,我们需要准备必要的软件...完成这些步骤后,你将能够成功地在本地环境中运行Spark,这对于学习、调试或开发Spark功能是非常有用的。
1. **创建SparkConf对象**:这是设置Spark应用程序配置的地方,包括应用程序名、运行模式等。在本地模式下,我们使用`SparkConf().setMaster("local")`。 2. **创建SparkContext对象**:SparkConf配置后,通过`new ...
通过Eclipse进行开发,利用Hadoop处理数据,通过Spark加速计算,HBase用于实时数据存储,ZooKeeper保证集群的稳定运行,而Hive则提供了更友好的数据查询方式。这样的环境对于理解Hadoop生态系统的运作机制以及开发...
5. 编写和运行代码:使用Scala编写Spark程序,然后通过ECLIPSE的运行配置来启动Spark集群并提交作业。 总之,Spark Scala开发依赖包对于在ECLIPSE中进行大数据处理项目至关重要。正确配置这些依赖将使开发者能够在...
4. **安装Scala**: 虽然Spark可以不依赖Scala运行,但Spark的API是基于Scala设计的,所以为了开发Spark程序,通常需要安装Scala环境。同样地,Scala的bin目录也需要添加到环境变量PATH中。 5. **集成开发环境(IDE)*...
在Windows 7操作系统中,使用Eclipse开发Hadoop应用程序的过程涉及多个步骤,涵盖了从环境配置到实际编程的各个层面。以下是对这个主题的详细讲解: 首先,我们需要了解Hadoop和Eclipse的基础。Hadoop是一个开源的...
如果你在本地运行Hadoop,地址可能是`localhost:8088`和`localhost:9000`。 4. **创建Hadoop项目**:现在,你可以创建一个新的Map/Reduce项目。在`File` > `New` > `Project` > `Map/Reduce Project`,然后按照向导...
在Linux环境下搭建Hadoop并配置Eclipse开发环境是大数据处理工作中的重要步骤。Hadoop是一个开源的分布式计算框架,主要用于处理和存储大规模数据。而Eclipse是一款强大的集成开发环境(IDE),通过特定的插件可以...
在Eclipse中直接运行MapReduce程序,可以进行快速的本地测试和调试,减少了实际在集群上运行的时间。 任务3是对开发过程的总结和反思,通常包括遇到的问题、解决策略以及优化建议。在实践中,可能需要根据硬件资源...
6. **运行和测试**:配置完成后,你可以通过编写简单的Java或Scala程序,利用Hadoop和Spark处理本地文件或模拟集群进行测试。例如,使用WordCount示例来验证配置是否正确。 这个压缩包中的Word文档可能包含了配置...
这通常需要安装Remote System Explorer (RSE) 插件,它允许开发者在本地Eclipse环境中查看、编辑和管理远程文件系统,实现无缝的远程开发体验。 总之,Eclipse 2019针对Linux平台的版本为Java开发者提供了强大且...
5. 运行和调试:使用Eclipse的Run或Debug功能,可以直接在本地运行MapReduce程序,或者连接到远程Hadoop集群进行测试。 此外,对于更高效的开发,可以学习使用Hadoop的高级特性,如Pig、Hive、Spark等工具,它们...
运行MapReduce程序时,必须提供输入和输出路径,使用'hdfs://主机:端口/路径'格式,确保程序访问的是HDFS而不是本地文件系统。同时,确保在项目的Build Path中包含所有必要的Hadoop相关JAR包,特别是与HDFS相关的包...
总的来说,搭建Openfire与Spark环境需要对Eclipse项目管理和Java开发有一定的了解,同时对即时通讯系统的运行机制要有基本的认识。虽然过程可能稍显繁琐,但只要按照上述步骤操作,就能成功建立一个本地即时通讯环境...
9. **开发和调试**: 对于开发者,IDEA、IntelliJ IDEA和Eclipse等集成开发环境有专门的Spark插件,便于代码编写和调试。此外,使用`--master local[*]`可以在本地模式下快速测试代码,而无需连接到集群。 10. **...
本文详细介绍了如何在本地环境中搭建Spark开发环境,包括JDK、Scala、Spark以及构建工具Maven的安装配置。通过一个简单的WordCount示例验证了环境搭建的正确性。这些步骤适用于初学者快速上手Spark开发,同时也为更...
3. **运行/调试支持**:可以直接在Eclipse内提交MapReduce任务进行运行或调试,无需离开IDE,这包括本地模式和远程集群模式。 4. **资源管理**:通过插件可以方便地查看和管理Hadoop集群的资源状态,包括节点信息、...
- 测试程序是否能在本地或集群环境中正常运行。 #### 使用Spark的功能 1. **Spark Core**: - RDD(弹性分布式数据集):是Spark中最基本的数据抽象,可以用来执行各种并行操作。 - Transformation和Action:...
构建成功后,需要配置Run/Debug配置来运行Spark应用程序。 以上步骤完成后,就可以运行Spark项目并看到Spark的启动界面,从而完成整个Spark源代码的部署和编译过程。这个过程不仅涉及到MyEclipse和Ant工具的使用,...