`
bit1129
  • 浏览: 1069719 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

【Spark六十七】Spark Standalone完全分布式安装

 
阅读更多

 

1. 下载并解压Spark1.2.1(with hadoop2.4)

http://mirror.bit.edu.cn/apache/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.4.tgz

 

2.下载并解压Scala-2.10.4

http://www.scala-lang.org/files/archive/scala-2.10.4.tgz

 

3.配置Scala环境变量:

3.1 设置SCALA_HOME

 

/home/hadoop/software/scala-2.10.4
 

 

3.2.将$SCALA_HOME/bin假如到系统PATH变量中

 

3.3 在console上执行命令scala -version查看Scala安装的版本

 

Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67).
 

 

 

4. 配置Spark环境变量(vim /etc/profile)

 

export JAVA_HOME=/home/hadoop/software/jdk1.7.0_67
export HADOOP_HOME=/home/hadoop/software/hadoop-2.5.2
export SPARK_HOME=/home/hadoop/software/spark-1.2.1-bin-hadoop2.4
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

 

5. 编辑Spark自带的spark-env.sh文件(所有的节点都做设置)

 

export JAVA_HOME=/home/hadoop/software/jdk1.7.0_67
export HADOOP_HOME=/home/hadoop/software/hadoop-2.5.2
export SPARK_HOME=/home/hadoop/software/spark-1.2.1-bin-hadoop2.4
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SCALA_HOME=/home/hadoop/software/scala-2.10.4
export SPARK_MASTER_IP=192.168.26.131
export SPARK_WORKER_INSTANCES=1
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=7087
export SPARK_WORKER_PORT=8077

 

6. 在Master节点修改slaves文件

 

在Master节点上修改/conf/slaves文件,以配置slaves。Slave节点不要做如下的设置

 

192.168.26.133
192.168.26.134

 

7. 启动Master和两个slaves

7.1 在Master节点通过如下命令启动Master和两个Slave

 

sbin/start-all.sh
 

 

 

7.2 启动发现,两个Slave启动了,但是Master没有启动,Master的异常如下:

 

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /home/hadoop/software/jdk1.7.0_67/bin/java -cp ::/home/hadoop/software/spark-1.2.1-bin-hadoop2.4/sbin/../conf:/home/hadoop/software/spark-1.2.1-bin-hadoop2.4/lib/spark-assembly-1.2.1-hadoop2.4.0.jar:/home/hadoop/software/spark-1.2.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/hadoop/software/spark-1.2.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/software/spark-1.2.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/software/hadoop-2.5.2/etc/hadoop:/home/hadoop/software/hadoop-2.5.2/etc/hadoop -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip 192.168.26.131 --port 7077 --webui-port 7087
========================================

15/02/18 01:47:16 INFO master.Master: Registered signal handlers for [TERM, HUP, INT]
15/02/18 01:47:17 INFO spark.SecurityManager: Changing view acls to: hadoop
15/02/18 01:47:17 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/02/18 01:47:17 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/02/18 01:47:21 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/18 01:47:24 INFO Remoting: Starting remoting
Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at akka.remote.Remoting.start(Remoting.scala:180)
        at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
        at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
        at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
        at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
        at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
        at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
        at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
        at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
        at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
        at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
        at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1765)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
        at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1756)
        at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
        at org.apache.spark.deploy.master.Master$.startSystemAndActor(Master.scala:849)
        at org.apache.spark.deploy.master.Master$.main(Master.scala:829)
        at org.apache.spark.deploy.master.Master.main(Master.scala)
15/02/18 01:47:33 ERROR Remoting: Remoting error: [Startup timed out] [
akka.remote.RemoteTransportException: Startup timed out
        at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:136)
        at akka.remote.Remoting.start(Remoting.scala:198)
        at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
        at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
        at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
        at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
        at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
        at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
        at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
        at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
        at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
        at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
        at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1765)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
        at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1756)
        at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
        at org.apache.spark.deploy.master.Master$.startSystemAndActor(Master.scala:849)
        at org.apache.spark.deploy.master.Master$.main(Master.scala:829)
        at org.apache.spark.deploy.master.Master.main(Master.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at akka.remote.Remoting.start(Remoting.scala:180)
        ... 17 more
]
15/02/18 01:47:37 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/02/18 01:47:37 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports
.

 

 

7.3 指定stop-all.sh将所有的Master和Slaves进程关闭,然后再次启动start-all.sh,Master和两个Slave都启动成功,

7.3.1 Master的启动日志如下:

 

15/02/18 01:54:50 INFO master.Master: Registered signal handlers for [TERM, HUP, INT]
15/02/18 01:54:50 INFO spark.SecurityManager: Changing view acls to: hadoop
15/02/18 01:54:50 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/02/18 01:54:50 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/02/18 01:54:51 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/18 01:54:51 INFO Remoting: Starting remoting
15/02/18 01:54:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@192.168.26.131:7077]
15/02/18 01:54:52 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkMaster@192.168.26.131:7077]
15/02/18 01:54:52 INFO util.Utils: Successfully started service 'sparkMaster' on port 7077.
15/02/18 01:54:52 INFO master.Master: Starting Spark master at spark://192.168.26.131:7077
15/02/18 01:54:52 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/02/18 01:54:52 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:7087
15/02/18 01:54:52 INFO util.Utils: Successfully started service 'MasterUI' on port 7087.
15/02/18 01:54:52 INFO ui.MasterWebUI: Started MasterWebUI at http://hadoop.master:7087
15/02/18 01:54:53 INFO master.Master: I have been elected leader! New state: ALIVE
15/02/18 01:54:57 INFO master.Master: Registering worker hadoop.slave2:8077 with 1 cores, 971.0 MB RAM
15/02/18 01:54:57 INFO master.Master: Registering worker hadoop.slave1:8077 with 1 cores, 971.0 MB RAM

可见

7.3.1.1. Master启动并监听于7077端口,

7.3.1.2. Master的WebUI已经启动,访问地址是

7.3.1.3 hadoop.slave1和hadoop.slave2都监听于8077端口,并且已经注册到Master节点上

 

 

7.3.2 Slave端的日志如下:

 

15/02/18 01:54:54 INFO worker.Worker: Registered signal handlers for [TERM, HUP, INT]
15/02/18 01:54:54 INFO spark.SecurityManager: Changing view acls to: hadoop
15/02/18 01:54:54 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/02/18 01:54:54 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/02/18 01:54:55 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/18 01:54:56 INFO Remoting: Starting remoting
15/02/18 01:54:56 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker@hadoop.slave1:8077]
15/02/18 01:54:56 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkWorker@hadoop.slave1:8077]
15/02/18 01:54:56 INFO util.Utils: Successfully started service 'sparkWorker' on port 8077.
15/02/18 01:54:56 INFO worker.Worker: Starting Spark worker hadoop.slave1:8077 with 1 cores, 971.0 MB RAM
15/02/18 01:54:56 INFO worker.Worker: Spark home: /home/hadoop/software/spark-1.2.1-bin-hadoop2.4
15/02/18 01:54:56 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/02/18 01:54:56 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:8081
15/02/18 01:54:56 INFO util.Utils: Successfully started service 'WorkerUI' on port 8081.
15/02/18 01:54:56 INFO ui.WorkerWebUI: Started WorkerWebUI at http://hadoop.slave1:8081
15/02/18 01:54:56 INFO worker.Worker: Connecting to master spark://192.168.26.131:7077...
15/02/18 01:54:57 INFO worker.Worker: Successfully registered with master spark://192.168.26.131:7077

 

8 UI展现

 

8.1 Master的UI展示(7087端口):

 

 

 

8.2 Slave的UI展示(8081端口):

 

 

 

9 Spark集群测试:

 

 

9.1. 在Master上启动一个Spark Shell,执行如下操作,发现所有的结果存在于Master,而没有在Worker上执行,这是为什么呢?怀疑bin/spark-shell命令默认的使用本地local作为master,即 spark-shell --master local

 

bin>./spark-shell
scala> val rdd = sc.parallelize(List(1,3,7,7,8,9,11,2,11,33,44,99,111,2432,4311,111,111), 7)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:12

scala> rdd.saveAsTextFile("file:///home/hadoop/output")

 
9.2. 退出上面的shell,执行如下的操作

./spark-shell --master spark://192.168.26.131:7077

 上面的IP,如果指定成域名,例如hadoop.master,则提示连接不上,,不知道为什么。

从日志中,可以看出来,Spark的数据已经提交给Worker去执行,

scala> rdd.saveAsTextFile("file:///home/hadoop/output2")
15/02/18 02:37:55 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/02/18 02:37:55 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/02/18 02:37:55 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/02/18 02:37:55 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/02/18 02:37:55 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/02/18 02:37:55 INFO spark.SparkContext: Starting job: saveAsTextFile at <console>:15
15/02/18 02:37:55 INFO scheduler.DAGScheduler: Got job 0 (saveAsTextFile at <console>:15) with 7 output partitions (allowLocal=false)
15/02/18 02:37:55 INFO scheduler.DAGScheduler: Final stage: Stage 0(saveAsTextFile at <console>:15)
15/02/18 02:37:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/02/18 02:37:55 INFO scheduler.DAGScheduler: Missing parents: List()
15/02/18 02:37:55 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[1] at saveAsTextFile at <console>:15), which has no missing parents
15/02/18 02:37:55 INFO storage.MemoryStore: ensureFreeSpace(112056) called with curMem=0, maxMem=280248975
15/02/18 02:37:55 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 109.4 KB, free 267.2 MB)
15/02/18 02:37:55 INFO storage.MemoryStore: ensureFreeSpace(67552) called with curMem=112056, maxMem=280248975
15/02/18 02:37:55 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 66.0 KB, free 267.1 MB)
15/02/18 02:37:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop.master:44435 (size: 66.0 KB, free: 267.2 MB)
15/02/18 02:37:55 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/02/18 02:37:55 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:838
15/02/18 02:37:55 INFO scheduler.DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[1] at saveAsTextFile at <console>:15)
15/02/18 02:37:55 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 7 tasks
15/02/18 02:37:55 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, hadoop.slave1, PROCESS_LOCAL, 1208 bytes)
15/02/18 02:37:55 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, hadoop.slave2, PROCESS_LOCAL, 1208 bytes)
15/02/18 02:38:06 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop.slave1:41802 (size: 66.0 KB, free: 267.2 MB)
15/02/18 02:38:06 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop.slave2:34337 (size: 66.0 KB, free: 267.2 MB)
15/02/18 02:38:24 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, hadoop.slave1, PROCESS_LOCAL, 1212 bytes)
15/02/18 02:38:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 28274 ms on hadoop.slave1 (1/7)
15/02/18 02:38:24 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, hadoop.slave2, PROCESS_LOCAL, 1208 bytes)
15/02/18 02:38:24 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 28719 ms on hadoop.slave2 (2/7)
15/02/18 02:38:26 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, hadoop.slave2, PROCESS_LOCAL, 1212 bytes)
15/02/18 02:38:26 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 1908 ms on hadoop.slave2 (3/7)
15/02/18 02:38:26 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, hadoop.slave1, PROCESS_LOCAL, 1208 bytes)
15/02/18 02:38:26 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2595 ms on hadoop.slave1 (4/7)
15/02/18 02:38:27 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, hadoop.slave2, PROCESS_LOCAL, 1212 bytes)
15/02/18 02:38:27 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 440 ms on hadoop.slave2 (5/7)
15/02/18 02:38:27 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 336 ms on hadoop.slave1 (6/7)
15/02/18 02:38:27 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 193 ms on hadoop.slave2 (7/7)
15/02/18 02:38:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/02/18 02:38:27 INFO scheduler.DAGScheduler: Stage 0 (saveAsTextFile at <console>:15) finished in 31.266 s
15/02/18 02:38:27 INFO scheduler.DAGScheduler: Job 0 finished: saveAsTextFile at <console>:15, took 31.932050 s

 

结果发现,虽然Slave1和Slave2已经有了输出目录,可是目录底下是没有数据的

 

9.3 执行workdcount也是如此结果

bin/spark-shell --master spark://192.168.26.131:7077
scala> var rdd = sc.textFile("file:///home/hadoop/history.txt.used.byspark", 7)
rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _,5).map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).saveAsTextFile("file:///home/hadoop/output")

 

9.4  运行Spark自带的SparkPi程序,

 

./run-example SparkPi 1000 --master spark://192.168.26.131:7077
 结果能够输出:Pi is roughly 3.14173708

 

 9.5 执行workdcount(读写HDFS)

bin/spark-shell --master spark://192.168.26.131:7077
scala> var rdd = sc.textFile("/user/hadoop/history.txt.used.byspark", 7)
rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _,5).map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).saveAsTextFile("/user/hadoop/output")

可以得到正确的结果是正确的part-00000~part-00004

 

 

 可见slave1执行了3个任务,slave2执行了两个Task,通过Shuffle,分别读取了661B和1248B

 

 

 

  • 大小: 49.7 KB
  • 大小: 26.9 KB
  • 大小: 67.8 KB
分享到:
评论

相关推荐

    Spark实验:Standalone模式安装部署(带答案)1

    总结,Spark Standalone模式的安装部署是一个基础但重要的实践环节,它涉及到了Linux环境管理、文件配置以及分布式系统的基础操作。通过这次实验,可以深入理解Spark集群的工作原理,为后续的数据处理和分析打下坚实...

    spark 分布式集群搭建

    ### Spark Standalone 分布式集群搭建详解 #### Spark Standalone 运行模式概述 Spark Standalone 是 Apache Spark 提供的一种自带的集群管理模式,主要用于管理 Spark 应用程序的执行环境。这种模式简单易用,适合...

    springboot整合spark连接远程服务计算框架使用standAlone模式

    在现代大数据处理领域,Spark作为一个高性能的分布式计算框架,被广泛应用。Spring Boot作为Java开发的轻量级框架,简化了微服务的构建。当需要在Spring Boot应用中整合Spark进行远程服务计算时,通常会采用Spark的...

    Spark Standalone架构设计.docx

    RDD(Resilient Distributed Datasets)是 Spark 的核心概念之一,弹性分布式数据集,它是对分布式数据集的一种内存抽象,通过受限的共享内存方式来提供容错性,同时这种内存模型使得计算比传统的数据流模型要高效。...

    Spark一个高效的分布式计算系统

    Spark是一个由UC Berkeley AMP实验室开发并开源的分布式计算框架,其设计目标是提供高效、通用的并行计算能力,尤其适合大数据处理中的迭代计算任务。Spark借鉴了Hadoop MapReduce的思想,但在性能和灵活性上进行了...

    zookeeper单节点安装和伪分布式集群安装和完全分布式集群安装

    ### Zookeeper单节点安装与集群部署详解 ...通过以上步骤,我们可以完成ZooKeeper的单节点安装、伪分布式集群安装以及完全分布式集群安装。这些步骤不仅适用于学习目的,也为实际生产环境中的部署提供了指导。

    Spark环境搭建——standalone集群模式

    这篇博客,Alice为大家带来的是Spark集群环境搭建之——standalone集群模式。 文章目录集群角色介绍集群规划修改配置并分发启动和停止查看web界面测试 集群角色介绍  Spark是基于内存计算的大数据并行计算框架,...

    spark在虚拟机的安装

    #### 六、Spark Standalone Cluster模式配置 1. **配置spark-env.sh** 在`$SPARK_HOME/conf/spark-env.sh`文件中配置Spark的环境变量,例如设置内存大小等。 2. **复制Spark文件夹** 在每个节点上创建Spark的...

    Spark Standalone 单机版部署

    Spark standalone 单机版部署,看了网上很多方法,事实证明都是错误的,本人亲身经历,在导师的指导下,成功配置成功单机版。

    Spark安装包、安装文档

    Spark以其高效的内存计算和弹性分布式数据集(Resilient Distributed Datasets, RDD)为核心特性,显著提升了数据处理速度。 首先,Spark的核心组件包括: 1. **Spark Core**:Spark的基础框架,提供了任务调度、...

    23:Spark2.3.x分布式集群安装部署.zip

    在这个23章的教程中,我们将深入探讨如何在分布式环境下安装和部署Spark 2.3.x集群,以实现大规模数据处理。 首先,了解Spark的基本架构是至关重要的。Spark的核心组件包括Driver程序、Executor和Cluster Manager。...

    spark伪分布.docx

    Spark伪分布(Standalone)模式安装部署 在大数据处理中,Spark是非常流行的计算引擎,它可以快速处理大量数据。然而,为了让Spark正常运行,需要安装和部署Spark伪分布(Standalone)模式。在本文中,我们将介绍...

    Spark安装文档以及介绍

    ### Spark概述 #### 什么是Spark Spark是一种高性能的大数据分析处理框架,主要特点是速度快、易于使用...通过以上步骤,可以完成一个基本的Spark集群的安装和配置工作,为后续的大数据分析处理任务提供强大的支持。

    spark-3.2.1 安装包 下载 hadoop3.2

    在分布式部署模式下,Spark支持多种部署选项,包括Standalone、Mesos、YARN和Kubernetes。你可以根据实际的集群环境选择合适的部署方式。对于大规模生产环境,通常会选择YARN或Kubernetes,因为它们提供了更好的资源...

    spark1.2.1常用模式部署运行

    Spark 支持三种分布式部署模式:standalone、spark on mesos 和 spark on YARN。其中,standalone 模式是 Spark 的默认模式,不需要其他的资源管理系统。在 standalone 模式下,Spark 可以以单机模式或伪分布式模式...

    Spark分布式内存计算框架视频教程

    5.Spark 分布式缓存 6.业务报表分析 7.应用执行部署 8.Oozie和Hue集成调度Spark 应用 第五章、SparkStreaming 模块 1.Streaming流式应用概述 2.Streaming 计算模式 3.SparkStreaming计算思路 4.入门案例 5.Spark...

    spark集群安装部署与初步开发

    - **Spark standalone安装**: - 下载`spark-2.1.0-bin-hadoop2.7`。 - 安装并配置环境变量。 - **主机与IP映射**: - 编辑`/etc/hosts`文件,添加集群主机名与IP地址映射。 #### 六、搭建Spark开发集群 - **...

    spark全套视频教程

    2. **Spark安装与配置**:学习如何在不同的操作系统上安装Spark,并配置分布式环境,如在本地模式、standalone模式、YARN或Mesos集群上运行Spark。 3. **Spark Shell与Spark SQL**:Spark Shell是交互式探索数据的...

Global site tag (gtag.js) - Google Analytics