simiar to the prevous article,this one is focused on cluster mode.
1.issue command
./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode cluster --master spark://gzsw-02:6066 lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user/hadoop/input.txt
note:1) the deploy-mode is necessary to specify by 'cluster'.
2) then the 'master' param is rest-url,ie,
REST URL: spark://gzsw-02:6066 (cluster mode)
which shown in spark master ui page,since spark will use rest.RestSubmissionClient to submit jobs.
2.run logs in user side(it's brief,as this is cluster mode)
Spark Command: /usr/local/jdk/jdk1.6.0_31/bin/java -cp /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/conf/:/home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/spark-assembly-1.4.1-hadoop2.4.0.jar:/home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/hadoop/hadoop-2.5.2/etc/hadoop/ -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master spark://gzsw-02:6066 --deploy-mode cluster --class org.apache.spark.examples.JavaWordCount lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://hd02:/user/hadoop/input.txt ======================================== -executed cmd retruned by Main.java:/usr/local/jdk/jdk1.6.0_31/bin/java -cp /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/conf/:/home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/spark-assembly-1.4.1-hadoop2.4.0.jar:/home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/hadoop/hadoop-2.5.2/etc/hadoop/ -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master spark://gzsw-02:6066 --deploy-mode cluster --class org.apache.spark.examples.JavaWordCount lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user/hadoop/input.txt Running Spark using the REST application submission protocol. 16/09/19 11:26:06 INFO rest.RestSubmissionClient: Submitting a request to launch an application in spark://gzsw-02:6066. 16/09/19 11:26:07 INFO rest.RestSubmissionClient: Submission successfully created as driver-20160919112607-0001. Polling submission state... 16/09/19 11:26:07 INFO rest.RestSubmissionClient: Submitting a request for the status of submission driver-20160919112607-0001 in spark://gzsw-02:6066. 16/09/19 11:26:07 INFO rest.RestSubmissionClient: State of driver driver-20160919112607-0001 is now RUNNING. 16/09/19 11:26:07 INFO rest.RestSubmissionClient: Driver is running on worker worker-20160914175456-192.168.100.14-36693 at 192.168.100.14:36693. 16/09/19 11:26:07 INFO rest.RestSubmissionClient: Server responded with CreateSubmissionResponse: { "action" : "CreateSubmissionResponse", "message" : "Driver successfully submitted as driver-20160919112607-0001", "serverSparkVersion" : "1.4.1", "submissionId" : "driver-20160919112607-0001", "success" : true } 16/09/19 11:26:07 INFO util.Utils: Shutdown hook called
so we know,driver is running on worker 192.168.100.14:36693(not local host)
3.FAQ
1) in cluser mode,the driver info will show in spark master ui page(but not for client mode)
(app-0000/0001 both are run in cluster mode,so the corresponding drivers are shown in 'completed drivers' block)
2) can't open the application detail ui.ie when you click the app which run in cluster mode,similar errors will compain about:
Application history not found (app-20160919151936-0000) No event logs found for application JavaWordCount in file:/home/hadoop/spark/spark-eventlog/. Did you specify the correct logging directory?
this msg is present as in cluster mode,the driver will run on other worker instead of master local host,so a request to master will find nothing about this app.
workaround:use the hdfs fs instead of local fs,ie
spark.eventLog.dir=hdfs://host02:8020/user/hadoop/spark-eventlog
3) applications disappear after restart spark
eventhrough you set a distributed filesystem to 'spark.eventlog.dir' mentioned above,you will see nothgin when restart spark,that means spark master will keep apps info in mem when it's alive,but when restarts.there is a spark-history-server.sh to figure out this problem[1]
ref:
相关推荐
1. **Spark Core**:这是Spark的基础,提供了分布式任务调度、内存管理、错误恢复和网络通信等功能。 2. **Spark SQL**:Spark SQL是Spark处理结构化数据的模块,它允许用户通过SQL或者DataFrame API来操作数据,...
1. **Spark Core**:这是Spark的基础,提供了分布式任务调度、内存管理、错误恢复和与其他存储系统的接口。 2. **Spark SQL**:用于处理结构化数据,它支持SQL查询并通过DataFrame API提供了与多种数据源的交互。 3....
4. **组件丰富**:Spark包含多个模块,如Spark Core、Spark SQL、Spark Streaming、MLlib(机器学习库)和GraphX(图计算)。这些组件协同工作,覆盖了批处理、交互式查询、实时流处理、机器学习和图计算等多种应用...
HOME环境变量,并在启动时指定master节点,例如本地模式(`--master local[n]`)、standalone模式(`--master spark://<master_ip>:<port>`)或YARN模式(`--master yarn-client`或`--master yarn-cluster`)。...
5. **运行模式**:Spark支持多种运行模式,包括本地模式(方便开发测试)、standalone模式(Spark自带的集群管理器)、YARN模式(使用Hadoop的资源管理器)和Mesos模式(Mesos集群管理器)。在Hadoop 2.6环境中,...
在分布式部署模式下,Spark支持多种部署选项,包括Standalone、Mesos、YARN和Kubernetes。你可以根据实际的集群环境选择合适的部署方式。对于大规模生产环境,通常会选择YARN或Kubernetes,因为它们提供了更好的资源...
- **Spark Core**:基础执行引擎,负责任务调度、内存管理、故障恢复等。 - **Spark SQL**:提供SQL和DataFrame/Dataset API,用于结构化和半结构化数据处理,与Hive兼容。 - **Spark Streaming**:处理连续数据...
5. 使用`spark-submit`脚本提交Spark作业到YARN上运行,或者在本地模式或standalone模式下启动Spark Shell进行交互式测试。 Spark的使用场景广泛,涵盖了数据批处理、实时流处理、机器学习和图计算等。在大数据领域...
Standalone模式是Spark自带的资源管理器,可以快速地在多台机器上搭建Spark集群。而YARN模式则利用了CDH中的资源管理系统,使Spark作业能在更大的Hadoop集群上运行。若要部署在YARN上,需要配置Spark的conf目录中的...
Spark支持四种运行模式:本地模式(用于测试)、集群模式(如YARN、Mesos或standalone)、Spark on Kubernetes以及云服务提供商的托管Spark。 6. **编程接口**: Spark提供了Python(PySpark)、Java、Scala和R的...
Spark-Core文档是本人经三年总结笔记汇总而来,对于自我学习Spark核心基础知识非常方便,资料中例举完善,内容丰富。具体目录如下: 目录 第一章 Spark简介与计算模型 3 1 What is Spark 3 2 Spark简介 3 3 Spark...
同时,可能需要配置其他的资源管理器,如YARN或Mesos,或者使用Standalone模式自建Spark集群。 Spark的核心组件主要包括:Spark Core、Spark SQL、Spark Streaming、MLlib(机器学习库)和GraphX(图计算)。Spark ...
Spark Core源码阅读 Spark Context 阅读要点 Spark的缓存,变量,shuffle数据等清理及机制 Spark-submit关于参数及部署模式的部分解析 GroupByKey VS ReduceByKey OrderedRDDFunctions那些事 高效使用...
用户可以将"spark-2.3.3-bin-hadoop2.6"解压后在本地模式、集群模式(如YARN、Mesos或standalone)下运行Spark。配置文件如`conf/spark-defaults.conf`用于设置各种参数,如内存分配、日志级别等。 5. **编程接口*...
Spark是Apache软件基金会下的一个开源大数据处理框架,其核心组件包括Spark Core、Spark SQL、Spark Streaming、MLlib(机器学习库)和GraphX(图计算)。Spark以其高效、易用和可扩展性著称,尤其在内存计算方面,...
1. **Spark核心组件**:Spark的核心组件包括Spark Core、Spark SQL、Spark Streaming、MLlib(机器学习库)和GraphX(图计算)。Spark Core是基础,提供了分布式任务调度、内存管理、错误恢复和互操作性等功能。 2....
Spark的核心特性在于其内存计算(In-Memory Computing)策略,这使得数据处理速度显著快于传统的MapReduce模型。它的主要组件包括: 1. **Spark Core**:Spark的基础框架,提供分布式任务调度、内存管理以及错误...
### Spark-Core 3.1.0 基础知识点详解 #### 一、Spark 概述 ##### 1.1 Spark 发展历程 Spark 作为一款高性能的数据处理框架,其发展历程颇具代表性。它最初由美国加州大学伯克利分校的 AMPLab 开发,诞生于2009年...
ClusterManager:在Standalone模式中即为Master(主节点),控制整个集群,监控Worker。在YARN模式中为资源管理器。 Worker:从节点,负责控制计算节点,启动Executor。在YARN模式中为NodeManager,负责计算节点的...
折腾了很久,终于开始学习Spark的源码...这个是提交到standalone集群的方式,打开spark-submit这文件,我们会发现它最后是调用了org.apache.spark.deploy.SparkSubmit这个类。我们直接进去看就行了,main函数就几行代码