`
Kevin12
  • 浏览: 235408 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

spark1.6.0搭建(基于hadoop2.6.0分布式)

阅读更多
本文是基于hadoop2.6.0的分布式环境搭建spark1.6.0的分布式集群。
hadoop2.6.0分布式集群可参考:http://kevin12.iteye.com/blog/2273532
1.解压spark的包,tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz,并将其移到/usr/local/spark目录下面;
在~/.bashrc文件中配置spark的环境变量,保存并退出,执行source ~/.bashrc使之生效;
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_60
export JRE_HOME=${JAVA_HOME}/jre
export CLASS_PATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export SCALA_HOME=/usr/local/scala/scala-2.10.4
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib"
export SPARK_HOME=/usr/local/spark/spark-1.6.0-bin-hadoop2.6
export PATH=.:${JAVA_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:$PATH

然后将运行下面命令,将master1上的.bashrc拷贝到四台worker上。
root@master1:~# scp ~/.bashrc root@worker1:~/
root@master1:~# scp ~/.bashrc root@worker2:~/
root@master1:~# scp ~/.bashrc root@worker3:~/
root@master1:~# scp ~/.bashrc root@worker4:~/

分别在四台worker上执行source ~/.bashrc 使配置生效。
2.配置spark环境
2.1 将conf下面的spark-env.sh.template拷贝一份到spark-env.sh中,并编辑配置。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp spark-env.sh.template  spark-env.sh
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vim spark-env.sh
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_60
export export SCALA_HOME=/usr/local/scala/scala-2.10.4
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_MASTER_IP=master1
export SPARK_WORKER_MEMORY=2g
export SPARK_EXECUTOR_MEMORY=2g
export SPARK_DRIVER_MEMORY=2g
export SPARK_WORKDER_CORES=4


说明:HADOOP_CONF_DIR配置是让spark运行在yarn模式下,非常关键。
SPARK_WORKER_MEMORY,SPARK_EXECUTOR_MEMORY,SPARK_DRIVER_MEMORY,SPARK_WORKDER_CORES 根据自己的集群情况进行配置。
配置slavas:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp slaves.template slaves
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vim slaves
# A Spark Worker will be started on each of the machines listed below.
worker1
worker2
worker3
worker4

配置spark-defaults.conf:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp spark-defaults.conf.template spark-defaults.conf
#添加下面的配置:
spark.executor.extraJavaOptions    -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.eventLog.enabled    true
spark.eventLog.dir    hdfs://master1:9000/historyserverforSpark
spark.yarn.historyServer.address    master1:18080
spark.history.fs.logDirectory    hdfs://master1:9000/historyserverforSpark

说明:spark.eventLog.enabled打开后并配置了spark.eventLog.dir 那么在集群运行时,会将所有运行的日志信息都记录下来,方便运维。
将master1中配置的spark通过scp命令同步到worker上面。
root@master1:/usr/local# scp -r spark/ root@worker1:/usr/local/
root@master1:/usr/local# scp -r spark/ root@worker2:/usr/local/
root@master1:/usr/local# scp -r spark/ root@worker3:/usr/local/
root@master1:/usr/local# scp -r spark/ root@worker4:/usr/local/

然后查看worker上面的/usr/local/目录,确认一下是否将spark拷贝过来。

在hdfs上创建一个historyserverforSpark目录
root@master1:/usr/local# hdfs dfs -mkdir /historyserverforSpark
16/01/24 07:46:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local# hdfs dfs -ls /
16/01/24 07:46:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x   - root supergroup          0 2016-01-24 07:46 /historyserverforSpark
可用通过浏览器查看我们创建的目录。

3.启动spark
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-master1.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker4: failed to launch org.apache.spark.deploy.worker.Worker:
worker4: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker1: failed to launch org.apache.spark.deploy.worker.Worker:
worker1: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker3: failed to launch org.apache.spark.deploy.worker.Worker:
worker3: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker2: failed to launch org.apache.spark.deploy.worker.Worker:
worker2: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out

从上面看worker节点没启动成功,查看日志发现没有报错,原因是虚拟机自身的问题,但是具体哪里问题还不清楚;
通过命令./sbin/stop-all.sh停止spark集群,将所有节点中/usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs目录下的日志全部删除,再次启动,集群启动成功。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-master1.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out

通过jps命令确认是否启动了Master和Worker进程:
master1上的如下:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# jps
4551 ResourceManager
7255 Jps
7143 Master
4379 SecondaryNameNode
4175 NameNode

worker1上的如下:
root@worker1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs# jps
4528 Worker
2563 DataNode
2713 NodeManager
4606 Jps


通过浏览器访问http://192.168.112.130:8080/查看控制台,有4个节点。

到此为止,spark集群已经搭建完成!!!

启动history-server进程,记录集群的运行情况,即使重启后也能恢复之前的运行信息。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-history-server.sh 
starting org.apache.spark.deploy.history.HistoryServer, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.history.HistoryServer-1-master1.out

通过http://192.168.112.130:18080/查看History Server

运行例子:计算pi
位置:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/examples/src/main/scala/org/apache/spark/examples
源码如下:
// scalastyle:off println
package org.apache.spark.examples

import scala.math.random

import org.apache.spark._

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
    spark.stop()
  }
}
// scalastyle:on println

设置并行度为5000,在运行的过程中,方便通过浏览器查看;
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://master1:7077 ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 5000

通过浏览器查看任务:


运行结果:Pi is roughly 3.14156656
从打印日志上看为什么程序启动的这么快?
答:因为spark使用了Coarse Grained(粗粒度);
粗粒度就是在程序启动初始化的那一个时刻就为分配资源,后续程序计算时直接使用资源就行了,不需要每次计算时再分配资源。
粗粒度适合于作业非常多,而且需要资源复用时。粗粒度的一个缺点是:当并行很多时,一个作业运行时间很长,而其他作业运行很短,就会浪费资源。
细粒度就是是指程序计算时才分配资源,计算完成后立即回收资源。
通过history Server来看下运行情况:



  • 大小: 36.3 KB
  • 大小: 159.7 KB
  • 大小: 85.4 KB
  • 大小: 177 KB
  • 大小: 152.4 KB
  • 大小: 186.7 KB
  • 大小: 134.4 KB
分享到:
评论

相关推荐

    hadoop2.6.0+spark1.0所需资源

    hadoop-2.6.0.tar.gz + ideaIC-13.1.6.tar.gz + jdk-7u75-linux-i586.tar.gz + scala-2.10.4.tgz + spark-1.0.0-bin-hadoop2.tgz

    spark-assembly-1.6.0-cdh5.9.2-hadoop2.6.0-cdh5.9.2.jar

    spark-assembly-1.6.0-cdh5.9.2-hadoop2.6.0-cdh5.9.2.jar

    spark-examples-1.6.1-hadoop2.6.0.jar

    spark-examples-1.6.1-hadoop2.6.0.jar包下载,用于spark开发使用 用于spark开发使用 用于spark开发使用

    spark-assembly-1.6.0-cdh5.8.4-hadoop2.6.0-cdh5.8.4.jar

    spark-assembly-1.6.0-cdh5.8.4-hadoop2.6.0-cdh5.8.4.jar

    spark-1.6.0-bin-hadoop2.6.tgz

    Spark的核心设计是基于分布式内存计算模型,它引入了Resilient Distributed Datasets (RDDs)的概念,这是一种可分区的、只读的数据集,可以在集群中的多个节点上并行操作。RDDs支持两种主要操作:转换...

    spark-1.6.0-bin-hadoop2.4.tgz

    在给定的压缩包文件"spark-1.6.0-bin-hadoop2.4.tgz"中,包含了Spark的1.6.0版本,该版本已经集成了与Hadoop 2.4版本的兼容性,使得用户可以在Hadoop环境下运行Spark应用。 Spark的核心组件包括: 1. **Spark Core*...

    spark-1.6.1-bin-hadoop2.6.zip (缺spark-examples-1.6.1-hadoop2.6.0.jar)

    总的来说,Spark-1.6.1-bin-hadoop2.6.zip是一个用于搭建Spark环境的基础,尽管缺少示例JAR,但仍然足以启动和运行Spark应用程序。如果你是初学者,建议从官方文档和在线教程开始学习,逐步了解Spark的各种功能和...

    dr-elephant spark 1.6.0 hadoop 2.4.1

    而Spark 1.6.0则是一个流行的分布式计算引擎,它在内存计算和SQL查询性能上有着显著优势。 接下来,我们将进入Dr. Elephant的部署环节。下载并解压Dr. Elephant的源代码后,会得到一个包含`part1`和`part2`目录的...

    hadoop伪分布式集群搭建

    ### Hadoop伪分布式集群搭建详解 #### 一、概述 Hadoop是一款开源的大数据处理框架,主要用于处理海量数据。在实际应用中,Hadoop通常运行在由多台服务器组成的集群环境中,但为了方便学习和测试,可以搭建一个伪...

    spark1.6.0-src.rar

    2. **Spark SQL**:Spark 1.6.0引入了DataFrame API,它是SQL和Hadoop MapReduce的统一接口,提供了一种声明式编程风格,使得处理结构化数据变得更加简单。源码中,你可以看到DataFrame如何通过SparkSession与Hive等...

    centos下hadoop2.6.0配置.pdf

    在CentOS系统中配置Hadoop 2.6.0涉及多个步骤,主要目的是确保集群能够正确运行分布式文件系统(HDFS)和MapReduce任务。以下是一个详细的配置指南: 1. **配置环境变量**: 在`hadoop-2.6.0/etc/hadoop/hadoop-...

    Dr.elephant 编译完成文件 hadoop2.4.1 spark1.6.0

    《Dr.Elephant与Hadoop 2.4.1 Spark 1.6.0的集成应用》 在大数据处理领域,Dr.Elephant作为一个开源的分析工具,被广泛用于优化Hadoop和Spark作业的性能。它能通过收集、分析日志,为用户提供深入的作业执行信息,...

    spark-1.6.0.zip.002

    官网下载,windows可用, 有001 002 两个部分 ... Welcome to ____ __ ... /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)

    Spark1.6.0安装与使用

    - 选择与 Hadoop 版本兼容的 Spark 包,本例中使用的是 `spark-1.6.0-bin-hadoop2.6.tgz`。 **2. 解压安装 Spark** - 将下载的 `spark-1.6.0-bin-hadoop2.6.tgz` 文件解压到 `/home/Hadoop/bigdata/` 目录下。 ...

    Spark 1.6.0 API CHM

    自己下载的Spark 1.6.0 API 文档,编译生成的CHM版本文档

    spark-1.6.0-bin-hadoop2-without-hive.tgz

    -- spark1.6.0 hive on spark的spark包,这个是已经经过./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"编译后的了spark-1.6.0-bin-hadoop2-...

    Linux下Hadoop的分布式配置和使用.doc

    【Linux下Hadoop的分布式配置和使用】 在互联网领域,Hadoop是一个广泛使用的开源框架,用于处理和存储大规模数据。它构建在Java语言上,主要由Apache软件基金会维护。本文档将详细介绍如何在Linux环境下配置和使用...

    Linix下Hadoop的伪分布式配置

    在Linux环境下配置Hadoop的伪分布式模式是学习和测试Hadoop功能的重要步骤。这个模式允许你在单个节点上运行Hadoop,模拟多节点集群的行为,无需物理扩展硬件资源。以下是对配置过程的详细解释: 首先,你需要确保...

Global site tag (gtag.js) - Google Analytics