`
sillycat
  • 浏览: 2552002 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Spark/Hadoop/Zeppelin Upgrade(2)

 
阅读更多
Spark/Hadoop/Zeppelin Upgrade(2)

1 Install Hadoop 2.6.4
> wget http://mirrors.ibiblio.org/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4-src.tar.gz

Fail on the annotation package
> mvn package -Pdist,native -DskipTests -Dtar

Maybe because the version of java, Cmake or other packages, so I will choose to directly download the binary of hadoop 2.6.4.
> wget http://mirror.nexcess.net/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz

Configure and set up as the same as 2.7.2
> cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://ubuntu-master:9000</value>
</property>
<property>
  <name>io.file.buffer.size</name>
  <value>131072</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>file:/opt/hadoop/temp</value>
</property>
<property>
  <name>hadoop.proxyuser.hadoop.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.hadoop.groups</name>
  <value>*</value>
</property>
</configuration>

Edit hadoop-env.sh
export JAVA_HOME="/opt/jdk"

> cat hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>ubuntu-master:9001</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/hadoop/dfs/name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/opt/hadoop/dfs/data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
  </property>
</configuration>

> cat slaves
ubuntu-dev1
ubuntu-dev2

> cat yarn-site.xml
<?xml version="1.0"?>
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>ubuntu-master:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>ubuntu-master:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>ubuntu-master:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>ubuntu-master:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>ubuntu-master:8088</value>
  </property>
</configuration>

> mkdir /opt/hadoop/temp

> mkdir -p /opt/hadoop/dfs/data

> mkdir -p /opt/hadoop/dfs/name

Same things on the ubuntu-dev1 and ubuntu-dev2

hadoop is done.
1 HDFS
cd /opt/hadoop
sbin/start-dfs.sh

http://ubuntu-master:50070/dfshealth.html#tab-overview

2 YARN
cd /opt/hadoop
sbin/start-yarn.sh

http://ubuntu-master:8088/cluster

2 Installation of Spark
Build the Spark with MAVEN
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive -DskipTests clean package

Build the Spark with SBT
> build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive assembly

Here is the command to build the binary
> ./make-distribution.sh --name spark-1.6.1 --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive

Build Success. I get this binary file  spark-1.6.1-bin-spark-1.6.1.tgz

Spark YARN Setting
On ubuntu-master
>cat conf/spark-env.sh
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

This command will start the shell
> MASTER=yarn-client bin/spark-shell

We can also use spark-submit to submit my job to the remote.
http://sillycat.iteye.com/blog/2103457

3 Zeppelin Installation
http://sillycat.iteye.com/blog/2286997

> git clone https://github.com/apache/incubator-zeppelin.git

> git checkout tags/v0.5.6

> mvn clean package -DskipTests -Pspark-1.6 -Dspark.version=1.6.1 -Phadoop-2.6 -Dhadoop.version=2.6.4

> mvn clean package -Pbuild-distr -DskipTests -Pspark-1.6 -Dspark.version=1.6.1 -Phadoop-2.6 -Dhadoop.version=2.6.4

Build success. The binary will be generate here.  /home/carl/install/incubator-zeppelin/zeppelin-distribution/target

Unzip and Check the Configure
> cat zeppelin-env.sh

# export HADOOP_CONF_DIR



# yarn-site.xml is located in configuration directory in HADOOP_CONF_DIR.
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"

# export SPARK_HOME
# (required) When it is defined, load it instead of Zeppelin embedded Spark libraries
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
# export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"

Start the Server
> bin/zeppelin-daemon.sh start

The visit console
http://ubuntu-master:8080/#/

Error Message:
ERROR [2016-04-01 13:58:49,540] ({qtp1232306490-35} NotebookServer.java[onMessage]:162) - Can't handle message
org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.cancel(RemoteInterpreter.java:248)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:99)
        at org.apache.zeppelin.notebook.Paragraph.jobAbort(Paragraph.java:229)
        at org.apache.zeppelin.scheduler.Job.abort(Job.java:232)
        at org.apache.zeppelin.socket.NotebookServer.cancelParagraph(NotebookServer.java:695)

More Error Message in file  less zeppelin-carl-ubuntu-master.out

16/04/01 14:10:40 WARN netty.NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(1,Container killed by YARN for exceeding memory limits. 2.1 GB of 2.1 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead.)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
        at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)

org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
        at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)

On the hadoop slaves in logging  yarn-carl-nodemanager-ubuntu-dev2.log

2016-04-01 15:28:54,525 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 2229 for container-id container
_1459541332549_0002_02_000001: 124.6 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used

Solution:
http://www.wdong.org/wordpress/blog/2015/01/08/spark-on-yarn-where-have-all-my-memory-gone/

http://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits

This configuration in yarn-site.xml fixed the problem.
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>

After we start the task in zeppelin, we can visit the spark context from this console
http://ubuntu-master:4040/

References:
http://sillycat.iteye.com/blog/2286997
分享到:
评论

相关推荐

    Spark/Hadoop开发缺失插件-winutils.exe

    本地开发Spark/Hadoop报错“ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.” ...

    spark/hadoop读取s3所需要的外部依赖包

    在大数据处理领域,Spark和Hadoop是两个非常重要的框架,它们广泛用于数据处理、分析以及存储。当需要从云存储服务如Amazon S3(Simple Storage Service)读取或写入数据时,这两个框架需要额外的依赖包来实现与S3的...

    重新编译好的contain-executor文件,指向/etc/hadoop/container-executor.cfg

    所以需要重新编译Container-executor,这边提供重新编译好的,默认加载配置文件路径/etc/hadoop/container-executor.cfg 使用方法: 1 替换/$HADOOP_HOME/bin/下的container-executor 2 创建/etc/hadoop目录,并将...

    扩展了对阿里云 E-MapReduce 上 Spark/Hadoop 数据源的支持

    Spark 1.3+ 介绍 本项目支持在Spark运行环境中与阿里云的基础服务OSS、ODPS、LogService、ONS等进行交互。 构建和安装 git clone https://github.com/aliyun/aliyun-emapreduce-datasources.git cd aliyun-...

    hadoop安装过程中的问题

    Hadoop/etc/hadoop/slaves 的IP地址要变。 5个重要的配置文件逐个检查,IP地址需要变 2.配置文件确认无错误,看日志: 从logs日志中寻找slave的namenode为什么没有起来。 3.最后发现是hdfs中存在上次的数据,删掉...

    spark+hadoop环境搭建

    sudo nano /home/hadoop/hadoop-2.6.4/etc/hadoop/mapred-site.xml ``` 配置 `mapreduce.framework.name` 为 `yarn`。 - **yarn-site.xml**: ```bash sudo nano /home/hadoop/hadoop-2.6.4/etc/hadoop/yarn-...

    大数据面试题,大数据成神之路开启...Flink/Spark/Hadoop/Hbase/Hive...-Python开发

    大数据面试题,大数据成神之路开启...Flink/Spark/Hadoop/Hbase/Hive... 已经更新100+篇~ 关注公众号~ 大数据成神之路目录 大数据开发基础篇 :skis: Java基础 :memo: NIO :open_book: 并发 :...

    hadoop/etc/hadoop/6个文件

    hadoop/etc/hadoop/6个文件 core-site.xml hadoop-env.sh hdfs-site.xml mapred-site.xml yarn-env.sh yarn-site.xml

    Spark和Hadoop的集成

    Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算。Storm是一个分布式的、容错的实时计算系统。两者整合,优势互补。

    hadoop单机配置方法

    2. **解压Hadoop** 使用`sudo tar xzf hadoop-0.20.2.tar.gz`命令解压缩Hadoop软件包。 3. **更改文件所有者** 执行`sudo chown -R dm:dm hadoop-0.20.2`,将解压后的Hadoop目录的所有权更改为之前创建的Hadoop...

    hadoop-lzo-master

    1.安装 Hadoop-gpl-compression ...1.2 mv hadoop-gpl-...bin/hadoop jar /usr/local/hadoop-1.0.2/lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /home/hadoop/project_hadoop/aa.html.lzo

    hadoop中文实战

    Hadoop生态中的其他重要组件,如HBase(分布式数据库)、Hive(数据仓库工具)、Pig(数据流处理语言)和Spark(快速大数据处理引擎)也会有所提及。这些工具与Hadoop结合,提供了更高效、灵活的数据处理方案。例如...

    hadoop安装与配置 Hadoop的安装与配置可以分成几个主要步骤: 1. 安装Java 2. 下载Hadoop 3. 配

    hadoop安装与配置 hadoop安装与配置 Hadoop的安装与配置可以分成几个主要步骤: 1. 安装Java 2. 下载Hadoop 3. 配置Hadoop ...编辑/usr/local/hadoop/etc/hadoop/hadoop-env.sh,设置JAVA_HOME: export JAVA_H

    Hadoop和Apache Spark环境配置.docx

    sudo nano /opt/hadoop/etc/hadoop/core-site.xml ``` 添加如下配置: ```xml &lt;name&gt;fs.defaultFS&lt;/name&gt; &lt;value&gt;hdfs://localhost:9000&lt;/value&gt; &lt;/property&gt; ``` - 编辑 `hdfs-site.xml`: ```bash ...

    实验七:Spark初级编程实践

    实验中统计了 `/home/hadoop/test.txt` 和 `/user/hadoop/test.txt` 文件的行数,这展示了 Spark 对文本数据的基本操作。 3. **编写独立 Scala 应用程序** Spark 提供了 Scala、Java、Python 和 R 的 API,便于...

    运行成功的hadoop配置文件

    经过多次反复试验,完全可用的hadoop配置,有0.19的版本,也有0.20的版本。并且有脚本可以在两个版本之间...vi hadoop/conf/core-site.xml &lt;name&gt;hadoop.tmp.dir&lt;/name&gt; &lt;value&gt;/data/hadoop_tmp&lt;/value&gt; 祝好运!

    Python+Spark 2.0+Hadoop机器学习与大数据

    2. Spark 2.0的安装、配置、编程模型,如RDD、DataFrame和DataSet,以及Spark SQL的使用。 3. Python在大数据处理中的应用,包括数据读取、清洗、转换和分析的流程。 4. 使用Pyspark(Python API for Spark)进行...

    Spark+Hadoop+IDE环境搭建

    2. **配置Hadoop依赖**:在Spark应用中,需要包含Hadoop的相关jar包,确保Spark能与Hadoop通信。 3. **使用Spark的Hadoop兼容模式**:Spark可以以Hadoop客户端模式运行,通过`spark.hadoop`前缀配置Hadoop参数。 **...

Global site tag (gtag.js) - Google Analytics