`
sillycat
  • 浏览: 2557977 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Data Solution 2019(10)Spark Cluster Solution with Zeppelin

 
阅读更多
Data Solution 2019(10)Spark Cluster Solution with Zeppelin

Spark Single Cluster
https://spark.apache.org/docs/latest/spark-standalone.html
Mesos Cluster
https://spark.apache.org/docs/latest/running-on-mesos.html
Hadoop2 YARN
https://spark.apache.org/docs/latest/running-on-yarn.html
K8S
https://spark.apache.org/docs/latest/running-on-kubernetes.html

Zeppelin with Cluster
https://zeppelin.apache.org/docs/latest/interpreter/spark.html

Decide to Set Up Spark Standalone Cluster and Zeppelin
Start the Spark Master Machine
Prepare Spark
> wget http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
> mv spark-2.4.4-bin-hadoop2.7 ~/tool/spark-2.4.4
> sudo ln -s /home/carl/tool/spark-2.4.4 /opt/spark-2.4.4
> sudo ln -s /opt/spark-2.4.4 /opt/spark
> cd /opt/spark
> cp conf/spark-env.sh.template conf/spark-env.sh

A lot of sample configuration there
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

https://spark.apache.org/docs/latest/spark-standalone.html
Make some changes according to my ENV
> vi conf/spark-env.sh

SPARK_MASTER_HOST=rancher-home
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188

Start the master service
> sbin/start-master.sh

Start the Slave on rancher-worker1
> wget http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
> tar zxvf spark-2.4.4-bin-hadoop2.7.tgz
> mv spark-2.4.4-bin-hadoop2.7 ~/tool/spark-2.4.4
> sudo ln -s /home/carl/tool/spark-2.4.4 /opt/spark-2.4.4
> sudo ln -s /opt/spark-2.4.4 /opt/spark

Prepare Configuration
> cp conf/spark-env.sh.template conf/spark-env.sh
SPARK_MASTER_HOST=rancher-home
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188

Start the slave and connect to master
> sbin/start-slave.sh spark://rancher-home:7077

Stop the slave
> sbin/stop-slave.sh spark://rancher-home:7077

Make Spark Cluster in Docker
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
SPARK_NO_DAEMONIZE=true

It fails if I start the services
2019-10-28T00:41:42.502359700Z 19/10/28 00:41:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2019-10-28T00:41:43.110823900Z 19/10/28 00:41:43 WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078.

HOST file
https://cloud.tencent.com/developer/article/1175087

Finally, the configuration will be close to these for Master
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8088
SPARK_LOCAL_HOSTNAME=rancher-home
SPARK_IDENT_STRING=rancher-home
SPARK_PUBLIC_DNS=rancher-home
SPARK_NO_DAEMONIZE=true
SPARK_DAEMON_MEMORY=1g

Dockerfile as follow:
#Set up spark master in Docker

#Prepre the OS
FROM    centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>

RUN     yum -y update
RUN     yum install -y wget

#install jdk
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile

RUN            mkdir /tool/
WORKDIR        /tool/

#add the software spark
RUN  wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN  tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN  ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark

ADD  conf/spark-env.sh /tool/spark/conf/

#set up the app
EXPOSE  8088 7077
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh" ]

Makefile important parts as follow:
run:
    docker run -d -p 7077:7077 -p 8088:8088 \
    --hostname rancher-home \
    --name $(NAME) $(IMAGE):$(TAG)

The Slave Machine Configuration will be as follow:
SPARK_WORKER_PORT=7177
SPARK_WORKER_WEBUI_PORT=8188
SPARK_PUBLIC_DNS=rancher-worker1
SPARK_LOCAL_HOSTNAME=rancher-worker1
SPARK_IDENT_STRING=rancher-worker1
SPARK_NO_DAEMONIZE=true

Dockerfile is as follow:
#Set up spark slave in Docker

#Prepre the OS
FROM    centos:7
MAINTAINER Yiyi Kang <yiyikangrachel@gmail.com>

RUN     yum -y update
RUN     yum install -y wget

#install jdk
RUN yum -y install java-1.8.0-openjdk.x86_64
RUN echo ‘export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk’ | tee -a /etc/profile

RUN            mkdir /tool/
WORKDIR        /tool/

#add the software spark
RUN  wget --no-verbose http://apache.mirrors.ionfish.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN  tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz
RUN  ln -s /tool/spark-2.4.4-bin-hadoop2.7 /tool/spark

ADD  conf/spark-env.sh /tool/spark/conf/

#set up the app
EXPOSE  8188 7177
RUN     mkdir -p /app/
ADD     start.sh /app/
WORKDIR /app/
CMD    [ "./start.sh" ]

Add host to point to our master machine
run:
    docker run -d -p 7177:7177 -p 8188:8188 \
    --name $(NAME) \
    --hostname rancher-worker1 \
    --add-host=rancher-home:192.168.56.110 $(IMAGE):$(TAG)

Next step is to put a lot of configuration in parameters.

References:
https://spark.apache.org/docs/latest/cluster-overview.html
https://stackoverflow.com/questions/28664834/which-cluster-type-should-i-choose-for-spark
https://stackoverflow.com/questions/39671117/docker-container-with-apache-spark-in-standalone-cluster-mode
https://github.com/shuaicj/docker-spark-master
https://stackoverflow.com/questions/32719007/spark-spark-public-dns-and-spark-local-ip-on-stand-alone-cluster-with-docker-con

分享到:
评论

相关推荐

    Mastering Apache Spark 2.x Scale your m l and d l systems with SparkML, DL4j and

    Advanced analytics on your Big Data with latest Apache Spark 2.x About This Book An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark ...

    Scala and Spark for Big Data Analytics

    You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, ...

    藏经阁-nabling Apache Zeppelin_ and Spark_ for Data Science in the

    Apache Zeppelin 和 Spark 在数据科学企业应用中的启用 Apache Zeppelin 是一个基于 Web 的交互式笔记本,旨在使数据科学家和数据分析师更方便地处理大数据。 Apache Spark 是一个开源的数据处理引擎,能够高效地...

    Structured Spark Streaming-as-a-Service with Hopsworks

    Structured Spark Streaming as a Service with Hopsworks is a powerful and flexible solution designed to simplify the process of building real-time data processing pipelines. This service leverages ...

    Mastering Apache Spark 2.x - Second Edition

    Advanced analytics on your Big Data with latest Apache Spark 2.x About This Book An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark ...

    bigdata-docker-compose:Hadoop,Hive,Spark,Zeppelin和Livy

    大数据游乐场:通过Docker-compose与Hadoop,Hive,Spark,Zeppelin和Livy集群。 我希望能够轻松地处理各种大数据应用程序,即Amazon EMR中的那些应用程序。 理想情况下,这可以在一个命令中提出和拆除。 这就是这...

    藏经阁-State of Security_Apache Spark&Apache Zeppelin.pdf

    "Apache Spark & Apache Zeppelin 安全性概述" 本资源摘要信息主要介绍 Apache Spark 和 Apache Zeppelin 的安全性概述,涵盖安全防护的四大支柱:身份验证、授权、审计和加密。同时,本文还讨论了 Spark 的安全...

    spark streaming

    spark streaming streaming

    vagrant-spark-zeppelin:Vagrant,Apache Spark和Apache Zeppelin VM,带有用于学习Spark的笔记本

    【标题】"vagrant-spark-zeppelin" 提供了一个集成环境,用于学习和探索Apache Spark和Apache Zeppelin。这个项目利用Vagrant技术创建了一个虚拟机(VM),在这个虚拟环境中预装了Apache Spark和Apache Zeppelin,...

    BigDataTools_for_intellij-213.5449.243

    With this plugin, you can conveniently work with Zeppelin notebooks, run applications with spark-submit, produce and consume messages with Kafka, monitor Spark and Hadoop YARN applications, and work ...

    Scala Data Analysis Cookbook(PACKT,2015)

    Starting with introductory recipes on utilizing the Breeze and Spark libraries, get to grips withhow to import data from a host of possible sources and how to pre-process numerical, string, and date ...

    apache zeppelin使用文档

    - **多语言支持**:Zeppelin 支持多种编程语言,包括 Scala、Python、Spark SQL、Hive、Markdown 和 R 等。 - **交互式分析**:用户可以直接在 Notebook 中执行代码并查看结果,非常适合探索性数据分析。 - **可视化...

    藏经阁-Enabling Apache Zeppelin and Sp.pdf

    "藏经阁-Enabling Apache Zeppelin and Spark for Data Science in the Enterprise" Apache Zeppelin 是一个基于 Web 的交互式笔记本环境,支持多种执行平台和语言,旨在使数据科学家更方便地进行大数据科学研究和...

    zeppelin-0.8.1-bin-all.tgz

    2. **配置环境**:修改conf/zeppelin-env.sh或zeppelin-site.xml文件,配置Hive、Spark等相关连接信息。 3. **启动Zeppelin**:运行bin/zeppelin-daemon.sh start命令启动服务。 4. **访问Web界面**:通过浏览器...

    zeppelin-0.8.0-bin-all.tgz

    Zeppelin 提供了多种语言的解释器,如 SQL、Spark、Python、R 和 Scala,使得用户可以方便地进行多语言编程,并在同一个环境中无缝切换。这个“zeppelin-0.8.0-bin-all.tgz”压缩包是 Apache Zeppelin 的 0.8.0 版本...

    docker-zeppelin:Dockerized Zepplin w Spark 1.5

    当地的构建映像并在安装了数据量的本地模式下运行docker build -t zeppelin:1.5.0 .mkdir /data && chmod -R 777 /datadocker run -d -v /data:/zeppelin/data -p 8080:8080 -p 8081:8081 zeppelin:1.5.0Zeppelin将...

    Complete Guide to Open Source Big Data Stack

    Install and use DCOS for big data processingUse Apache Spark for big data stack data processing Who This Book Is For Developers, architects, IT project managers, database administrators, and others...

Global site tag (gtag.js) - Google Analytics