`
sillycat
  • 浏览: 2551316 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Apache Zeppelin(2)Zeppelin and Spark Yarn Cluster

 
阅读更多
Apache Zeppelin(2)Zeppelin and Spark Yarn Cluster

Recently try to debug something on zeppelin, if some error happens, we need to go to the log file to check more information.

Check the log file under zeppelin/opt/zeppelin/logs
zeppelin-carl-carl-mac.local.log
zeppelin-interpreter-spark-carl-carl-mac.local.log

Error Message:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.deploy.SparkHadoopUtil$
        at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959)
        at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:104)
        at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:179)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:310)
        at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
        at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:272)

Build the Zeppelin again.
> mvn clean package -Pspark-1.4 -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn -DskipTests

Error Message
ERROR [2015-06-30 17:04:43,588] ({Thread-43} JobProgressPoller.java[run]:57) - Can not get or update progress
org.apache.zeppelin.interpreter.InterpreterException: java.lang.IllegalStateException: Pool not open
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:286)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:110)
        at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:179)
        at org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:54)
Caused by: java.lang.IllegalStateException: Pool not open
        at org.apache.commons.pool2.impl.BaseGenericObjectPool.assertOpen(BaseGenericObjectPool.java:662)
        at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:412)
        at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:139)
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:284)

Error Message:
ERROR [2015-06-30 17:18:05,297] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logError]:75) - Lost executor 13 on ubuntu-dev1: remote Rpc client disassociated
INFO [2015-06-30 17:18:05,297] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logInfo]:59) - Re-queueing tasks for 13 from TaskSet 3.0
WARN [2015-06-30 17:18:05,298] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logWarning]:71) - Lost task 0.3 in stage 3.0 (TID 14, ubuntu-dev1): ExecutorLostFailure (executor 13 lost)
ERROR [2015-06-30 17:18:05,298] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logError]:75) - Task 0 in stage 3.0 failed 4 times; aborting job

Solutions:
After I fetch and load the recently zeppelin from github and build it myself again. Everything works.

Some configuration are as follow:
> less conf/zeppelin-env.sh
export MASTER="yarn-client"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"

export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"

Start the yarn cluster, start the zeppelin with command
> bin/zeppelin-daemon.sh start

Check the yarn cluster
http://ubuntu-master:8088/cluster/apps

Visit the zeppelin UI
http://ubuntu-master:8080/

Check the Interpreter to make sure we are using the yarn-client mode and other information.

Place this Simple List there:
val threshold = "book1"
val products = Seq("book1", "book2", "book3", "book4")
val rdd = sc.makeRDD(products,2)
val result = rdd.filter{ p =>
  p.equals(threshold)
}.count()
println("!!!!!!!!!!!!!!================result=" + result)

Run that simple list, zeppelin will start a something like spark-shell context on yarn cluster, that jobs will be always running, and after that, we will visit the spark master from this URL
http://ubuntu-master:4040/

We can see all the spark jobs, executors there.

If you plan to try some complex example like this one, you need to open the interpreter and increase the memory.
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}

val data:RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "file:///opt/spark/data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).
val splits:Array[RDD[LabeledPoint]] = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training:RDD[LabeledPoint] = splits(0).cache()
val test:RDD[LabeledPoint] = splits(1)

// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)

// Clear the default threshold.
model.clearThreshold()

// Compute raw scores on the test set.
val scoreAndLabels:RDD[(Double,Double)] = test.map { point =>
val score = model.predict(point.features)
      (score, point.label)
}

scoreAndLabels.take(10).foreach { case (score, label) =>
      println("Score = " + score + " Label = " + label);
}

// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)


Reference:
http://sillycat.iteye.com/blog/2216604


分享到:
评论

相关推荐

    藏经阁-nabling Apache Zeppelin_ and Spark_ for Data Science in the

    Apache Zeppelin 和 Spark 在数据科学企业应用中的启用 Apache Zeppelin 是一个基于 Web 的交互式笔记本,旨在使数据科学家和数据分析师更方便地处理大数据。 Apache Spark 是一个开源的数据处理引擎,能够高效地...

    藏经阁-Enabling Apache Zeppelin and Sp.pdf

    "藏经阁-Enabling Apache Zeppelin and Spark for Data Science in the Enterprise" Apache Zeppelin 是一个基于 Web 的交互式笔记本环境,支持多种执行平台和语言,旨在使数据科学家更方便地进行大数据科学研究和...

    Apache Zeppelin 0.7.2 中文文档

    Apache Zeppelin 0.7.2 中文文档 Apache Zeppelin 0.7.2 中文文档 Apache Zeppelin 0.7.2 中文文档

    apache zeppelin使用文档

    ### Apache Zeppelin 使用指南 #### 一、Apache Zeppelin 概览 Apache Zeppelin 是一款功能强大的基于 Web 的 Notebook 服务器,它为数据科学家提供了一个交互式的环境来探索数据、编写代码并创建可视化报告。...

    藏经阁-State of Security_Apache Spark&Apache Zeppelin.pdf

    "Apache Spark & Apache Zeppelin 安全性概述" 本资源摘要信息主要介绍 Apache Spark 和 Apache Zeppelin 的安全性概述,涵盖安全防护的四大支柱:身份验证、授权、审计和加密。同时,本文还讨论了 Spark 的安全...

    Apache Zeppelin 未授权任意命令执行.md

    Apache Zeppelin 未授权任意命令执行

    vagrant-spark-zeppelin:Vagrant,Apache Spark和Apache Zeppelin VM,带有用于学习Spark的笔记本

    【标题】"vagrant-spark-zeppelin" 提供了一个集成环境,用于学习和探索Apache Spark和Apache Zeppelin。这个项目利用Vagrant技术创建了一个虚拟机(VM),在这个虚拟环境中预装了Apache Spark和Apache Zeppelin,...

    Mastering Apache Spark 2.x Scale your m l and d l systems with SparkML, DL4j and

    • Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud • Understand internal details of cost based optimizers used in Catalyst, ...

    Mastering Apache Spark 2.x - Second Edition

    Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud Understand internal details of cost based optimizers used in Catalyst, SystemML and...

    基于Java的Apache Zeppelin交互式数据分析设计源码

    本项目是一款基于Java的Apache Zeppelin交互式数据分析框架源码,集成了丰富的功能,包括数据分析、数据可视化等。源码共包含2505个文件,其中Java源文件869个,TypeScript和JavaScript文件各224个,HTML和Scala文件...

    Scala and Spark for Big Data Analytics

    You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, ...

    Apache Zeppelin Github Viewer-crx插件

    Github Viewer for Apache Zeppelin 安装该文件以按预期的方式在Github中查看Apache Zeppelin笔记本。 请在此处创建任何问题:https://github.com/rishabhbhardwaj/zeppelin-github-viewer/issues欢迎您为改善此问题...

    walkthrough-for-deploying-apache-zeppelin-on-kubernetes:这是在Kubernetes上部署Apache Zeppelin的演练

    在Kubernetes上部署Apache Zeppelin的演练 Apache Zeppelin是基于Web的笔记本,可通过SQL,Scala等实现数据驱动的交互式数据分析和协作文档。 作为数据科学家,我喜欢Apache Zeppelin是因为它的多功能性和灵活性-从...

    zeppelin-0.8.0-bin-all.tgz

    2. **多语言支持**:Zeppelin 内置了对多种数据处理语言的支持,如 Spark SQL、PySpark、Pysparkling(用于 Apache H2O)、Pig、Flink 等,允许用户根据需求选择合适的语言进行分析。 3. **实时交互**:通过与 ...

    zeppelin-R:Apache Zeppelin R解释器

    Apache Zeppelin R解释器 这是代码的插入物。 它支持横截面变量和R图(ggplot2 ...)。 由于使用了库的GPL2许可,因此该软件未在ASL2下发行()。 先决条件 您需要在运行笔记本计算机的主机上使用R(带有Rserve和...

    zeppelin-notebooks:Apache Zeppelin笔记本

    齐柏林飞艇笔记本Apache Zeppelin笔记本该存储库包含Apache Zeppelin的笔记本。

    虚拟机zeppelin安装

    Zeppelin是由Apache软件基金会维护的一个开源项目,它的主要目标是提供一个易于使用的、多用户交互式的环境,支持多种大数据处理引擎如Spark、Hadoop等。用户可以通过Web界面编写和执行笔记(Notebook),这些笔记...

    zeppelin-0.8.1-bin-all.tgz

    2. **配置环境**:修改conf/zeppelin-env.sh或zeppelin-site.xml文件,配置Hive、Spark等相关连接信息。 3. **启动Zeppelin**:运行bin/zeppelin-daemon.sh start命令启动服务。 4. **访问Web界面**:通过浏览器...

Global site tag (gtag.js) - Google Analytics