- 浏览: 2551563 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Apache Zeppelin(2)Zeppelin and Spark Yarn Cluster
Recently try to debug something on zeppelin, if some error happens, we need to go to the log file to check more information.
Check the log file under zeppelin/opt/zeppelin/logs
zeppelin-carl-carl-mac.local.log
zeppelin-interpreter-spark-carl-carl-mac.local.log
Error Message:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.deploy.SparkHadoopUtil$
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:104)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:179)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:310)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:272)
Build the Zeppelin again.
> mvn clean package -Pspark-1.4 -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn -DskipTests
Error Message
ERROR [2015-06-30 17:04:43,588] ({Thread-43} JobProgressPoller.java[run]:57) - Can not get or update progress
org.apache.zeppelin.interpreter.InterpreterException: java.lang.IllegalStateException: Pool not open
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:286)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:110)
at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:179)
at org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:54)
Caused by: java.lang.IllegalStateException: Pool not open
at org.apache.commons.pool2.impl.BaseGenericObjectPool.assertOpen(BaseGenericObjectPool.java:662)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:412)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:139)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:284)
Error Message:
ERROR [2015-06-30 17:18:05,297] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logError]:75) - Lost executor 13 on ubuntu-dev1: remote Rpc client disassociated
INFO [2015-06-30 17:18:05,297] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logInfo]:59) - Re-queueing tasks for 13 from TaskSet 3.0
WARN [2015-06-30 17:18:05,298] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logWarning]:71) - Lost task 0.3 in stage 3.0 (TID 14, ubuntu-dev1): ExecutorLostFailure (executor 13 lost)
ERROR [2015-06-30 17:18:05,298] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logError]:75) - Task 0 in stage 3.0 failed 4 times; aborting job
Solutions:
After I fetch and load the recently zeppelin from github and build it myself again. Everything works.
Some configuration are as follow:
> less conf/zeppelin-env.sh
export MASTER="yarn-client"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
Start the yarn cluster, start the zeppelin with command
> bin/zeppelin-daemon.sh start
Check the yarn cluster
http://ubuntu-master:8088/cluster/apps
Visit the zeppelin UI
http://ubuntu-master:8080/
Check the Interpreter to make sure we are using the yarn-client mode and other information.
Place this Simple List there:
val threshold = "book1"
val products = Seq("book1", "book2", "book3", "book4")
val rdd = sc.makeRDD(products,2)
val result = rdd.filter{ p =>
p.equals(threshold)
}.count()
println("!!!!!!!!!!!!!!================result=" + result)
Run that simple list, zeppelin will start a something like spark-shell context on yarn cluster, that jobs will be always running, and after that, we will visit the spark master from this URL
http://ubuntu-master:4040/
We can see all the spark jobs, executors there.
If you plan to try some complex example like this one, you need to open the interpreter and increase the memory.
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
val data:RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "file:///opt/spark/data/mllib/sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits:Array[RDD[LabeledPoint]] = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training:RDD[LabeledPoint] = splits(0).cache()
val test:RDD[LabeledPoint] = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
// Clear the default threshold.
model.clearThreshold()
// Compute raw scores on the test set.
val scoreAndLabels:RDD[(Double,Double)] = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
scoreAndLabels.take(10).foreach { case (score, label) =>
println("Score = " + score + " Label = " + label);
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
Reference:
http://sillycat.iteye.com/blog/2216604
Recently try to debug something on zeppelin, if some error happens, we need to go to the log file to check more information.
Check the log file under zeppelin/opt/zeppelin/logs
zeppelin-carl-carl-mac.local.log
zeppelin-interpreter-spark-carl-carl-mac.local.log
Error Message:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.deploy.SparkHadoopUtil$
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:104)
at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:179)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:310)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:272)
Build the Zeppelin again.
> mvn clean package -Pspark-1.4 -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn -DskipTests
Error Message
ERROR [2015-06-30 17:04:43,588] ({Thread-43} JobProgressPoller.java[run]:57) - Can not get or update progress
org.apache.zeppelin.interpreter.InterpreterException: java.lang.IllegalStateException: Pool not open
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:286)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:110)
at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:179)
at org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:54)
Caused by: java.lang.IllegalStateException: Pool not open
at org.apache.commons.pool2.impl.BaseGenericObjectPool.assertOpen(BaseGenericObjectPool.java:662)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:412)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:139)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:284)
Error Message:
ERROR [2015-06-30 17:18:05,297] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logError]:75) - Lost executor 13 on ubuntu-dev1: remote Rpc client disassociated
INFO [2015-06-30 17:18:05,297] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logInfo]:59) - Re-queueing tasks for 13 from TaskSet 3.0
WARN [2015-06-30 17:18:05,298] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logWarning]:71) - Lost task 0.3 in stage 3.0 (TID 14, ubuntu-dev1): ExecutorLostFailure (executor 13 lost)
ERROR [2015-06-30 17:18:05,298] ({sparkDriver-akka.actor.default-dispatcher-4} Logging.scala[logError]:75) - Task 0 in stage 3.0 failed 4 times; aborting job
Solutions:
After I fetch and load the recently zeppelin from github and build it myself again. Everything works.
Some configuration are as follow:
> less conf/zeppelin-env.sh
export MASTER="yarn-client"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
Start the yarn cluster, start the zeppelin with command
> bin/zeppelin-daemon.sh start
Check the yarn cluster
http://ubuntu-master:8088/cluster/apps
Visit the zeppelin UI
http://ubuntu-master:8080/
Check the Interpreter to make sure we are using the yarn-client mode and other information.
Place this Simple List there:
val threshold = "book1"
val products = Seq("book1", "book2", "book3", "book4")
val rdd = sc.makeRDD(products,2)
val result = rdd.filter{ p =>
p.equals(threshold)
}.count()
println("!!!!!!!!!!!!!!================result=" + result)
Run that simple list, zeppelin will start a something like spark-shell context on yarn cluster, that jobs will be always running, and after that, we will visit the spark master from this URL
http://ubuntu-master:4040/
We can see all the spark jobs, executors there.
If you plan to try some complex example like this one, you need to open the interpreter and increase the memory.
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
val data:RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "file:///opt/spark/data/mllib/sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits:Array[RDD[LabeledPoint]] = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training:RDD[LabeledPoint] = splits(0).cache()
val test:RDD[LabeledPoint] = splits(1)
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
// Clear the default threshold.
model.clearThreshold()
// Compute raw scores on the test set.
val scoreAndLabels:RDD[(Double,Double)] = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
scoreAndLabels.take(10).foreach { case (score, label) =>
println("Score = " + score + " Label = " + label);
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
Reference:
http://sillycat.iteye.com/blog/2216604
发表评论
-
Update Site will come soon
2021-06-02 04:10 1677I am still keep notes my tech n ... -
Stop Update Here
2020-04-28 09:00 316I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 475NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 368Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 369Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 336Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 430Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 435Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 373Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 455VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 384Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 478NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 422Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 337Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 247GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 450GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 328GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 314Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 317Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 293Serverless with NodeJS and Tenc ...
相关推荐
Apache Zeppelin 和 Spark 在数据科学企业应用中的启用 Apache Zeppelin 是一个基于 Web 的交互式笔记本,旨在使数据科学家和数据分析师更方便地处理大数据。 Apache Spark 是一个开源的数据处理引擎,能够高效地...
"藏经阁-Enabling Apache Zeppelin and Spark for Data Science in the Enterprise" Apache Zeppelin 是一个基于 Web 的交互式笔记本环境,支持多种执行平台和语言,旨在使数据科学家更方便地进行大数据科学研究和...
Apache Zeppelin 0.7.2 中文文档 Apache Zeppelin 0.7.2 中文文档 Apache Zeppelin 0.7.2 中文文档
### Apache Zeppelin 使用指南 #### 一、Apache Zeppelin 概览 Apache Zeppelin 是一款功能强大的基于 Web 的 Notebook 服务器,它为数据科学家提供了一个交互式的环境来探索数据、编写代码并创建可视化报告。...
"Apache Spark & Apache Zeppelin 安全性概述" 本资源摘要信息主要介绍 Apache Spark 和 Apache Zeppelin 的安全性概述,涵盖安全防护的四大支柱:身份验证、授权、审计和加密。同时,本文还讨论了 Spark 的安全...
Apache Zeppelin 未授权任意命令执行
【标题】"vagrant-spark-zeppelin" 提供了一个集成环境,用于学习和探索Apache Spark和Apache Zeppelin。这个项目利用Vagrant技术创建了一个虚拟机(VM),在这个虚拟环境中预装了Apache Spark和Apache Zeppelin,...
• Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud • Understand internal details of cost based optimizers used in Catalyst, ...
Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud Understand internal details of cost based optimizers used in Catalyst, SystemML and...
本项目是一款基于Java的Apache Zeppelin交互式数据分析框架源码,集成了丰富的功能,包括数据分析、数据可视化等。源码共包含2505个文件,其中Java源文件869个,TypeScript和JavaScript文件各224个,HTML和Scala文件...
You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, ...
Github Viewer for Apache Zeppelin 安装该文件以按预期的方式在Github中查看Apache Zeppelin笔记本。 请在此处创建任何问题:https://github.com/rishabhbhardwaj/zeppelin-github-viewer/issues欢迎您为改善此问题...
在Kubernetes上部署Apache Zeppelin的演练 Apache Zeppelin是基于Web的笔记本,可通过SQL,Scala等实现数据驱动的交互式数据分析和协作文档。 作为数据科学家,我喜欢Apache Zeppelin是因为它的多功能性和灵活性-从...
2. **多语言支持**:Zeppelin 内置了对多种数据处理语言的支持,如 Spark SQL、PySpark、Pysparkling(用于 Apache H2O)、Pig、Flink 等,允许用户根据需求选择合适的语言进行分析。 3. **实时交互**:通过与 ...
Apache Zeppelin R解释器 这是代码的插入物。 它支持横截面变量和R图(ggplot2 ...)。 由于使用了库的GPL2许可,因此该软件未在ASL2下发行()。 先决条件 您需要在运行笔记本计算机的主机上使用R(带有Rserve和...
齐柏林飞艇笔记本Apache Zeppelin笔记本该存储库包含Apache Zeppelin的笔记本。
Zeppelin是由Apache软件基金会维护的一个开源项目,它的主要目标是提供一个易于使用的、多用户交互式的环境,支持多种大数据处理引擎如Spark、Hadoop等。用户可以通过Web界面编写和执行笔记(Notebook),这些笔记...
2. **配置环境**:修改conf/zeppelin-env.sh或zeppelin-site.xml文件,配置Hive、Spark等相关连接信息。 3. **启动Zeppelin**:运行bin/zeppelin-daemon.sh start命令启动服务。 4. **访问Web界面**:通过浏览器...