- 浏览: 2539548 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Prediction(6)PyLib and Machine Learning
1. Introduction
An ensemble method will create a model composed of a set of other base models. Gradientboostedtrees and RandomForest both use decision trees as their base models.
GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests in parallel. (smaller trees with GBTs)
Training more trees in a Random Forest reduces the likelihood of overfitting. More trees with GBTs increases the likelihood of overfitting.
Random Forests reduce variance by using more trees, GBTs reduce bias by using more trees.
2. Try with Random Forests
Error Message in Zeppelin:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 162, in <module> eval(compiledCode) File "<string>", line 1, in <module> File "/opt/spark/python/pyspark/mllib/__init__.py", line 25, in <module> import numpy ImportError: No module named numpy
Solution:
http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy
Download the latest file from http://sourceforge.net/projects/numpy/files/NumPy/
> wget http://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz
> sudo python setup.py install
Verify the installation
>python
python>>>import numpy
python>>>exit()
Error Message in Zeppelin Logs
ERROR [2015-10-06 14:14:40,447] ({qtp1852584274-48} NotebookServer.java[runParagraph]:630) - Exception from run
org.apache.zeppelin.interpreter.InterpreterException: pyspark interpreter not found
at org.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)
at org.apache.zeppelin.notebook.Note.run(Note.java:282)
at org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)
at org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)
Solution:
http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/
>mvn clean package -Pspark-1.5 -Dpyspark -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -P build-distr
Try these codes in zeppelin.
%pyspark
sc.parallelize([1,2,3]).count()
Exception:
Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jar java.io.EOFException
Solution:
http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to
Add this in zeppelin configuration file.
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"
It should be right there. But the VMs are slow. So I did not make it perfectly working. I may try this in later version.
3. Set up Single Mode
Following the nodes here http://sillycat.iteye.com/blog/2247102
Only these configuration for zeppelin in local MODE
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"
And the single mode is working great for me. And the speed is also much better than in the VMs.
4. Random Forest Sample on Zeppelin
%pyspark
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
%pyspark
data = MLUtils.loadLibSVMFile(sc, "/opt/spark/data/mllib/sample_libsvm_data.txt")
(trainingData, testData) = data.randomSplit([0.7, 0.3])
%pyspark
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='variance', maxDepth=4, maxBins=32)
%pyspark
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())
References:
http://spark.apache.org/docs/latest/mllib-ensembles.html
Setup Zeppelin Again with Python
http://sillycat.iteye.com/blog/2247102
1. Introduction
An ensemble method will create a model composed of a set of other base models. Gradientboostedtrees and RandomForest both use decision trees as their base models.
GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests in parallel. (smaller trees with GBTs)
Training more trees in a Random Forest reduces the likelihood of overfitting. More trees with GBTs increases the likelihood of overfitting.
Random Forests reduce variance by using more trees, GBTs reduce bias by using more trees.
2. Try with Random Forests
Error Message in Zeppelin:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 162, in <module> eval(compiledCode) File "<string>", line 1, in <module> File "/opt/spark/python/pyspark/mllib/__init__.py", line 25, in <module> import numpy ImportError: No module named numpy
Solution:
http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy
Download the latest file from http://sourceforge.net/projects/numpy/files/NumPy/
> wget http://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz
> sudo python setup.py install
Verify the installation
>python
python>>>import numpy
python>>>exit()
Error Message in Zeppelin Logs
ERROR [2015-10-06 14:14:40,447] ({qtp1852584274-48} NotebookServer.java[runParagraph]:630) - Exception from run
org.apache.zeppelin.interpreter.InterpreterException: pyspark interpreter not found
at org.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)
at org.apache.zeppelin.notebook.Note.run(Note.java:282)
at org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)
at org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)
Solution:
http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/
>mvn clean package -Pspark-1.5 -Dpyspark -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -P build-distr
Try these codes in zeppelin.
%pyspark
sc.parallelize([1,2,3]).count()
Exception:
Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jar java.io.EOFException
Solution:
http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to
Add this in zeppelin configuration file.
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"
It should be right there. But the VMs are slow. So I did not make it perfectly working. I may try this in later version.
3. Set up Single Mode
Following the nodes here http://sillycat.iteye.com/blog/2247102
Only these configuration for zeppelin in local MODE
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"
And the single mode is working great for me. And the speed is also much better than in the VMs.
4. Random Forest Sample on Zeppelin
%pyspark
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
%pyspark
data = MLUtils.loadLibSVMFile(sc, "/opt/spark/data/mllib/sample_libsvm_data.txt")
(trainingData, testData) = data.randomSplit([0.7, 0.3])
%pyspark
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='variance', maxDepth=4, maxBins=32)
%pyspark
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())
References:
http://spark.apache.org/docs/latest/mllib-ensembles.html
Setup Zeppelin Again with Python
http://sillycat.iteye.com/blog/2247102
发表评论
-
Stop Update Here
2020-04-28 09:00 310I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 465NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 361Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 363Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 328Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 419Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 428Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 364Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 444VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 376Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 464NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 413Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 330Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 242GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 443GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 320GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 306Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 310Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 284Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 302Serverless with NodeJS and Tenc ...
相关推荐
### Pattern Recognition and Machine Learning **Pattern Recognition and Machine Learning** by Christopher M. Bishop is a seminal work in the field of machine learning and pattern recognition. The book...
PART 1 - YOUR MACHINE-LEARNING RIGA machine-learning odysseyTensorFlow essentialsPART 2 - CORE LEARNING ALGORITHMSLinear regression and beyondA gentle introduction to classificationAutomatically ...
where he applied his machine learning expertise in ad optimization, click-through rate and conversion rate prediction, and click fraud detection. Yuxi earned his degree from the University of Toronto...
PART 1 - YOUR MACHINE-LEARNING RIGA machine-learning odysseyTensorFlow essentialsPART 2 - CORE LEARNING ALGORITHMSLinear regression and beyondA gentle introduction to classificationAutomatically ...
Amazon ML provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. Once your ...
藏经阁-Large Scale Ads CTR Prediction with Spark and Deep Learning-
Recent advances in machine learning (ML), natural language processing, image recognition, content personalization, and behavior prediction are radically changing the capabilities of software and the ...
The second half of the book focuses on three different machine learning case studies, all based on real-world data, and offers solutions and solves specific machine-learning issues in each one. ...
Data science and machine learning are some of the top buzzwords in the technical world today. A resurging interest in machine learning is due to the same factors that have made data mining and ...
Calories-Burnt-Prediction-using-Machine-Learning-main
You’ll learn how recent advancements in machine learning can radically enhance software capabilities through natural language processing, image recognition, content personalization, and behavior ...
具体而言,本文详细探讨了支持向量机(Support Vector Machines, SVM)、线性回归(Linear Regression)、使用决策树桩(Prediction using Decision Stumps)、专家加权(Expert Weighting)以及在线学习(Online Learning)等...
interest in Machine Learning is to develop effcient algorithms for designing the models and also for analysis and prediction. The latter part is gaining importance in the dawn of what we call the big ...
different Machine Learning setting and a couple of well-studied methods as well as show step-by-step examples that use Python and scikit-learn to solve concrete tasks. We will also show you tips and ...
本文探讨了在大数据环境下,通过机器学习方法对慢性疾病进行有效预测的研究。随着生物医学和医疗健康领域大数据的增长,准确分析医疗数据可以促进早期疾病检测、改善患者护理并提升社区服务。但是,当医疗数据质量不...