Prediction(6)PyLib and Machine Learning

sillycat

浏览: 2560289 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Summary

Prediction(6)PyLib and Machine Learning

1. Introduction
An ensemble method will create a model composed of a set of other base models. Gradientboostedtrees and RandomForest both use decision trees as their base models.

GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests in parallel. (smaller trees with GBTs)

Training more trees in a Random Forest reduces the likelihood of overfitting. More trees with GBTs increases the likelihood of overfitting.

Random Forests reduce variance by using more trees, GBTs reduce bias by using more trees.

2. Try with Random Forests

Error Message in Zeppelin:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 162, in <module> eval(compiledCode) File "<string>", line 1, in <module> File "/opt/spark/python/pyspark/mllib/__init__.py", line 25, in <module> import numpy ImportError: No module named numpy

Solution:
http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy

Download the latest file from http://sourceforge.net/projects/numpy/files/NumPy/

> wget http://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz

> sudo python setup.py install

Verify the installation
>python
python>>>import numpy
python>>>exit()

Error Message in Zeppelin Logs
ERROR [2015-10-06 14:14:40,447] ({qtp1852584274-48} NotebookServer.java[runParagraph]:630) - Exception from run
org.apache.zeppelin.interpreter.InterpreterException: pyspark interpreter not found
        at org.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)
        at org.apache.zeppelin.notebook.Note.run(Note.java:282)
        at org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)
        at org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)

Solution:
http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

>mvn clean package -Pspark-1.5 -Dpyspark -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -P build-distr

Try these codes in zeppelin.
%pyspark
sc.parallelize([1,2,3]).count()

Exception:
Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jar java.io.EOFException

Solution:
http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to

Add this in zeppelin configuration file.
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

It should be right there. But the VMs are slow. So I did not make it perfectly working. I may try this in later version.

3. Set up Single Mode
Following the nodes here http://sillycat.iteye.com/blog/2247102
Only these configuration for zeppelin in local MODE
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

And the single mode is working great for me. And the speed is also much better than in the VMs.

4. Random Forest Sample on Zeppelin
%pyspark

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

%pyspark

data = MLUtils.loadLibSVMFile(sc, "/opt/spark/data/mllib/sample_libsvm_data.txt")
(trainingData, testData) = data.randomSplit([0.7, 0.3])

%pyspark

model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=3, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=4, maxBins=32)

%pyspark

predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())

References:
http://spark.apache.org/docs/latest/mllib-ensembles.html

Setup Zeppelin Again with Python
http://sillycat.iteye.com/blog/2247102

分享到：

FIPS and County Code Lookup | Playframework and Swagger

2015-10-08 03:28
浏览 1461
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论