`
sillycat
  • 浏览: 2551384 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Prediction(6)PyLib and Machine Learning

 
阅读更多
Prediction(6)PyLib and Machine Learning

1. Introduction
An ensemble method will create a model composed of a set of other base models. Gradientboostedtrees and RandomForest both use decision trees as their base models.

GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests in parallel. (smaller trees with GBTs)

Training more trees in a Random Forest reduces the likelihood of overfitting. More trees with GBTs increases the likelihood of overfitting.

Random Forests reduce variance by using more trees, GBTs reduce bias by using more trees.

2. Try with Random Forests

Error Message in Zeppelin:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark.py", line 162, in <module> eval(compiledCode) File "<string>", line 1, in <module> File "/opt/spark/python/pyspark/mllib/__init__.py", line 25, in <module> import numpy ImportError: No module named numpy

Solution:
http://stackoverflow.com/questions/7818811/import-error-no-module-named-numpy

Download the latest file from http://sourceforge.net/projects/numpy/files/NumPy/

> wget http://tcpdiag.dl.sourceforge.net/project/numpy/NumPy/1.10.0/numpy-1.10.0.tar.gz

> sudo python setup.py install

Verify the installation
>python
python>>>import numpy
python>>>exit()

Error Message in Zeppelin Logs
ERROR [2015-10-06 14:14:40,447] ({qtp1852584274-48} NotebookServer.java[runParagraph]:630) - Exception from run
org.apache.zeppelin.interpreter.InterpreterException: pyspark interpreter not found
        at org.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:148)
        at org.apache.zeppelin.notebook.Note.run(Note.java:282)
        at org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:628)
        at org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)

Solution:
http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

>mvn clean package -Pspark-1.5 -Dpyspark -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -P build-distr

Try these codes in zeppelin.
%pyspark
sc.parallelize([1,2,3]).count()

Exception:
Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /home/carl/tool/hadoop-2.7.1/temp/nm-local-dir/usercache/carl/filecache/20/spark-assembly-1.5.0-hadoop2.6.0.jar java.io.EOFException

Solution:
http://stackoverflow.com/questions/30824818/what-to-set-spark-home-to

Add this in zeppelin configuration file.
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

It should be right there. But the VMs are slow. So I did not make it perfectly working. I may try this in later version.

3. Set up Single Mode
Following the nodes here http://sillycat.iteye.com/blog/2247102
Only these configuration for zeppelin in local MODE
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
export PYTHONPATH="$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH"

And the single mode is working great for me. And the speed is also much better than in the VMs.

4. Random Forest Sample on Zeppelin
%pyspark

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

%pyspark

data = MLUtils.loadLibSVMFile(sc, "/opt/spark/data/mllib/sample_libsvm_data.txt")
(trainingData, testData) = data.randomSplit([0.7, 0.3])

%pyspark

model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=3, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=4, maxBins=32)

%pyspark

predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())

References:
http://spark.apache.org/docs/latest/mllib-ensembles.html

Setup Zeppelin Again with Python
http://sillycat.iteye.com/blog/2247102
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics