`
sillycat
  • 浏览: 2539625 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Prediction(4)Logistic Regression - Local Cluster Set Up

 
阅读更多
Prediction(4)Logistic Regression - Local Cluster Set Up

1. Try to Set Up Hadoop
Download the right version
> wget http://apache.spinellicreations.com/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
Place it in the right place and soft link the file
> hadoop version
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a

Set up the Cluster
> mkdir /opt/hadoop/temp

Config core-site.xml
<configuration>
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://ubuntu-master:9000</value>
</property>
<property>
  <name>io.file.buffer.size</name>
  <value>131072</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>file:/opt/hadoop/temp</value>
</property>
<property>
  <name>hadoop.proxyuser.hadoop.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.hadoop.groups</name>
  <value>*</value>
</property>
</configuration>

> mkdir /opt/hadoop/dfs
> mkdir /opt/hadoop/dfs/name

> mkdir /opt/hadoop/dfs/data

Configure hdfs-site.xml
<configuration>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>ubuntu-master:9001</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/hadoop/dfs/name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/opt/hadoop/dfs/data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
  </property>

> mv mapred-site.xml.template mapred-site.xml

Configure mapred-site.xml
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>ubuntu-master:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>ubuntu-master:19888</value>
  </property>
</configuration>

Configure the yarn-site.xml
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>ubuntu-master:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>ubuntu-master:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>ubuntu-master:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>ubuntu-master:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>ubuntu-master:8088</value>
  </property>
</configuration>

Configure slaves
ubuntu-dev1
ubuntu-dev2
ubuntu-dev3

Prepare the 3 slave machines if needed.
> mkdir ~/.ssh

> vi ~/.ssh/authorized_keys

Copy the keys there, the content is from  cat ~/.ssh/id_rsa.pub

scp all the files to all slaves machines.

The same command will start hadoop
7. Hadoop hdfs and yarn
cd /opt/hadoop
sbin/start-dfs.sh
sbin/start-yarn.sh

visit the page
http://ubuntu-master:50070/dfshealth.html#tab-overview
http://ubuntu-master:8088/cluster

Error Message:
> sbin/start-dfs.sh
Starting namenodes on [ubuntu-master]
ubuntu-master: Error: JAVA_HOME is not set and could not be found.
ubuntu-dev1: Error: JAVA_HOME is not set and could not be found.
ubuntu-dev2: Error: JAVA_HOME is not set and could not be found.

Solution:
> vi hadoop-env.sh

export JAVA_HOME="/usr/lib/jvm/java-8-oracle"

Error Message:
2015-09-30 19:39:49,482 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /opt/hadoop/dfs/name/in_use.lock acquired by nodename 3017@ubuntu-master
2015-09-30 19:39:49,487 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
java.io.IOException: NameNode is not formatted.
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:225)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584)

Solution:
hdfs namenode -format

Cool, all things are up and running for yarn cluster.

2. Try to Set Up Spark 1.5.0
Fetch the latest Spark
> wget http://apache.mirrors.ionfish.org/spark/spark-1.5.0/spark-1.5.0-bin-hadoop2.6.tgz

Unzip and place that in the right working directory.

3. Try to Set Up Zeppelin
Fetch the source codes first.
> git clone https://github.com/apache/incubator-zeppelin.git

> npm install -g grunt-cli

> grunt --version
grunt-cli v0.1.13

> mvn clean package -Pspark-1.5 -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests

Exception:
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.23:grunt (grunt build) on project zeppelin-web: Failed to run task: 'grunt --no-color' failed. (error code 3) -> [Help 1]

INFO [launcher]: Trying to start PhantomJS again (1/2).
ERROR [launcher]: Cannot start PhantomJS


INFO [launcher]: Trying to start PhantomJS again (2/2).
ERROR [launcher]: Cannot start PhantomJS


ERROR [launcher]: PhantomJS failed 2 times (cannot start). Giving up.
Warning: Task "karma:unit" failed. Use --force to continue.

Solution:
>cd  /home/carl/install/incubator-zeppelin/zeppelin-web

> mvn clean install

I get more exceptions in detail. It shows that the PhantomJS is not installed.
Install PhantomJS
http://sillycat.iteye.com/blog/1874971

Build own PhantomJS from source
http://phantomjs.org/build.html

Or find an older version from here
https://code.google.com/p/phantomjs/downloads/list

Download the right version
> wget https://phantomjs.googlecode.com/files/phantomjs-1.9.2-linux-x86_64.tar.bz2

> bzip2 -d phantomjs-1.9.2-linux-x86_64.tar.bz2

> tar -xvf phantomjs-1.9.2-linux-x86_64.tar

Move to the proper directory. Add to path. Verify installation.
Error Exception:
phantomjs --version
phantomjs: error while loading shared libraries: libfontconfig.so.1: cannot open shared object file: No such file or directory

Solution:
> sudo apt-get install libfontconfig

It works.
> phantomjs --version
1.9.2

Build Success.

4. Configure Spark and Zeppelin
Set Up Zeppelin
>cp zeppelin-env.sh.template zeppelin-env.sh
> cp zeppelin-site.xml.template zeppelin-site.xml

>vi zeppelin-env.sh
export MASTER="yarn-client"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"

export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"

Set Up Spark
>cp spark-env.sh.template spark-env.sh
>vi spark-env.sh
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop"
export SPARK_WORKER_MEMORY=768m
export SPARK_JAVA_OPTS="-Dbuild.env=lmm.sparkvm"
export USER=carl

Rebuild and set up the zeppelin.
> mvn clean package -Pspark-1.5 -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -P build-distr

The final gz file will be here:
/home/carl/install/incubator-zeppelin-0.6.0/zeppelin-distribution/target

> mv zeppelin-0.6.0-incubating-SNAPSHOT /home/carl/tool/zeppelin-0.6.0

> sudo ln -s /opt/zeppelin-0.6.0 /opt/zeppelin

Start the Server
> bin/zeppelin-daemon.sh start

Visit the Zeppelin
http://ubuntu-master:8080/#/

Exception:
Found both spark.driver.extraJavaOptions and SPARK_JAVA_OPTS. Use only the former.

Solution:
Zeppelin Configuration
export ZEPPELIN_JAVA_OPTS="-Dspark.akka.frameSize=100 -Dspark.jars=/home/hadoop/spark-seed-assembly-0.0.1.jar"

Spark Configuration
export SPARK_DAEMON_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70"
export SPARK_LOCAL_DIRS=/opt/spark

export SPARK_LOG_DIR=/var/log/apps
export SPARK_CLASSPATH=“/opt/spark/conf:/home/hadoop/conf:/opt/spark/classpath/emr/*:/opt/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar"

References:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression

zeppelin
http://sillycat.iteye.com/blog/2216604
http://sillycat.iteye.com/blog/2223622

https://github.com/apache/incubator-zeppelin

hadoop
http://sillycat.iteye.com/blog/2242559
http://sillycat.iteye.com/blog/2193762
http://sillycat.iteye.com/blog/2103457
http://sillycat.iteye.com/blog/2084169
http://sillycat.iteye.com/blog/2090186
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics