Spark(1)Introduction and Installation

sillycat

浏览: 2560288 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Distributed

Spark(1)Introduction and Installation

1. Introduction
1.1 MapReduce Model
Map -- read, convert
Reduce -- calculate

4 classes
Read and Convert data to key-value, Map, Reduce, Convert and output key-value to output data.

1.2 Apache mesos
Mesos and YARN, they can control the resource. Resource sharing system.

Hadoop scheduler, MPI scheduler, Spark
Mesos master, Standby master, … (Controlled by ZooKeeper)
Mesos slave, Mesos slave, Mesos slave … (execute Hadoop executor task, MPI executor task, … )

Mesos-master: manage framework and slave, give the resource from slave to framework
Mesos-slave: mesos-task
Framework: Hadoop, Spark …
Executor:

1.3 Spark Introduction
Spark is implemented by Scala and based on Mesos.
It can work with Hadoop and EC2, directly read data from HDFS or S3.

Bagel    Shark
Spark(RDD, Map Reduce, FP)
Mesos
HDFS AWS s3n

Spark is using Map Reduce Model, function programming, Mesos, HDFS and S3

Spark Terms
RDD - Resilient Distributed Datasets
Local mode and Mesos Mode
Tansformations and Actions -
          Transformation will return RDD,
          Action return a collection of scala, value, null

Spark on Mesos
RDD + Job(tasks) ----> SparkScheduler -----> Mesos Master ---> Mesos Slave, Mesos Slave … ( Spark executor… tasks)

1.4 HDFS Introduction
Hadoop Distributed File System ---- NameNode(Only One)------> DataNode

Block   64M, default block of file
NameNode     File name, tree, namespace image, edit log, How many blocks does one file have, where is them on the DataNodes.
DataNode       Client or NameNode can write and read data from DataNodes

1.5 Zookeeper
Configuration Management
Cluster Management

1.6 NFS Introduction
NFS - Network FileSystem

2. Installation of Spark
After the version 0.6, we can ignore Mesos at first.
Get the source codes
>git clone https://github.com/mesos/spark.git

My scala version is 2.10.0, just try the command
>sudo sbt/sbt package

It works.

I also try to build with MAVEN, but it seems not working. Since I already have SCALA_HOME, I will directly run that
Syntax: ./run <class> <params>
>./run spark.examples.SparkLR local[2]

Or
>./run spark.examples.SparkPi local

I try to run spark to verify my environment, but it seems that it is not working because of the SCALA_HOME.
Error Message:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
     at spark.examples.SparkPi.main(SparkPi.scala)
Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest
     at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
     at java.security.AccessController.doPrivileged(Native Method)
Solution:
>cd examples
>sudo mvn eclipse:eclipse
>cd ..
>sudo mvn eclipse:eclipse

try to import the samples and spark project into my eclipse and read the resource codes.

Read the code in spark-examples/src/main/scala/spark/examples/SparkPi.scala
Run this again
>sudo ./run spark.examples.SparkPi local

Still not working, told me SCALA_HOME is not set. But I am sure it is there.

>wget http://www.spark-project.org/files/spark-0.7.0-sources.tgz
Unzip and put it in the working directory
>sudo ln -s /Users/carl/tool/spark-0.7.0 /opt/spark-0.7.0
>sudo ln -s /opt/spark-0.7.0 /opt/spark

Compile the source codes
>sudo sbt/sbt compile
>sudo sbt/sbt package
>sudo sbt/sbt assembly

>sudo ./run spark.examples.SparkPi local
Error is still there, SCALA_HOME is not set.

Finally, I found the reason. I should change the conf/spark-env.sh
>cd conf
>cp spark-env.sh.template spark-env.sh
And be careful, do not use Scala version 2.10.0 there. I should use 2.9.2
export SCALA_HOME=/opt/scala2.9.2

This time, every thing will go well.
>sudo ./run spark.examples.SparkPi local

>sudo ./run spark.examples.SparkLR local[2]

Use local 2 CPU.

References:
Spark
http://www.ibm.com/developerworks/cn/opensource/os-spark/
http://spark-project.org/documentation/
http://rdc.taobao.com/team/jm/archives/tag/spark
http://rdc.taobao.com/team/jm/archives/2043
http://spark-project.org/examples/

http://rdc.taobao.com/team/jm/archives/1871

http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/
http://run-xiao.iteye.com/blog/1835707

http://www.yiihsia.com/2011/12/%E5%88%9D%E5%A7%8Bspark-%E5%9F%BA%E6%9C%AC%E6%A6%82%E5%BF%B5%E5%92%8C%E4%BE%8B%E5%AD%90/
http://www.cnblogs.com/jerrylead/archive/2012/08/13/2636115.html

http://blog.csdn.net/macyang/article/details/7100523

Git resource
https://github.com/mesos/spark

HDFS
http://www.cnblogs.com/forfuture1978/archive/2010/03/14/1685351.html

Hadoop
http://blog.csdn.net/robertleepeak/article/details/6001369

mesos
http://dongxicheng.org/mapreduce-nextgen/mesos_vs_yarn/

zookeeper
http://rdc.taobao.com/team/jm/archives/665

分享到：

SuperPlan(6)TaoBao Winner - UI log4javas ... | Cassandra Database(1)Begin from the Gett ...

2013-05-17 22:45
浏览 2543
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论