Spark(1)Introduction and Installation

Spark(1)Introduction and Installation

1. Introduction
1.1 MapReduce Model
Map -- read, convert
Reduce -- calculate

4 classes
Read and Convert data to key-value, Map, Reduce, Convert and output key-value to output data.


1.2 Apache mesos
Mesos and YARN, they can control the resource. Resource sharing system.

Hadoop scheduler, MPI scheduler, Spark
Mesos master, Standby master, … (Controlled by ZooKeeper)
Mesos slave, Mesos slave, Mesos slave … (execute Hadoop executor task, MPI executor task, … )

Mesos-master: manage framework and slave, give the resource from slave to framework
Mesos-slave: mesos-task
Framework: Hadoop, Spark …
Executor:


1.3 Spark Introduction
Spark is implemented by Scala and based on Mesos.
It can work with Hadoop and EC2, directly read data from HDFS or S3.

Bagel    Shark
Spark(RDD, Map Reduce, FP)
Mesos
HDFS  AWS s3n

Spark is using  Map Reduce Model, function programming, Mesos, HDFS and S3

Spark Terms
RDD - Resilient Distributed Datasets
Local mode and Mesos Mode
Tansformations and Actions -
          Transformation will return RDD,
          Action return a collection of scala, value, null

Spark on Mesos
RDD + Job(tasks)  ----> SparkScheduler -----> Mesos Master ---> Mesos Slave, Mesos Slave … ( Spark executor… tasks)

1.4 HDFS Introduction
Hadoop Distributed File System ---- NameNode(Only One)------> DataNode

Block   64M, default block of file
NameNode     File name, tree, namespace image, edit log, How many blocks does one file have, where is them on the DataNodes.
DataNode       Client or NameNode can write and read data from DataNodes

1.5 Zookeeper
Configuration Management
Cluster Management

1.6 NFS Introduction
NFS - Network FileSystem

2. Installation of Spark
After the version 0.6, we can ignore Mesos at first.
Get the source codes
>git clone https://github.com/mesos/spark.git

My scala version is 2.10.0, just try the command
>sudo sbt/sbt package

It works.

I also try to build with MAVEN, but it seems not working. Since I already have SCALA_HOME, I will directly run that
Syntax: ./run <class> <params>
>./run spark.examples.SparkLR local[2]

Or
>./run spark.examples.SparkPi local

I try to run spark to verify my environment, but it seems that it is not working because of the SCALA_HOME.
Error Message:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
     at spark.examples.SparkPi.main(SparkPi.scala)
Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest
     at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
     at java.security.AccessController.doPrivileged(Native Method)
Solution:
>cd examples
>sudo mvn eclipse:eclipse
>cd ..
>sudo mvn eclipse:eclipse

try to import the samples and spark project into my eclipse and read the resource codes.

Read the code in spark-examples/src/main/scala/spark/examples/SparkPi.scala
Run this again
>sudo ./run spark.examples.SparkPi local

Still not working, told me SCALA_HOME is not set. But I am sure it is there.

>wget http://www.spark-project.org/files/spark-0.7.0-sources.tgz
Unzip and put it in the working directory
>sudo ln -s /Users/carl/tool/spark-0.7.0 /opt/spark-0.7.0
>sudo ln -s /opt/spark-0.7.0 /opt/spark

Compile the source codes
>sudo sbt/sbt compile
>sudo sbt/sbt package
>sudo sbt/sbt assembly

>sudo ./run spark.examples.SparkPi local
Error is still there, SCALA_HOME is not set.

Finally, I found the reason. I should change the conf/spark-env.sh
>cd conf
>cp spark-env.sh.template spark-env.sh
And be careful, do not use Scala version 2.10.0 there. I should use 2.9.2
export SCALA_HOME=/opt/scala2.9.2

This time, every thing will go well.
>sudo ./run spark.examples.SparkPi local 

>sudo ./run spark.examples.SparkLR local[2]

Use local 2 CPU.


References:
Spark
http://www.ibm.com/developerworks/cn/opensource/os-spark/
http://spark-project.org/documentation/
http://rdc.taobao.com/team/jm/archives/tag/spark
http://rdc.taobao.com/team/jm/archives/2043
http://spark-project.org/examples/

http://rdc.taobao.com/team/jm/archives/1871

http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/
http://run-xiao.iteye.com/blog/1835707

http://www.yiihsia.com/2011/12/%E5%88%9D%E5%A7%8Bspark-%E5%9F%BA%E6%9C%AC%E6%A6%82%E5%BF%B5%E5%92%8C%E4%BE%8B%E5%AD%90/
http://www.cnblogs.com/jerrylead/archive/2012/08/13/2636115.html

http://blog.csdn.net/macyang/article/details/7100523

Git resource
https://github.com/mesos/spark

HDFS
http://www.cnblogs.com/forfuture1978/archive/2010/03/14/1685351.html

Hadoop
http://blog.csdn.net/robertleepeak/article/details/6001369

mesos
http://dongxicheng.org/mapreduce-nextgen/mesos_vs_yarn/

zookeeper
http://rdc.taobao.com/team/jm/archives/665

猜你喜欢

转载自sillycat.iteye.com/blog/1871204
今日推荐