Spark is a fast and general computing engine designed for large-scale data processing. It has the advantages of Hadoop MapReduce; but unlike MapReduce, the intermediate output results of Job can be stored in memory, so it is no longer necessary to read and write HDFS, so Spark can be better applied to MapReduce that requires iteration such as data mining and machine learning algorithm. Since spark uses dependency scala. So install it together.
1. Unzip the file
tar -zxvf /opt/spark-1.6.0-cdh5.8.0.tar.gz
tar -zxvf /opt/scala-2.10.4.tgz
2. Configure environment variables
# vim /etc/profile
Add at the end of the file:
export SPARK_HOME=/opt/spark-1.6.0-cdh5.8.0
export SCALA_HOME=/opt/scala-2.10.4
export PATH=.:$JAVA_HOME/bin:$SACLA_HOME/bin:$PATH //Add the scala path to the environment variable
3. Configure spark-env.sh
The spark-env.sh file configures some environments, dependencies, and resource configuration of the master and slave for the spark runtime.
cp conf/spark-env.sh.template conf/spark-env.sh //Copy spark-env.sh.template as spark-env.sh
The configuration is as follows:
HADOOP_CONF_DIR=/opt/hadoop-2.6.0-cdh5.8.0/etc/hadoop
SPARK_LOCAL_IP=slave1 //This refers to the current running machine of spark
SPARK_MASTER_IP=master //Master node ip
SPARK_CLASSPATH=$CLASSPATH:`find /opt/hadoop-2.6.0-cdh5.8.0 -name *.jar|tr '\n' ':'`
SPARK_LOCAL_DIRS=/opt/spark/
HADOOP_HOME=/opt/hadoop-2.6.0-cdh5
4. Configure /opt/spark-1.6.0-cdh5.8.0/conf/slaves
master
slave1
slave2
5. Copy the entire directory to slave1, slave2
scp -r /opt/spark-1.6.0-cdh5.8.0 hadoop@slave1:/opt/
scp -r /opt/spark-1.6.0-cdh5.8.0 hadoop@slave2:/opt/
Modify spark-env.sh on slave1, slave2
SPARK_LOCAL_IP is the current machine name
5. Verify