1. Introduction to Spark
Apache Spark is a fast and general computing engine designed for large-scale data processing. Spark is a Hadoop MapReduce-like general parallel framework open sourced by UC Berkeley AMP lab (University of California, Berkeley's AMP lab). Spark has the advantages of Hadoop MapReduce; but unlike MapReduce, the intermediate output results of Job can be It is stored in memory, so it is no longer necessary to read and write HDFS, so Spark is better suited for MapReduce algorithms that require iteration, such as data mining and machine learning.
Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two that make Spark superior for certain workloads, in other words, Spark enables In addition to being able to provide interactive queries, it can also optimize iterative workloads.
Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, where Scala can manipulate distributed datasets as easily as native collection objects.
Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can run in parallel on the Hadoop file system. This behavior is supported through a third-party clustering framework called Mesos. Spark was developed by the UC Berkeley AMP Lab (Algorithms, Machines, and People Lab) to build large-scale, low-latency data analytics applications.
2. Deployment preparation
2.1. Preparation of installation package
2.2. Node configuration information
IP address | CPU name |
192.168.23.1 | risen01 |
192.168.23.2 | risen02 |
192.168.23.3 | risen03 |
2.3. Node resource configuration information
IP address |
Role |
192.168.23.1 |
master,worker |
192.168.23.2 |
HA-Master,worker |
192.168.23.3 |
worker |
3. Cluster configuration and startup
3.1. Upload and decompress the installation package
Operation node: risen01
Operating user: root
1. Upload the installation packages spark-2.2.0-bin-hadoop2.6.tgz, scala-2.11.0.tgz, jdk-8u161-linux-x64.tar.gz (if it already exists, this step is not required) to risen01 In the ~/packages directory under the node, the result is as shown in the figure:
2. Unzip the JDK installation package, Spark installation package Scala installation package and go to /usr/local
Operation node: risen01
Operating user: root
Unzip the JDK command:
tar -zxvf ~/packeages/jdk-8u161-linux-x64.tar.gz -C /usr/local
Unzip the spark command:
tar -zxvf ~/packages/spark-2.2.0-bin-hadoop2.6.tgz -C /usr/local
Unpack the Scala command:
tar -zxvf ~/packages/scala-2.11.0.tgz -C /usr/local
3.2. Preparation before startup
Operation nodes: risen01, risen02, risen03
Operating user: root
1. Create a new spark/work directory in the /data directory to store the spark task processing logs
2. Create a new spark directory in the /log directory to store spark startup logs, etc.
3.3, modify the configuration file
3.3.1. Edit spark-env.sh file
Operation node: risen01
Operating user: root
Note: Please configure each parameter according to the actual cluster size and hardware conditions
Go to the /usr/local/spark-2.2.0-bin-hadoop2.6/conf directory and execute the command:
cp spark-env.sh.template spark-env.sh
Edit the spark-env.sh file and add the following:
#设置spark的web访问端口
SPARK_MASTER_WEBUI_PORT=18080
#设置spark的任务处理日志存放目录
SPARK_WORKER_DIR=/data/spark/work
#设置spark每个worker上面的核数
SPARK_WORKER_CORES=2
#设置spark每个worker的内存
SPARK_WORKER_MEMORY=1g
#设置spark的启动日志等目录
SPARK_LOG_DIR=/log/spark
#指定spark需要的JDK目录
export JAVA_HOME=/usr/local/jdk1.8.0_161
#指定spark需要的Scala目录
export SCALA_HOME=/usr/local/scala-2.11.0
#指定Hadoop的安装目录
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
#指定Hadoop的配置目录
export HADOOP_CONF_DIR=/opt/cloudera/parcels/CDH/lib/hadoop/etc/hadoop/
#实现spark-standlone HA(因为我们HA实现的是risen01和risen02之间的切换不涉及risen03,所以这段配置risen03可有可无)
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=risen01:2181,risen02:2181,risen03:2181 -Dspark.deploy.zookeeper.dir=/data/spark"
3.3.2. Edit spark-defaults.conf file
Operation node: risen01
Operating user: root
Note: Please configure each parameter according to the actual cluster size and hardware conditions
Go to the /usr/local/spark-2.2.0-bin-hadoop2.6/conf directory and execute the command:
cp spark-defaults.conf.template spark-defaults.conf
Edit the spark-defaults.conf file and add the following:
#设置spark的主节点
spark.master spark://risen01:7077
#开启eventLog
spark.eventLog.enabled true
#设置eventLog存储目录
spark.eventLog.dir /log/spark/eventLog
#设置spark序列化方式
spark.serializer org.apache.spark.serializer.KryoSerializer
#设置spark的driver内存
spark.driver.memory 1g
#设置spark的心跳检测时间间隔
spark.executor.heartbeatInterval 20s
#默认并行数
spark.default.parallelism 20
#最大网络延时
spark.network.timeout 3000s
3.3.3. Edit the slaves file
Operation node: risen01
Operating user: root
Note: Please configure each parameter according to the actual cluster size and hardware conditions
Go to the /usr/local/spark-2.2.0-bin-hadoop2.6/conf directory and execute the command:
cp slaves.templete slaves
Edit the slaves file and modify localhost to:
risen01
risen02
risen03
3.4. Distribute other nodes
1. Execute the scp command:
Operation node: risen01
Operating user: root
scp -r /usr/local/spark-2.2.0-bin-hadoop2.6 root@risen02:/usr/local
scp -r /usr/local/scala-2.11.0 root@risen02:/usr/local
scp -r /usr/local/jdk1.8.0_161 root@risen02:/usr/local
scp -r /usr/local/spark-2.2.0-bin-hadoop2.6 root@risen03:/usr/local
scp -r /usr/local/scala-2.11.0 root@risen03:/usr/local
scp -r /usr/local/jdk1.8.0_161 root@risen03:/usr/local
2. It is necessary to create a bigdata user in advance and realize password-free (I won't repeat it here, if you have done this step, you may not do it)
3. Permission modification
Operation nodes: risen01, risen02, risen03
Operating user: root
Modify the /log/spark permission command:
chown -R bigdata.bigdata /log/spark
Modify /data/spark permission command:
chown -R bigdata.bigdata /data/spark
Modify the spark installation directory command:
chown -R bigdata.bigdata /usr/local/spark-2.2.0-bin-hadoop2.6
Modify the Scala installation directory command:
chown -R bigdata.bigdata /usr/local/scala-2.11.0
Modify the installation directory command of JDK1.8: (If you have done this step, you may not do it)
chown -R bigdata.bigdata /usr/local/jdk1.8.0_161
The result is shown below:
3.5, start the cluster
Operation nodes: risen01, risen02
Operating user: bigdata
(1) Go to the /usr/local/spark-2.2.0-bin-hadoop2.6/sbin directory and execute ./start-all.sh to view the web interface as shown below:
Then run the command ./start-master.sh in /usr/local/spark-2.2.0-bin-hadoop2.6/sbin in the spark installation directory of the risen02 machine to start the standby master node of the spark cluster. (Remember to start the process of the standby master node. Here we only use risen02 as the standby master node. Although risen03 is also configured to be eligible, we do not need it for the time being)
(2) Enter the /usr/local/spark-2.2.0-bin-hadoop2.6/bin directory to execute spark-shell, and test the word frequency test. The results are shown in the following figure:
So far, the spark-standlone mode has been successfully installed!