Spark2.2.0 Distributed Offline Construction

1. Introduction to Spark

Apache Spark is a fast and general computing engine designed for large-scale data processing. Spark is a Hadoop MapReduce-like general parallel framework open sourced by UC Berkeley AMP lab (University of California, Berkeley's AMP lab). Spark has the advantages of Hadoop MapReduce; but unlike MapReduce, the intermediate output results of Job can be It is stored in memory, so it is no longer necessary to read and write HDFS, so Spark is better suited for MapReduce algorithms that require iteration, such as data mining and machine learning.

Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two that make Spark superior for certain workloads, in other words, Spark enables In addition to being able to provide interactive queries, it can also optimize iterative workloads.

Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, where Scala can manipulate distributed datasets as easily as native collection objects.

Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can run in parallel on the Hadoop file system. This behavior is supported through a third-party clustering framework called Mesos. Spark was developed by the UC Berkeley AMP Lab (Algorithms, Machines, and People Lab) to build large-scale, low-latency data analytics applications.

2. Deployment preparation

2.1. Preparation of installation package

2.2. Node configuration information

IP address CPU name
192.168.23.1 risen01
192.168.23.2 risen02
192.168.23.3 risen03

 

2.3. Node resource configuration information

IP address

Role

192.168.23.1

master,worker

192.168.23.2

HA-Master,worker

192.168.23.3

worker

 

3. Cluster configuration and startup

3.1. Upload and decompress the installation package

Operation node: risen01

Operating user: root

1. Upload the installation packages spark-2.2.0-bin-hadoop2.6.tgz, scala-2.11.0.tgz, jdk-8u161-linux-x64.tar.gz (if it already exists, this step is not required) to risen01 In the ~/packages directory under the node, the result is as shown in the figure:

2. Unzip the JDK installation package, Spark installation package Scala installation package and go to /usr/local

Operation node: risen01

Operating user: root

Unzip the JDK command:

tar -zxvf ~/packeages/jdk-8u161-linux-x64.tar.gz -C /usr/local

Unzip the spark command:

tar -zxvf ~/packages/spark-2.2.0-bin-hadoop2.6.tgz -C /usr/local

Unpack the Scala command:

tar -zxvf ~/packages/scala-2.11.0.tgz -C /usr/local

3.2. Preparation before startup

Operation nodes: risen01, risen02, risen03

Operating user: root

1. Create a new spark/work directory in the /data directory to store the spark task processing logs

2. Create a new spark directory in the /log directory to store spark startup logs, etc.

3.3, modify the configuration file

3.3.1. Edit spark-env.sh file

Operation node: risen01

Operating user: root

Note: Please configure each parameter according to the actual cluster size and hardware conditions

Go to the /usr/local/spark-2.2.0-bin-hadoop2.6/conf directory and execute the command:

cp spark-env.sh.template spark-env.sh

Edit the spark-env.sh file and add the following:

#设置spark的web访问端口
SPARK_MASTER_WEBUI_PORT=18080

#设置spark的任务处理日志存放目录
SPARK_WORKER_DIR=/data/spark/work

#设置spark每个worker上面的核数
SPARK_WORKER_CORES=2

#设置spark每个worker的内存
SPARK_WORKER_MEMORY=1g

#设置spark的启动日志等目录
SPARK_LOG_DIR=/log/spark

#指定spark需要的JDK目录
export JAVA_HOME=/usr/local/jdk1.8.0_161

#指定spark需要的Scala目录
export SCALA_HOME=/usr/local/scala-2.11.0

#指定Hadoop的安装目录
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop

#指定Hadoop的配置目录
export HADOOP_CONF_DIR=/opt/cloudera/parcels/CDH/lib/hadoop/etc/hadoop/

#实现spark-standlone HA(因为我们HA实现的是risen01和risen02之间的切换不涉及risen03,所以这段配置risen03可有可无)
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=risen01:2181,risen02:2181,risen03:2181 -Dspark.deploy.zookeeper.dir=/data/spark"

3.3.2. Edit spark-defaults.conf file

Operation node: risen01

Operating user: root

Note: Please configure each parameter according to the actual cluster size and hardware conditions

Go to the /usr/local/spark-2.2.0-bin-hadoop2.6/conf directory and execute the command:

cp spark-defaults.conf.template spark-defaults.conf

Edit the spark-defaults.conf file and add the following:

#设置spark的主节点
spark.master                      spark://risen01:7077

#开启eventLog
spark.eventLog.enabled            true

#设置eventLog存储目录
spark.eventLog.dir                /log/spark/eventLog

#设置spark序列化方式
spark.serializer    org.apache.spark.serializer.KryoSerializer

#设置spark的driver内存
spark.driver.memory               1g

#设置spark的心跳检测时间间隔
spark.executor.heartbeatInterval  20s

#默认并行数
spark.default.parallelism         20

#最大网络延时
spark.network.timeout             3000s

3.3.3. Edit the slaves file

Operation node: risen01

Operating user: root

Note: Please configure each parameter according to the actual cluster size and hardware conditions

Go to the /usr/local/spark-2.2.0-bin-hadoop2.6/conf directory and execute the command:

cp slaves.templete slaves

Edit the slaves file and modify localhost to:

risen01
risen02
risen03

3.4. Distribute other nodes

1. Execute the scp command:

Operation node: risen01

Operating user: root

scp -r /usr/local/spark-2.2.0-bin-hadoop2.6 root@risen02:/usr/local
scp -r /usr/local/scala-2.11.0 root@risen02:/usr/local
scp -r /usr/local/jdk1.8.0_161 root@risen02:/usr/local
scp -r /usr/local/spark-2.2.0-bin-hadoop2.6 root@risen03:/usr/local
scp -r /usr/local/scala-2.11.0 root@risen03:/usr/local
scp -r /usr/local/jdk1.8.0_161 root@risen03:/usr/local

2. It is necessary to create a bigdata user in advance and realize password-free (I won't repeat it here, if you have done this step, you may not do it)

3. Permission modification

Operation nodes: risen01, risen02, risen03

Operating user: root

Modify the /log/spark permission command:

chown -R bigdata.bigdata /log/spark

Modify /data/spark permission command:

chown -R bigdata.bigdata /data/spark

Modify the spark installation directory command:

chown -R bigdata.bigdata /usr/local/spark-2.2.0-bin-hadoop2.6

Modify the Scala installation directory command:

chown -R bigdata.bigdata /usr/local/scala-2.11.0

Modify the installation directory command of JDK1.8: (If you have done this step, you may not do it)

chown -R bigdata.bigdata /usr/local/jdk1.8.0_161

The result is shown below:

3.5, start the cluster

Operation nodes: risen01, risen02

Operating user: bigdata

(1) Go to the /usr/local/spark-2.2.0-bin-hadoop2.6/sbin directory and execute ./start-all.sh to view the web interface as shown below:

Then run the command ./start-master.sh in /usr/local/spark-2.2.0-bin-hadoop2.6/sbin in the spark installation directory of the risen02 machine to start the standby master node of the spark cluster. (Remember to start the process of the standby master node. Here we only use risen02 as the standby master node. Although risen03 is also configured to be eligible, we do not need it for the time being)

(2) Enter the /usr/local/spark-2.2.0-bin-hadoop2.6/bin directory to execute spark-shell, and test the word frequency test. The results are shown in the following figure:

So far, the spark-standlone mode has been successfully installed!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324896867&siteId=291194637