A cluster plan
Here to build a three-node cluster Spark, where the three hosts were deployed Worker
services. Meanwhile, in order to ensure high availability, in addition to the main deployment on hadoop001 Master
services, are deployed also on hadoop002 and hadoop003 alternate Master
service, Master coordinated service managed by Zookeeper cluster, if the primary Master
is unavailable, the backup Master
will become the new primary Master
.
Second, the pre-conditions
Spark ago to build a cluster, the need to ensure JDK environment, Zookeeper Hadoop clusters and clusters have been set up, the steps can be found in:
- Linux environment JDK installation
- Zookeeper stand-alone environment and to build a clustered environment
- Hadoop cluster environment to build
Three, Spark Cluster Setup
3.1 download, unzip
Download the required version of the Spark, the official website Download: http://spark.apache.org/downloads.html
After downloading decompressing:
# tar -zxvf spark-2.2.3-bin-hadoop2.6.tgz
3.2 Configuration Environment Variables
# vim /etc/profile
Add environment variables:
export SPARK_HOME=/usr/app/spark-2.2.3-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:$PATH
It makes the configuration of environment variables to take effect immediately:
# source /etc/profile
3.3 Cluster Configuration
Into the ${SPARK_HOME}/conf
directory, copy the configuration modified samples:
1. spark-env.sh
cp spark-env.sh.template spark-env.sh
# 配置JDK安装位置
JAVA_HOME=/usr/java/jdk1.8.0_201
# 配置hadoop配置文件的位置
HADOOP_CONF_DIR=/usr/app/hadoop-2.6.0-cdh5.15.2/etc/hadoop
# 配置zookeeper地址
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=hadoop001:2181,hadoop002:2181,hadoop003:2181 -Dspark.deploy.zookeeper.dir=/spark"
2. slaves
cp slaves.template slaves
Configure the location of all Woker node:
hadoop001
hadoop002
hadoop003
3.4 installation package distribution
Spark installation package will be distributed to other servers, the distribution is recommended on both servers also configure the look Spark environment variables.
scp -r /usr/app/spark-2.4.0-bin-hadoop2.6/ hadoop002:usr/app/
scp -r /usr/app/spark-2.4.0-bin-hadoop2.6/ hadoop003:usr/app/
Fourth, start the cluster
4.1 start ZooKeeper cluster
Respectively, to start the ZooKeeper service on three servers:
zkServer.sh start
4.2 Start Hadoop cluster
# 启动dfs服务
start-dfs.sh
# 启动yarn服务
start-yarn.sh
Start Cluster 4.3 Spark
Hadoop001 enter the ${SPARK_HOME}/sbin
directory, execute the following command to start the cluster. After executing the command, starts on the hadoop001 Maser
service will slaves
start on all nodes in the configuration profile Worker
service.
start-all.sh
Run the following command on hadoop002 and hadoop003, start backup Master
service:
# ${SPARK_HOME}/sbin 下执行
start-master.sh
4.4 View Service
View Spark's Web-UI page, port 8080
. At this time, we can see on the Master node is hadoop001 ALIVE
state, and there are three available Worker
nodes.
Master node and the hadoop002 hadoop003 are in the STANDBY
state, no available Worker
nodes.
Fifth, high availability cluster validation
At this point you can use the kill
command to kill the hadoop001 Master
process, then backup Master
will once again there will be a 主 Master
, I have here is hadoop002, it can be seen on hadoop2 Master
through RECOVERING
after becoming the new primary Master
, and won all be used Workers
.
Hadoop002 on Master
to become the master Master
, and received all be used Workers
.
At this point, if you then use on hadoop001 start-master.sh
start Master service, then it will be as a backup Master
exists.
Sixth, submit jobs
And stand-alone environment to be submitted under the command of Yarn exactly the same, where Pi is calculated to Spark built-in sample program, for example, submit the following command:
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--executor-memory 1G \
--num-executors 10 \
/usr/app/spark-2.4.0-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.4.0.jar \
100
更多大数据系列文章可以参见 GitHub 开源项目: 大数据入门指南