Introduction: Standalone mode is a cluster mode that comes with Spark . It is different from the previous local mode to start multiple processes to simulate the environment of the cluster. Standalone mode is an environment that actually builds a Spark cluster between multiple machines. It can be fully utilized. The model builds a multi-machine cluster for actual big data processing.
- The Standalone cluster uses the master-slave model in distributed computing. The master is the node containing the Master process in the cluster, and the slave is the Worker node in the cluster containing the Executor process.
Official website chain introduction: http://spark.apache.org/docs/latest/cluster-overview.html
- Spark Standalone cluster, similar to Hadoop YARN, manages cluster resources and scheduling resources:
- Master node Master : Manage the resources of the entire cluster, receive and submit applications, allocate resources to each application, and run Task tasks
- Workers from slave nodes : 1) Manage the resources of each machine (CPU resources + memory resources) , and allocate corresponding resources to run Task 2) Each slave node allocates resource information to Worker management. The resource information includes memory and CPU Cores .
- History Server HistoryServer (optional): After the Spark Application has finished running, save the event log data to HDFS, and start HistoryServer to view information about application running.
The cluster planning I built this time:
Standalone cluster installation service planning and resource configuration:
node01:master/worker
node02:slave/worker
node03:slave/worker
Official document: http://spark.apache.org/docs/2.4.5/spark-standalone.html
step:
- 1- modify slaves
进入配置目录
cd /export/server/spark/conf
修改配置文件名称
mv slaves.template slaves
vim slaves
内容如下:
node1
node2
node3
- 2- Modify spark-env.sh
进入配置目录
cd /export/server/spark/conf
修改配置文件名称
mv spark-env.sh.template spark-env.sh
修改配置文件
vim spark-env.sh
增加如下内容:
## 设置JAVA安装目录
JAVA_HOME=/export/server/jdk #改成自己的JDK目录 看自己JDK配置的环境变量
## HADOOP软件配置文件目录,读取HDFS上文件和运行YARN集群
HADOOP_CONF_DIR=/export/server/hadoop/etc/hadoop
YARN_CONF_DIR=/export/server/hadoop/etc/hadoop
## 指定spark老大Master的IP和提交任务的通信端口
export SPARK_MASTER_HOST=node1 #主机名
export SPARK_MASTER_PORT=7077 #通信端口
SPARK_MASTER_WEBUI_PORT=8080 #webUI端口
SPARK_WORKER_CORES=1 #CPU核数
SPARK_WORKER_MEMORY=1g #内存
SPARK_WORKER_PORT=7078 #通信端口
SPARK_WORKER_WEBUI_PORT=8081 #webUI端口
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://node1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true" #sparklog需要自己创建
- 3- Create EventLogs storage directory
启动HDFS服务,创建应用运行事件日志目录,命令如下:
hdfs dfs -mkdir -p /sparklog/
- 4- Configure Spark application to save EventLogs
## 进入配置目录 对于历史服务器的设置
cd /export/server/spark/conf
## 修改配置文件名称 将【$SPARK_HOME/conf/spark-defaults.conf.template】名称命名为【spark-defaults.conf】
mv spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
## 添加内容如下:
spark.eventLog.enabled true #log是否可以用
spark.eventLog.dir hdfs://node1:8020/sparklog/ #目录是哪一个
spark.eventLog.compress true #是否启动压缩
- 5- set log level
## 进入目录
cd /export/server/spark/conf
## 修改日志属性配置文件名称 将【$SPARK_HOME/conf/log4j.properties.template】名称命名为【log4j.properties】,修改级别为警告WARN。
mv log4j.properties.template log4j.properties
## 改变日志级别
vim log4j.properties
The revised content is as follows:
- 6-Distribute to other machines
将配置好的将 Spark 安装包分发给集群中其它机器,命令如下:
cd /export/server/
scp -r spark-2.4.5-bin-hadoop2.7 root@node2:$PWD
scp -r spark-2.4.5-bin-hadoop2.7 root@node3:$PWD
##创建软连接
ln -s /export/server/spark-2.4.5-bin-hadoop2.7 /export/server/spark
Just set it up here!!!
- Start Spark
启动方式1:集群启动和停止
在主节点上启动spark集群
/export/server/spark/sbin/start-all.sh
在主节点上停止spark集群
/export/server/spark/sbin/stop-all.sh
启动方式2:单独启动和停止
在 master 安装节点上启动和停止 master:
start-master.sh
stop-master.sh
在 Master 所在节点上启动和停止worker(work指的是slaves 配置文件中的主机名)
start-slaves.sh
stop-slaves.sh
- WEB UI page
http://node1:8080/
- History Server HistoryServer:
/export/server/spark/sbin/start-history-server.sh
WEB UI页面地址:http://node1:18080
- How to write code in StandAlone mode through interactive commands?
bin/spark-shell --master spark://node1:7077
wordcount案例:
sc.textFile("hdfs://node1:8020/wordcount/input/words.txt").flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).collect
- How to submit tasks in Standalone mode
bin/spark-submit \
--master spark://node1:7077 \
--class org.apache.spark.examples.SparkPi \
/export/server/spark/examples/jars/spark-examples_2.11-2.4.5.jar \
10
- Supplement: Picture of Driver and Executors