Detailed explanation of the setup process of Spark's Standalone environment

Introduction: Standalone mode is a cluster mode that comes with Spark . It is different from the previous local mode to start multiple processes to simulate the environment of the cluster. Standalone mode is an environment that actually builds a Spark cluster between multiple machines. It can be fully utilized. The model builds a multi-machine cluster for actual big data processing.

  • The Standalone cluster uses the master-slave model in distributed computing. The master is the node containing the Master process in the cluster, and the slave is the Worker node in the cluster containing the Executor process.
    Official website chain introduction: http://spark.apache.org/docs/latest/cluster-overview.html

Insert picture description here

  • Spark Standalone cluster, similar to Hadoop YARN, manages cluster resources and scheduling resources:
  • Master node Master : Manage the resources of the entire cluster, receive and submit applications, allocate resources to each application, and run Task tasks
  • Workers from slave nodes : 1) Manage the resources of each machine (CPU resources + memory resources) , and allocate corresponding resources to run Task 2) Each slave node allocates resource information to Worker management. The resource information includes memory and CPU Cores .
  • History Server HistoryServer (optional): After the Spark Application has finished running, save the event log data to HDFS, and start HistoryServer to view information about application running.

Insert picture description here

The cluster planning I built this time:
Insert picture description here
Standalone cluster installation service planning and resource configuration:

node01:master/worker

node02:slave/worker

node03:slave/worker

Official document: http://spark.apache.org/docs/2.4.5/spark-standalone.html

step:

  • 1- modify slaves
进入配置目录
cd /export/server/spark/conf
修改配置文件名称
mv slaves.template slaves
vim slaves
内容如下:
node1
node2
node3
  • 2- Modify spark-env.sh
进入配置目录
cd /export/server/spark/conf
修改配置文件名称
mv spark-env.sh.template spark-env.sh
修改配置文件
vim spark-env.sh
增加如下内容:

## 设置JAVA安装目录
JAVA_HOME=/export/server/jdk    #改成自己的JDK目录 看自己JDK配置的环境变量

## HADOOP软件配置文件目录,读取HDFS上文件和运行YARN集群
HADOOP_CONF_DIR=/export/server/hadoop/etc/hadoop
YARN_CONF_DIR=/export/server/hadoop/etc/hadoop

## 指定spark老大Master的IP和提交任务的通信端口
export SPARK_MASTER_HOST=node1  #主机名
export SPARK_MASTER_PORT=7077   #通信端口
SPARK_MASTER_WEBUI_PORT=8080    #webUI端口

SPARK_WORKER_CORES=1            #CPU核数
SPARK_WORKER_MEMORY=1g          #内存
SPARK_WORKER_PORT=7078          #通信端口
SPARK_WORKER_WEBUI_PORT=8081    #webUI端口

SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://node1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"   #sparklog需要自己创建
  • 3- Create EventLogs storage directory
启动HDFS服务,创建应用运行事件日志目录,命令如下:
hdfs dfs -mkdir -p /sparklog/
  • 4- Configure Spark application to save EventLogs
## 进入配置目录      对于历史服务器的设置
cd /export/server/spark/conf
## 修改配置文件名称  将【$SPARK_HOME/conf/spark-defaults.conf.template】名称命名为【spark-defaults.conf】

mv spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf

## 添加内容如下:
spark.eventLog.enabled 	true    #log是否可以用
spark.eventLog.dir	 hdfs://node1:8020/sparklog/  #目录是哪一个
spark.eventLog.compress 	true       #是否启动压缩
  • 5- set log level
## 进入目录
cd /export/server/spark/conf
## 修改日志属性配置文件名称 将【$SPARK_HOME/conf/log4j.properties.template】名称命名为【log4j.properties】,修改级别为警告WARN。

mv log4j.properties.template log4j.properties
## 改变日志级别
vim log4j.properties

The revised content is as follows:
Insert picture description here

  • 6-Distribute to other machines
将配置好的将 Spark 安装包分发给集群中其它机器,命令如下:
cd /export/server/
scp -r spark-2.4.5-bin-hadoop2.7 root@node2:$PWD
scp -r spark-2.4.5-bin-hadoop2.7 root@node3:$PWD
##创建软连接
ln -s /export/server/spark-2.4.5-bin-hadoop2.7 /export/server/spark

Just set it up here!!!

  • Start Spark
启动方式1:集群启动和停止

在主节点上启动spark集群
/export/server/spark/sbin/start-all.sh 

 
在主节点上停止spark集群
/export/server/spark/sbin/stop-all.sh


启动方式2:单独启动和停止
在 master 安装节点上启动和停止 master:
start-master.sh
stop-master.sh
在 Master 所在节点上启动和停止worker(work指的是slaves 配置文件中的主机名)
start-slaves.sh
stop-slaves.sh
  • WEB UI page
http://node1:8080/

Insert picture description here

  • History Server HistoryServer:
/export/server/spark/sbin/start-history-server.sh

WEB UI页面地址:http://node1:18080

Insert picture description here

  • How to write code in StandAlone mode through interactive commands?
    Insert picture description here
bin/spark-shell --master spark://node1:7077

wordcount案例: 
sc.textFile("hdfs://node1:8020/wordcount/input/words.txt").flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).collect
  • How to submit tasks in Standalone mode
bin/spark-submit \
--master spark://node1:7077 \
--class org.apache.spark.examples.SparkPi \
/export/server/spark/examples/jars/spark-examples_2.11-2.4.5.jar \
10

Insert picture description here

  • Supplement: Picture of Driver and Executors
    Insert picture description here

Guess you like

Origin blog.csdn.net/m0_49834705/article/details/112505564