Getting started with Spark (2)

2.3 Standalone mode

Standalone mode is Spark's own resource scheduling engine. It builds a Spark cluster consisting of Master + Worker, and Spark runs in the cluster.
This Standalone is different from Hadoop. Standalone here means that only Spark is used to build a cluster without the need for other frameworks.

2.3.1 Resource management of cluster roles

Master and Worker cluster resource management
insert image description here
Master and Worker are Spark daemon processes (resident background processes) and cluster resource managers, that is, background resident processes necessary for Spark to run normally in a specific mode (Standalone).
insert image description here
Driver and Executor are temporary programs that will only be started when a specific task is submitted to the Spark cluster.
Standalone is the resource scheduling engine that comes with Spark. It builds a Spark cluster consisting of Master + Worker, and Spark runs in the cluster.

2.3.2 Installation and use

1) Cluster planning
insert image description here
2) Unzip another Spark installation package, and modify the decompressed folder name to spark-standalone

[aa@hadoop102 sorfware]$ tar -zxvf spark-3.1.3-bin-hadoop3.2.tgz -C /opt/module/
[aa@hadoop102 module]$ mv spark-3.1.3-bin-hadoop3.2 spark-standalone

3) Enter the Spark configuration directory /opt/module/spark-standalone/conf

[aa@hadoop102 spark-standalone]$ cd conf

4) Rename the conf/workers.template file to conf/workers, modify the contents of the works file, and add the work node:

[aa@hadoop102 conf]$ mv workers.template workers
[aa@hadoop102 conf]$ vim workers

```dart
hadoop102
hadoop103
hadoop104

5)修改重命名文件conf/spark-env.sh.template为conf/spark-env.sh文件,添加master节点

```dart
[aa@hadoop102 conf]$ mv spark-env.sh.template spark-env.sh
[aa@hadoop102 conf]$ vim spark-env.sh
SPARK_MASTER_HOST=hadoop102
SPARK_MASTER_PORT=7077

6) Distribute the spark-standalone package

[aa module]$ xsync spark-standalone/

7) Start the spark cluster

[aa spark-standalone]$ sbin/start-all.sh

Check the running process of the three servers (xcall.sh is the script mentioned in the previous collection project).

[aa spark-standalone]$ xcall.sh
=============== hadoop102 ===============
10341 Worker
11061 Jps
10231 Master
=============== hadoop103 ===============
7800 Worker
8266 Jps
=============== hadoop104 ===============
13601 Jps
13293 Worker

Note: If you encounter a "JAVA_HOME not set" exception, you can add the following configuration to the spark-config.sh file in the sbin directory.

export JAVA_HOME=XXXX

8) Web page view: hadoop102:8080 (the port of master web, which is equivalent to port 8088 of yarn).
Currently, no task execution information can be seen.
9) Cases of official request for PI

[aa spark-standalone]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop102:7077 \
./examples/jars/spark-examples_2.12-3.1.3.jar \
10

Parameters: –master spark://hadoop102:7077 specifies the master of the cluster to be connected (the information configured in the configuration file is consistent).
10) View the page at http://hadoop102:8080/.
By default, the total number of cores of the three server nodes is 24 cores, and the memory of each node is 1024M
. 8080: the webUI of the master
4040: the port number of the webUI of the application
insert image description here

2.3.3 Parameter description

Of course, we can also specify the use of resources according to the actual task requirements
1) Configure the available memory of the Executor to be 2G, and the number of CPU cores to be used is 2

[aa spark-standalone]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop102:7077 \
--executor-memory 2G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.12-3.1.3.jar \
10

2) View the page http://hadoop102:8080/
insert image description here
3) Basic syntax
bin/spark-submit
–class
–master
… # other options

[application-arguments]
4) Parameter description
insert image description here

2.3.4 Configuring History Service

Since the running status of historical tasks cannot be seen on the hadoop102:4040 page after the spark-shell is stopped, we need historical server task logs.
1) Modify the spark-default.conf.template name

[aa conf]$ mv spark-defaults.conf.template spark-defaults.conf

2) Modify the spark-default.conf file and configure the log storage path (write)

[aa conf]$ vim spark-defaults.conf
spark.eventLog.enabled          true
spark.eventLog.dir              hdfs://hadoop102:8020/directory

Note: The Hadoop cluster needs to be started, and the directory on HDFS needs to exist in advance (because the historical task log data is stored on HDFS).

[aa hadoop-3.1.3]$ hadoop.sh start
[aa hadoop-3.1.3]$ hadoop fs -mkdir /directory

3) Modify the spark-env.sh file and add the following configuration:

[aa conf]$ vim spark-env.sh

export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 
-Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/directory 
-Dspark.history.retainedApplications=30"

The meaning of parameter 1: the port number for WEBUI access is 18080
The meaning of parameter 2: specify the storage path of history server logs (read)
The meaning of parameter 3: specify the number of application history records to be saved, if this value is exceeded, the old application information will be deleted , this is the number of applications in memory, not the number of applications displayed on the page.
4) Distribution configuration file

[aa conf]$ xsync spark-defaults.conf 
[aa conf]$ xsync spark-env.sh 

5) Start the history service

[aa spark-standalone]$ sbin/start-history-server.sh

6) Execute the task again

[aa spark-standalone]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop102:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.12-3.1.3.jar \
10

7) View Spark history service address: hadoop102:18080

insert image description here

Guess you like

Origin blog.csdn.net/qq_37247026/article/details/131244095