Detailed explanation of the two modes of Spark on yarn (this is very important)

Introduction: When a Spark application is submitted to the cluster to run, the application architecture consists of two parts

  • Driver Program (resource application and scheduling job execution)
  • Executors (Run Task tasks and cache data in Job), both are JVM Process processes

Insert picture description here

1: The location where the Driver program runs can be specified by --deploy-mode:

Clear: Driver refers to the process running the main() function of the application and creating the SparkContext to run the main() function of the application and
create the SparkContext

  • client : Indicates that the Driver is running on the Client submitting the application (default)

  • cluster : indicates that the Driver is running in a cluster (Standalone: ​​Worker, YARN: NodeManager)

  • The most essential difference between cluster and client mode is : where the Driver program runs.
    Use cluster in the actual production environment of the enterprise

1. Client mode
  • DeployMode is Client, which means that the application Driver Program is running on the host of the submitting application Client.
    Diagram:
    Insert picture description here
    Client mode test pi:
SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master yarn  \
--deploy-mode client \
--driver-memory 512m \
--executor-memory 512m \
--num-executors 1 \
--total-executor-cores 2 \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.5.jar \
10

Insert picture description here
Yarn's webUI view results:
Insert picture description here

2. Cluster (cluster) mode, for production environment
  • DeployMode is Cluster, which means that the application Driver Program is running on a machine of the cluster slave node

Diagram:
Insert picture description here
Pi test:

SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 512m \
--executor-memory 512m \
--num-executors 1 \
--total-executor-cores 2 \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples_2.11-2.4.5.jar \
10

Insert picture description here
yarn的webUI:
Insert picture description here

Enter the task to view the log
Insert picture description here
Insert picture description here
summary: The
most essential difference between Client mode and Cluster mode is: Where is the Driver program running.

  • Client mode: used during testing, no need for development, just understand
    1.Driver runs on Client, and the communication cost with the cluster is high
    2.Driver output results will be displayed on the client
  • Cluster mode: Use this mode in the production environment. 1. The
    driver program is in the YARN cluster, and the communication cost with the cluster is low.
    2. The output result
    of the driver cannot be displayed on the client. 3. In this mode, the Driver runs on the ApplicationMaster node, which is managed by Yarn. If there is a problem, yarn will restart ApplicationMaster (Driver)
3. Detailed flowchart of the two modes

client mode:
Insert picture description here

cluster mode:
Insert picture description here

Appendix: Roles in a Spark cluster

  • 1) Driver : It is a JVM Process process, and the written Spark application runs on the Driver and is executed by the Driver process;
  • 2) Master (ResourceManager) : It is a JVM Process process, which is mainly responsible for resource scheduling and allocation , and performs cluster monitoring and other responsibilities;
  • 3) Worker (NodeManager) : It is a JVM Process. A Worker runs on a server in the cluster. It is mainly responsible for two responsibilities. One is to store one or some partitions of RDD with its own memory; the other is Start other processes and threads (Executor) to perform parallel processing and calculation on the partition on the RDD.
  • 4) Executor : It is a JVM Process. Multiple Executors can run on a Worker (NodeManager). The Executor executes parallel calculations on RDD partitions by starting multiple threads (tasks), which is to execute our definition of RDD. Operator operations such as map, flatMap, and reduce.

Guess you like

Origin blog.csdn.net/m0_49834705/article/details/112557164