Spark execution process (detailed)

General execution process
1. The driver executes the main method (lazy execution), and the action operator triggers the job
2. Divide the stage according to the width and the width
3. Each stage will be organized into taskset (including multiple tasks) 4. Each task is distributed to a specific Executor for execution
 
Complete scheduling process
 
 

1. When the Driver starts, DagScheduler and TaskScheduler will be created accordingly during initialization

2. When TaskScheduler is initialized, SchedulerBacked will be created (mainly responsible for communication between clusters)

3. SchedulerBacked and ApplicationMaster communicate, SchedulerBacked will tell ApplicationManager how many Executors will be started

4. Then ApplicationManager will apply for resources from ResourceManager

5. Then start the corresponding Executor

6. Executor calls ExecutorBackend to register with ScheduleBackend in Driver, when all Executors are registered

7. The Driver starts to execute the main function. When it encounters the Action operator, it will trigger the Job, divide the Stage in DagScheduler according to the wide dependence, and then create the TaskSet. After the creation is completed, DagScheduler will send the TaskSet to the TaskScheduler for maintenance. TaskSchedule will encapsulate it as TaskSetManager, which corresponds to TaskSet one-to-one

8. TaskScheduler uses a certain scheduling strategy to determine which Task is executed, and then decides which Executor to execute

9. Then SchedulerBackend and ExecutorBackend establish communication, tell which Executor to execute

10. Then Executor starts execution

11. During the execution process, the Executor will continuously send heartbeats to the Driver's HeartbeatReceiver, so as to monitor whether the Executor is active in real time.

12. At the same time, Executor will send the running status of each Task to ScheduleBackend through ExecutorBackend during the execution of Task

13.ScheduleBackend will further send these operating conditions to TaskScheduler, so that TaskScheduler grasps the running status of the Task currently running on each Executor. Once a Task fails to execute, TaskScheduler will reschedule and re-select Executor to execute the Task 

To put it simply , the Action operator triggers the Job, the DAGScheduler object divides the Stage according to wide dependencies, determines how many Tasks there are based on the data processed in the Stage, and then creates the corresponding TaskSet, and then hands it to the TaskScheduler for scheduling, and to the SchedulerBackend (using (Interaction between Driver and other components), send a corresponding Executor for execution. How many tasks the Executor executes is determined by TaskSchedule.

 

Detailed explanation of each related component
Driver
The Spark driver node is used to execute the main method in the Spark task and is responsible for the execution of the actual code. Driver is mainly responsible for:
1. Convert the user program into a task (job)
2. Scheduling tasks between Executor (task)
3. Track the execution of Executor
4. Display query operation status through UI
 
Executor
The Executor node is a JVM process, responsible for running specific tasks in Spark jobs, and the tasks are independent of each other. When Spark starts, the Executor node is started at the same time, and it always accompanies the life cycle of the entire Spark application. If a failure occurs, the Spark application can continue to execute, and tasks on the faulty node will be scheduled to other Executor nodes to continue execution.
Executor core functions:
1. Responsible for running the tasks that make up the Spark application and returning the results to the driver process;
2. Provide memory storage for RDDs that require caching in the user program through its own block manager. RDD is directly cached in the Executor process, so the task can make full use of the accelerated operation of the cached data at runtime.
 
Master : It is a process that is mainly responsible for resource scheduling and allocation, and performs cluster monitoring and other responsibilities
 
Worker : It is a process. A Worker runs on a server in the cluster. It is mainly responsible for two responsibilities. One is to store one or some partitions of the RDD with its own memory; the other is to start other processes and threads (Executor ) To perform parallel processing and calculation on the partition on the RDD.
 
YARN Cluster mode
1. After the task is submitted, it will communicate with the ResourceManager to apply to start the ApplicationMaster, and then the ResourceManager will allocate a container and start the ApplicationManager on the appropriate NodeManager. At this time, the ApplicationManager is the Driver;
2. After the Driver starts, apply for the Executor memory from the ResourceManager. After the ResourceManager receives the resource request from the ApplicationManager, it will allocate the container, and then start the Executor process on the appropriate NodeManager. After the Executor process is started, it will register with the Driver in the reverse direction;
3. After all Executor registration is completed, the Driver starts to execute the main function, and then when the Action operator is executed, a job is triggered, and the stages are divided according to the width and narrow dependencies, and each stage generates a corresponding taskSet, which is then distributed to each Executor for execution.
 
supplement
Spark's task scheduling mainly focuses on two aspects: resource application and task distribution.
Job is bounded by the Action method, and a Job is triggered when it encounters an Action method;
Stage is a subset of Job. It is bounded by RDD dependency (ie Shuffle), and it is divided once when encountering a Shuffle;
Task is a subset of Stage. It is measured by the degree of parallelism (number of partitions). The number of partitions is how many tasks there are. One task corresponds to one RDD partition. If the data comes from HDFS, the data of one partition is the data of one split.
 
+++++++++++++++++++++++++++++++++++++++++
+ If you have any questions, you can +Q: 1602701980 Discuss together +
+++++++++++++++++++++++++++++++++++++++++

Guess you like

Origin blog.csdn.net/shenyuye/article/details/107789973