Spark kernel analysis (5) deep analysis of Spark task scheduling principle and mechanism

In the factory environment, Spark 集群的部署方式一般为YARN-Cluster 模式in the following kernel analysis, our default cluster deployment method is YARN-Cluster mode.

1. Spark task submission process

In the previous chapter, we explained the task submission process in Spark YARN-Cluster mode, as shown in the following figure:
Insert picture description here

The following sequence diagram clearly illustrates the complete process of a Spark application from submission to operation:
Insert picture description here
Submit a Spark application, first request to start an Application through the Client to the ResourceManager, and check whether there are enough resources to meet the requirements of the Application. If the conditions are met, the startup context of ApplicationMaster is prepared, handed over to ResourceManager, and the Application status is monitored cyclically.

When there are resources in the submitted resource queue, the ResourceManager will start the ApplicationMaster process on a NodeManager, and the ApplicationMaster will start the Driver background thread separately. When the Driver is started, the ApplicationMaster will connect to the Driver through the local RPC and start requesting Container resources from the ResourceManager Run the Executor process (one Executor corresponds to one Container), when ResourceManager returns the Container resource, ApplicationMaster starts Executor on the corresponding Container.

The Driver thread mainly initializes the SparkContext object, prepares the required context for running, and then maintains the RPC connection with the ApplicationMaster on the one hand, applies for resources through the ApplicationMaster, and on the other hand starts scheduling tasks according to the user's business logic, and sends the tasks to the existing idle On the Executor. When the ResourceManager returns the Container resource to the ApplicationMaster, the ApplicationMaster tries to start the Executor process on the corresponding Container. After the Executor process gets up, it will register with the Driver in the reverse direction. After the registration is successful, it will maintain the heartbeat with the Driver while waiting for the Driver to distribute tasks. After the task is executed, the task status is reported to the Driver.

As can be seen from the above sequence diagram, the Client is only responsible for submitting the Application and monitoring the status of the Application.
For Spark's task scheduling is mainly concentrated in two aspects: 资源申请 and 任务分发, its主要是通过ApplicationMaster、Driver 以及 Executor 之间来完成。

2. Overview of Spark task scheduling

When the Driver is up, the Driver will prepare tasks according to the user program logic and gradually distribute tasks according to the Executor resource situation. Before elaborating on task scheduling, first explain a few concepts in Spark.

A Spark application includes three concepts: Job, Stage, and Task:

(1) JobTake Action method as the boundary, and trigger a Job when encountering an Action method;

(2) It Stageis a subset of Job, bounded by RDD wide dependency (ie Shuffle), and shuffle is encountered once for division;

(3) It Taskis a subset of Stage, which is measured by the degree of parallelism (number of partitions). The number of partitions is the number of tasks.

Spark's task scheduling is generally divided into two ways: one is Stage-level scheduling, and the other is Task-level scheduling. The overall scheduling process is shown in the following figure:
Insert picture description here
Spark RDD forms an RDD blood relationship graph, namely DAG, through its Transactions operation. , And finally through the call of Action, the Job is triggered and scheduled for execution.
DAGScheduler 负责 Stage 级的调度, It is mainly to divide the job into several Stages, and package each Stage into a TaskSet and deliver it to TaskScheduler for scheduling.
TaskScheduler 负责 Task 级的调度, The TaskSet given by DAGScheduler is distributed to the Executor for execution according to the specified scheduling strategy. During the scheduling process, SchedulerBackend is responsible for providing available resources. Among them, SchedulerBackend has multiple implementations, which are connected to different resource management systems.

Three, Spark stage scheduling

Spark 的任务调度是从 DAG 切割开始,主要是由 DAGScheduler 来完成。When an Action operation is encountered, the calculation of a Job will be triggered and handed over to DAGScheduler for submission. The following figure is a flowchart of related method calls related to Job submission.
Insert picture description here
Job is encapsulated by the final RDD and Action methods. SparkContext will hand over the job to DAGScheduler for submission. It will segment the DAG based on the blood relationship of the RDD, and divide a job into several stages. The specific division strategy is that the final RDD continuously uses dependency backtracking to determine whether the parent dependency is a wide dependency, that is, divide the stage with Shuffle as the boundary, and the narrow-dependent RDDs are divided into the same stage, and pipeline-style calculations can be performed, as shown in the purple process part of the figure above. The divided stages are divided into two categories, one is called ResultStage, which is the most downstream stage of the DAG, which is determined by the Action method, and the other is called ShuffleMapStage, which prepares data for downstream stages. Let's look at a simple example of WordCount.
Insert picture description here
The job is triggered by saveAsTextFile. The job consists of RDD-3 and saveAsTextFile methods. According to the dependency between RDDs, the backtracking search starts from RDD-3 until there is no dependent RDD-0. In the process of backtracking search, RDD-3 depends on RDD -2, and wide-dependent, so the division between Stage RDD-2 and RDD-3, RDD-3 was designated Stage to the last, i.e. ResultStagethe, RDD-2-dependent RDD-1, RDD-1 dependent RDD-0 these are narrow-dependent dependent, so the RDD-0, RDD-1 and RDD-2 the Stage into the same, i.e. ShuffleMapStage, the time of the actual execution of the data recording performed in one go RDD-0 is transformed to RDD-2 . It's not difficult to see that its 本质上是一个深度优先搜索算法.

一个 Stage 是否被提交,需要判断它的父 Stage 是否执行,只有在父 Stage 执行完毕才能提交当前 Stage,如果一个 Stage 没有父 Stage,那么从该 Stage 开始提交。When the stage is submitted, the Task information (partition information and methods, etc.) will be serialized and packaged into a TaskSet and handed over to the TaskScheduler. A Partition corresponds to a Task. On the other hand, the TaskScheduler will monitor the running status of the Stage. Only the Executor is lost or the Task fails due to Fetch. It is necessary to resubmit the failed stage to schedule the failed task. Other types of task failure will be retried during the scheduling process of TaskScheduler.

Relatively speaking, what DAGScheduler does is relatively simple. It just divides the DAG at the stage level, submits the stage and monitors related status information. TaskScheduler is relatively more complicated, and the details are explained in detail below.

Fourth, Spark Task-level scheduling

The scheduling of Spark Task is done by TaskScheduler. As we can see from the above, DAGScheduler packs the Stage into TaskSet and gives it to TaskScheduler. TaskScheduler encapsulates TaskSet as TaskSetManager and adds it to the scheduling queue. The structure of TaskSetManager is shown in the figure below.
Insert picture description here
TaskSetManager 负责监控管理同一个 Stage 中的 Tasks,TaskScheduler 就是以TaskSetManager 为单元来调度任务。

TaskScheduler 初始化后会启动 SchedulerBackend,它负责跟外界打交道,接收Executor 的注册信息,并维护 Executor 的状态. So SchedulerBackend is in charge of "food", and after startup, it will periodically "ask" TaskScheduler if there are tasks to run, that is, it will periodically "ask" TaskScheduler "I have such a margin, you want "Don't", when TaskScheduler "asks" it in SchedulerBackend, it will select TaskSetManager from the scheduling queue according to the specified scheduling strategy to schedule and run. The general method call flow is shown in the
Insert picture description here
figure above : In the figure above, add TaskSetManager to the rootPool scheduling pool After that, call the riviveOffers method of SchedulerBackend to send a ReviveOffer message to driverEndpoint;driverEndpoint 收到ReviveOffer 消息后调用 makeOffers 方法,过滤出活跃状态的 Executor(这些 Executor都是任务启动时反向注册到 Driver 的 Executor),然后将 Executor 封装成 WorkerOffer对 象 ; 准 备 好 计 算 资 源 ( WorkerOffer ) 后 , taskScheduler 基 于 这 些 资 源 调 用resourceOffer 在 Executor 上分配 task。

4.1 Scheduling strategy

As mentioned earlier, TaskScheduler will first encapsulate the TaskSet given by DAGScheduler into TaskSetManager and throw it into the task queue, and then take them out of the task queue according to certain rules and run them on the Executor given by SchedulerBackend.这个调度过程实际上还是比较粗粒度的,是面向 TaskSetManager 的。

The hierarchy of the scheduling queue is shown in the following figure:
Insert picture description here
TaskScheduler manages the task queue in a way, the node type in the tree is Scheduled, the leaf node is TaskSetManager, and the non-leaf node is Pool. The following figure shows the inheritance relationship between them.
Insert picture description here
TaskScheduler supports two scheduling strategies:

(1) FIFO is also the default scheduling strategy

(2)FAIR

During the initialization of TaskScheduler, rootPool is instantiated, which represents the root node of the tree and is of type Pool.

1. FIFO scheduling strategy

The execution steps of the FIFO scheduling strategy are as follows:
(1) The priority of the two Scheduled s1 and s2 (an attribute of the Schedule class, marked as priority, the smaller the value, the higher the priority);

(2) If the priority of the two schedules is the same, compare the identities of the Stage to which s1 and s2 belong (an attribute of the Schedule class, marked as priority, the smaller the value, the higher the priority);

(3) If the result of the comparison is less than 0, s1 is scheduled first, otherwise s2 is scheduled first.
Insert picture description here

2. FAIR scheduling strategy

The tree structure of the FAIR scheduling strategy is shown in the following figure: In the
Insert picture description here
FAIR mode, there is a rootPool and multiple sub-Pools, and each sub-Pool stores all the TaskSetMagagers to be allocated.

You can specify a scheduling pool in the scheduling pool as the parent scheduling pool of TaskSetManager by specifying the spark.scheduler.pool property in Properties. If the root scheduling pool does not have a scheduling pool corresponding to this property value, it will be created with this property value The named scheduling pool is used as the parent scheduling pool of TaskSetManager, and this scheduling pool is used as the child scheduling pool of the root scheduling pool.

在 FAIR 模 式 中 , 需 要 先 对 子 Pool 进 行 排 序 , 再 对 子 Pool 里 面 的TaskSetMagager 进行排序,因为 Pool 和 TaskSetMagager 都继承了 Schedulable 特质,因此使用相同的排序算法。

The comparison of the sorting process is based on Fair-share. Each object to be sorted contains three attributes: runningTasks value (the number of running Tasks), minShare value, and weight value. The runningTasks value and minShare value will be taken into account when comparing. And the weight value.

Note: The values ​​of minShare and weight are specified in the fair scheduling configuration file fairscheduler.xml, and the scheduling pool will read the relevant configuration of this file during the construction phase.

(1) If the runningTasks of the A object is greater than its minShare, and the runningTasks of the B object is less than its minShare, then B is ranked in front of A; ( runningTasks 比 minShare 小的先执行)

(2) If the runningTasks of objects A and B are less than their minShare, then compare the ratio of runningTasks to minShare ( minShare 使用率), and whoever is smaller ranks first; ( minShare使用率低的先执行)

(3) If the runningTasks of A and B objects are greater than their minShare, then compare the ratio of runningTasks to weight ( 权重使用率), and whoever is smaller will rank first. ( 权重使用率低的先执行)

(4) If the above comparisons are equal, compare the names.整体上来说就是通过 minShare 和 weight 这两个参数控制比较过程,可以做到让 minShare 使用率和权重使用率少(实际运行 task 比例较少)的先运行。

After FAIR mode sorting is completed, all TaskSetManagers are put into an ArrayBuffer, and then taken out in turn and sent to Executor for execution.

After the TaskSetManager is obtained from the scheduling queue, since the TaskSetManager encapsulates all the tasks of a Stage and is responsible for managing and scheduling these tasks, the next task is to take out the Tasks one by one according to certain rules and give the TaskScheduler to the TaskScheduler, and then the TaskScheduler to the SchedulerBackend. Send to Executor for execution.

4.2 Localized scheduling

DAGScheduler cuts the job, divides the stage, and submits tasks corresponding to a stage by calling submitStage. SubmitStage will call submitMissingTasks, and submitMissingTasks determines the preferredLocations of each task that needs to be calculated.通过调用 getPreferrdeLocations()得到partition 的优先位置,由于一个 partition 对应一个 task,此 partition 的优先位置就是 task 的优先位置,对于要提交到 TaskScheduler 的 TaskSet 中的每一个 task,该 task优先位置与其对应的 partition 对应的优先位置一致。

After getting the TaskSetManager from the scheduling queue, the next task is that the TaskSetManager takes out the tasks one by one according to certain rules and sends them to the TaskScheduler, and then the TaskScheduler gives it to the SchedulerBackend to send to the Executor for execution. As mentioned earlier, TaskSetManager encapsulates all tasks of a Stage and is responsible for managing and scheduling these tasks.

According to the priority position of each task, determine the locality level of the task. There are five types of locality, and the priority is from high to low:
Insert picture description here
When scheduling execution, Spark scheduling always tries to make each task with the highest locality level Start, when a task is started at the X locality level, but all nodes corresponding to the locality level have no free resources and fail to start, at this time, the locality level will not be reduced immediately and the start will be started again within a certain length of time. X locality level to start the task, if the time exceeds the time limit, it will be degraded and started to try the next locality level, and so on.可以通过调大每个类别的最大容忍延迟时间,在等待阶段对应的 Executor 可能就会有相应的资源去执行此 task,这就在在一定程度上提升了运行性能。

4.3 Failure retry and blacklist mechanism

In addition to selecting the appropriate task for scheduling and running, it is also necessary to monitor the execution status of the task. As mentioned earlier, it is the SchedulerBackend that deals with the outside. After the task is submitted to the Executor to start execution, the Executor will report the execution status to the SchedulerBackend, and the SchedulerBackend will tell TaskScheduler, TaskScheduler finds the TaskSetManager corresponding to the Task, and informs the TaskSetManager so that the TaskSetManager knows the failure and success status of the Task.对于失败的 Task,会记录它失败的次数,如果失败次数还没有超过最大重试次数,那么就把它放回待调度的 Task池子中,否则整个 Application 失败。

在记录 Task 失败次数过程中,会记录它上一次失败所在的 Executor Id 和 Host,这样下次再调度这个 Task 时,会使用黑名单机制,避免它被调度到上一次失败的节点上,起到一定的容错作用。The blacklist records the Executor Id and Host where the task last failed, and its corresponding "blackout" time. The "blackout" time means that this task should not be scheduled on this node during this period.

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108607827