Spark architecture and operating mechanism (5) - RDD directed acyclic graph split

    A Spark application will be decomposed into multiple jobs (Job) submitted to the Spark cluster for processing. However, the job is not the smallest computing unit in which the application is split. Spark splits and plans the job twice after receiving the job. The first step is to divide the job into the smallest processing unit according to the RDD transformation operation. That is, the task (Task); the second step is to plan the task and generate a stage (Stage) containing multiple tasks. These two steps are performed by the DAGScheduler instance, input "RDD directed acyclic graph", and output "a series of tasks", this step becomes "RDD directed acyclic graph split".
    Each data unit RDD can be divided into smaller data blocks for processing in different computing nodes. Such data blocks are called partitions. In the RDD DAG, the child RDD has a dependency relationship with the parent RDD. Due to the different partition correspondence between the parent and child RDDs, the dependencies between the RDDs are divided into two types: narrow dependencies and wide dependencies.
    The rectangle in the figure below represents an RDD, and the oval block represents a partition:

Narrow dependency: a partition of the parent RDD can only be used by at most one partition of the child RDD during the RDD conversion process, such as map and union in (a).
Wide dependency: a partition of the parent RDD is used by the partitions of multiple child RDDs during the RDD conversion process, such as groupByKey in (b)

The difference between narrow dependencies and wide dependencies
Narrow dependencies: There is a one-to-one conversion between partitions, which can be performed within a computing node. If there are multiple such narrow dependencies, it can be pipelined within a node. If there are many-to-one (multiple parent RDD partitions corresponding to one child RDD partitions) narrow dependency conversion between partitions, they can be executed in parallel between multiple nodes without affecting each other. When a partition error occurs during the calculation of a child RDD with narrow dependencies, it is only necessary to obtain the partition corresponding to the parent RDD to restore the partition of the child RDD;

wide dependencies: wide dependencies depend on all partition data of the parent RDD before the child RDD can be calculated. , which will inevitably bring about a series of problems such as network overhead and intermediate result storage. The cost of calculating the error response is much higher than that of narrow dependencies;

    due to the difference between narrow dependencies and wide dependencies, after Spark splits the application into tasks, it does not directly allocate the tasks corresponding to the conversion operations one by one, but adds A process of planning tasks, combining tasks that fit together into a phase.
    This process is completed by the DAGScheduler instance. The principle is:
    if the child RDD is narrowly dependent on the parent RDD, multiple operator operations are processed together, and finally a unified synchronization operation is performed, which not only reduces a lot of global synchronization, but also does not require Store many intermediate results.
    If it is a wide dependency, try to divide it into different stages to avoid excessive network transmission and calculation overhead.

    To achieve this, after the application submits the Job job to Spark, the DAGScheduler will traverse the RDD DAG. During the traversal process, if continuous narrow-dependent RDD transformations are required, put as many as possible into the same stage. If a wide dependency is encountered, a new stage is generated.
    In the figure below, the conversion of A to B is a wide dependency, and a stage1 is generated, C is converted to D, D and E are converted to F are narrow dependencies, and they are both put into stage2, B and F are converted to G are wide dependencies, and Generate a stage3.

Through such a process, DAGScheduler realizes the operation of dividing the dependency chain. The entire dependency chain is divided into multiple stages, and each stage is a set of tasks that are related to each other but have no shuffle dependencies between each other, which are called "TaskSets". Each TaskSet contains multiple tasks.
DAGScheduler will determine how many tasks will occur according to the number of partitions. One partition corresponds to one task, and tasks in the same stage are executed in parallel. The relationship between job (Job), stage (Stage), task set (TaskSet) and task (Task) is shown in the following figure:

DAGScheduler also maintains three sets to store the execution status of the stage:

1.waitingStages set: the same RDD parent-child dependencies are the same. In a series of mutually dependent Stages, the subsequent Stages are called child Stages, and the dependent Stages are called parent Stages. If the parent Stage of a Stage has not been completed, the waitingStages collection is responsible for recording the child Stage;

2.runningStage collection: In order to repeat the submission of the stage, the

runningStage collection saves the executing Stage; 3.failedStage collection: saves the failed Stage. ;

    DAGScheduler will reasonably schedule all Stages to be submitted to the cluster according to the running status of Stages. DAGScheduler assigns a StageID to each Stage to indicate the priority of the Stage. The smaller the StageID, the higher the priority.
    DAGScheduler traverses the RDD dependency chain in reverse. Therefore, the FinalStage generated by the last RDD has the smallest StageID and should be submitted first.
    Taking WordCount as an example, Spark firstly judges from the last RDD of the RDD dependency chain. When traversing to reduceByKey to generate a ShuffleRDD, Spark splits the entire dependency chain once. Since there is only one wide dependency in WordCount, DAGScheduler finally converts WordCount to WordCount. Split into 2 Stages.

Spark architecture and operating mechanism (5) - RDD directed acyclic graph split

Guess you like