SparkCore of Spark: RDD-Data Core/API [Generation and Division of DAG Stage]

1. Concept

\quad \quad DAG (Directed Acyclic Graph) is called a directed acyclic graph.

Insert picture description here

2. DAG generation

\quad \quad The original RDD forms a DAG directed acyclic graph through a series of conversion operations. When the task is executed, the real calculation (a process in which data is manipulated) can be performed according to the description of the DAG.
Insert picture description here
DAG boundary

  • Start: RDD created by SparkContext
  • End: Trigger the Action, once the Action is triggered, a complete DAG is formed

3. Divide the stage

Insert picture description here
Why divide the stage? --Parallel computing

\quad \quad If there is a shuffle for a complex business logic, it means that the next stage can be executed only after the results of the previous stage are produced, that is, the calculation of the next stage depends on the data of the previous stage. Then we divide according to shuffle (that is, divide according to wide dependence), we can divide a DAG into multiple stages/stages, in the same stage, there will be multiple operator operations, which can form a pipeline pipeline. Multiple parallel partitions within can be executed in parallel

How to divide the stage of DAG?

\quad \quad Spark will use the backtracking algorithm to divide the DAG according to the shuffle/wide dependency. From back to front, it will disconnect when it encounters a wide dependency, and divide all the code before the wide dependency into one stage; when it encounters a narrow dependency, it will divide the current one. RDD is added to the current stage/stage

  • For narrow dependencies , the conversion processing of partition is completed in the stage, without division (put the narrow dependencies in the same stage as much as possible to achieve pipeline calculation)
  • For wide dependencies , due to the existence of shuffle (a common shuffle operator) , the next calculation can only be started after the parent RDD processing is completed, that is to say, the stage needs to be divided (the wide dependency is split)

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112553957