Deep understanding of spark Stage

narrow and wide dependencies

Narrow dependencies:

It means that each partition of the parent RDD is used by at most one partition of the child RDD, which means that one parent RDD partition corresponds to one child RDD partition, and two parent RDD partitions correspond to one child RDD partition. In the figure, map/filter and union belong to the first category, and joins that co-partition the input belong to the second category.

Wide dependencies:

It means that the partition of the child RDD depends on all the partitions of the parent RDD. This is because of shuffle-like operations, such as groupByKey and join without cooperative partitioning in the figure. 
write picture description here

Stage:

A Job will be split into multiple groups of Tasks, and each group of tasks is called a Stage just like Map Stage and Reduce Stage. The division of Stage is described in detail in the RDD paper. Simply put, it is divided into two types: shuffle and result. There are two types of tasks in Spark, one is shuffleMapTask and the other is resultTask. The output of the first type of task is the data required for shuffle, and the output of the second type of task is result. The division of stages is also based on this. Before shuffle All transformations are one stage, and operations after shuffle are another stage. For example, rdd.parallize(1 to 10).foreach(println) This operation is not shuffled, it is output directly, then only its task is resultTask, and there is only one stage; if it is rdd.map(x => (x, 1 )).reduceByKey(_ + _).foreach(println), because this job has reduce, there is a shuffle process, then there is a stage before reduceByKey, execute shuffleMapTask, output the data required for shuffle, reduceByKey to the end is a stage, output the result directly. If there are multiple shuffles in the job, each shuffle is preceded by a stage. 
The DAG graph will be divided into different stages according to the dependencies between RDDs. For narrow dependencies, due to the certainty of the partition dependencies, partition conversion The processing can be completed in the same thread, and the narrow dependencies are divided into the same stage by spark. For the wide dependencies, the next stage can only start the next calculation after the parent RDD shuffle processing is completed. The reason why it is called ShuffleMapTask is because it needs to shuffle its calculation results to the next stage 
An example of a stage

Stage division ideas

Therefore, the overall idea of ​​Spark's stage division is: push it from the back to the front, disconnect it when it encounters a wide dependency, and divide it into a stage; when it encounters a narrow dependency, add the RDD to the stage. Therefore, in Figure 2, RDD C, RDD D, RDD E, and RDDF are constructed in one stage, RDD A is constructed in a separate Stage, and RDD B and RDD G are constructed in the same stage. 
  In spark, the types of tasks are divided into two types: ShuffleMapTask and ResultTask; in short, the last stage of DAG will generate a ResultTask for each result partition, that is, the number of tasks in each stage is determined by the stage. Determined by the number of Partitions in the last RDD! All other stages will generate ShuffleMapTask; it is called ShuffleMapTask because it needs to shuffle its calculation results to the next stage; that is to say, stage1 and stage2 in Figure 2 are equivalent to Mapper in mapreduce, and ResultTask The stage3 represented is equivalent to the reducer in mapreduce.

Summarize

map, filtre are narrow dependencies, 
groupbykey is  a type dependency, and
a wide dependency is divided into a stage

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325688384&siteId=291194637