Spark Basic Study Notes 19: RDD Dependency and Stage Division

Zero, the learning objectives of this lecture

  1. Understand the wide and narrow dependencies of RDD
  2. Understand that Spark divides computation into multiple stages based on DAG

1. Dependency of RDD

  • In Spark, a new RDD is generated for each transformation operation on an RDD. Because of the RDD 懒加载特性, the new RDD will depend on the original RDD, so there is a pipeline-like before and after dependency between RDDs. There are two types of dependencies: narrow dependencies and wide dependencies.

(1) Narrow dependency

  • A narrow dependency means that one partition of the parent RDD is used by at most one partition of the child RDD. That is, the correspondence between the partitions of the parent RDD and the partitions of the child RDD is one-to-one or many-to-one. For example, operations such as map(), filter(), union(), and join() all generate narrow dependencies.

1. map() and filter() operators

  • one-to-one dependency
    insert image description here

2. Union() operator

  • many-to-one dependency
    insert image description here

3. join() operator

  • many-to-one dependency
    insert image description here
  • For RDDs with narrow dependencies, the partition data of the child RDD can be calculated by performing pipeline operations according to the partitions of the parent RDD, and the entire operation can be performed on one node of the cluster.

(2) Wide dependence

  • Wide dependencies are when one partition of the parent RDD is used by multiple partitions of the child RDD. That is, the correspondence between the partitions of the parent RDD and the partitions of the child RDD is many-to-many. For example, operations such as groupByKey(), reduceByKey(), and sortByKey() all generate wide dependencies.

1. groupBy() operator

insert image description here

2. join() operator

insert image description here

  • The dependencies of join() operations are divided into two cases: one partition of an RDD is only combined with a known number of partitions in another RDD. This type of join() operation is a narrow dependency, and other cases are wide dependencies.
  • In a wide dependency relationship, RDD will aggregate data in different partitions according to the key of each record. The process of data aggregation is called Shuffle, which is similar to the Shuffle process in MapReduce. As an example in life, 4 people play cards together, and they need to shuffle after playing cards. These 4 people are equivalent to 4 partitions, and the cards in each person's hand are equivalent to the data in the partitions. The process of shuffling is understandable. for Shuffle. Therefore, Shuffle is actually data aggregation or data shuffling between different partitions. Shuffle is a resource-intensive operation because it involves disk I/O, data serialization, and network I/O.

3. reduceByKey() operator

  • Perform a reduceByKey() operation on an RDD, all records with the same key in the RDD will be aggregated, and all records with the same key may not be in the same partition or even on the same node, but the operation must aggregate these records together Calculations are performed to ensure accurate results, so the reduceByKey() operation will generate Shuffle and wide dependencies.

(3) Comparison of the two dependencies

  • In terms of data fault tolerance, narrow dependencies are better than wide dependencies. When the data of a certain partition of the child RDD is lost, if it is a narrow dependency, only the parent RDD partition corresponding to the partition needs to be recalculated, while the wide dependency needs to recalculate all the partitions of the parent RDD. In the groupByKey() operation, if partition 1 of RDD2 is lost, all partitions (partition 1, partition 2, and partition 3) of RDD1 need to be recalculated to restore it. In addition, before Shuffle, the wide dependency needs to calculate the data of all parent partitions. If the data of a parent partition has not been calculated, it needs to wait.

2. Stage division

(1) Directed acyclic graph

  • In Spark, operations on each RDD generate a new RDD, and connecting these RDDs with directional straight lines (connecting from parent RDD to child RDD) forms a directed acyclic graph of computational paths, called It is DAG (Directed Acyclic Graph).
    insert image description here

(2) Stage division basis

  • Spark will divide the entire calculation into multiple stages according to the DAG, and each stage is called a Stage. Each stage is calculated in parallel by multiple tasks, and each task acts on one partition. The total number of tasks in a stage is determined by the number of partitions of the last RDD in the stage .
  • The division of Stage is based on whether there is a wide dependency, that is, whether there is Shuffle. The Spark scheduler will recursively divide forward from the end of the DAG graph, and will divide it when encountering Shuffle. All RDDs before Shuffle form a Stage, and the entire DAG graph is a Stage.

1. Two-stage case

  • The Stage division of the classic word count execution process is shown in the following figure.
    insert image description here
  • The dependencies in the above figure can be divided into two stages: recursive division from back to front, the conversion from RDD3 to RDD4 is a Shuffle operation, so divide between RDD3 and RDD4, and continue to search forward, RDD1, RDD2, The relationship between RDD3 is a narrow dependency, so it is a Stage; the entire conversion process is a Stage.

3. Three-stage case

  • The dependencies in the figure below can be divided into three stages: recursive division from back to front. Since the conversion from RDD6 to RDD7 is a Shuffle operation, it is divided between RDD6 and RDD7, and then continues to search forward, RDD3, RDD4, RDD5, and RDD6 are a Stage; since the conversion from RDD1 to RDD2 is a Shuffle operation, it is divided between RDD1 and RDD2, and then continues to search forward, RDD1 is a Stage; the entire conversion process is a Stage.
    insert image description here

Guess you like

Origin blog.csdn.net/howard2005/article/details/123926097