SparkCore of Spark: RDD-Data Core/API [Dependency: Broad and Narrow Dependency]

\quad \quad There are two different types of relationship between RDD and its dependent parent RDD, namely narrow dependency and wide dependency.

Insert picture description here

1. Narrow dependency

  • Narrow dependency means that each Partition of the parent RDD is used by at most one Partition of the child RDD (one-to-one relationship)
  • Summary·: The analogy that narrowly relies on our image is the only child
  • Common operators: map flatmap filter union sample, etc.

2. Wide dependency

  • Wide dependency means that the Partition of multiple child RDDs will depend on the Partition of the same parent RDD (one-to-many relationship)
  • Summary: The metaphor of Kuan's reliance on our image is superbirth
  • Common operators: groupByKey reduceByKey sortByKey join, etc.

How to distinguish between broad and narrow dependencies ?

  • Narrow dependency: a partition of the parent RDD will only be dependent on a partition of the child RDD
  • Wide dependency: One partition of the parent RDD will be dependent on multiple partitions of the child RDD (involving shuffle)

Error-prone: Is a partition of a child RDD dependent on multiple parent RDDs as a wide dependency or a narrow dependency?

  • Not sure, that is, the division of width and narrow dependencies is based on whether a partition of the parent RDD is dependent on multiple partitions of the child RDD, yes, it is wide dependence, or judging from the perspective of shuffle, shuffle is wide dependence

3. Function

Narrow dependence:

  • Spark can compute in parallel
  • If there is a partition data loss, only need to recalculate from the corresponding 1 partition of the parent RDD, and there is no need to recalculate the entire task, which improves fault tolerance.

Wide dependence:

  • Is the basis for dividing the stage

  • Fault tolerance (for complex business logic, when the execution reaches a wide dependency, proper cache is carried out, worrying about the abnormal end of the task and the re-run of the data)

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112548192