Detailed explanation of the dependency (blood relationship) between SparkCore's RDD

  • 1- What is dependency (blood relationship)
    • RDD fault tolerance can be achieved by building dependencies
    • Child RDD depends on parent RDD
  • 2- Why do we need dependencies
    • Because Spark is based RDD parallel count calculation frame
    • RDD immutable partition can be counted parallel set of calculated
    • By dividing into wide dependencies and narrow dependencies, parallel computing of RDD partitions can be realized in the process of narrow dependencies
    • However, in the part of wide dependence, data needs to be pulled from different partitions of the previous RDD, and parallel computing cannot be implemented in the Shuffle stage.
  • 3-How many kinds of dependencies are there?
    • Narrow Dependency: NarrowDependency
    • Wide dependency: ShuffleDependency
  • 4- How to judge whether a dependency is a narrow dependency or a wide dependency?
    • Corresponding to a child RDD through a parent RDD, narrow dependency
    • Corresponding to multiple child RDDs through one parent RDD, wide dependence

Insert picture description here
Insert picture description here
Here is an interview question : Is a partition of a child RDD dependent on multiple parent RDDs, wide or narrow?
1) Uncertainty, that is, the division of width and narrow dependencies is based on whether a partition of the parent RDD is dependent on multiple partitions of the child RDD , yes, it is wide dependency, or judging from the perspective of shuffle, shuffle is wide dependency, such as Join

5- What is the purpose of Spark design dependency?

  • In order to be able to perform Spark parallel computing, it is the basis for dividing the stage
  • In order to build a blood relationship for RDD fault tolerance , a partition data is lost, only need to recalculate from the corresponding 1 partition of the parent RDD
    Insert picture description here

Guess you like

Origin blog.csdn.net/m0_49834705/article/details/112647243