Spark big data processing lecture notes 3.4 Understanding RDD dependencies

 

Table of contents

0. Learning objectives of this lecture

1. RDD dependency

Two, narrow dependence

(1) map() and filter() operators

(2) union() operator

(3) join() operator

Three, wide dependence

(1) groupBy() operator

(2) join() operator

(3) reduceByKey() operator

3. Comparison of two kinds of dependencies


0. Learning objectives of this lecture

  1. Understanding narrow dependencies of RDDs
  2. Understanding wide dependencies of RDDs
  3. Understand the difference between the two dependencies

1. RDD dependency

  • In Spark, every conversion operation on RDD will generate a new RDD. Because of the RDD 懒加载特性, the new RDD will depend on the original RDD, so there is a pipeline-like front and back dependency between RDDs. There are two types of dependencies: narrow dependencies and wide dependencies.
    insert image description here

Two, narrow dependence

  • Narrow dependency means that each partition of the parent RDD is used by at most one partition of the child RDD, that is, OneToOneDependencies . The performance of narrow dependencies is generally divided into two categories. The first category is that a parent RDD partition corresponds to a child RDD partition; the second category is that multiple parent RDD partitions correspond to a child RDD partition. One partition of a parent RDD cannot correspond to multiple partitions of a child RDD. For ease of understanding, we usually compare narrow dependence to an only child.
  • When RDD performs map, filter, and union operator operations, it belongs to the first category of narrow dependency performance; while RDD performs join operator operations (coordinated division of input), it belongs to the second category of narrow dependency performance. Input co-partition means that all the keys of a certain partition of multiple parent RDDs are divided into the same partition of the child RDD. When the child RDD performs an operator operation and data is lost due to a partition operation failure, you only need to perform the operator operation on the corresponding partition in the parent RDD again to recover the data.

(1) map() and filter() operators

  • one-to-one dependency
    insert image description here

(2) union() operator

  • one-to-one dependency
    insert image description here

(3) join() operator

  • many-to-one dependency
    insert image description here
  • For RDDs with narrow dependencies, the partition data of the child RDD can be calculated by performing pipeline operations according to the partition of the parent RDD, and the entire operation can be performed on one node of the cluster.

Three, wide dependence

  • Wide dependency means that each partition of the child RDD will use all partitions or multiple partitions of all parent RDDs, that is, OneToManyDependecies. For ease of understanding, we usually compare the image of wide dependence to superbirth.
  • When the parent RDD performs groupByKey and join (input is not collaboratively partitioned) operations, each partition of the child RDD depends on all partitions of all parent RDDs. When the child RDD performs an operator operation and data is lost due to a partition operation failure, it is necessary to perform the operator operation on all partitions in the parent RDD again to recover the data.
  • For example, operations such as groupByKey(), reduceByKey(), sortByKey(), etc. will generate wide dependencies.

(1) groupBy() operator

  • many-to-many dependency
    insert image description here

(2) join() operator

  • many-to-many dependency
    insert image description here
  • The dependency of join() operation is divided into two cases: one partition of RDD is only combined with a known number of partitions in another RDD, this type of join() operation is narrow dependency, and other cases are wide dependency.
  • In a wide dependency relationship, RDD will aggregate data in different partitions according to the key of each record. The process of data aggregation is called Shuffle, which is similar to the Shuffle process in MapReduce. To give an example in life, 4 people play cards together and need to shuffle the cards after playing. These 4 people are equivalent to 4 partitions, and the cards in each person's hand are equivalent to the data in the partitions. The process of shuffling is understandable for Shuffle. Therefore, Shuffle is actually data aggregation or data shuffling between different partitions. Shuffle is a resource-intensive operation because it involves disk I/O, data serialization, and network I/O.

(3) reduceByKey() operator

  • Perform a reduceByKey() operation on an RDD, all records with the same key in the RDD will be aggregated, and all records with the same key may not be in the same partition or even on the same node, but the operation must aggregate these records together Calculations are required to ensure accurate results, so the reduceByKey() operation will generate Shuffle and wide dependencies.

3. Comparison of two kinds of dependencies

  • In terms of data fault tolerance, narrow dependencies are better than wide dependencies . When the data of a certain partition of the child RDD is lost, if it is a narrow dependency, it only needs to recalculate the parent RDD partition corresponding to the partition, while for a wide dependency, all partitions of the parent RDD need to be recalculated. In the groupByKey() operation, if partition 1 of RDD2 is lost, all partitions (partition 1, partition 2, and partition 3) of RDD1 need to be recalculated to restore it. In addition, wide dependencies need to calculate the data of all parent partitions before performing Shuffle. If the data of a parent partition has not been calculated, you need to wait.

Guess you like

Origin blog.csdn.net/qq_61324603/article/details/130614062
Recommended