One spark source code analysis - four kinds of dependencies RDD

 Four kinds of dependencies RDD

  RDD four kinds of dependency, namely ShuffleDependency, PrunDependency, RangeDependency and OneToOneDependency four kinds of dependencies. As shown below: org.apache.spark.Dependency a two subclasses, respectively, and ShuffleDependency NarrowDependency. Wherein, NarrowDependency is an abstract class, it has to achieve three categories, namely OneToOneDependency, RangeDependency and PruneDependency.

  

 RDD narrow dependence

  Let's look at how narrow RDD RDD is determined dependent on the parent partition it? NarrowDependency defines an abstract method, as follows:

/**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  RDD whose input parameter is a sub-partition Id, the output is dependent on the parent sub-partition RDD RDD sequence id of the partition.

  The following were looking to achieve three sub-classes:

  OneToOneDependency

  First, OneToOneDependency of getParent achieve the following:

override def getParents(partitionId: Int): List[Int] = List(partitionId)

  On a line of code, it is relatively simple, child RDD corresponding partition index partition of the index with the same parent RDD. RDD corresponds to the parent partition copy to each of the corresponding partition of the sub-RDD, the relationship is one to one partition. RDD is also one to one relationship.

  RangeDependency

  Secondly, RangeDependency of getParent achieve the following:

  

/**
 * :: DeveloperApi ::
 * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
 * @param rdd the parent RDD
 * @param inStart the start of the range in the parent RDD
 * @param outStart the start of the range in the child RDD
 * @param length the length of the range
 */
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
    } else {
      Nil
    }
  }
}

  Explained first three variables: inStart: RDD starting position range of the parent; outStart: the start position of the sub range of the RDD; length: length of the range.

  Get partition index of the parent RDD rule is: if the sub-partition index of RDD RDD the range within the parent, the parent returns the sub Partition RDD RDD partition index - start + parent sub-range partitions RDD RDD range starting partition. Wherein, (- RDD parent sub-partition + RDD range starting partition starting range) i.e. the relative distance of the starting position range and the starting position of the partition range RDD parent partition of the sub-RDD. The sub RDD parttion index plus the relative distance corresponding to that of the parent RDD partition. Otherwise it is not dependent on the parent RDD of partition index. Partition father and son relationship is one of the RDD. RDD relationship may be one (length is 1, is a special OneToOneDependency), it may be a many, it could be many.

  PruneDependency

      Finally, PruneDependency of getParent achieve the following:

 1 /**
 2  * Represents a dependency between the PartitionPruningRDD and its parent. In this
 3  * case, the child RDD contains a subset of partitions of the parents'.
 4  */
 5 private[spark] class PruneDependency[T](rdd: RDD[T], partitionFilterFunc: Int => Boolean)
 6   extends NarrowDependency[T](rdd) {
 7 
 8   @transient
 9   val partitions: Array[Partition] = rdd.partitions
10     .filter(s => partitionFilterFunc(s.index)).zipWithIndex
11     .map { case(split, idx) => new PartitionPruningRDDPartition(idx, split) : Partition }
12 
13   override def getParents(partitionId: Int): List[Int] = {
14     List(partitions(partitionId).asInstanceOf[PartitionPruningRDDPartition].parentSplit.index)
15   }
16 }

  First, to explain three variables: the parent RDD RDD is directed to a reference example; partitionFilterFunc is a callback function is the set of eligible filtered parent partition of RDD; PartitionPruningRDDPartition class declaration is as follows:

private[spark] class PartitionPruningRDDPartition(idx: Int, val parentSplit: Partition)
  extends Partition {
  override val index = idx
}

  partitions is generated as follows: first obtain the parent partition set in accordance with the corresponding parent RDD RDD reference, and in accordance with the filter function and index partition, partition the set of filtered desired parent RDD starts from 0 and number, and finally, the parent partition according to RDD and new numbering PartitionPruningRDDPartition instantiate a new instance, and put into partitions set, it is equivalent to first make partition parent RDD Filter pruning operation.

  In getParent method, first partition index obtained according to the parent sub RDD RDD corresponding to the corresponding partition, and then obtaining a member function of the Partition index, the index is the parent of all the partitions of the partition RDD RDD in the parent index. The relationship between child and parent RDD partition RDD partition is a one to one relationship between parent and child RDD RDD is a many, it could be many, it may be one to one.

  Briefly, in dependence narrow, and the relationship between a parent partition the partition sub RDD RDD is one to one.

 RDD width dependent

  The following highlights look ShuffleDependency, on behalf of ShuffleDependency is a shuffle stage output. First look at its construction method, i.e., it depends on variables or instance:

1 @DeveloperApi
2 class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
3     @transient private val _rdd: RDD[_ <: Product2[K, V]],
4     val partitioner: Partitioner,
5     val serializer: Serializer = SparkEnv.get.serializer,
6     val keyOrdering: Option[Ordering[K]] = None,
7     val aggregator: Option[Aggregator[K, V, C]] = None,
8     val mapSideCombine: Boolean = false)
9   extends Dependency[Product2[K, V]]

  Wherein, _rdd RDD instance on behalf of that parent; partitioner is used to shuffle the output partition of the partition; Serializer, mainly for serialization, default org.apache.spark.serializer.JavaSerializer, by `spark.serializer` parameter Specifies; keyOrdering RDD shuffle the order of the key. Aggregator, a map, or reduce end RDD shuffle combine the polymerizer; mapSideCombine whether polymeric moiety (i.e., the end of the prepolymerization map, can improve the efficiency and reduce the transmission efficiency of the network side), the default is false. Because not all are suitable to do so. Such as for Global average, average, poor square, but like the global maximum, minimum, and the like are suitable for the mapSideCombine. Note that, when mapSideCombine is true, must combine polymerizer provided, because of the need to make use of polymerizer map-combine operation before shuffle.

  Seven kinds of realization partitioner

  partitioner defines the RDD in the key-value pair is how to press the key partitions. Each key is mapped to a partition id, from 0 to the number of partitions --1; note that the partition must be deterministic, i.e., with a given key, must return to the same partition, it fails to facilitate the task, retrospective partition data, to ensure the consistency of the data for each partition to be involved in the calculation. That partition the data to determine the shuffle process which partitions to specific flows.

  org.apache.spark.Partition implementation class 7 as follows:

  

  Let's look at Partitioner way to define:

1 abstract class Partitioner extends Serializable {
2   def numPartitions: Int
3   def getPartition(key: Any): Int
4 }

  Wherein, numPartitions returned is the number of sub-partition of RDD; getPartition returns the child index of RDD partition according to the specified key.

  getPartition HashPartitioner implemented as follows, the idea is the number of partition key.hashcode () mod sub RDD of:

1 def getPartition(key: Any): Int = key match {
2     case null => 0
3     case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
4   }

  RangePartitioner of getPartition implementation as follows:

 1 def getPartition(key: Any): Int = {
 2     val k = key.asInstanceOf[K]
 3     var partition = 0
 4     if (rangeBounds.length <= 128) { // 不大于 128 分区
 5       // If we have less than 128 partitions naive search
 6       while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
 7         partition += 1
 8       }
 9     } else { // 大于 128 个分区数量
10       // Determine which binary search method to use only once.
11       partition = binarySearch(rangeBounds, k) // 二分查找
12       // binarySearch either returns the match location or -[insertion point]-1
13       if (partition < 0) {
14         partition = -partition-1
15       }
16       if (partition > rangeBounds.length) {
17         partition = rangeBounds.length
18       }
19     }
20     if (ascending) {
21       partition
22     } else {
23       rangeBounds.length - partition
24     }
25   }

  PythonPartitioner of getPartition as follows, with the hash is very similar:

1 override def getPartition(key: Any): Int = key match {
2     case null => 0
3     // we don't trust the Python partition function to return valid partition ID's so
4     // let's do a modulo numPartitions in any case
5     case key: Long => Utils.nonNegativeMod(key.toInt, numPartitions)
6     case _ => Utils.nonNegativeMod(key.hashCode(), numPartitions)
7   }

  PartitionIdPassthrough of getPartition as follows:

1 override def getPartition(key: Any): Int = key.asInstanceOf[Int]

  GridPartitioner the getPartition follows, thought, locate the tuple partition grid:

 1 override val numPartitions: Int = rowPartitions * colPartitions
 2 
 3   /**
 4    * Returns the index of the partition the input coordinate belongs to.
 5    *
 6    * @param key The partition id i (calculated through this method for coordinate (i, j) in
 7    *            `simulateMultiply`, the coordinate (i, j) or a tuple (i, j, k), where k is
 8    *            the inner index used in multiplication. k is ignored in computing partitions.
 9    * @return The index of the partition, which the coordinate belongs to.
10    */
11   override def getPartition(key: Any): Int = {
12     key match {
13       case i: Int => i
14       case (i: Int, j: Int) =>
15         getPartitionId(i, j)
16       case (i: Int, j: Int, _: Int) =>
17         getPartitionId(i, j)
18       case _ =>
19         throw new IllegalArgumentException(s"Unrecognized key: $key.")
20     }
21   }
22 
23   /** Partitions sub-matrices as blocks with neighboring sub-matrices. */
24   private def getPartitionId(i: Int, j: Int): Int = {
25     require(0 <= i && i < rows, s"Row index $i out of range [0, $rows).")
26     require(0 <= j && j < cols, s"Column index $j out of range [0, $cols).")
27     i / rowsPerPart + j / colsPerPart * rowPartitions
28   }

  Including anonymous class, as well as a good variety, not introduced. In short, wide partitioner dependence is determined according to the particular partition data which partition.

  Thus, RDD narrow width dependence and dependence are described clearly.

 

Guess you like

Origin www.cnblogs.com/johnny666888/p/11111957.html