Spark源码分析-3.依赖分析

上篇文章中,我们讲完了job层面的代码,最后引出了stage。再讲stage划分源码前,我们先看下stage划分原理和Dependency类。


stage的划分是根据宽窄依赖来的,遇到了宽依赖需要进行shuffle,各节点不能并行运行,而窄依赖的每个分区的计算可以并行。我们从两个角度解释下宽窄依赖。
1. 从子RDD来分析
(1)窄依赖:子RDD的每个分区依赖于常数个父分区(即与数据规模无关);
(2)宽依赖:子RDD的每个分区依赖于所有父RDD分区。例如,map产生窄依赖,而join则是宽依赖(除非父RDD被哈希分区)。
2. 从父RDD来分析
(1)窄依赖:每个父RDD的分区最多被子RDD的一个分区所使用;
(2)宽依赖:每个父RDD的分区被多个子RDD的分区所使用。

Dependency.scala
此处输入图片的描述
从图中可以看到Dependency.scala中定义了Dependency类及其子类

我们看下继承关系:

Dependency抽象类

只定义了一个名为rdd的函数,其返回值为RDD类型,后面的子类会对该函数进行重写。

abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}

Dependency有两个直接子类,NarrowDependency和ShuffleDependency

NarrowDependency抽象类

定义了getParents函数,为子分区获取父分区,这里是复数,也就是说,窄依赖一个子分区可以有多个父分区,但是一个父分区最多被一个子分区使用。
重写了rdd函数,简单的赋值

abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}

NarrowDependency有两个子类,分别是OneToOneDependency和RangeDependency,这两个是类,不是抽象类。

OneToOneDependency类:

重写了getParents方法,Seq[Int]变成了List[Int],函数体是简单把当前partitionId,即子分区ID添加到List中并返回。子RDD的PartitionId和父RDD的PartitionId一样。因为Spark是惰性计算,这里的依赖记录了最初的PartitionId,只不过是对于该RDD上的操作不同。

class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}

RangeDependency类:

通过子RDD的分区计算父RDD的分区,父RDD从inStart开始的partition,逐个生成了子RDD从outStart开始的partition。由于是range,直接记录起点就可以知道整个父RDD PartitionId,所以这里进行了优化,List只保存了父RDD partitionId的起点。
Nil是一个空的List

/**
 * :: DeveloperApi ::
 * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
 * @param rdd the parent RDD
 * @param inStart the start of the range in the parent RDD
 * @param outStart the start of the range in the child RDD
 * @param length the length of the range
 */
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
    } else {
      Nil
    }
  }
}

ShuffleDependency类:

代表对shuffle stage输出的依赖,注意是在shuffle情况下
_rdd:父RDD
partitioner:用于shuffle输出分区的partitioner
transient表示不进行序列化
serializer:shuffle一般是需要网络传输的,所以经常需要序列化,默认SparkEnv获取
keyOrdering:shuffle中键的顺序,默认None
aggregator:每次shuffle都需要合并,默认为None
mapSideCombine:也可以在map端合并,合并后再一起传输,默认false
ShuffleDependency是对键值对进行操作,如join、groupByKey等等,所以这里用Product2表示,可以理解为Tuple2二元组。

扫描二维码关注公众号,回复: 1124996 查看本文章
/**
 * :: DeveloperApi ::
 * Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
 * the RDD is transient since we don't need it on the executor side.
 *
 * @param _rdd the parent RDD
 * @param partitioner partitioner used to partition the shuffle output
 * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If not set
 *                   explicitly then the default serializer, as specified by `spark.serializer`
 *                   config option, will be used.
 * @param keyOrdering key ordering for RDD's shuffles
 * @param aggregator map/reduce-side aggregator for RDD's shuffle
 * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
 */
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {

  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
  // Note: It's possible that the combiner class tag is null, if the combineByKey
  // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
  private[spark] val combinerClassName: Option[String] =
    Option(reflect.classTag[C]).map(_.runtimeClass.getName)

  val shuffleId: Int = _rdd.context.newShuffleId()

  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}

猜你喜欢

转载自blog.csdn.net/bloddy/article/details/79315624