上篇文章中,我们讲完了job层面的代码,最后引出了stage。再讲stage划分源码前,我们先看下stage划分原理和Dependency类。
stage的划分是根据宽窄依赖来的,遇到了宽依赖需要进行shuffle,各节点不能并行运行,而窄依赖的每个分区的计算可以并行。我们从两个角度解释下宽窄依赖。
1. 从子RDD来分析
(1)窄依赖:子RDD的每个分区依赖于常数个父分区(即与数据规模无关);
(2)宽依赖:子RDD的每个分区依赖于所有父RDD分区。例如,map产生窄依赖,而join则是宽依赖(除非父RDD被哈希分区)。
2. 从父RDD来分析
(1)窄依赖:每个父RDD的分区最多被子RDD的一个分区所使用;
(2)宽依赖:每个父RDD的分区被多个子RDD的分区所使用。
Dependency.scala
从图中可以看到Dependency.scala中定义了Dependency类及其子类
我们看下继承关系:
Dependency抽象类
只定义了一个名为rdd的函数,其返回值为RDD类型,后面的子类会对该函数进行重写。
abstract class Dependency[T] extends Serializable {
def rdd: RDD[T]
}
Dependency有两个直接子类,NarrowDependency和ShuffleDependency
NarrowDependency抽象类
定义了getParents函数,为子分区获取父分区,这里是复数,也就是说,窄依赖一个子分区可以有多个父分区,但是一个父分区最多被一个子分区使用。
重写了rdd函数,简单的赋值
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
/**
* Get the parent partitions for a child partition.
* @param partitionId a partition of the child RDD
* @return the partitions of the parent RDD that the child partition depends upon
*/
def getParents(partitionId: Int): Seq[Int]
override def rdd: RDD[T] = _rdd
}
NarrowDependency有两个子类,分别是OneToOneDependency和RangeDependency,这两个是类,不是抽象类。
OneToOneDependency类:
重写了getParents方法,Seq[Int]变成了List[Int],函数体是简单把当前partitionId,即子分区ID添加到List中并返回。子RDD的PartitionId和父RDD的PartitionId一样。因为Spark是惰性计算,这里的依赖记录了最初的PartitionId,只不过是对于该RDD上的操作不同。
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
RangeDependency类:
通过子RDD的分区计算父RDD的分区,父RDD从inStart开始的partition,逐个生成了子RDD从outStart开始的partition。由于是range,直接记录起点就可以知道整个父RDD PartitionId,所以这里进行了优化,List只保存了父RDD partitionId的起点。
Nil是一个空的List
/**
* :: DeveloperApi ::
* Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
* @param rdd the parent RDD
* @param inStart the start of the range in the parent RDD
* @param outStart the start of the range in the child RDD
* @param length the length of the range
*/
@DeveloperApi
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = {
if (partitionId >= outStart && partitionId < outStart + length) {
List(partitionId - outStart + inStart)
} else {
Nil
}
}
}
ShuffleDependency类:
代表对shuffle stage输出的依赖,注意是在shuffle情况下
_rdd:父RDD
partitioner:用于shuffle输出分区的partitioner
transient表示不进行序列化
serializer:shuffle一般是需要网络传输的,所以经常需要序列化,默认SparkEnv获取
keyOrdering:shuffle中键的顺序,默认None
aggregator:每次shuffle都需要合并,默认为None
mapSideCombine:也可以在map端合并,合并后再一起传输,默认false
ShuffleDependency是对键值对进行操作,如join、groupByKey等等,所以这里用Product2表示,可以理解为Tuple2二元组。
/**
* :: DeveloperApi ::
* Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
* the RDD is transient since we don't need it on the executor side.
*
* @param _rdd the parent RDD
* @param partitioner partitioner used to partition the shuffle output
* @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If not set
* explicitly then the default serializer, as specified by `spark.serializer`
* config option, will be used.
* @param keyOrdering key ordering for RDD's shuffles
* @param aggregator map/reduce-side aggregator for RDD's shuffle
* @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
*/
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Serializer = SparkEnv.get.serializer,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false)
extends Dependency[Product2[K, V]] {
override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
// Note: It's possible that the combiner class tag is null, if the combineByKey
// methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
private[spark] val combinerClassName: Option[String] =
Option(reflect.classTag[C]).map(_.runtimeClass.getName)
val shuffleId: Int = _rdd.context.newShuffleId()
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.length, this)
_rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}