spark发展至今,核心设计没什么大变化,如果想快速了解底层实现,可以去看早期的源码,
Branch-0.5分支的,https://github.com/apache/spark/tree/branch-0.5,github直接可以找到,相比spark2.x源码的庞大,
动辄几十个包,早期的除了注释少点之外,读起来没大的挫败感。
spark源码<一>,主要写点RDD,Split,Partitioner分区器,所有代码均为早期代码,与spark2.x可能有所区别,请忽略。
1.RDD
/**
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel.
* Each RDD is characterized by five main properties:
* - A list of splits (partitions)
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for HDFS)
* All the scheduling and execution in Spark is done based on these methods, allowing each RDD to
* implement its own way of computing itself.
*/
这是官方给的注释,RDD是spark基本的数据抽象,表示一个可变的,可分区的,可并行操作的元素集合。
每个RDD都有5个主要的特征:
1. 一个分区列表:
spark会将数据切成n份,一份即为一个分区,分区数由用户指定,或者默认父rdd的最大分区数(可能多个父RDD)
2.计算每个分区的方法:
多数RDD的子类重写了该方法,所有有丰富的算子操作
3.依赖的RDD列表
4.一个分区器对于键值对的RDD
5.一个计算的优先位置(eg:若本节点有,则直接计算,若无,则从别的节点copy来,计算)
abstract class RDD[T: ClassManifest](@transient sc: SparkContext) extends Serializable {
def splits: Array[Split]//返回分区数组的方法
def compute(split: Split): Iterator[T]//计算每个分区的方法
@transient val dependencies: List[Dependency[_]]//@transient修饰:变量不被序列化,依赖列表
val partitioner: Option[Partitioner] = None//分区器
// Optionally overridden by subclasses to specify placement preferences
def preferredLocations(split: Split): Seq[String] = Nil//优先位置
def context = sc
// Get a unique ID for this RDD
val id = sc.newRddId()//获取一个新的RDDid
private var shouldCache = false//是否缓存,默认false
def cache(): RDD[T] = {//将RDD相关信息缓存
shouldCache = true
this
}
//返回一个计算后数据的迭代器,如果缓存中有,则读缓存
final def iterator(split: Split): Iterator[T] = {
if (shouldCache) {
SparkEnv.get.cacheTracker.getOrCompute[T](this, split)
} else {
compute(split)
}
}
//后面都是RDD的子类用于各种算子,后面会有单独算子篇幅来写
。。。。。。。
}
2.Split
与partition,task是一对一的关系
trait Split extends Serializable {
val index: Int//下标
override def hashCode(): Int = index
}
3.Partitioner
abstract class Partitioner extends Serializable {
def numPartitions: Int //分区数
def getPartition(key: Any): Int//根据Key得到该去往的分区
}
Partitioner只用于键值对的数据,有两个子类:分别为HashPartitioner,RangePartitioner
<一>HashPartitioner
根据key的哈希值模与partiton的数量,决定数据该去往哪个分区
class HashPartitioner(partitions: Int) extends Partitioner {
def numPartitions = partitions
def getPartition(key: Any): Int = {
if (key == null) {//如果key为空,直接返回0
return 0
} else {
val mod = key.hashCode % partitions//取key的哈希值模与分区数量
if (mod < 0) {
mod + partitions//防止出现负的哈希值
} else {
mod
}
}
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
}
<二>RangePartitioner
适用于key是可排序的,这个我会模拟个例子帮助理解
class RangePartitioner[K <% Ordered[K]: ClassManifest, V](
partitions: Int,
@transient rdd: RDD[(K,V)],
private val ascending: Boolean = true)
extends Partitioner {
//第一步,主要获取区间边界值
private val rangeBounds: Array[K] = {
if (partitions == 1) {//如果分区数为1,直接返回空数组
Array()
} else {
val rddSize = rdd.count()//计算元素总数量
val maxSampleSize = partitions * 20.0//抽样的数量
val frac = math.min(maxSampleSize / math.max(rddSize, 1), 1.0)
//抽样的比例=抽样数量/元素总数量
val rddSample = rdd.sample(true, frac, 1).map(_._1).collect().sortWith(_ < _)
//按比例抽样,取key由小到达排序
if (rddSample.length == 0) {//如果未抽到数据,直接返回空数组
Array()
} else {
val bounds = new Array[K](partitions - 1)//创建一个长度为分区数-1的数组
for (i <- 0 until partitions - 1) {
//循环,将区间边界值赋给数组
val index = (rddSample.length - 1) * (i + 1) / partitions
bounds(i) = rddSample(index)
}
bounds
}
}
}
def numPartitions = partitions
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]//强转
var partition = 0
while (partition < rangeBounds.length && k > rangeBounds(partition)) {
//循环直到给定key值大于边界值
partition += 1
}
//数据排序是升序还是降序,默认升序
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}
override def equals(other: Any): Boolean = other match {
case r: RangePartitioner[_,_] =>
r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
case _ =>
false
}
}
模拟例子:300个元素,为1-300,分6个区
#rangeBounds方法:会返回一个长度为6-1的数组,依次为:50,100,150,200,250
#getPartition方法:会将传入的key与返回数组进行比较,确定去往哪个分区