spark源码《一》RDD

spark发展至今，核心设计没什么大变化，如果想快速了解底层实现，可以去看早期的源码，

Branch-0.5分支的，https://github.com/apache/spark/tree/branch-0.5，github直接可以找到，相比spark2.x源码的庞大，

动辄几十个包，早期的除了注释少点之外，读起来没大的挫败感。

spark源码<一>，主要写点RDD，Split，Partitioner分区器，所有代码均为早期代码，与spark2.x可能有所区别，请忽略。

1.RDD

/**
 * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
 * partitioned collection of elements that can be operated on in parallel.
 * Each RDD is characterized by five main properties:
 * - A list of splits (partitions)
 * - A function for computing each split
 * - A list of dependencies on other RDDs
 * - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 * - Optionally, a list of preferred locations to compute each split on (e.g. block locations for HDFS)
 * All the scheduling and execution in Spark is done based on these methods, allowing each RDD to
 * implement its own way of computing itself.
*/

这是官方给的注释，RDD是spark基本的数据抽象，表示一个可变的，可分区的，可并行操作的元素集合。

每个RDD都有5个主要的特征:

1. 一个分区列表：

spark会将数据切成n份，一份即为一个分区，分区数由用户指定，或者默认父rdd的最大分区数(可能多个父RDD)

2.计算每个分区的方法：

多数RDD的子类重写了该方法，所有有丰富的算子操作

3.依赖的RDD列表

4.一个分区器对于键值对的RDD

5.一个计算的优先位置（eg:若本节点有，则直接计算，若无，则从别的节点copy来，计算）

abstract class RDD[T: ClassManifest](@transient sc: SparkContext) extends Serializable {
  
  def splits: Array[Split]//返回分区数组的方法
  def compute(split: Split): Iterator[T]//计算每个分区的方法

  @transient val dependencies: List[Dependency[_]]//@transient修饰:变量不被序列化，依赖列表
  

  val partitioner: Option[Partitioner] = None//分区器

  // Optionally overridden by subclasses to specify placement preferences
  def preferredLocations(split: Split): Seq[String] = Nil//优先位置
  
  def context = sc
  
  // Get a unique ID for this RDD
  val id = sc.newRddId()//获取一个新的RDDid
  

  private var shouldCache = false//是否缓存，默认false
  

  def cache(): RDD[T] = {//将RDD相关信息缓存
    shouldCache = true
    this
  }
  
  //返回一个计算后数据的迭代器，如果缓存中有，则读缓存
  final def iterator(split: Split): Iterator[T] = {
    if (shouldCache) {
      SparkEnv.get.cacheTracker.getOrCompute[T](this, split)
    } else {
      compute(split)
    }
  }
//后面都是RDD的子类用于各种算子，后面会有单独算子篇幅来写
。。。。。。。
}

2.Split

与partition，task是一对一的关系

trait Split extends Serializable {

  val index: Int//下标
  override def hashCode(): Int = index
}

3.Partitioner

abstract class Partitioner extends Serializable {
  def numPartitions: Int //分区数
  def getPartition(key: Any): Int//根据Key得到该去往的分区
}

Partitioner只用于键值对的数据，有两个子类：分别为HashPartitioner，RangePartitioner

<一>HashPartitioner

根据key的哈希值模与partiton的数量，决定数据该去往哪个分区

class HashPartitioner(partitions: Int) extends Partitioner {
  def numPartitions = partitions

  def getPartition(key: Any): Int = {
    if (key == null) {//如果key为空，直接返回0
      return 0
    } else {
      val mod = key.hashCode % partitions//取key的哈希值模与分区数量
      if (mod < 0) {
        mod + partitions//防止出现负的哈希值
      } else {
        mod 
      }
    }
  }
  
  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }
}

<二>RangePartitioner

适用于key是可排序的，这个我会模拟个例子帮助理解

class RangePartitioner[K <% Ordered[K]: ClassManifest, V](
    partitions: Int,
    @transient rdd: RDD[(K,V)],
    private val ascending: Boolean = true) 
  extends Partitioner {

//第一步，主要获取区间边界值
  private val rangeBounds: Array[K] = {
    if (partitions == 1) {//如果分区数为1，直接返回空数组
      Array()
    } else {
      val rddSize = rdd.count()//计算元素总数量
      val maxSampleSize = partitions * 20.0//抽样的数量

      val frac = math.min(maxSampleSize / math.max(rddSize, 1), 1.0)
      //抽样的比例=抽样数量/元素总数量

      val rddSample = rdd.sample(true, frac, 1).map(_._1).collect().sortWith(_ < _)
      //按比例抽样，取key由小到达排序
    
      if (rddSample.length == 0) {//如果未抽到数据，直接返回空数组
        Array()
      } else {
        val bounds = new Array[K](partitions - 1)//创建一个长度为分区数-1的数组
        for (i <- 0 until partitions - 1) {
          //循环，将区间边界值赋给数组
          val index = (rddSample.length - 1) * (i + 1) / partitions 
          bounds(i) = rddSample(index)
        }
        bounds
      }
    }
  }

  def numPartitions = partitions

  def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[K]//强转
    var partition = 0
    while (partition < rangeBounds.length && k > rangeBounds(partition)) {
      //循环直到给定key值大于边界值
      partition += 1
    }
    //数据排序是升序还是降序，默认升序
    if (ascending) {
      partition
    } else {
      rangeBounds.length - partition
    }
  }

  override def equals(other: Any): Boolean = other match {
    case r: RangePartitioner[_,_] =>
      r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
    case _ =>
      false
  }
}

模拟例子:300个元素，为1-300，分6个区

#rangeBounds方法:会返回一个长度为6-1的数组，依次为:50，100，150，200，250

#getPartition方法:会将传入的key与返回数组进行比较，确定去往哪个分区