RDD(Resilient Distributed Dataset)

RDD是一个抽象类，它代表的是对不可变的分区元素的集合进行并行操作。

A list of partitions

一个RDD由几个分区构成。它是一个可分区的集合，那么它的好处就体现在，对于之前的普通不能进行分区的集合，数据就只能在一个节点上进行处理，而对于RDD来说，对集合进行分区，那么就可以把集合里面的元素存储在不同机器上处理。这样性能肯定会有所提升。

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */
  protected def getPartitions: Array[Partition]

A function for computing each split/partition

这句话表示RDD具有并行处理任务的特性。每个函数会同时作用到几个分区中，有几个partition就有几个task。对RDD做计算是相当于作用在每一个分片上的。对应函数compute

  /**
   * :: DeveloperApi ::
   * Implemented by subclasses to compute a given partition.
   */
  @DeveloperApi
  def compute(split: Partition, context: TaskContext): Iterator[T]

A list of dependencies on other RDDs

假设对RDD1做了一个map操作，它得到的是RDD2，是一个新的RDD。每个RDD都依赖着其父类，可能来自于一个，也可能来自于多个。RDD会记录它的依赖，这个特性是有容错机制的，也就是说在内存中的RDD操作时出错或丢失时能够找到它的依赖关系来进行重算。对应函数getDependencies

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */
  protected def getDependencies: Seq[Dependency[_]] = deps

Optionally, a Partitioner for key-value RDDs

RDD里面的有一个分区器对应的是key-value，也就是说会有shuffled ,比如join、groupby这种。

/** Optionally overridden by subclasses to specify how they are partitioned. */
@transient val partitioner: Option[Partitioner] = None

补充：mapreduce的默认分区规则是根据key的hash对reduce的数量取模(key.hashCode()&Integer.MAX_VALUE）%numReduceTasks)

Optionally, a list of preferred locations to compute each split on

对于计算的时候，会优先把计算节点放在数据所在的block块上面，如果机器繁忙的话，它会选择等一会，如果不忙，那么task就属于是并行计算，同时做。假设说有一个文件只有一个block，它具有三个副本分别存储在1、2、3三台机器上，那么在计算的时候，它就会优先把计算节点启动在1或者2或者3其中一台的机器上，如果这时候，这三个节点不是特别繁忙，那么就会在这上面执行计算。假如说，这三台节点都比较繁忙，把计算节点放在4或者其它机器上，这时候其他机器并没有这个block块的内容，也就需要拷贝一份到现计算节点的机器上，那么就很明显不能保证数据本地性了。对于这种情况，一般都不会选择把数据移动空闲的机器上去计算，都是把作业调到数据所在的机器上面，然后保持本地性操作。选择移动计算而不是移动数据。

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil

RDD的五大特性