Spark源码之checkpoint方法解析

转载自：https://blog.csdn.net/do_yourself_go_on/article/details/74946288

今天在阅读Spark源码的时候看到了checkpoint方法，之前也在处理数据的的时候用到过，但是没有深入理解这个方法，今天结合官方文档以及网上博客重新认识了一下这个方法，这里做个总结。主要从两个方面讲解：
1.官方对这个方法的解释
2.这个方法的使用场景

checkpoint官方源码以及解释

  /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to its  parent RDDs will be removed.
   * 给这个RDD做checkpoint,之后这个RDD会以file的形式保存在之前设定的checkpoint目录下面并且
   * 此RDD之前的父RDD们会被删除
   * This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.
   * 这个方法必须在RDD执行任何Job之前被使用。并且这里强烈推荐在checkpoint之前应该将该RDD持久化到内存中去，否则将它保存为一个文件会做重复计算的多余工作
   */
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    //这里可以看出，在使用checkpoint方法之前必须要设置checkpoint的保存目录，否则会抛出异常
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }

从上面的解释可以看出，checkpoint方法是一个action算子，会触发一个job，所以在执行该方法前需要将RDD持久化，不然执行下一个job的时候会重新计算该RDD，造成时间效率较低。

应用场景

checkpoint方法其实是将RDD之前的血统lineage关系持久化到磁盘上去，这样在计算过程中如果出现错误情况，可以直接读取磁盘上的持久化的RDD，而避免了之前的许多重复计算，也是容错的一种手段，特别适用于迭代次数很多，数据量较大的计算场景。试想一下，如果你对大量的数据做10000此迭代处理，在第9998次迭代出现错误，如果之前没有做checkpoint处理的话，那么根据RDD的自我容错机制会重新计算出现错误的分区，这样也就是说需要重新进行之前的9998次迭代计算，势必会造成时间效率低下；如果之前在9997次做了checkpoint的话，那么即使在第9998次出现错误，程序会加载第9997次保存的RDD然后再从第9998次重新计算即可，不需要再计算前面那么多次的迭代了，极大的节省的计算时间，所以当迭代次数很多，数据量较大的时候，适当的做checkpoint是很有必要的。