spark(九)-checkpoint的读写流程

RDD.checkpoint

spark计算中，当计算流程DAG特别长,服务器需要将整个DAG计算完成得出结果,但是如果在这很长的计算流程中突然中间算出的数据丢失了,spark又会根据RDD的依赖关系从头到尾计算一遍，浪费计算资源，也非常耗时。checkpoint的作用就是将DAG中比较重要的中间数据做一个检查点将结果存储到一个高可用的地方（比如HDFS），当下游RDD计算出错时，可以直接从checkpoint过的RDD那里读取数据继续计算。

在大数据量的情况下，保存和读取rdd数据也会十分消耗资源。所以，是选择使用checkpoint机制来还原数据，还是重新计算该rdd，在实际的场景中还需要权衡。

checkpoint()也是个transformation的算子，要action才能触发真正的checkpoint。一般我们先进行cache然后做checkpoint。
使用时：

  sc.setCheckpointDir("hdfs://m1:9000/checkpoint0518")
  val rdd = sc.parallelize(1 to 10).map(x => (x % 3, 1)).reduceByKey(_ + _)
  //然后
  rdd.cache()
  rdd.checkpoint()
  rdd.collect()

checkpoint和persist/cache的区别

最基本的区别是在rdd进行checkpoint时，会先把rdd的血缘关系(lineage)去掉。

persist时，即时数据从StorageLevel的缓存中取出了，RDD的lineage依然保持完好。这意味着如果RDD的某些分区丢失了，数据可以按照lineage重新计算。而checkpoint之后，RDD就彻底失去了lineage，不能被重建。

其次，checkpoint是在另外的job中单独计算的，所以强烈建议使用了checkpoint的RDD应该先persisted in memory，否则Action操作触发checkpoint保存RDD到高可用文件系统时，会重新计算一遍。先cache()再checkpoint()就会从刚cache到内存中取数据写入hdfs，避免重新计算。

最后，Finally checkpointed data is persistent and not removed after SparkContext is destroyed.

注意：
RDD的checkpointing机制与Spark Streaming中的checkpoint是完全不同的概念。RDD的checkpoint被设计用来解决lineage寻址的问题，Spark Streaming中的checkpoint是所有关于Streaming的高可用和故障恢复。

checkpoint写入流程

setCheckpointDir()

就是利用hadoop的api创建了一个hdfs目录。

  def setCheckpointDir(directory: String) {
    //
    if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
      logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
        s"must not be on the local filesystem. Directory '$directory' " +
        "appears to be on the local filesystem.")
    }

    checkpointDir = Option(directory).map { dir =>
      val path = new Path(dir, UUID.randomUUID().toString)
      val fs = path.getFileSystem(hadoopConfiguration)
      fs.mkdirs(path)
      fs.getFileStatus(path).getPath.toString
    }
  }

rdd.checkpoint()

目的是标记此RDD需要checkpointing但并不立即执行，checkpoint()函数必须在此RDD相关的任何job执行之前先调用。
注意看注释。

  /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.
   */
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }

这里创建的ReliableRDDCheckpointData继承自CheckpointState，内部的cpState状态值有：Initialized, CheckpointingInProgress, Checkpointed, 初始默认值是Initialized。

RDD 需要经过
  [ Initialized  --> CheckpointingInProgress--> Checkpointed ] 
这几个阶段才能完成checkpoint。

该RDD随后的第一个action操作的runJob时将会触发checkpoint()，也就是runJob()时。
比如RDD的collect():

  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

runJob()触发checkpoint()

在SparkContext内部执行runJob()时，最后调用了rdd.doCheckPoint()

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

RDD.doCheckPoint()

如果定义了checkpointData，也就是前面调用了rdd.checkpoint()做了标记（创建了checkpointData），再执行。

checkpointAllMarkedAncestors取自配置"spark.checkpoint.checkpointAllMarkedAncestors"，表示是否将其祖先的所有标记过checkpoint的都通过此job触发进行处理，默认值为false，只会触发其父RDD的doCheckpoint。

  private[spark] def doCheckpoint(): Unit = {
    RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
      if (!doCheckpointCalled) {
        doCheckpointCalled = true
        if (checkpointData.isDefined) {
          if (checkpointAllMarkedAncestors) {
            // TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint
            // them in parallel.
            // Checkpoint parents first because our lineage will be truncated after we checkpoint ourselves
            // 遍历依赖的rdd，调用每个rdd的doCheckpoint
            dependencies.foreach(_.rdd.doCheckpoint())
          }
          checkpointData.get.checkpoint()  //actual的checkpoint操作
        } else {
          dependencies.foreach(_.rdd.doCheckpoint())
        }
      }
    }
  }

如果创建了checkpointData，并且默认不对所有祖先执行checkpoint，就调用checkpointData的checkpoint()，
也就是 ReliableRDDCheckpointData 实例中的checkpoint()。否则对依赖关系中的所有RDD调用doCheckpoint。

这里把cpState状态从Initialized修改成了CheckpointingInProgress，正式开始。
调用ReliableRDDCheckpointData的doCheckpoint，并返回新的RDD。
完成后修改状态为Checkpointed。

  final def checkpoint(): Unit = {
    // Guard against multiple threads checkpointing the same RDD by
    // atomically flipping the state of this RDDCheckpointData
    // 标记状态为“进行中”
    RDDCheckpointData.synchronized {
      if (cpState == Initialized) {
        cpState = CheckpointingInProgress
      } else {
        return
      }
    }
    // 调用子类ReliableRDDCheckpointData的doCheckpoint()
    val newRDD = doCheckpoint()

    // Update our state and truncate the RDD lineage
    // 更新状态为“已完成”，并截断RDD lineage（清除依赖关系）。
    RDDCheckpointData.synchronized {
      cpRDD = Some(newRDD)
      cpState = Checkpointed
      rdd.markCheckpointed()
    }
  }

写入数据到CheckpointDir

ReliableRDDCheckpointData的doCheckpoint()

把RDD写入到CheckpointDir，返回一个ReliableCheckpointRDD作为newRDD返回。

具体写入过程，向集群提交了一个单独的job，来执行写入。

  protected override def doCheckpoint(): CheckpointRDD[T] = {
    val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

    // Optionally clean our checkpoint files if the reference is out of scope
    if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
      rdd.context.cleaner.foreach { cleaner =>
        cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
      }
    }

    logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
    newRDD
  }

}

ReliableCheckpointRDD中的writeRDDToCheckpointDirectory负责将该RDD具体写入到${checkpointDir}/uuid/rdd-xx的目录中。

这里调用原始RDD(要checkpoint)的sparkContext的runJob，运行一个writePartitionToCheckpointFile()的Job。

把RDD的每个分区的数据写入一个文件：
写文件时，创建一个临时文件，和一个(“spark.buffer.size”, 65536)大小的buffer，
调用env.serializer(封装了hadoopConfiguration中的fileSystem)的writeAll()执行写入。
写入完成后调用fs.rename得到最终的文件。

写入后，把CheckpointDir中的文件读取为一个新的RDD，填充原有RDD的partitioner，封装为一个ReliableCheckpointRDD对象，
返回这个newRDD。

  def writeRDDToCheckpointDirectory[T: ClassTag](
      originalRDD: RDD[T],
      checkpointDir: String,
      blockSize: Int = -1): ReliableCheckpointRDD[T] = {
    val checkpointStartTimeNs = System.nanoTime()

    val sc = originalRDD.sparkContext

    // Create the output path for the checkpoint
    val checkpointDirPath = new Path(checkpointDir)
    val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
    if (!fs.mkdirs(checkpointDirPath)) {
      throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
    }

    // Save to file, and reload it as an RDD
    // 把sc.hadoopConfiguration广播到执行这个job的worker节点上
    val broadcastedConf = sc.broadcast(
      new SerializableConfiguration(sc.hadoopConfiguration))
    // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
    sc.runJob(originalRDD,
      writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)

    // 似乎是补写非空的分区
    if (originalRDD.partitioner.nonEmpty) {
      writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
    }

    val checkpointDurationMs =
      TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - checkpointStartTimeNs)
    logInfo(s"Checkpointing took $checkpointDurationMs ms.")

    val newRDD = new ReliableCheckpointRDD[T](
      sc, checkpointDirPath.toString, originalRDD.partitioner)
    if (newRDD.partitions.length != originalRDD.partitions.length) {
      throw new SparkException(
        "Checkpoint RDD has a different number of partitions from original RDD. Original " +
          s"RDD [ID: ${originalRDD.id}, num of partitions: ${originalRDD.partitions.length}]; " +
          s"Checkpoint RDD [ID: ${newRDD.id}, num of partitions: " +
          s"${newRDD.partitions.length}].")
    }
    newRDD
  }

RDD.doCheckPoint()中，调用成员checkpointData : ReliableRDDCheckpointData: RDDCheckpointData的checkpoint()，
把最终写入的newRDD赋值给了checkpointData的cpRDD，这个cpRDD定义在抽象类RDDCheckpointData中：

def checkpointRDD: Option[CheckpointRDD[T]] = RDDCheckpointData.synchronized { cpRDD }

所以完成后，RDD.RDDCheckpointData.checkpointRDD就是cpRDD=newRDD

checkpoint读取流程

读取checkpoint并不是Task执行失败时才会触发，那么什么时候读取Checkpoint文件呢？

在[spark(八)-Executor运行一个task]时，
比如ShuffleMapTask，会调用rdd.iterator(partition, context)在每个分区上执行用户自定义的函数序列。

RDD.iterator()执行分区上的计算任务时，如果RDD被cache了，就读取cache的分区进行计算，否则就computeOrReadCheckpoint()

**computeOrReadCheckpoint()**会检查该RDD如果是isCheckpointedAndMaterialized的RDD，就调用其Parent Rdd的iterator()，
**其实就是CheckpointRDD.iterator()**进行计算，
否则就调用当前RDD的compute()。

  /**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
   */
  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }
  
    /**
   * Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
   */
  private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    if (isCheckpointedAndMaterialized) {
      firstParent[T].iterator(split, context)
    } else {
      compute(split, context)
    }
  }

如果当前RDD已经是一个CheckpointRDD，那么这里的compute就调用CheckpointRDD.compute()，
也就是ReliableCheckpointRDD.compute()，在这里读取。

  /**
   * Read the content of the checkpoint file associated with the given partition.
   */
  override def compute(split: Partition, context: TaskContext): Iterator[T] = {
    val file = new Path(checkpointPath, ReliableCheckpointRDD.checkpointFileName(split.index))
    ReliableCheckpointRDD.readCheckpointFile(file, broadcastedConf, context)
  }

CheckpointRDD 负责读取文件系统上的文件，生成该RDD的partition，反序列化数据后转换为一个Iterator。

调用compute就是执行分区上的Iterator用户函数s。

  def readCheckpointFile[T](
      path: Path,
      broadcastedConf: Broadcast[SerializableConfiguration],
      context: TaskContext): Iterator[T] = {
    val env = SparkEnv.get
    val fs = path.getFileSystem(broadcastedConf.value.value)
    val bufferSize = env.conf.getInt("spark.buffer.size", 65536)
    val fileInputStream = {
      val fileStream = fs.open(path, bufferSize)
      if (env.conf.get(CHECKPOINT_COMPRESS)) {
        CompressionCodec.createCodec(env.conf).compressedInputStream(fileStream)
      } else {
        fileStream
      }
    }
    val serializer = env.serializer.newInstance()
    val deserializeStream = serializer.deserializeStream(fileInputStream)

    // Register an on-task-completion callback to close the input stream.
    context.addTaskCompletionListener[Unit](context => deserializeStream.close())

    deserializeStream.asIterator.asInstanceOf[Iterator[T]]
  }