Checkpoint source code analysis

Checkpoint is a relatively advanced function provided by Spark. Sometimes, for example, our Spark application is extremely complex, and then, from the initial RDD to the end of the entire application, there are many steps, such as more than 20 transformation operations. Moreover, the entire application takes a long time to run, for example, it usually takes 1 to 5 hours to run.

In the above cases, it is more suitable to use the checkpoint function. Because, for particularly complex Spark applications, there is a high risk that there will be an RDD that needs to be used repeatedly. Because of the failure of the node, although it has been persisted before, it still leads to data loss. That is to say, when there is a failure, there is no fault-tolerant mechanism, so when the RDD is used in the subsequent transformation operation, it will be found that the data is lost (CacheManager). At this time, if there is no fault-tolerant processing, then it may You have to recalculate the data again.

In short, in view of the above situation, the fault tolerance of the entire Spark application is very poor.

Therefore, in response to the above-mentioned complex Spark application problem (the problem of no fault tolerance mechanism). You can use the checkponit function.

What does the checkpoint function mean? Checkpoint means that for a complex RDD chain, if we are worried about some key RDDs that will be used repeatedly in the future, the failure of the node may lead to the loss of persistent data, then we can target the RDD Especially start the checkpoint mechanism to achieve fault tolerance and high availability.

Checkpoint, that is, first, call the setCheckpointDir() method of SparkContext to set a fault-tolerant file system directory, such as HDFS; then, call the checkpoint() method on the RDD. After that, after the job where the RDD is located, a separate job will be started to write the checked RDD data into the previously set file system for high-availability and fault-tolerant persistent operations.

So at this time, even when the RDD is used later, its persistent data is accidentally lost, but its data can still be read directly from its checkpoint file without recalculation. (CacheManager)

1. How to checkpoint?
SparkContext.setCheckpointDir()
RDD.checkpoint()
2. Checkpoint principle analysis
3. The difference between Checkpoint and persistence: lineage changes
4. RDD.iterator(): read checkpoint data
5. Give the RDD to be checkpointed, persist first (StorageLevel.DISK_ONLY)

RDD.iterator
  /**
   * 先persist,再checkpoint
   * 那么首先执行到该rdd的iterator之后,会发现storageLevel != StorageLevel.NONE
   * 就通过CacheManager去获取数据,此时发现通过BlockManager获取不到数据(因为第一次执行)
   * 那么就会第一次还是会计算一次该rdd的数据,然后通过CacheManager的putInBlockManager将其通过
   * BlockManager进行持久化
   * rdd所在的job运行结束了,然后启动单独job进行checkpoint操作,此时是不是又会执行到该rdd的iterator方法
   * 那么就会发现持久化级别不为空,默认从BlockManager直接读取持久化数据(正常情况下可以)但是问题是,如果非正常情况下
   * 持久化数据丢失了,那么此时会走else,调用computeOrReadCheckpoint方法判断如果rdd是isCheckpoint为ture
   * 就会用用它的父rdd的iterator方法,其实就是从checkpoint外部文件系统中读取数据
   */
  
  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    
    
    // TODO getOrCompute  如果StorageLevel不为NONE,之前持久化过RDD,那么就不要直接去从父RDD执行算子,计算新的RDD的partition了
    // 优先尝试使用CacheManager,去获取持久化的数据
    if (storageLevel != StorageLevel.NONE) {
    
    
      SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
    } else {
    
    
      // TODO 进行rdd partition的计算
      computeOrReadCheckpoint(split, context)
    }
  }
=> computeOrReadCheckpoint
  private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    
    
    // TODO   MapPartitionsRDD.compute
    if (isCheckpointed) firstParent[T].iterator(split, context) else compute(split, context)
  }
==> CheckpointRDD.compute
  override def compute(split: Partition, context: TaskContext): Iterator[T] = {
    
    
    val file = new Path(checkpointPath, CheckpointRDD.splitIdToFile(split.index))
    CheckpointRDD.readFromFile(file, broadcastedConf, context)
  }
===> CheckpointRDD.readFromFile  读取checkpoint数据
  def readFromFile[T](
      path: Path,
      broadcastedConf: Broadcast[SerializableWritable[Configuration]],
      context: TaskContext
    ): Iterator[T] = {
    
    
    val env = SparkEnv.get
    val fs = path.getFileSystem(broadcastedConf.value.value)
    val bufferSize = env.conf.getInt("spark.buffer.size", 65536)
    val fileInputStream = fs.open(path, bufferSize)
    val serializer = env.serializer.newInstance()
    val deserializeStream = serializer.deserializeStream(fileInputStream)

    // Register an on-task-completion callback to close the input stream.
    context.addTaskCompletionListener(context => deserializeStream.close())

    deserializeStream.asIterator.asInstanceOf[Iterator[T]]
  }

Guess you like

Origin blog.csdn.net/m0_46449152/article/details/109562062