How to remove the invalid rdd checkpoint Spark

can be used as a spark checkpoint checkpoint, the data is written rdd hdfs files, local cache subsystem may be utilized.
When we use the checkpoint to when rdd save to hdfs files, temporary files if the task is a long time does not remove the long run, hdfs be a lot of files of no use, spark also take into account this point, therefore, with some tricky way to solve this problem.

spark config:

spark.cleaner.referenceTracking.cleanCheckpoints = 默认false

That default has been saved file will be placed dfs, unless manually delete
the following contents are established only when the value of true

Checkpoint path

spark.sparkContext().setCheckpointDir("hdfs://nameservice1/xx/xx");

Benefits stored in the file system is hdfs comes with high fault tolerance, availability.
So, all tasks are written to run this route will cover the situation does not happen? The answer is no

  /**
   * Set the directory under which RDDs are going to be checkpointed.
   * @param directory path to the directory where checkpoint files will be stored
   * (must be HDFS path if running in cluster)
   */
  def setCheckpointDir(directory: String) {

    // If we are running on a cluster, log a warning if the directory is local.
    // Otherwise, the driver may attempt to reconstruct the checkpointed RDD from
    // its own local file system, which is incorrect because the checkpoint files
    // are actually on the executor machines.
    if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
      logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
        s"must not be on the local filesystem. Directory '$directory' " +
        "appears to be on the local filesystem.")
    }

    checkpointDir = Option(directory).map { dir =>
      //利用uuid生成了一个子目录,存放的rdd文件将放到子目录中  
      val path = new Path(dir, UUID.randomUUID().toString)
      val fs = path.getFileSystem(hadoopConfiguration)
      fs.mkdirs(path)
      fs.getFileStatus(path).getPath.toString
    }
  }

When using the uniqueness of the uuid of the checkpoint without disturbing each other between the different processes, there is a subsequent request checkpoint created, will create a file in the directory to save the contents of rdd

In the method of generating ReliableRDDCheckpointData the checkpoint,

Save Checkpoint

  /**
   * Materialize this RDD and write its content to a reliable DFS.
   * This is called immediately after the first action invoked on this RDD has completed.
   */
  protected override def doCheckpoint(): CheckpointRDD[T] = {
    //写入到可靠的文件中
    val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

    // Optionally clean our checkpoint files if the reference is out of scope
    //默认false,才会注册清理器
    if (rdd.conf.get(CLEANER_REFERENCE_TRACKING_CLEAN_CHECKPOINTS)) {
      rdd.context.cleaner.foreach { cleaner =>
        //注册清理事件
        cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
      }
    }

    logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
    newRDD
  }

Registration issue

Meaning clean-up event is registered when rdd no other objects when referencing dependency, the cleanup thread asynchronous clean up the corresponding checkpoint files

  /** Register a RDDCheckpointData for cleanup when it is garbage collected. */
  def registerRDDCheckpointDataForCleanup[T](rdd: RDD[_], parentId: Int): Unit = {
    registerForCleanup(rdd, CleanCheckpoint(parentId))
  }

  /** Register an object for cleanup. */
  private def registerForCleanup(objectForCleanup: AnyRef, task: CleanupTask): Unit = {
    referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue))
  }

ReferenceBuffer role is to hold CleanupTaskWeakReference object references prevent CleanupTaskWeakReference be recovered in advance, leading to early cleanup.

Weak reference objects

CleanupTaskWeakReference inherited from WeakReference, the referent (i.e. RDD), bound to ReferenceQueue, if recovery gc found ReferenceQueue referent addition to this weak reference, the reference has no other objects, will be placed in the corresponding ReferenceQueue in CleanupTaskWeakReference

//引用队列,当garbage collector发现对应的可达性改变被发现时,会将引用对象推入队列中
//这是通过Reference.enqueue方法实现的 public boolean enqueue() {return this.queue.enqueue(this);}
private val referenceQueue = new ReferenceQueue[AnyRef]

/**
 * A WeakReference associated with a CleanupTask.
 *
 * When the referent object becomes only weakly reachable, the corresponding
 * CleanupTaskWeakReference is automatically added to the given reference queue.
 */
private class CleanupTaskWeakReference(
    val task: CleanupTask,
    referent: AnyRef,
    referenceQueue: ReferenceQueue[AnyRef])
  extends WeakReference(referent, referenceQueue)

Recycling thread

Again speaking meticulous collection threads
at the SparkContext initialization, starts cleaner, more code direct order

_cleaner =
  if (_conf.get(CLEANER_REFERENCE_TRACKING)) {
    Some(new ContextCleaner(this))
  } else {
    None
  }
_cleaner.foreach(_.start())

  /** Start the cleaner. */
  def start(): Unit = {
    cleaningThread.setDaemon(true) //守护进程
    cleaningThread.setName("Spark Context Cleaner")
    cleaningThread.start()
    //这里有点银弹的意思,定时执行gc,默认半小时一次,主要是应对长时间任务问题
    periodicGCService.scheduleAtFixedRate(() => System.gc(),
      periodicGCInterval, periodicGCInterval, TimeUnit.SECONDS)
  }

private val cleaningThread = new Thread() { override def run() { keepCleaning() }}

  /** Keep cleaning RDD, shuffle, and broadcast state. */
  private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) {
    while (!stopped) {
      try {
        //从referenceQueue中取可以回收的弱引用对象,弱引用对象返回表示登记的rdd已经可回收了
        val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT))
          .map(_.asInstanceOf[CleanupTaskWeakReference])
        // Synchronize here to avoid being interrupted on stop()
        synchronized {
          reference.foreach { ref =>
            logDebug("Got cleaning task " + ref.task)
            //清除强引用
            referenceBuffer.remove(ref)
            ref.task match {
              case CleanRDD(rddId) =>
                doCleanupRDD(rddId, blocking = blockOnCleanupTasks)
              case CleanShuffle(shuffleId) =>
                doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks)
              case CleanBroadcast(broadcastId) =>
                doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks)
              case CleanAccum(accId) =>
                doCleanupAccum(accId, blocking = blockOnCleanupTasks)
              case CleanCheckpoint(rddId) =>
                doCleanCheckpoint(rddId) //如果任务是cleancheckpoint任务
            }
          }
        }
      } catch {
        case ie: InterruptedException if stopped => // ignore
        case e: Exception => logError("Error in cleaning thread", e)
      }
    }
  }

  /**
   * Clean up checkpoint files written to a reliable storage.
   * Locally checkpointed files are cleaned up separately through RDD cleanups.
   */
  def doCleanCheckpoint(rddId: Int): Unit = {
    try {
      logDebug("Cleaning rdd checkpoint data " + rddId)
      //删除checkpoint操作被触发
      ReliableRDDCheckpointData.cleanCheckpoint(sc, rddId)
      listeners.asScala.foreach(_.checkpointCleaned(rddId))
      logInfo("Cleaned rdd checkpoint data " + rddId)
    }
    catch {
      case e: Exception => logError("Error cleaning rdd checkpoint data " + rddId, e)
    }
  }

Special operations mean

Why should regular implementation System.gc () to trigger full gc?

  • Because delete rdd checkpoint method utilizes WeakReference, it is heavily dependent on the function of gc, gc if not, it will not find the object recyclable, will not trigger recovery logic.
  • Extreme cases can occur for a long time only yong gc, the elderly and the object region for a long time can not be recycled, but had no other references to the object to try to perform a full gc using the System.gc (), the purpose of recycling old age

to sum up

  • By default, the file has been saved will be placed dfs, unless manually deleted
  • Timely open spark.cleaner.referenceTracking.cleanCheckpoints, can not mean necessarily recover, because the garbage collector will not perform at the right time, it may eventually trigger a weak reference does not clean up the task logic

Guess you like

Origin www.cnblogs.com/windliu/p/10983334.html