[spark] spark speculative execution

Overview

The speculative task refers to the task that is behind in a Stage, and the task will be started again on the Executor of other nodes. If one of the Task instances runs successfully, the calculation result of the first completed Task will be used as the final result, and other tasks will be killed at the same time. The instance running on the Executor. Spark speculative execution is disabled by default and can be enabled through the spark.speculation property.

Check if there is a task that needs to be executed speculatively

After SparkContext creates schedulerBackend and taskScheduler, the start method of taskScheduler is called immediately:

override def start() {
    backend.start()
    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
      speculationScheduler.scheduleAtFixedRate(new Runnable {
        override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
          checkSpeculatableTasks()
        }
      }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
  }

It can be seen that after starting the SchedulerBackend, the TaskScheduler checks whether the speculative execution function is enabled under the premise of non-local mode (default disabled, can be enabled through spark.speculation), if enabled, it will start a thread every SPECULATION_INTERVAL_MS (default 100ms, can be enabled) Set by the spark.speculation.interval property) to detect whether there are tasks that require speculative execution through the checkSpeculatableTasks method:

// Check for speculatable tasks in all our active jobs.
  def checkSpeculatableTasks() {
    var shouldRevive = false
    synchronized {
      shouldRevive = rootPool.checkSpeculatableTasks()
    }
    if (shouldRevive) {
      backend.reviveOffers()
    }
  }

Then, the rootPool method is used to determine whether there are tasks that need to be speculatively executed, and if so, the reviveOffers of the SchedulerBackend will be called to try to run the speculative tasks with resources. Go ahead and see what the detection logic looks like:

override def checkSpeculatableTasks(): Boolean = {
    var shouldRevive = false
    for (schedulable <- schedulableQueue.asScala) {
      shouldRevive |= schedulable.checkSpeculatableTasks()
    }
    shouldRevive
  }

In the rootPool, the scheduled method is called again. Scheduled is of ConcurrentLinkedQueue[Schedulable] type, and all TaskSetMagagers are placed in the queue. Look at the checkSpeculatableTasks method of TaskSetMagager, and finally find the source of detection:

 override def checkSpeculatableTasks(): Boolean = {
    // 如果task只有一个或者所有task都不需要再执行了就没有必要再检测
    if (isZombie || numTasks == 1) {  
      return false
    }
    var foundTasks = false
    // 所有task数 * SPECULATION_QUANTILE(默认0.75,可通过spark.speculation.quantile设置) 
    val minFinishedForSpeculation = (SPECULATION_QUANTILE * numTasks).floor.toInt
    logDebug("Checking for speculative tasks: minFinished = " + minFinishedForSpeculation)
    // 成功的task数是否超过总数的75%,并且成功的task是否大于0
    if (tasksSuccessful >= minFinishedForSpeculation && tasksSuccessful > 0) {
      val time = clock.getTimeMillis()
      // 过滤出成功执行的task的执行时间并排序
      val durations = taskInfos.values.filter(_.successful).map(_.duration).toArray
      Arrays.sort(durations)
     // 取这多个时间的中位数
      val medianDuration = durations(min((0.5 * tasksSuccessful).round.toInt, durations.length - 1))
      // 中位数 * SPECULATION_MULTIPLIER (默认1.5,可通过spark.speculation.multiplier设置)
      val threshold = max(SPECULATION_MULTIPLIER * medianDuration, 100)
      logDebug("Task length threshold for speculation: " + threshold)
      // 遍历该TaskSet中的task,取未成功执行、正在执行、执行时间已经大于threshold 、
      // 推测式执行task列表中未包括的task放进需要推测式执行的列表中speculatableTasks
      for ((tid, info) <- taskInfos) {
        val index = info.index
        if (!successful(index) && copiesRunning(index) == 1 && info.timeRunning(time) > threshold &&
          !speculatableTasks.contains(index)) {
          logInfo(
            "Marking task %d in stage %s (on %s) as speculatable because it ran more than %.0f ms"
              .format(index, taskSet.id, info.host, threshold))
          speculatableTasks += index
          foundTasks = true
        }
      }
    }
    foundTasks
  }

Check the comments in the logic code. It is clear that when the number of successful tasks exceeds 75% of the total number of tasks (which can be set by the parameter spark.speculation.quantile), then count the running time of all successful tasks to get a median, use This median is multiplied by 1.5 (controllable with the parameter spark.speculation.multiplier) to get the running time threshold, if the running time of the running Tasks exceeds this threshold, it is enabled for speculation. Simply put, it is to enable speculation on those Tasks that slow down the overall progress to speed up the operation of the entire Stage.
The general flow of the algorithm is shown in the figure:

when speculative tasks are scheduled

The dequeueTask method is called when TaskSetMagager assigns a task to an executor under the delayed scheduling policy:

private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value)
    : Option[(Int, TaskLocality.Value, Boolean)] =
  {
    for (index <- dequeueTaskFromList(execId, getPendingTasksForExecutor(execId))) {
      return Some((index, TaskLocality.PROCESS_LOCAL, false))
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {
      for (index <- dequeueTaskFromList(execId, getPendingTasksForHost(host))) {
        return Some((index, TaskLocality.NODE_LOCAL, false))
      }
    }
   ......
    // find a speculative task if all others tasks have been scheduled
    dequeueSpeculativeTask(execId, host, maxLocality).map {
      case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}
  }

The last part of the method is to schedule speculative tasks after other tasks have been scheduled. Let's see the implementation:

protected def dequeueSpeculativeTask(execId: String, host: String, locality: TaskLocality.Value)
    : Option[(Int, TaskLocality.Value)] =
  {
    // 从推测式执行任务列表中移除已经成功完成的task,因为从检测到调度之间还有一段时间,
    // 某些task已经成功执行
    speculatableTasks.retain(index => !successful(index)) // Remove finished tasks from set
     // 判断task是否可以在该executor对应的Host上执行,判断条件是:
     // task没有在该host上运行;
     // 该executor没有在task的黑名单里面(task在这个executor上失败过,并还在'黑暗'时间内)
    def canRunOnHost(index: Int): Boolean =
      !hasAttemptOnHost(index, host) && !executorIsBlacklisted(execId, index)

    if (!speculatableTasks.isEmpty) {
      // 获取能在该executor上启动的taskIndex
      for (index <- speculatableTasks if canRunOnHost(index)) {
        // 获取task的优先位置
        val prefs = tasks(index).preferredLocations 
        val executors = prefs.flatMap(_ match {
          case e: ExecutorCacheTaskLocation => Some(e.executorId)
          case _ => None
        });
        // 优先位置若为ExecutorCacheTaskLocation并且数据所在executor包含当前executor,
        // 则返回其task在taskSet的index和Locality Levels
        if (executors.contains(execId)) {
          speculatableTasks -= index
          return Some((index, TaskLocality.PROCESS_LOCAL))
        }
      }

      // 这里的判断是延迟调度的作用,即使是推测式任务也尽量以最好的本地性级别来启动
      if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) {
        for (index <- speculatableTasks if canRunOnHost(index)) {
          val locations = tasks(index).preferredLocations.map(_.host)
          if (locations.contains(host)) {
            speculatableTasks -= index
            return Some((index, TaskLocality.NODE_LOCAL))
          }
        }
      }

       ........
    }
    None
  }

The code is too long and only the first part is listed, but they are all similar logic, and the comments in the code are also very clear. First, the tasks that have been successfully executed are filtered out. In addition, the speculative execution task is not executed on the same Host as the executing task, and is not executed in the blacklisted executor. Then, under the delayed scheduling policy, it is determined whether to execute the task on the executor according to the priority position of the task. Some level of locality is scheduled to execute.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325521387&siteId=291194637