[spark] Result processing of successful execution of Task

foreword

The article Task Execution Process describes how the task is assigned to the executor for execution. This article explains the process of returning the result to the driver when the task is successfully executed.

The driver side receives the task completion event

After the task is successfully executed on the executor and the serializedResult is obtained, the result is returned to the driver through the statusUpdate method of CoarseGrainedExecutorBackend, which will use the driverRpcEndpointRef to send a StatusUpdate message containing the serializedResult to the driver.

execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)

override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) {
    val msg = StatusUpdate(executorId, taskId, state, data)
    driver match {
      case Some(driverRef) => driverRef.send(msg)
      case None => logWarning(s"Drop $msg because has not yet connected to driver")
    }
  }

On the driver side CoarseGrainedSchedulerBackend receives the StatusUpdate event processing code as follows:

case StatusUpdate(executorId, taskId, state, data) =>
        scheduler.statusUpdate(taskId, state, data.value)
        if (TaskState.isFinished(state)) {
          executorDataMap.get(executorId) match {
            case Some(executorInfo) =>
              executorInfo.freeCores += scheduler.CPUS_PER_TASK
              makeOffers(executorId)
            case None =>
              // Ignoring the update since we don't know about the executor.
              logWarning(s"Ignored task status update ($taskId state $state) " +
                s"from unknown executor with ID $executorId")
          }
        }
  • Call the statusUpdate method of TaskSchedulerImpl to inform the execution status of the task to trigger the corresponding operation
  • When the task ends, the corresponding resources are freed, and the cores of the executor corresponding to the task are updated.
  • There are free resources on the executor corresponding to the finished task, and the task is allocated to it

Here we focus on what operations are done in TaskSchedulerImpl according to the status of the task:

def statusUpdate(tid: Long, state: TaskState, serializedData: ByteBuffer) {
    var failedExecutor: Option[String] = None
    var reason: Option[ExecutorLossReason] = None
    synchronized {
      try {
        // task丢失,则标记对应的executor也丢失,并涉及到一些映射跟新
        if (state == TaskState.LOST && taskIdToExecutorId.contains(tid)) {
          // We lost this entire executor, so remember that it's gone
          val execId = taskIdToExecutorId(tid)

          if (executorIdToTaskCount.contains(execId)) {
            reason = Some(
              SlaveLost(s"Task $tid was lost, so marking the executor as lost as well."))
            removeExecutor(execId, reason.get)
            failedExecutor = Some(execId)
          }
        }
        //获取task所在的taskSetManager
        taskIdToTaskSetManager.get(tid) match {
          case Some(taskSet) =>
            if (TaskState.isFinished(state)) {
              taskIdToTaskSetManager.remove(tid)
              taskIdToExecutorId.remove(tid).foreach { execId =>
                if (executorIdToTaskCount.contains(execId)) {
                  executorIdToTaskCount(execId) -= 1
                }
              }
            }
            // task成功的处理
            if (state == TaskState.FINISHED) {
              // 将当前task从taskSet中正在执行的task列表中移除
              taskSet.removeRunningTask(tid)
              //成功执行时,在线程池中处理任务的结果
              taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
            //处理失败的情况
            } else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
              taskSet.removeRunningTask(tid)
              taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
            }
          case None =>
            logError(
              ("Ignoring update with state %s for TID %s because its task set is gone (this is " +
                "likely the result of receiving duplicate task finished status updates)")
                .format(state, tid))
        }
      } catch {
        case e: Exception => logError("Exception in statusUpdate", e)
      }
    }
    // Update the DAGScheduler without holding a lock on this, since that can deadlock
    if (failedExecutor.isDefined) {
      assert(reason.isDefined)
      dagScheduler.executorLost(failedExecutor.get, reason.get)
      backend.reviveOffers()
    }
  }

If the task status is Lost, the executor corresponding to the flag is also lost, and some mapping updates are involved, which means the reassignment of the corresponding task on the executor; there are other statuses that will not be parsed for the time being. Mainly see that when the task status is FINISHED, the result processing of the task is thrown into the thread pool for execution through the enqueueSuccessfulTask ​​method of taskResultGetter:

def enqueueSuccessfulTask(
      taskSetManager: TaskSetManager,
      tid: Long,
      serializedData: ByteBuffer): Unit = {
    getTaskResultExecutor.execute(new Runnable {
      override def run(): Unit = Utils.logUncaughtExceptions {
        try {
          // 从serializedData反序列化出result和结果大小
          val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
            // 可直接获取的结果
            case directResult: DirectTaskResult[_] =>
              // taskSet的总结果大小超过限制
              if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {
                return
              } 
              directResult.value()
              // 直接返回结果及大小
              (directResult, serializedData.limit())
            // 可间接的获取执行结果,需借助BlockManager来获取
            case IndirectTaskResult(blockId, size) =>
              // 若大小超多了taskSetManager能抓取的最大限制,则删除远程节点上对应的blockManager 
              if (!taskSetManager.canFetchMoreResults(size)) {
                // dropped by executor if size is larger than maxResultSize
                sparkEnv.blockManager.master.removeBlock(blockId)
                return
              }
              logDebug("Fetching indirect task result for TID %s".format(tid))
              // 标记Task为需要远程抓取的Task并通知DAGScheduler              
              scheduler.handleTaskGettingResult(taskSetManager, tid)
              // 从远程的BlockManager上获取Task计算结果 
              val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId)
              // 抓取结果失败,结果丢失
              if (!serializedTaskResult.isDefined) {
               // 在Task执行结束获得结果后到driver远程去抓取结果之间,如果运行task的机器挂掉,
               // 或者该机器的BlockManager已经刷新掉了Task执行结果,都会导致远程抓取结果失败。
                scheduler.handleFailedTask(
                  taskSetManager, tid, TaskState.FINISHED, TaskResultLost)
                return
              }
              // 抓取结果成功,反序列化结果
              val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                serializedTaskResult.get.toByteBuffer)
                // 删除远程BlockManager对应的结果
               sparkEnv.blockManager.master.removeBlock(blockId)
              // 返回结果
              (deserializedResult, size)
          }
          ...
        // 通知scheduler处理成功Task
        scheduler.handleSuccessfulTask(taskSetManager, tid, result)
        } catch { 
          ...
        }
      }
    })
  }
  • Deserialize serializedData
  • If the result can be obtained directly (DirectTaskResult), the deserialized result can be directly returned when the total size of the results of the completed tasks of the current taskSet has not exceeded the limit (spark.driver.maxResultSize, default 1G).
  • If the result can be obtained indirectly (IndirectTaskResult), on the premise that the size satisfies the conditions, mark the Task as the Task that needs to be remotely captured and notify the DAGScheduler to obtain the Task calculation result from the remote BlockManager. If the acquisition fails, notify the scheduler to handle the failure. , there are two reasons for failure:
    • After the task is executed and the result is obtained, the driver remotely captures the result, if the machine running the task hangs up
    • The BlockManager of the machine has refreshed the Task execution result
  • Get the result If the result is successfully obtained remotely, delete the result corresponding to the remote BlockManager, and directly return its serialized result.
  • Finally, the TaskSetMagager and tid and the result corresponding to the task are used as parameters to notify the scheduler of the successful task processing

Continue to follow up on how the scheduler handles successful tasks:

def handleSuccessfulTask(
      taskSetManager: TaskSetManager,
      tid: Long,
      taskResult: DirectTaskResult[_]): Unit = synchronized {
    taskSetManager.handleSuccessfulTask(tid, taskResult)
  }

It calls the taskSetManager's processing method for successful tasks:

def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {
    val info = taskInfos(tid)
    val index = info.index
    info.markSuccessful()
    // 从线程池中移除该task
    removeRunningTask(tid)
    // 通知dagScheduler
    sched.dagScheduler.taskEnded(tasks(index), Success, result.value(), result.accumUpdates, info)
    // 标记该task成功处理
    if (!successful(index)) {
      tasksSuccessful += 1
      logInfo("Finished task %s in stage %s (TID %d) in %d ms on %s (%d/%d)".format(
        info.id, taskSet.id, info.taskId, info.duration, info.host, tasksSuccessful, numTasks))
      // Mark successful and stop if all the tasks have succeeded.
      successful(index) = true
      if (tasksSuccessful == numTasks) {
        isZombie = true
      }
    } else {
      logInfo("Ignoring task-finished event for " + info.id + " in stage " + taskSet.id +
        " because task " + index + " has already completed successfully")
    }
    // 从失败过的task->executor中移除
    failedExecutors.remove(index)
    // 若该taskSet所有task都成功执行
    maybeFinishTaskSet()
  }

The logic is very simple, mark the task to run successfully, follow the new failedExecutors, and if all tasks in the taskSet are successfully executed, let's see how to notify the dagScheduler. Here, the taskEnded method of the dagScheduler is called:

def taskEnded(
      task: Task[_],
      reason: TaskEndReason,
      result: Any,
      accumUpdates: Seq[AccumulatorV2[_, _]],
      taskInfo: TaskInfo): Unit = {
    eventProcessLoop.post(
      CompletionEvent(task, reason, result, accumUpdates, taskInfo))
  }

Here, like DAGScheduler Post a CompletionEvent event, there is corresponding processing in DAGScheduler#doOnReceive:

// DAGScheduler#doOnReceive
 case completion: CompletionEvent =>
      dagScheduler.handleTaskCompletion(completion)

Continue to look at the implementation of dagScheduler#handleTaskCompletion, the code is too long, list the main logic parts:

 private[scheduler] def handleTaskCompletion(event: CompletionEvent) {
    ...
    val stage = stageIdToStage(task.stageId)
    event.reason match {
      case Success =>
        // 从该stage中等待处理的partition列表中移除Task对应的partition 
        stage.pendingPartitions -= task.partitionId
        task match {
          case rt: ResultTask[_, _] =>
            // Cast to ResultStage here because it's part of the ResultTask
            // TODO Refactor this out to a function that accepts a ResultStage
            val resultStage = stage.asInstanceOf[ResultStage]
            resultStage.activeJob match {
              case Some(job) =>
                if (!job.finished(rt.outputId)) {
                  updateAccumulators(event)
                  job.finished(rt.outputId) = true
                  job.numFinished += 1
                  // If the whole job has finished, remove it
                  if (job.numFinished == job.numPartitions) {
                    markStageAsFinished(resultStage)
                    cleanupStateForJobAndIndependentStages(job)
                    listenerBus.post(
                      SparkListenerJobEnd(job.jobId, clock.getTimeMillis(), JobSucceeded))
                  }

                  // taskSucceeded runs some user code that might throw an exception. Make sure
                  // we are resilient against that.
                  try {
                    job.listener.taskSucceeded(rt.outputId, event.result)
                  } catch {
                    case e: Exception =>
                      // TODO: Perhaps we want to mark the resultStage as failed?
                      job.listener.jobFailed(new SparkDriverExecutionException(e))
                  }
                }
              case None =>
                logInfo("Ignoring result from " + rt + " because its job has finished")
            }
          // 若是ShuffleMapTask
          case smt: ShuffleMapTask =>
            val shuffleStage = stage.asInstanceOf[ShuffleMapStage]
            updateAccumulators(event)
            val status = event.result.asInstanceOf[MapStatus]
            val execId = status.location.executorId
            logDebug("ShuffleMapTask finished on " + execId)
            // 忽略在集群中游走的ShuffleMapTask(来自一个失效的节点的Task结果)。
            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {
              logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")
            } else {
              // 将结果保存到对应的Stage
              shuffleStage.addOutputLoc(smt.partitionId, status)
            }
            // 若当前stage的所有task已经全部执行完毕
            if (runningStages.contains(shuffleStage) && shuffleStage.pendingPartitions.isEmpty) {
              markStageAsFinished(shuffleStage)
              logInfo("looking for newly runnable stages")
              logInfo("running: " + runningStages)
              logInfo("waiting: " + waitingStages)
              logInfo("failed: " + failedStages)

              // 将stage的结果注册到MapOutputTrackerMaster
              mapOutputTracker.registerMapOutputs(
                shuffleStage.shuffleDep.shuffleId,
                shuffleStage.outputLocInMapOutputTrackerFormat(),
                changeEpoch = true)
              // 清除本地缓存
              clearCacheLocs()
              // 若stage一些task执行失败没有结果,重新提交stage来调度执行未执行的task
              if (!shuffleStage.isAvailable) {
                // Some tasks had failed; let's resubmit this shuffleStage
                // TODO: Lower-level scheduler should also deal with this
                logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +
                  ") because some of its tasks had failed: " +
                  shuffleStage.findMissingPartitions().mkString(", "))
                submitStage(shuffleStage)
              } else {
                // 标记所有等待这个Stage结束的Map-Stage Job为结束状态 
                if (shuffleStage.mapStageJobs.nonEmpty) {
                  val stats = mapOutputTracker.getStatistics(shuffleStage.shuffleDep)
                  for (job <- shuffleStage.mapStageJobs) {
                    markMapStageJobAsFinished(job, stats)
                  }
                }
              }

              // Note: newly runnable stages will be submitted below when we submit waiting stages
            }
        }
        ...
    }
    submitWaitingStages()
  }

When the task is ShuffleMapTask, the task does not save the results to the stage under the condition that the invalid node is running. If all tasks of the current stage are completed (not necessarily successful), then all the results are registered to the MapOutputTrackerMaster (so that the next The task of the stage can use it to obtain the metadata information of the shuffle result); then clear the local cache; when the stage has tasks that are not successfully executed, there will be no results, and the stage needs to be resubmitted to run the unfinished tasks; if all tasks If all completed successfully, indicating that the stage has been completed, it will mark all Map-Stage Jobs waiting for the stage to end as the end state.

When the task is a ResultTask, increase the number of tasks completed by the job. If all tasks are completed, that is, the job has been completed, mark the stage complete and remove it from runningStages. In the cleanupStateForJobAndIndependentStages method, traverse all stages of the current job, and in the corresponding stage If there are no dependent jobs, remove this stage directly. Then remove the current job from activeJob.

Finally, job.listener.taskSucceeded(rt.outputId, event.result) is called, and the taskSucceeded method of JobWaiter (the specific implementation of JobListener) is actually called:

override def taskSucceeded(index: Int, result: Any): Unit = {
    // resultHandler call must be synchronized in case resultHandler itself is not thread safe.
    synchronized {
      resultHandler(index, result.asInstanceOf[T])
    }
    if (finishedTasks.incrementAndGet() == totalTasks) {
      jobPromise.success(())
    }
  }

The resultHandler here is a result handler specified when the action operation triggers the runJob:

def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }

Here (index, res) => results(index) = res is the resultHandler, that is, fill up the results array here and return it, and perform different operations according to different actions.
If the number of completed tasks is equal to the number of totalTasks, the job is successfully executed and the log is printed.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325521320&siteId=291194637