[spark] TaskScheduler task submission and scheduling source code analysis

After DAGScheduler is divided into Stages and submitted to TaskScheduler in the form of TaskSet, TaskScheduler schedules and executes the tasks of taskSet through TaskSetMagager.

taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))

The implementation of the submitTasks method is in the TaskSchedulerImpl implementation class of TaskScheduler. First look at the entire implementation:

override def submitTasks(taskSet: TaskSet) {
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    this.synchronized {
      val manager = createTaskSetManager(taskSet, maxTaskFailures)
      val stage = taskSet.stageId
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets(taskSet.stageAttemptId) = manager
      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
        ts.taskSet != taskSet && !ts.isZombie
      }
      if (conflictingTaskSet) {
        throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
          s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
      }
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

      if (!isLocal && !hasReceivedTask) {
        starvationTimer.scheduleAtFixedRate(new TimerTask() {
          override def run() {
            if (!hasLaunchedTask) {
              logWarning("Initial job has not accepted any resources; " +
                "check your cluster UI to ensure that workers are registered " +
                "and have sufficient resources")
            } else {
              this.cancel()
            }
          }
        }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
      }
      hasReceivedTask = true
    }
    backend.reviveOffers()
  }
val manager = createTaskSetManager(taskSet, maxTaskFailures)

First create a TaskSetManager for the current TaskSet. The TaskSetManager is responsible for managing each task of a separate taskSet, and decides whether a task is started on an executor. If the task fails, it is responsible for retrying the task until the number of task retries, and executing the task through delayed scheduling. location-aware scheduling.

val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets(taskSet.stageAttemptId) = manager

The key is stageId and the value is a HashMap, where the key is stageAttemptId and the value is TaskSet.

val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
        ts.taskSet != taskSet && !ts.isZombie
      }
      if (conflictingTaskSet) {
        throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
          s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
      }

isZombie is a marker for whether all tasks in the TaskSetManager do not need to be executed (successfully executed or the stage is deleted). If the TaskSet is not fully executed and there is another TaskSet that is the same as the newly incoming taskset, an exception will be thrown to ensure a A stage cannot have two taskSets running at the same time.

schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

Add the current taskSet to the scheduling pool. The type of schedulableBuilder is a trait of SchedulerBuilder. There are two implementations of FIFOSchedulerBuilder and FairSchedulerBuilder, and the default is FIFO.

schedulableBuilder is a newTaskSchedulerImpl(sc) in SparkContext that instantiates the schedulableBuilder through the initialize method of scheduler.initialize(backend) when creating TaskSchedulerImpl.

def initialize(backend: SchedulerBackend) {
    this.backend = backend
    // temporarily set rootPool name to empty
    rootPool = new Pool("", schedulingMode, 0, 0)
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported spark.scheduler.mode: $schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }
backend.reviveOffers()

Next, the riviveOffers method of SchedulerBackend is called to apply for resources from schedulerBackend. The backend is also passed through the parameter of scheduler.initialize(backend), which is created in SparkContext.

val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)

Go back to applying for resources from schedulerBackend and
call the reviveOffers method of CoarseGrainedSchedulerBackend, which sends a ReviveOffer message to driverEndpoint.

 override def reviveOffers() {
    driverEndpoint.send(ReviveOffers)
  }

DriverEndpoint calls the makeOffers method after receiving the ReviveOffer message.

private def makeOffers() {
      // Filter out executors under killing
      val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
      val workOffers = activeExecutors.map { case (id, executorData) =>
        new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
      }.toSeq
      launchTasks(scheduler.resourceOffers(workOffers))
    }

This method first filters out active executors and encapsulates them into WorkerOffer. WorkerOffer contains three pieces of information: executorId, host, and available cores. The executorDataMap here is of type HashMap[String, ExecutorData], the key is executorId, and the value is the information of the corresponding executor, including host, RPC information, totalCores, and freeCores.

When the client registers the Application with the Master, the Master has allocated and started the Executor for the Application, and then registered it with the CoarseGrainedSchedulerBackend. The registration information is stored in the executorDataMap data structure.

launchTasks(scheduler.resourceOffers(workOffers))

First look at scheduler.resourceOffers (workOffers) inside, TaskSchedulerImpl calls the resourceOffers method to obtain the Seq[TaskDescription] to be executed through the prepared resources, and hand it over to CoarseGrainedSchedulerBackend to distribute to each executor for execution. Let's look at the specific implementation:

def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
    //标记是否有新的executor加入
    var newExecAvail = false
    // 更新executor,host,rack信息
    for (o <- offers) {
      executorIdToHost(o.executorId) = o.host
      executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
      if (!executorsByHost.contains(o.host)) {
        executorsByHost(o.host) = new HashSet[String]()
        executorAdded(o.executorId, o.host)
        newExecAvail = true
      }
      for (rack <- getRackForHost(o.host)) {
        hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
      }
    }

    // 随机打乱offers,避免多个task集中分配到某些节点上,为了负载均衡
    val shuffledOffers = Random.shuffle(offers)
    // 建一个二维数组,保存每个Executor上将要分配的那些task
    val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
    //每个executor上可用的cores
    val availableCpus = shuffledOffers.map(o => o.cores).toArray
    //返回排序过的TaskSet队列,有FIFO及Fair两种排序规则,默认为FIFO
    val sortedTaskSets = rootPool.getSortedTaskSetQueue
    for (taskSet <- sortedTaskSets) {
      logDebug("parentName: %s, name: %s, runningTasks: %s".format(
        taskSet.parent.name, taskSet.name, taskSet.runningTasks))
      if (newExecAvail) { // 如果该executor是新分配来的
        taskSet.executorAdded() // 重新计算TaskSetManager的就近原则
      }
    }

    // 利用双重循环对每一个taskSet依照调度的顺序,依次按照本地性级别顺序尝试启动task
    // 根据taskSet及locality遍历所有可用的executor,找出可以在各个executor上启动的task,
    // 加到tasks:Seq[Seq[TaskDescription]]中
    // 数据本地性级别顺序:PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
    var launchedTask = false
    for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
      do {
       //将计算资源按照就近原则分配给taskSet,用于执行其中的task
        launchedTask = resourceOfferSingleTaskSet(
            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
      } while (launchedTask)
    }

    if (tasks.size > 0) {
      hasLaunchedTask = true
    }
    return tasks
  }

Follow up on the resourceOfferSingleTaskSet method:

private def resourceOfferSingleTaskSet(
      taskSet: TaskSetManager,
      maxLocality: TaskLocality,
      shuffledOffers: Seq[WorkerOffer],
      availableCpus: Array[Int],
      tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = {
    var launchedTask = false
    //遍历所有executor
    for (i <- 0 until shuffledOffers.size) {
      val execId = shuffledOffers(i).executorId
      val host = shuffledOffers(i).host
      //若当前executor可用的core数满足一个task所需的core数
      if (availableCpus(i) >= CPUS_PER_TASK) {
        try {
          //获取taskSet哪些task可以在该executor上启动
          for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {
            //将需要在该executor启动的task添加到tasks中
            tasks(i) += task 
            val tid = task.taskId 
            taskIdToTaskSetManager(tid) = taskSet // task -> taskSetManager
            taskIdToExecutorId(tid) = execId // task -> executorId
            executorIdToTaskCount(execId) += 1 //该executor上的task+1
            executorsByHost(host) += execId // host -> executorId
            availableCpus(i) -= CPUS_PER_TASK //该executor上可用core数减去对应task的core数
            assert(availableCpus(i) >= 0)
            launchedTask = true
          }
        } catch {
          case e: TaskNotSerializableException =>
            logError(s"Resource offer failed, task set ${taskSet.name} was not serializable")
            // Do not offer resources for this task, but don't throw an error to allow other
            // task sets to be submitted.
            return launchedTask
        }
      }
    }
    return launchedTask
  }

This method mainly traverses all available executors, and obtains the tasks that the taskSet can start on the executor through the resourceOffer method under the condition that the core satisfies the core required by a task, and adds them to the tasks to return. Let's look at the implementation of resourceOffer in detail:

def resourceOffer(
      execId: String,
      host: String,
      maxLocality: TaskLocality.TaskLocality)
    : Option[TaskDescription] =
  {
    if (!isZombie) {
      val curTime = clock.getTimeMillis()

      var allowedLocality = maxLocality

      if (maxLocality != TaskLocality.NO_PREF) {
        allowedLocality = getAllowedLocalityLevel(curTime)
        if (allowedLocality > maxLocality) {
          // We're not allowed to search for farther-away tasks
          allowedLocality = maxLocality
        }
      }

      dequeueTask(execId, host, allowedLocality) match {
        case Some((index, taskLocality, speculative)) =>
          // Found a task; do some bookkeeping and return a task description
          val task = tasks(index)
          val taskId = sched.newTaskId()
          // Do various bookkeeping
          copiesRunning(index) += 1
          val attemptNum = taskAttempts(index).size
          val info = new TaskInfo(taskId, index, attemptNum, curTime,
            execId, host, taskLocality, speculative)
          taskInfos(taskId) = info
          taskAttempts(index) = info :: taskAttempts(index)
          // Update our locality level for delay scheduling
          // NO_PREF will not affect the variables related to delay scheduling
          if (maxLocality != TaskLocality.NO_PREF) {
            currentLocalityIndex = getLocalityIndex(taskLocality)
            lastLaunchTime = curTime
          }
          // Serialize and return the task
          val startTime = clock.getTimeMillis()
          val serializedTask: ByteBuffer = try {
            Task.serializeWithDependencies(task, sched.sc.addedFiles, sched.sc.addedJars, ser)
          } catch {
            // If the task cannot be serialized, then there's no point to re-attempt the task,
            // as it will always fail. So just abort the whole task-set.
            case NonFatal(e) =>
              val msg = s"Failed to serialize task $taskId, not attempting to retry it."
              logError(msg, e)
              abort(s"$msg Exception during serialization: $e")
              throw new TaskNotSerializableException(e)
          }
          if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
              !emittedTaskSizeWarning) {
            emittedTaskSizeWarning = true
            logWarning(s"Stage ${task.stageId} contains a task of very large size " +
              s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " +
              s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
          }
          addRunningTask(taskId)

          // We used to log the time it takes to serialize the task, but task size is already
          // a good proxy to task serialization time.
          // val timeTaken = clock.getTime() - startTime
          val taskName = s"task ${info.id} in stage ${taskSet.id}"
          logInfo(s"Starting $taskName (TID $taskId, $host, partition ${task.partitionId}," +
            s" $taskLocality, ${serializedTask.limit} bytes)")

          sched.dagScheduler.taskStarted(task, info)
          return Some(new TaskDescription(taskId = taskId, attemptNumber = attemptNum, execId,
            taskName, index, serializedTask))
        case _ =>
      }
    }
    None
  }
 if (maxLocality != TaskLocality.NO_PREF) {
        allowedLocality = getAllowedLocalityLevel(curTime)
        if (allowedLocality > maxLocality) {
          // We're not allowed to search for farther-away tasks
          allowedLocality = maxLocality
        }
      }

getAllowedLocalityLevel(curTime) will adjust the appropriate Locality according to the delay scheduling, the purpose is to start each task with the best locality as much as possible, getAllowedLocalityLevel returns the highest locality of all unexecuted tasks in the current taskSet, and this locality is used as the The worst locality that can be tolerated by this scheduling, and only searches for cases where the locality is better than this level in subsequent searches. allowedLocality finally takes the higher-level locality among the locality returned by getAllowedLocalityLevel(curTime) and maxLocality.

Find the appropriate task according to allowedLocality, if the return is not empty, it means that the task is allocated on the executor, then update the information, add the taskid to the runningTask, update the delay scheduling information, serialize the task, notify the DAGScheduler, and finally Returning to taskDescription, let's take a look at the implementation of dequeueTask:

private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value)
    : Option[(Int, TaskLocality.Value, Boolean)] =
  {
    for (index <- dequeueTaskFromList(execId, getPendingTasksForExecutor(execId))) {
      return Some((index, TaskLocality.PROCESS_LOCAL, false))
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {
      for (index <- dequeueTaskFromList(execId, getPendingTasksForHost(host))) {
        return Some((index, TaskLocality.NODE_LOCAL, false))
      }
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) {
      // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic
      for (index <- dequeueTaskFromList(execId, pendingTasksWithNoPrefs)) {
        return Some((index, TaskLocality.PROCESS_LOCAL, false))
      }
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.RACK_LOCAL)) {
      for {
        rack <- sched.getRackForHost(host)
        index <- dequeueTaskFromList(execId, getPendingTasksForRack(rack))
      } {
        return Some((index, TaskLocality.RACK_LOCAL, false))
      }
    }

    if (TaskLocality.isAllowed(maxLocality, TaskLocality.ANY)) {
      for (index <- dequeueTaskFromList(execId, allPendingTasks)) {
        return Some((index, TaskLocality.ANY, false))
      }
    }

    // find a speculative task if all others tasks have been scheduled
    dequeueSpeculativeTask(execId, host, maxLocality).map {
      case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}
  }

First, check whether there is a task of the PROCESS_LOCAL category corresponding to execId. If it exists, take it out and schedule it. If it does not exist, only check whether there is a task of the category corresponding to execId at a level greater than or equal to allowedLocality. If there is, schedule it.

The dequeueTaskFromList is to retrieve a task from the end of the task list of the corresponding category of execId (such as PROCESS_LOCAL) and return its taskIndex in the taskSet, and follow up the method:

private def dequeueTaskFromList(execId: String, list: ArrayBuffer[Int]): Option[Int] = {
    var indexOffset = list.size
    while (indexOffset > 0) {
      indexOffset -= 1
      val index = list(indexOffset)
      if (!executorIsBlacklisted(execId, index)) {
        // This should almost always be list.trimEnd(1) to remove tail
        list.remove(indexOffset)
        if (copiesRunning(index) == 0 && !successful(index)) {
          return Some(index)
        }
      }
    }
    None
  }

There is a blacklist mechanism here. Use the executorIsBlacklisted method to check whether the executor belongs to the task's blacklist. The blacklist records the Executor Id and Host where the task failed last time, and its corresponding "dark" time. The "dark" time refers to this Do not schedule this Task on this node for a period of time.

private def executorIsBlacklisted(execId: String, taskId: Int): Boolean = {
    if (failedExecutors.contains(taskId)) {
      val failed = failedExecutors.get(taskId).get
      return failed.contains(execId) &&
        clock.getTimeMillis() - failed.get(execId).get < EXECUTOR_TASK_BLACKLIST_TIMEOUT
    }
    false
  }

You can see the last piece of code in the dequeueTask method:

 // find a speculative task if all others tasks have been scheduled
    dequeueSpeculativeTask(execId, host, maxLocality).map {
      case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}

This is to start speculative execution. The speculative task refers to starting multiple instances of a Task on different Executors. If a Task instance runs successfully, the instances running on other Executors will be killed, and only the slow-running task will start the speculative task. .

After returning the Seq[Seq[TaskDescription]] information of which tasks are started on which executors through the scheduler.resourceOffers(workOffers) method, launchTasks will be called to start each task, which is implemented as follows:

private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
      for (task <- tasks.flatten) {
        val serializedTask = ser.serialize(task)
        if (serializedTask.limit >= maxRpcMessageSize) {
          scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
            try {
              var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
                "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
                "spark.rpc.message.maxSize or using broadcast variables for large values."
              msg = msg.format(task.taskId, task.index, serializedTask.limit, maxRpcMessageSize)
              taskSetMgr.abort(msg)
            } catch {
              case e: Exception => logError("Exception in error callback", e)
            }
          }
        }
        else {
          val executorData = executorDataMap(task.executorId)
          executorData.freeCores -= scheduler.CPUS_PER_TASK

          logInfo(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
            s"${executorData.executorHost}.")

          executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
        }
      }
    }

Serialize the task first. If the serialized size of the current task exceeds 128MB-200KB, skip the current task and set the corresponding taskSetManager to zombie mode. If the size does not exceed the limit, send a message to the executor to start the task execution. .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325521425&siteId=291194637