Spark Task Scheduler划分task过程

前言

前面分析过了DAG Scheduler划分stage的过程【感兴趣的话可以看看DAG Scheduler划分stage的过程】，现在我们开始看看Task Scheduler是如何划分Task的。

注：本人使用的Spark源码版本为2.3.0，IDE为IDEA2019，对源码感兴趣的同学可以点击这里下载源码包，直接解压用IDEA打开即可。

正文

1、计算Task的并行度

首先看DAG Scheduler提交TaskSet的方法，这个方法是submitMissingTasks(stage: Stage, jobId: Int)方法。(这个方法把stage和jobID传进去，stage存的是stage的最后一个RDD，这个RDD可以通过血缘关系将前面的RDD序列化)

  private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")

    // First figure out the indexes of partition ids to compute.
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
    // with this Stage
    val properties = jobIdToActiveJob(jobId).properties

    runningStages += stage

可以看到，这里创建了一个partitionsToCompute，通过findMissingPartitions()来计算分区数。

  override def findMissingPartitions(): Seq[Int] = {
    mapOutputTrackerMaster
      .findMissingPartitions(shuffleDep.shuffleId)
      .getOrElse(0 until numPartitions)
  }
// numPartitions的数量
val numPartitions = rdd.partitions.length

之前我们说过，一个Stage的Task的并行度，取决于最后一个RDD的分区数，原因就在这里。

1、计算任务运行的最佳location

现在task的分区数已经确定了，接下来就是如何将任务分配到各个节点上去，根据stage的类型，得到不同的location，通过getPreferredLocs(stage.rdd, id)方法来获取任务的运行location。

      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }

  def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = {
    getPreferredLocsInternal(rdd, partition, new HashSet)
  }

## 这是一个递归方法，递归地调用前面的RDD，并获取最佳location
private def getPreferredLocsInternal(
      rdd: RDD[_],
      partition: Int,
      visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
    // If the partition has already been visited, no need to re-visit.
    // This avoids exponential path exploration.  SPARK-695
    if (!visited.add((rdd, partition))) {
      // Nil has already been returned for previously visited partitions.
      return Nil
    }

这里又调用了rdd.preferredLocations(rdd.partitions(partition))方法来获取最佳location。

  final def preferredLocations(split: Partition): Seq[String] = {
    checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
      getPreferredLocations(split)
    }
  }

可以看到，这里使用了getOrElse()来获取最佳location。我们看getPreferredLocations(split)这个方法，它并没有被实现，但是它被HadoopRDD和ShuffledRDD实现了，我们看一下HadoopRDD的实现方法。

  override def getPreferredLocations(split: Partition): Seq[String] = {
    val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
    val locs = hsplit match {
      case lsplit: InputSplitWithLocationInfo =>
        HadoopRDD.convertSplitLocationInfo(lsplit.getLocationInfo)
      case _ => None
    }
    locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
  }

在这里，获取到了要执行任务的location。

3、创建tasks

回到DAG Scheduler.scala。partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap我们最终得到了需要进行计算的最佳地址taskIdToLocations。

再往下，创建了一个taskBinary = sc.broadcast(taskBinaryBytes)，它是一个序列化对象，通过依赖将前面的RDD也一并序列化，并把这个taskBinary广播出去。

再往下，开始创建tasks

    val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId)
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
          }
      }
    }

tasks的创建首先的判断stage的类型，如果是shufflemapstage，则创建ShuffleMapTask，否则创建Resulttask，这里将分区号、stageid，任务树，最佳位置等信息都传入构造方法中，创建出了一个task。

4、tasks提交

现在Task的创建工作也完成了，开始提交任务，首先判断Task的数量是否大于0，如果是，则创建一个taskset，将tasks、stageid,location等信息封装进去，然后由taskScheduler.submitTasks进行tasksets的提交。

if (tasks.size > 0) {
      logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
    }

5、schedulableBuilder进行任务调度

taskset提交到TaskSchedulerImpl之后，先创建一个TaskSetManager，然后由SchedulableBuilder创建的schedulableBuilder来进行任务的调度，schedulableBuilder内部有一个 rootPool的数据结构，我们可以把它看成是一个任务树，所有的任务创建时都会写入到这棵任务树中，调度的过程首先是将taskSet和manager放入schedulableBuilder的rootPool中。在往后，会调用 backend.reviveOffers()方法。

  override def submitTasks(taskSet: TaskSet) {
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    this.synchronized {
      val manager = createTaskSetManager(taskSet, maxTaskFailures)
      val stage = taskSet.stageId
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets.foreach { case (_, ts) =>
        ts.isZombie = true
      }
      stageTaskSets(taskSet.stageAttemptId) = manager
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

      if (!isLocal && !hasReceivedTask) {
        starvationTimer.scheduleAtFixedRate(new TimerTask() {
          override def run() {
            if (!hasLaunchedTask) {
              logWarning("Initial job has not accepted any resources; " +
                "check your cluster UI to ensure that workers are registered " +
                "and have sufficient resources")
            } else {
              this.cancel()
            }
          }
        }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
      }
      hasReceivedTask = true
    }
    backend.reviveOffers()
  }

backend.reviveOffers()会调用driverEndpoint.send(ReviveOffers)方法。

    override def receive: PartialFunction[Any, Unit] = {
      case StatusUpdate(executorId, taskId, state, data) =>
        scheduler.statusUpdate(taskId, state, data.value)
        if (TaskState.isFinished(state)) {
          executorDataMap.get(executorId) match {
            case Some(executorInfo) =>
              executorInfo.freeCores += scheduler.CPUS_PER_TASK
              makeOffers(executorId)
            case None =>
              // Ignoring the update since we don't know about the executor.
              logWarning(s"Ignored task status update ($taskId state $state) " +
                s"from unknown executor with ID $executorId")
          }
        }

      case ReviveOffers =>
        makeOffers()

这个方法最终会调用makOffers()。

    private def makeOffers() {
      // Make sure no executor is killed while some task is launching on it
      val taskDescs = withLock {
        // Filter out executors under killing
        val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
        val workOffers = activeExecutors.map {
          case (id, executorData) =>
            new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
        }.toIndexedSeq
        scheduler.resourceOffers(workOffers)
      }
      if (!taskDescs.isEmpty) {
        launchTasks(taskDescs)
      }
    }

在启动executors之前，先得到 activeExecutors，从activeExecutors里进行任务的分配，得到了一系列的workOffers，workOffers里封装了executorHost，即任务运行的excutorID，然后调用luanchTasks()方法。

6、任务发送给executor

    private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
      for (task <- tasks.flatten) {
        val serializedTask = TaskDescription.encode(task)
        if (serializedTask.limit() >= maxRpcMessageSize) {
          Option(scheduler.taskIdToTaskSetManager.get(task.taskId)).foreach { taskSetMgr =>
            try {
              var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
                "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
                "spark.rpc.message.maxSize or using broadcast variables for large values."
              msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
              taskSetMgr.abort(msg)
            } catch {
              case e: Exception => logError("Exception in error callback", e)
            }
          }
        }
        else {
          val executorData = executorDataMap(task.executorId)
          executorData.freeCores -= scheduler.CPUS_PER_TASK

          logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
            s"${executorData.executorHost}.")

          executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
        }
      }
    }

这个方法里，将task的信息进行了一次序列化，然后将其发送给executorEndpoint端。接下来就是executor执行任务的过程了。

总结

本文主要讲解了Task Scheduler划分任务的过程，从task分区的获取最佳location的获取，再到task的序列化和workOffers的发送。接下来会介绍一下executor端任务的运行过程。