Spark:DAGScheduler原理剖析与源码分析

Job触发流程原理与源码解析

在这里插入图片描述

wordcount案例解析，来分析Spark Job的触发流程
代码：var linesRDD= sc.textFile('hdfs://')
SparkContext中textFile方法

  /**
   * hadoopFile方法调用会创建一个HadoopRDD，其中的元素pair是（key,value）
   * key是hdfs或者文本文件的每一行的offset,value就是文本行 	
   * 然后，调用map方法，过滤掉key，剩下value，最终获得一个MapPartitionRDD，其内部是一行一行的文本行
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    // hadoop mapreduce TextInputFormat：读取文本数据的一种方式  LongWritable 读取到的文本的偏移量（行号） Text 读到的文本
    // map操作中，pair就是行号和文本数据映射的tuple，pair._2.toString 就是取数据
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString)
  }

hadoopFile方法

  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()
    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    // 创建的HadoopRDD读取配置文件的之后，上面已经做了广播变量，在本机worker上就可以读到
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

hadoopFile().map() 方法最终调用的是RDD.scala的map方法

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

代码：var wordsRDD = linesRDD.flatMap(line => line.split(" "))
RDD.scala的flatMap方法

  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

代码：var pairsRDD = wordsRDD.map(word => (word,1))
RDD.scala的map方法

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

代码：var countRDD = pairsRDD.reduceByKey(_+_)

rdd中是没有reduceByKey方法的，这里有一个隐式转换（相当于Java中的包装类），程序执行过程中会找RDD.scala中的一个隐式转换代码

  implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
    new PairRDDFunctions(rdd)
  }

最终创建了PairRDDFunctions类，MapPartitionRDD调用reduceByKey会触发scala隐式转换，在作用域中内寻找隐式转换，最终将MapPartitionRDD转换为PairRDDFunctions，调用reduceByKey方法。

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKey[V]((v: V) => v, func, func, partitioner)
  }

代码：countRDD.foreach(count => println(count._1 + ":"+count._2))
RDD.scala的foreach方法

  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

有多个runJob的嵌套调用

  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    // 核心：调用SparkContext初始化创建的DAGSChduler的runJob方法
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

Stage划分算法的剖析

由于Spark的算子构建一般都是链式的，这就涉及了要如何进行这些链式计算，Spark的策略是对这些算子，先划分Stage，然后在进行计算。

在这里插入图片描述

算法总结：会从出发action操作的那个RDD往前倒推，首先会为最后一个RDD创建一个stage(stage1)，然后往前倒推的时候，如果发现对某个RDD的宽依赖，那么就会将宽依赖的那个RDD创建一个新的stage(stage0)，那个RDD就是新的stage的最后一个RDD；最后依此类推，继续往前倒推，根据窄依赖，或者宽依赖，进行stage的划分，知道所有的RDD全部遍历完了为止。

Spark Web页面
在这里插入图片描述

源码解析
第一步：DAGScheduler类中runJob方法
源码地址：org.apache.spark.scheduler.DAGScheduler

  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
    waiter.completionFuture.value.get match {
      case scala.util.Success(_) =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case scala.util.Failure(exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
  }

第二步：DAGScheduler类中submitJob方法

 private[spark] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
  taskScheduler.setDAGScheduler(this)

......

  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    // JobSubmitted对象给eventProcessLoop(消息循环器)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

//这里面封装了哪些Partition要进行计算，joblistener作业监听等等
private[scheduler] case class JobSubmitted(
    jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    callSite: CallSite,
    listener: JobListener,
    properties: Properties = null)
  extends DAGSchedulerEvent

......

第三步：eventProcessLoop类中post方法

//双端队列，任何一端都可以进行元素的出入
  private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()

  private val stopped = new AtomicBoolean(false)

  // Exposed for testing.
  private[spark] val eventThread = new Thread(name) {
    setDaemon(true)

    override def run(): Unit = {
      try {
        while (!stopped.get) {
        //接收消息。在这里并没有直接实现OnReceive方法,具体方法实现是在DAGScheduler#onReceive
          val event = eventQueue.take() //取出放入的消息
          try {
            onReceive(event)
          } catch {
            case NonFatal(e) =>
              try {
                onError(e)
              } catch {
                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
          }
        }
      } catch {
        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }

  }

......

  def post(event: E): Unit = {
    eventQueue.put(event)
  }

第三步：onReceive方法

 protected def onReceive(event: E): Unit //抽象方法调用子类的实现。

第三步：onReceive方法子类（DAGSchedulerEventProcessLoop）的实现
源码地址：org.apache.spark.scheduler.DAGScheduler.DAGSchedulerEventProcessLoop

  /**
   * The main event loop of the DAG scheduler.
   */
  override def onReceive(event: DAGSchedulerEvent): Unit = {
    val timerContext = timer.time()
    try {
      doOnReceive(event)
    } finally {
      timerContext.stop()
    }
  }

第三步：doOnReceive方法

  private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

......

}

DAGScheduler启动一个线程EventLoop(消息循环器)，不断地从消息队列中取消息。消息是通过EventLoop的put方法放入消息队列，当EventLoop拿到消息后会回调DAGScheduler的OnReceive，进而调用doOnReceive方法进行处理

第四步：handleJobSubmitted方法

/**
   *  DAGSchduler的job调度核心入口
   */
  private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],//参数finalRDD为触发action操作时最后一个RDD
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    // 使用触发job的最后一个rdd，创建stage
    var finalStage: ResultStage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.
      //第一步 创建一个结果输出stage，并加入DAGSchduler内部的内存缓存中
      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
      case e: BarrierJobSlotsNumberCheckFailed =>
        logWarning(s"The job $jobId requires to run a barrier stage that requires more slots " +
          "than the total number of slots in the cluster currently.")
        // If jobId doesn't exist in the map, Scala coverts its value null to 0: Int automatically.
        val numCheckFailures = barrierJobIdToNumTasksCheckFailures.compute(jobId,
          new BiFunction[Int, Int, Int] {
            override def apply(key: Int, value: Int): Int = value + 1
          })
        if (numCheckFailures <= maxFailureNumTasksCheck) {
          messageScheduler.schedule(
            new Runnable {
              override def run(): Unit = eventProcessLoop.post(JobSubmitted(jobId, finalRDD, func,
                partitions, callSite, listener, properties))
            },
            timeIntervalNumTasksCheck,
            TimeUnit.SECONDS
          )
          return
        } else {
          // Job failed, clear internal data.
          barrierJobIdToNumTasksCheckFailures.remove(jobId)
          listener.jobFailed(e)
          return
        }

      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    // Job submitted, clear internal data.
    barrierJobIdToNumTasksCheckFailures.remove(jobId)
    // 第二步 用finalStage创建一个job，即这个job的最后一个stage
    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
    clearCacheLocs()
    logInfo("Got job %s (%s) with %d output partitions".format(
      job.jobId, callSite.shortForm, partitions.length))
    logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
    logInfo("Parents of final stage: " + finalStage.parents)
    logInfo("Missing parents: " + getMissingParentStages(finalStage))

    val jobSubmissionTime = clock.getTimeMillis()
    //第三步，将job加入内存缓存中
    jobIdToActiveJob(jobId) = job
    activeJobs += job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
    
   //第四步，使用submitStage()方法提交finalStage
   submitStage(finalStage)
  }

第四步：createResultStage方法

  /**
   * Create a ResultStage associated with the provided jobId.
   */
  private def createResultStage(
      rdd: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      jobId: Int,
      callSite: CallSite): ResultStage = {
    checkBarrierStageWithDynamicAllocation(rdd)
    checkBarrierStageWithNumSlots(rdd)
    checkBarrierStageWithRDDChainPattern(rdd, partitions.toSet.size)
    //获取父stages
    val parents = getOrCreateParentStages(rdd, jobId)
    val id = nextStageId.getAndIncrement()
    val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
    stageIdToStage(id) = stage //将Stage的id放入stageIdToStage(HashMap)结构中。
    updateJobIdStageIdMaps(jobId, stage) //更新JobIdStageIdMaps 
    stage 
  }

第五步：submitStage方法，stage划分算法的入口

/** Submits stage, but first recursively submits any missing parents. */
  /**
   * 这里是stage划分的入口，submitStage与getMissingParentStages方法共同完成了stage的划分
   *
   */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        //调用getMissingParentStages()方法，去获取当前这个stage的父stage,并且按照id进行排序，
        //从小到大的进行排序,从前向后依次计算，这样做的原因是不同的rdd存在依赖关系
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          //首先提交这个第一个stage，stage0,其余的stage，此时全部都在waitingStage里
          submitMissingTasks(stage, jobId.get)
        } else {
          // 递归调用
          for (parent <- missing) {
            submitStage(parent)
          }
          // 将当前stage放入等待执行的stage队列
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

第六步：getMissingParentStages方法

/**
   * 获取某个stage的父stage
   *
   * 如果最后一个rdd的所有依赖都是窄依赖，就不会创建新的shuffle stage
   * 但是，如果stage的rdd是宽依赖，就使用rdd的宽依赖创建新的stage,立即新的stage返回
   *
   */
  private def getMissingParentStages(stage: Stage): List[Stage] = {
    //定义一个hashset来存放stage
    val missing = new HashSet[Stage]
    //存储已经被访问的RDD，构建的时候是从后往前回溯的一个过程，回溯过之后就会被保存起来。
    val visited = new HashSet[RDD[_]]
    // We are manually maintaining a stack here to prevent StackOverflowError
    // caused by recursively visiting
    //这里只是缓存RDD的信息，并未真正计算，因为此时我们并没有partition的信息。
    val waitingForVisit = new ArrayStack[RDD[_]] //存储需要被处理的RDD
    def visit(rdd: RDD[_]) {
      if (!visited(rdd)) {
        visited += rdd //如果没有被回溯过，那么就将此RDD加入HashSet中
        val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
        if (rddHasUncachedPartitions) { //如果分区的数量是非空的
          //遍历RDD依赖
          for (dep <- rdd.dependencies) {
            dep match {
              //那么使用宽依赖的那个rdd，创建一个stage
              //默认最后一个stage，不是shuffleMap stage,而是ResultStage
              //但是finalStage之前所有的stage，都是shuffleMap stage
              case shufDep: ShuffleDependency[_, _, _] =>
                val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
                if (!mapStage.isAvailable) {
                  missing += mapStage
                }
              //如果是窄依赖，那么将依然的RDD放入栈中
              case narrowDep: NarrowDependency[_] =>
                waitingForVisit.push(narrowDep.rdd)
            }
          }
        }
      }
    }

    // 往栈中，加入stage最后的一个rdd
    waitingForVisit.push(stage.rdd)
    while (waitingForVisit.nonEmpty) {
      //调用内部visit方法
      visit(waitingForVisit.pop())
    }
    //返回这个stage的所有的父亲节点，便于在后面递归的去查找
    missing.toList
  }

第七步：getOrCreateShuffleMapStage方法

  private def getOrCreateShuffleMapStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
    shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
      case Some(stage) =>
        stage

      case None =>
        // Create stages for all missing ancestor shuffle dependencies.
        getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
          // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
          // that were not already in shuffleIdToMapStage, it's possible that by the time we
          // get to a particular dependency in the foreach loop, it's been added to
          // shuffleIdToMapStage by the stage creation process for an earlier dependency. See
          // SPARK-13902 for more information.
          if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
            //创建一个shuffle stage
            createShuffleMapStage(dep, firstJobId)
          }
        }
        // Finally, create a stage for the given shuffle dependency.
        createShuffleMapStage(shuffleDep, firstJobId)
    }
  }

下面我们以下面的图为例，来详细叙述Stage的划分
在这里插入图片描述

第一次循环：
1.RDD G传进来，将RDD G压栈.

waitingForVisit.push(stage.rdd)

2.此时的栈不空，将栈里面的RDD G弹出，作为参数传入visit函数内。

while (waitingForVisit.nonEmpty) {
      //调用内部visit方法
      visit(waitingForVisit.pop())
}

3.RDD G没有被访问过，所以执行if中的代码

 val waitingForVisit = new ArrayStack[RDD[_]] //存储需要被处理的RDD
    def visit(rdd: RDD[_]) {
      if (!visited(rdd)) {

4.对RDD G进行处理，加入visited中。

visited += rdd

5.遍历RDD G依赖的父RDD

 for (dep <- rdd.dependencies) {
       dep match {
      //若是 RDD F 创建新的Stage,并将新创建的Stage存储到missing 中
         case shufDep: ShuffleDependency[_, _, _] =>
         val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
         if (!mapStage.isAvailable) {
                  missing += mapStage
          }
       //若是RDD F 则为和RDD G是一个Stage,然后就将RDD F压栈
        case narrowDep: NarrowDependency[_] =>
             waitingForVisit.push(narrowDep.rdd)
       }
 }

RDD G依赖两个RDD：

RDD B的依赖关系是窄依赖，因此，并不会产生新的RDD，只是将RDD B 压入Stack栈中
RDD F的依赖关系是宽依赖，因此，RDD G 和RDD F会被划分成两个Stage,Shuffle依赖的关系信息保存在missing 中，并且，RDD F所在的Stage 2是RDD G所在的Stage 3的父Stage

第二次循环：Stack栈中 RDD B
RDD B所依赖的父RDD是RDD A 之间是宽依赖关系，因此要创建一个新的Stage为Stage 1
在这里插入图片描述

此时 missing 中存在两个Stage，分别是 Stage 1与Stage 2

递归调用Stage 1，A无父RDD，提交stage

 // 递归调用,直到最初的stage，他没有父stage
 for (parent <- missing) {
    submitStage(parent)
 }

递归调用Stage 2，而RDD F的父RDD都是窄依赖，所以不产生新的Stage,均为Stage 2.

stage划分算法总结：

从finalStage倒推
通过宽依赖，来进行新的stage的划分
使用递归，优先提交父stage

Task任务本地性算法

1.在submitMissingTasks中会通过调用以下代码来获取任务的本地性。

 //提交stage,为stage创建一批task，task数量与partition数量相同
  private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")

    // First figure out the indexes of partition ids to compute.
    //获取你要创建的task的数量
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
    // with this Stage
    val properties = jobIdToActiveJob(jobId).properties
    //将stage加入runningStage队列
    runningStages += stage
    // SparkListenerStageSubmitted should be posted before testing whether tasks are
    // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
    // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
    // event.
    stage match {
      case s: ShuffleMapStage =>
        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
      case s: ResultStage =>
        outputCommitCoordinator.stageStart(
          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
    }
    //获得Task的本地性,task的最佳位置计算算法
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id)) }.toMap
        case s: ResultStage =>
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch {
      case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)

    // If there are tasks to execute, record the submission time of the stage. Otherwise,
    // post the even without the submission time, which indicates that this stage was
    // skipped.
    if (partitionsToCompute.nonEmpty) {
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    }
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

    // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
    // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
    // the serialized copy of the RDD and for each task we will deserialize it, which means each
    // task gets a different copy of the RDD. This provides stronger isolation between tasks that
    // might modify state of objects referenced in their closures. This is necessary in Hadoop
    // where the JobConf/Configuration object is not thread-safe.
    var taskBinary: Broadcast[Array[Byte]] = null
    var partitions: Array[Partition] = null
    try {
      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
      // For ResultTask, serialize and broadcast (rdd, func).
      var taskBinaryBytes: Array[Byte] = null
      // taskBinaryBytes and partitions are both effected by the checkpoint status. We need
      // this synchronization in case another concurrent job is checkpointing this RDD, so we get a
      // consistent view of both variables.
      RDDCheckpointData.synchronized {
        taskBinaryBytes = stage match {
          case stage: ShuffleMapStage =>
            JavaUtils.bufferToArray(
              closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
          case stage: ResultStage =>
            JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
        }

        partitions = stage.rdd.partitions
      }

      taskBinary = sc.broadcast(taskBinaryBytes)
    } catch {
      // In the case of a failure during serialization, abort the stage.
      case e: NotSerializableException =>
        abortStage(stage, "Task not serializable: " + e.toString, Some(e))
        runningStages -= stage

        // Abort execution
        return
      case NonFatal(e) =>
        abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }
    //根据不同的Stage类型创建不同的tasks队列
    val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            //为每一个Partition创建ShuffleMapTask,并计算最近位置
            val locs = taskIdToLocations(id)
            val part = partitions(id)
            stage.pendingPartitions += id
            //对finalStage之外的，都创建ShuffleMapTask
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = partitions(p)
            val locs = taskIdToLocations(id)
            //如果不是shuffleMap,那么就是finalStage 是创建ResultTask的
            new ResultTask(stage.id, stage.latestInfo.attemptNumber,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
              stage.rdd.isBarrier())
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
        runningStages -= stage
        return
    }

    if (tasks.size > 0) {
      logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
      //将Tasks封装到TaskSet中，并将TaskSet提交给TaskScheduler。
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
    } else {
      // Because we posted SparkListenerStageSubmitted earlier, we should mark
      // the stage as completed here in case there are no tasks to run
      markStageAsFinished(stage, None)

      stage match {
        case stage: ShuffleMapStage =>
          logDebug(s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})")
            // 如果Tasks执行完了，表示该Stage执行完成
          markMapStageJobsAsFinished(stage)
        case stage: ResultStage =>
          logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")
      }
      submitWaitingChildStages(stage)
    }
  }

2.getPreferredLocsInternal方法

 /**
   * Recursive implementation for getPreferredLocs.
   *
   * This method is thread-safe because it only accesses DAGScheduler state through thread-safe
   * methods (getCacheLocs()); please be careful when modifying this method, because any new
   * DAGScheduler state accessed by it may require additional synchronization.
   *
   * 计算每个task对应Partition最佳位置
   * 从stage的最后一个rdd开始，去找哪个rdd的partition，是被cache了，或者checkpoint
   * 那么，task的最佳位置，就是缓存的 checkpoint的partition的位置
   * 因为这样的话，task就在哪个节点上执行，不需要计算之前的rdd了
   */
  private def getPreferredLocsInternal(
    rdd: RDD[_],
    partition: Int,
    visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
    // If the partition has already been visited, no need to re-visit.
    // This avoids exponential path exploration.  SPARK-695
    // 如果已访问过RDD，即以获得RDD的TaskLocation则不需再次获得
    if (!visited.add((rdd, partition))) {
      // Nil has already been returned for previously visited partitions.
      return Nil
    }
    // If the partition is cached, return the cache locations
    //寻找当前rdd的partition是否缓存了
    val cached = getCacheLocs(rdd)(partition)
    if (cached.nonEmpty) {
      return cached
    }
    // If the RDD has some placement preferences (as is the case for input RDDs), get those
    //寻找当前rdd的partitions 是否checkpoint了
    val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
    if (rddPrefs.nonEmpty) {
      return rddPrefs.map(TaskLocation(_))
    }

    // If the RDD has narrow dependencies, pick the first partition of the first narrow dependency
    // that has any placement preferences. Ideally we would choose based on transfer sizes,
    // but this will do for now.
    //最后，递归调用自己，去寻找rdd 的父rdd，看着对应的partition是否缓存或者checkpoint
    rdd.dependencies.foreach {
      case n: NarrowDependency[_] =>
        for (inPart <- n.getParents(partition)) {
          val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
          if (locs != Nil) {
            return locs
          }
        }

      case _ =>
    }
    //如果这个stage，从最后一个rdd，到最开始的rdd，partition都没有被缓存或者checkpoint
    //那么，task的最佳位置preferredLocs，就是Nil
    Nil
  }

无论是通过哪种方式获取RDD分区的优先位置，第一次计算的数据来源肯定都是通过RDD的preferredLocations方法获取的，不同的RDD有不同的preferredLocations实现，但是数据无非就是在三个地方存在，被cache到内存、HDFS、磁盘，而这三种方式的TaskLocation都有具体的实现

/**
 * A location that includes both a host and an executor id on that host.
 */
//数据在内存中
private [spark]
case class ExecutorCacheTaskLocation(override val host: String, executorId: String)
  extends TaskLocation {
  override def toString: String = s"${TaskLocation.executorLocationTag}${host}_$executorId"
}

/**
 * A location on a host.
 */
//数据在磁盘上（非HDFS上）
private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation {
  override def toString: String = host
}

/**
 * A location on a host that is cached by HDFS.
 */
//数据在HDFS上
private [spark] case class HDFSCacheTaskLocation(override val host: String) extends TaskLocation {
  override def toString: String = TaskLocation.inMemoryLocationTag + host
}