spark源码分析之dagscheduler原理篇

这里写图片描述
一个action操作触发runjob方法，然后一步一步的调用runjob，一直调用到dagScheduler的runjob方法
解释：
1、首先会创建一个hadoopRDD，然后将hadoopRDD变成一个MapppedRDD
2、创建一个FlatMappedRDD
3、创建一个MappedRDD
4、reduceByKey这个算子，首先会产生MapPartitionsRDD，然后是ShuffleRDD，再然后是MapPartitionsRDD

MapPartitionsRDD是本地数据聚合的rdd，也就是代表本地文件的rdd，使用HashPartitioner，对key进行部分整合，保存到多个paritition中，也就是对应的不同文件中
ShuffleRDD对MapPartitionsRDD对相同的key保存到一个partition中
MapPartitionsRDD对key进行聚合操作

5、执行到foreach这个action操作的时候，就会通过SparkContext的runJob()去触发job(DAGScheduler)

总结：
DAGScheculer划分stage的算法，会从触发最后一个触发action操作的那个rdd向前倒推，首先为最后一个rdd创建一个stage，过程中，如果发现某一个rdd是宽依赖，就将该rdd创建一个新的stage，那个rdd就是新的stage的最后一个rdd，依次类推，直到所有rdd都遍历完成。

源码分析：
第一步：点击runjob方法
源码位置：org/apache/spark/SparkContext.scala

  /**
   * Run a job on all partitions in an RDD and return the results in an array.
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

第二步：点击第一步runjob方法
源码位置：org/apache/spark/SparkContext.scala

  /**
   * Run a job on a given set of partitions of an RDD, but take a function of type
   * `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

第三步：点击第二步runjob方法
源码位置：org/apache/spark/SparkContext.scala

  /**
   * Run a function on a given set of partitions in an RDD and return the results as an array.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int]): Array[U] = {
    val results = new Array[U](partitions.size)
    runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
    results
  }

第四步：点击第三步runjob方法
源码位置：org/apache/spark/SparkContext.scala

  /**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }


    // 调用DAGScheduler的runJob()方法，rdd为reduceByKey这个算子
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)

    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

第五步：点击第四步runjob方法
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime

    // 提交job
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)

    // 判断job的执行结果
    waiter.awaitResult() match {
      case JobSucceeded =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case JobFailed(exception: Exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
  }

第六步：点击第五步submitJob方法
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)

    // DAGSchedulerEventProcessLoop的doOnReceive方法创建了DAGScheduler的job到核心入口
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

第七步：点击第六步eventProcessLoop方法
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

第八步：点击第七步DAGSchedulerEventProcessLoop
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {

      // 最终调用了handleJobSubmitted
    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

第九步：点击第八步handleJobSubmitted
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  /**
   * DAGScheduler的job调度的核心入口
   *
   * stage划分算法非常重要：
   *    必须要对stage算法很清晰，直到你自己编写的spark application被划分为了几个job，
   *    每个job被划分为了几个stage，每个stage包含了你那些代码，只有知道了每个stage包括了
   *    你那些代码之后，在线上，如果你发现某一个stage执行特别慢，或者是某一个stage一直报错
   *    你才能针对哪个stage对应的代码，去排查问题，或者说是性能调优
   *
   *  stage划分算法总结：
   *    1、从finalStage倒推
   *    2、通过宽依赖，来进行新的stage的划分
   *    3、使用递归，优先提交父stage
   */
  private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {

    // 第一步：使用触发job的最后一个rdd，创建finalStage
    var finalStage: ResultStage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.

      // 创建一个新的stage,并且将stage加入DAGScheduler内部的内存缓存中
      finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }

    // 第二步：使用finalStage，创建一个job，也就是这个job的最后一个stage，就是finalstage
    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
    clearCacheLocs()
    logInfo("Got job %s (%s) with %d output partitions".format(
      job.jobId, callSite.shortForm, partitions.length))
    logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
    logInfo("Parents of final stage: " + finalStage.parents)
    logInfo("Missing parents: " + getMissingParentStages(finalStage))

    val jobSubmissionTime = clock.getTimeMillis()

    // 第三步：将job加入到内存缓存中
    jobIdToActiveJob(jobId) = job
    activeJobs += job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))

    // 第四步：使用submitStage()提交finalStage
    // 这个方法的调用，其实会导致第一个stage提交，并且导致其他所有的
    // stage，都给放入waitingStages队列中
    submitStage(finalStage)

    // 提交等待的stage
    submitWaitingStages()
  }

第十步：点击第九步newResultStage
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  /**
   * Create a ResultStage associated with the provided jobId.
   */
  private def newResultStage(
      rdd: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      jobId: Int,
      callSite: CallSite): ResultStage = {
    val (parentStages: List[Stage], id: Int) = getParentStagesAndId(rdd, jobId)
    // 创建ResultStage
    val stage = new ResultStage(id, rdd, func, partitions, parentStages, jobId, callSite)
    stageIdToStage(id) = stage
    updateJobIdStageIdMaps(jobId, stage)
    stage
  }

第十一步：点击第十步newResultStage
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

private[spark] class ResultStage(
    id: Int,
    rdd: RDD[_],
    val func: (TaskContext, Iterator[_]) => _,
    val partitions: Array[Int],
    parents: List[Stage],
    firstJobId: Int,
    callSite: CallSite)
  extends Stage(id, rdd, partitions.length, parents, firstJobId, callSite) { 
  ......
}

第十二步：点击第九步submitStage
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  /** Submits stage, but first recursively submits any missing parents. */
  /**
   * 提交stage的方法
   * 这个是stage划分算法的入口，但是stage划分算法，其实是由submitStage()方法和getMissingParentStages()方法共同组成的
   */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {

        // 调用getMissingParentStages方法，获取当前这个stage的父stage
        val missing = getMissingParentStages(stage).sortBy(_.id)

        logDebug("missing: " + missing)

        /**
         * 总结来说，这里其实会反复的递归调用，直到最初的stage，它没有父stage，
         * 那么，此时，就会首先会提交stage0，然后其余的stage，全部都在waitingStages里边
         */

        // 如果这个stage没有父stage的情况下
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          // 提交DAG，会创建task
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            // 递归调用submit()方法，去提交父stage
            // 这里的递归，就是stage划分算法的推动者和精髓
            submitStage(parent)
          }

          // 并且将当前stage，放入到waitingStages等待执行的stage队列中
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

第十三步：点击第十二步getMissingParentStages
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  /**
   * 获取某个stage的父stage
   *
   * 总体来说：对一个stage，如果它的最后一个rdd的所有依赖，都是窄依赖，那么
   * 它不会创建任何的一个stage，但是只要发现这个stage的rdd宽依赖了某个rdd，
   * 那就用宽依赖的那个rdd，创建一个新的stage，然后立即将新的stage返回
   */
  private def getMissingParentStages(stage: Stage): List[Stage] = {
    val missing = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]
    // We are manually maintaining a stack here to prevent StackOverflowError
    // caused by recursively visiting
    val waitingForVisit = new Stack[RDD[_]]


    def visit(rdd: RDD[_]) {
      if (!visited(rdd)) {
        visited += rdd
        val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
        if (rddHasUncachedPartitions) {
          // 遍历rdd的依赖
          // 所以说，针对我们之前的那个图，其实对于每一种都有shuffle的操作，比如groupByKey、
          // reduceByKey，countByKey，底层对应了三个RDD：MapPartitonsRDD、ShuffleRDD、MapPartitonsRDD
          for (dep <- rdd.dependencies) {
            dep match {
              // 如果是宽依赖
              case shufDep: ShuffleDependency[_, _, _] =>
                // 那么就用宽依赖的那个rdd，创建一个Stage，并将isShuffleMap设置为true
                // 默认最后stage，不是ShuffleMapStage
                // 但是finalStage之前的所有stage，都是shuffleMap stage
                val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
                if (!mapStage.isAvailable) {
                  missing += mapStage
                }

              // 如果是窄依赖，那么将依赖的rdd放入到栈中
              case narrowDep: NarrowDependency[_] =>
                waitingForVisit.push(narrowDep.rdd)
            }
          }
        }
      }
    }

    // 首先向栈中，推入了stage的最后一个rdd
    waitingForVisit.push(stage.rdd)

    // 然后进行while循环
    while (waitingForVisit.nonEmpty) {
      // 对stage的最后一个rdd，调用了自己内部定义的visit()方法
      visit(waitingForVisit.pop())
    }
    missing.toList
  }

第十四步：点击第十三步getShuffleMapStage
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  /**
   * Get or create a shuffle map stage for the given shuffle dependency's map side.
   */
  private def getShuffleMapStage(
      shuffleDep: ShuffleDependency[_, _, _],
      firstJobId: Int): ShuffleMapStage = {
    shuffleToMapStage.get(shuffleDep.shuffleId) match {
      case Some(stage) => stage
      case None =>
        // We are going to register ancestor shuffle dependencies
        getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
          shuffleToMapStage(dep.shuffleId) = newOrUsedShuffleStage(dep, firstJobId)
        }

        // Then register current shuffleDep
        // todo
        val stage = newOrUsedShuffleStage(shuffleDep, firstJobId)
        shuffleToMapStage(shuffleDep.shuffleId) = stage
        stage
    }
  }

第十五步：点击第十四步newOrUsedShuffleStage
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  private def newOrUsedShuffleStage(
      shuffleDep: ShuffleDependency[_, _, _],
      firstJobId: Int): ShuffleMapStage = {
    val rdd = shuffleDep.rdd
    val numTasks = rdd.partitions.length
    val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite)
    if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
      val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
      val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
      (0 until locs.length).foreach { i =>
        if (locs(i) ne null) {
          // locs(i) will be null if missing
          stage.addOutputLoc(i, locs(i))
        }
      }
    } else {
      // Kind of ugly: need to register RDDs with the cache and map output tracker here
      // since we can't do it in the RDD constructor because # of partitions is unknown
      logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
      mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
    }
    stage
  }

第十六步：点击第十五步newShuffleMapStage
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  private def newShuffleMapStage(
      rdd: RDD[_],
      numTasks: Int,
      // 创建一个stage的时候，就将它设置成为宽依赖
      shuffleDep: ShuffleDependency[_, _, _],
      firstJobId: Int,
      callSite: CallSite): ShuffleMapStage = {
    val (parentStages: List[Stage], id: Int) = getParentStagesAndId(rdd, firstJobId)
    val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, parentStages,
      firstJobId, callSite, shuffleDep)

    stageIdToStage(id) = stage
    updateJobIdStageIdMaps(firstJobId, stage)
    stage
  }

第十七步：点击第十二步submitMissingTasks
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

/** Called when stage's parents are available and we can now do its task. */
  /**
   * 提交stage，为stage创建一批task，task的数量和partition的数量相同
   */
  private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")
    // Get our pending tasks and remember them in our pendingTasks entry
    stage.pendingPartitions.clear()

    // First figure out the indexes of partition ids to compute.
    // 计算出要创建多少个partition
    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // Create internal accumulators if the stage has no accumulators initialized.
    // Reset internal accumulators only if this stage is not partially submitted
    // Otherwise, we may override existing accumulator values from some tasks
    if (stage.internalAccumulators.isEmpty || stage.numPartitions == partitionsToCompute.size) {
      stage.resetInternalAccumulators()
    }

    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
    // with this Stage
    val properties = jobIdToActiveJob(jobId).properties

    // 将stage加入到runningStages队列
    runningStages += stage
    // SparkListenerStageSubmitted should be posted before testing whether tasks are
    // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
    // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
    // event.
    stage match {
      case s: ShuffleMapStage =>
        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
      case s: ResultStage =>
        outputCommitCoordinator.stageStart(
          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
    }
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          val job = s.activeJob.get
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    } catch {
      case NonFatal(e) =>
        stage.makeNewStageAttempt(partitionsToCompute.size)
        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
        abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
        runningStages -= stage
        return
    }

    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

    // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
    // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
    // the serialized copy of the RDD and for each task we will deserialize it, which means each
    // task gets a different copy of the RDD. This provides stronger isolation between tasks that
    // might modify state of objects referenced in their closures. This is necessary in Hadoop
    // where the JobConf/Configuration object is not thread-safe.
    var taskBinary: Broadcast[Array[Byte]] = null
    try {
      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
      // For ResultTask, serialize and broadcast (rdd, func).
      val taskBinaryBytes: Array[Byte] = stage match {
        case stage: ShuffleMapStage =>
          closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
        case stage: ResultStage =>
          closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()
      }

      taskBinary = sc.broadcast(taskBinaryBytes)
    } catch {
      // In the case of a failure during serialization, abort the stage.
      case e: NotSerializableException =>
        abortStage(stage, "Task not serializable: " + e.toString, Some(e))
        runningStages -= stage

        // Abort execution
        return
      case NonFatal(e) =>
        abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}", Some(e))
        runningStages -= stage
        return
    }

    // 为stage创建指定数量的task
    // 这里最关键的是，task的最佳位置计算算法
    val tasks: Seq[Task[_]] = try {
      stage match {
        case stage: ShuffleMapStage =>
          partitionsToCompute.map { id =>
            // 给每一个partition创建一个task
            // 给每一个task计算最佳位置
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            // 然后对于finalStage之外的stage，它的isShuffleMap都是true
            // 所以会创建ShuffleMapTask
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, stage.internalAccumulators)
          }

          // 如果不是shuffleMap，那么就是finalStage
          // finalStage，是创建ResultTask的
        case stage: ResultStage =>
          val job = stage.activeJob.get
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, stage.internalAccumulators)
          }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
        runningStages -= stage
        return
    }

    if (tasks.size > 0) {
      logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
      stage.pendingPartitions ++= tasks.map(_.partitionId)
      logDebug("New pending partitions: " + stage.pendingPartitions)

      // 最后，针对stage的task，创建taskset对象，调用TaskScheduler的submitTasks方法，提交TaskSet
      taskScheduler.submitTasks(new TaskSet(
        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    } else {
      // Because we posted SparkListenerStageSubmitted earlier, we should mark
      // the stage as completed here in case there are no tasks to run
      markStageAsFinished(stage, None)

      val debugString = stage match {
        case stage: ShuffleMapStage =>
          s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})"
        case stage : ResultStage =>
          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
      }
      logDebug(debugString)
    }
  }

第十八步：点击第十七步getPreferredLocs
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  private[spark]
  def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = {
    getPreferredLocsInternal(rdd, partition, new HashSet)
  }

第十九步：点击第十八步getPreferredLocsInternal
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

 /**
   * 计算每个task对应的partition的最佳位置
   * 说白了，就是从stage的最后一个rdd开始，去找那个rdd的partition是被cache，或者是checkpoint
   * 那么，task的最佳位置，就是缓存的paritition位置或者是checkpoint缓存的位置，因为这样的话，
   * task就在那个节点上执行，不用计算之前的rdd了
   */
  private def getPreferredLocsInternal(
      rdd: RDD[_],
      partition: Int,
      visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
    // If the partition has already been visited, no need to re-visit.
    // This avoids exponential path exploration.  SPARK-695
    if (!visited.add((rdd, partition))) {
      // Nil has already been returned for previously visited partitions.
      return Nil
    }

    // If the partition is cached, return the cache locations
    // 寻找当前rdd的partition是否缓存了
    val cached = getCacheLocs(rdd)(partition)
    if (cached.nonEmpty) {
      return cached
    }

    // If the RDD has some placement preferences (as is the case for input RDDs), get those
    // 需要当前rdd是否checkpoint了
    val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
    if (rddPrefs.nonEmpty) {
      return rddPrefs.map(TaskLocation(_))
    }

    // If the RDD has narrow dependencies, pick the first partition of the first narrow dependency
    // that has any placement preferences. Ideally we would choose based on transfer sizes,
    // but this will do for now.
    // 最后，递归调用自己，去寻找rdd的父rdd，看看对应的partition是否缓存或者是checkpoint
    rdd.dependencies.foreach {
      case n: NarrowDependency[_] =>
        for (inPart <- n.getParents(partition)) {
          val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
          if (locs != Nil) {
            return locs
          }
        }

      case _ =>
    }

    // 如果这个stage，从最后一个stage，到最开始的rdd，partition都没有被缓存或者是checkpoint
    // 那么，task的最佳位置（preferredLocs），就是Nil
    Nil
  }

第二十步：点击第十九步preferredLocations
源码位置：org/apache/spark/scheduler/DAGScheduler.scala

  /**
   * Get the preferred locations of a partition, taking into account whether the
   * RDD is checkpointed.
   */
  final def preferredLocations(split: Partition): Seq[String] = {
    checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
      getPreferredLocations(split)
    }
  }

spark源码分析之dagscheduler原理篇

猜你喜欢