Spark源码分析之DAGScheduler详解

在前面的几节中，主要介绍了SparkContext的启动初始化过程，包括Driver的启动，向Master的注册，Master启动 Worker，在Worker中启动Executor，以及Worker向Master的注册，在讲述完这些之后，所有的准备工作都已经做完，就开始真正执行我们的Application，首先它会提交job到DAGScheduler中执行，包括对于job的stage划分，还有task的最佳位置计算等等工作，都需要DAGScheduler来完成，那么本文会主要围绕下面两个方面来研究DAGScheduler的源码：

stage的划分算法
task对应的partition的最佳位置计算算法

转载请标明原文地址：原文链接

首先我们以wordcount程序为例展开探究：

val rdd=sc.textFile("test.txt")

在执行上面这段代码的时候，首先它先调用了textFile创建了一个HadoopRDD，接着再调用了map方法来创建一个MapPartitionRDD，而HadoopRDD，MapPartitionRDD都继承RDD，源码如下所示：

 /**
    * 首先，调用hadoopFile(),会创建hadoopRDD,其中的元素是(key,value) pair
    * key是hdfs文本文件中的每一行的offset,value就是文本行
    * 然后对hadoopRDD调用map()方法，会剔除key，保留value，然后获得一个MapPartitionRDD
    * MapPartitionRDD内部的元素，就是一行一行的文本。
    * @param path
    * @param minPartitions
    * @return
    */
  def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      //保留value，剔除key
      minPartitions).map(pair => pair._2.toString).setName(path)
  }
   /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

在使用textFile方法生成RDD之后，接着执行下面语句：

val linesRDD=rdd.flatMap(_.split(" "))

此时在flatMap内部会创建一个MapPartitionRDD，在其内部会遍历每一个行的元素，源码如下：

  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

  def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = new AbstractIterator[B] {
    private var cur: Iterator[B] = empty
    def hasNext: Boolean =
      cur.hasNext || self.hasNext && { cur = f(self.next).toIterator; hasNext }
    def next(): B = (if (hasNext) cur else empty).next()
  }

接着调用map执行到了下面的语句：

val words=linesRDD.map(x=>(x,1))

其内部还是创建MapPartitionRDD，源码如下：

 def map[U: ClassTag](f: T => U): RDD[U] = {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

val wordCount=words.reduceByKey(_+_)

接着就是调用reduceByKey了，但是你会奇怪的发现，在RDD类中并没有这个方法，其实reduceByKey并不是RDD内部的方法，而是PairRDDFunctions中的方法，它首先在RDD内部会发生隐式转换，转换为PairRDDFunctions，然后再调用这个方法，在RDD内部隐式转换源码如下：

 //RDD中没有reduceByKey这个算子，因此调用这个方法的时候会发生隐式转换
  //将RDD转换为PairRDDFunction，然后调用PairRDDFunctions类中reduceByKey方法
  implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
    new PairRDDFunctions(rdd)
  }

PairRDDFunctions类总reduceByKey源码如下：

def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {
    reduceByKey(defaultPartitioner(self), func)
  }

/**
    * 要获得当前RDD的partition数目，在真正将Job提交执行之前，必须知道map中有多少Partition
    * 如果没有指定partition，那么使用默认的HashPartitioner
    * @param rdd
    * @param others
    * @return
    */
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
    for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.size)
    }
  }

在执行完reduceByKey之后，就会使用foreach操作打印结果：

wordCount.foreach(println)

在调用foreach的时候会触发action操作，也就是说真正的开始执行job了，这里它会调用runJob方法：

/**
    * 调用action操作，会调用底层的DAGScheduler来触发job
    * @param f
    */
  def foreach(f: T => Unit) {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

在这个runJob方法中，其实还是往下调用多个runJob的重载方法，直到调用到DAGScheduler的runJob方法：

def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit) {
    if (stopped) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    //调动SparkContext初始化时创建的DAGScheduler的runJob方法
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
      resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

在DAGScheduler的runJob方法中又会调用submitJob方法：

def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    //调用submitJob方法
    val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)
    waiter.awaitResult() match {
      case JobSucceeded => {
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      }
      case JobFailed(exception: Exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        throw exception
    }
  }

  /**
   * Submit a job to the job scheduler and get a JobWaiter object back. The JobWaiter object
   * can be used to block until the the job finishes executing or can be used to cancel the job.
   */
  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties))
    waiter
  }

在这里有一个重要的组件eventProcessLoop，在它的内部会使用模式匹配调用JobSubmitted方法，源码如下所示：

private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
  extends EventLoop[DAGSchedulerEvent]("dag-scheduler-event-loop") with Logging {

  /**
   * The main event loop of the DAG scheduler.
   */
  override def onReceive(event: DAGSchedulerEvent): Unit = event match {
    case JobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite, listener, properties) =>
     //调用 dagScheduler.handleJobSubmitted
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite,
        listener, properties)
...

这个方法中会调用DAGScheduler的handleJobSubmitted方法，这个方法是job调度的核心入口，这也引出了我们今天的第一个重点，stage的划分算法，在深究stage的划分算法之前，先讲述一下它的核心算法：首先它会从最后一个stage开始创建一个finalStage，然后使用递归调用stage，如果stage的rdd之间是窄依赖，将其放入一个以stack为数据结构的等待队里中，如果是宽依赖，那么将会创建一个新的stage，放入一缓存中；直到递归调用到第一个stage，然后开始提交。这个方法的源码如下：

/**
    * DAGScheduler调度的核心入口方法
    */
  private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      allowLocal: Boolean,
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    //使用触发job的最后一个RDD，创建一个finalStage
    var finalStage: Stage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.
      finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
    } catch {
      case e: Exception =>
        logWarning("Creating new stage failed due to exception - job: " + jobId, e)
        listener.jobFailed(e)
        return
    }
    if (finalStage != null) {
      //使用finalStage创建一个Job，也就是说这个Job的最后一个stage就是finalStage
      val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties)
      clearCacheLocs()
      logInfo("Got job %s (%s) with %d output partitions (allowLocal=%s)".format(
        job.jobId, callSite.shortForm, partitions.length, allowLocal))
      logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")")
      logInfo("Parents of final stage: " + finalStage.parents)
      logInfo("Missing parents: " + getMissingParentStages(finalStage))
      val shouldRunLocally =
        localExecutionEnabled && allowLocal && finalStage.parents.isEmpty && partitions.length == 1
      val jobSubmissionTime = clock.getTimeMillis()
      if (shouldRunLocally) {
        // Compute very short actions like first() or take() with no parent stages locally.
        listenerBus.post(
          SparkListenerJobStart(job.jobId, jobSubmissionTime, Seq.empty, properties))
        runLocally(job)
      } else {
        //将Job加入内存缓存中
        jobIdToActiveJob(jobId) = job
        activeJobs += job
        finalStage.resultOfJob = Some(job)
        val stageIds = jobIdToStageIds(jobId).toArray
        val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
        listenerBus.post(
          SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
        //提交Job
        submitStage(finalStage)
      }
    }
    //提交stage等待队列
    submitWaitingStages()
  }

进入submitStage方法中，这个方法就是整个算法的核心：

 /**
    * 递归提交stage，直到当前stage没有父stage
    */
  /** Submits stage, but first recursively submits any missing parents. */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        //根据rdd的宽依赖关系创建一个新的stage
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        //反复递归调用，直到最初的stage，它没有父stage，那么此时首先提交第一个stage，其余的stage都在waitingStages中
        if (missing == Nil) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          //提交stage
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
          //递归调用
            submitStage(parent)
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id)
    }
  }

在递归到第一个stage的时候，就会调用submitMissingTasks方法为每一个stage创建task进行提交，在提交task的时候，它会计算每一个task对应的partition的最佳位置，因此其算法是：首先从从stage的最后一个rdd开始找，哪个rdd的partition是被cache了，或者被checkPoint了，那么task的最佳位置就是cache/checkPoint的位置，因为这样的话，task的执行就不需要计算之前的RDD了，其源码如下：

 /**
    * 提交stage，为stage创建一批task，task数量与partition数量相同
    * @param stage
    * @param jobId
    */
  private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")
    // Get our pending tasks and remember them in our pendingTasks entry
    stage.pendingTasks.clear()
    // First figure out the indexes of partition ids to compute.
    val partitionsToCompute: Seq[Int] = {
      if (stage.isShuffleMap) {
        (0 until stage.numPartitions).filter(id => stage.outputLocs(id) == Nil)
      } else {
        val job = stage.resultOfJob.get
        (0 until job.numPartitions).filter(id => !job.finished(id))
      }
    }
    val properties = if (jobIdToActiveJob.contains(jobId)) {
      jobIdToActiveJob(stage.jobId).properties
    } else {
      // this stage will be assigned to "default" pool
      null
    }
    runningStages += stage
    stage.latestInfo = StageInfo.fromStage(stage, Some(partitionsToCompute.size))
    outputCommitCoordinator.stageStart(stage.id)
    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
    var taskBinary: Broadcast[Array[Byte]] = null
    try {
      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
      // For ResultTask, serialize and broadcast (rdd, func).
      val taskBinaryBytes: Array[Byte] =
        if (stage.isShuffleMap) {
          closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : AnyRef).array()
        } else {
          closureSerializer.serialize((stage.rdd, stage.resultOfJob.get.func) : AnyRef).array()
        }
      taskBinary = sc.broadcast(taskBinaryBytes)
    } catch {
      // In the case of a failure during serialization, abort the stage.
      case e: NotSerializableException =>
        abortStage(stage, "Task not serializable: " + e.toString)
        runningStages -= stage
        return
      case NonFatal(e) =>
        abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}")
        runningStages -= stage
        return
    }
    //为stage创建指定数量的task
    val tasks: Seq[Task[_]] = try {
      if (stage.isShuffleMap) {
        partitionsToCompute.map { id =>
          //给每一个partition创建一个task
          //给每一个task计算最佳位置
          val locs = getPreferredLocs(stage.rdd, id)
          val part = stage.rdd.partitions(id)
          //除了finalStage之外所有的stage，它的isShuffleMap是true
          //因此会创建ShuffleMapTask
          new ShuffleMapTask(stage.id, taskBinary, part, locs)
        }
      } else {
        //如果不是isShuffleMap，那么就是finalStage
        //finalStage创建ResultTask
        val job = stage.resultOfJob.get
        partitionsToCompute.map { id =>
          val p: Int = job.partitions(id)
          val part = stage.rdd.partitions(p)
          val locs = getPreferredLocs(stage.rdd, p)
          new ResultTask(stage.id, taskBinary, part, locs, id)
        }
      }
    } catch {
      case NonFatal(e) =>
        abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}")
        runningStages -= stage
        return
    }
    if (tasks.size > 0) {
      logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
      stage.pendingTasks ++= tasks
      logDebug("New pending tasks: " + stage.pendingTasks)
       //调用taskScheduler的submitTask创建TaskSet提交task
      taskScheduler.submitTasks(
        new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties))
      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
    } else {
      markStageAsFinished(stage, None)
      logDebug("Stage " + stage + " is actually done; %b %d %d".format(
        stage.isAvailable, stage.numAvailableOutputs, stage.numPartitions))
    }
  }

下面这个方法就是task最佳位置算法的核心方法：

/**
    * 计算每一个task对应的partition的最佳位置
    * 其实就是从stage的最后一个rdd开始找，哪个rdd的partition是被cache了，或者被checkPoint了
    * 那么task的最佳位置就是cache/checkPoint的位置
    * 因为这样的话，task的执行就不需要 计算之前的RDD了
    */
  private def getPreferredLocsInternal(
      rdd: RDD[_],
      partition: Int,
      visited: HashSet[(RDD[_],Int)])
    : Seq[TaskLocation] =
  {
    // If the partition has already been visited, no need to re-visit.
    // This avoids exponential path exploration.  SPARK-695
    if (!visited.add((rdd,partition))) {
      // Nil has already been returned for previously visited partitions.
      return Nil
    }
    // If the partition is cached, return the cache locations
    //判断rdd的partition是否被cache
    val cached = getCacheLocs(rdd)(partition)
    if (!cached.isEmpty) {
      return cached
    }
    // If the RDD has some placement preferences (as is the case for input RDDs), get those
    //判断rdd的partition是否被checkPoint
    val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
    if (!rddPrefs.isEmpty) {
      return rddPrefs.map(TaskLocation(_))
    }
    // If the RDD has narrow dependencies, pick the first partition of the first narrow dep
    // that has any placement preferences. Ideally we would choose based on transfer sizes,
    // but this will do for now.
    rdd.dependencies.foreach {
      case n: NarrowDependency[_] =>
        //递归调用自己，遍历父rdd对应的partition是否被cache或者checkPoint
        for (inPart <- n.getParents(partition)) {
          val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
          if (locs != Nil) {
            return locs
          }
        }
      case _ =>
    }
    //如果这个stage，从最后一个rdd到最开始的rdd，partition都没有被缓存或者checkPoint,那么task的最佳位置就是Nil
    Nil
  }

在选取好task的最佳位置之后，接着就开始调用TaskScheduler的submitTasks方法创建 TaskSet提交task了，在TaskScheduler中，会涉及到task的分配算法，分配到哪几个executor中执行，在后面的文章中我们会深入探究，今天，对于stage的划分算法和task的最佳位置选取算法做了深入的探究，如有任何问题，请不吝赐教，欢迎留言讨论！！！

Spark源码分析之DAGScheduler详解

猜你喜欢