Spark-Core源码学习记录

该系列作为Spark源码回顾学习的记录，旨在捋清Spark分发程序运行的机制和流程，对部分关键源码进行追踪，争取做到知其所以然，对枝节部分源码仅进行文字说明，不深入下钻，避免混淆主干内容。
本文是对Worker注册过程的补充，在这里Spark-Core源码学习记录 1提及，在Worker向Master注册完成后，Master会调用schedule方法进行资源调度，下面就详细追踪一下schedule方法的资源调度流程。

schedule方法源码，Driver如何见缝插针

/* Schedule the currently available resources among waiting apps. This method 
	will be called every time a new app joins or resource availability changes.*/
  private def schedule(): Unit = {...}

首先通过注释我们可以清晰的看到，每当有资源变化或者新应用提交，该方法都会被调用，给等待中的应用分配当前可用的资源。
下面看看里面的具体内容：

 private def schedule(): Unit = {
    // Drivers take strict precedence over executors
    // Drivers 的优先权 高于 executors
    // 在已注册的Workers中，过滤状态为alive的，并且随机打乱
    val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
    val numWorkersAlive = shuffledAliveWorkers.size
    var curPos = 0
    for (driver <- waitingDrivers.toList) {
      var launched = false // 标志当前遍历到的driver状态
      var numWorkersVisited = 0 // 当前的driver，遍历至worker列表中的位置记录
      while (numWorkersVisited < numWorkersAlive && !launched) {
        val worker = shuffledAliveWorkers(curPos)
        numWorkersVisited += 1
        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
          // 启动Driver，放在下面讲解
          launchDriver(worker, driver)
          waitingDrivers -= driver
          launched = true
        }
        curPos = (curPos + 1) % numWorkersAlive
      }
    }
    // 启动Executor，放在下面讲解
    startExecutorsOnWorkers()
  }

整个for循环和内部的while循环，简单来讲就是，外层遍历等待中的Drivers，然后内层遍历可用的workers，判断worker的cpu和内存是否满足driver的需求，满足就启动DriverlaunchDriver(worker, driver)，不满足就往下继续遍历，见缝插针。
下面我们先来看launchDriver方法，然后再关注startExecutorsOnWorkers方法。

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
    logInfo("Launching driver " + driver.id + " on worker " + worker.id)
    // worker和 driver之间互相记录
    worker.addDriver(driver)
    driver.worker = Some(worker)
    // worker向自身发送 LaunchDriver消息，参数 driver.desc为一个模板类，记录 driver需要的资源
    worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
    driver.state = DriverState.RUNNING
  }

追踪worker接收到LaunchDriver消息后的操作：

override def receive: PartialFunction[Any, Unit] = synchronized {
case LaunchDriver(driverId, driverDesc) =>
      logInfo(s"Asked to launch driver $driverId")
      //实例化 DriverRunner
      val driver = new DriverRunner(conf,driverId,workDir,sparkHome,
      		// 调用传入模板类的 copy方法，更新部分属性值
			driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
        	self,workerUri,securityMgr)
      // 在当前 worker 中记录该 driver，然后调用 start方法启动
      drivers(driverId) = driver
      driver.start()
	  // 更新 worker自身的资源使用情况
      coresUsed += driverDesc.cores
      memoryUsed += driverDesc.mem
}

看一下DriverRunner的实例化过程，重点是start方法做的操作。

/**
 * Manages the execution of one driver, including automatically restarting the driver on failure.
 * This is currently only used in standalone cluster deploy mode.
 */
private[deploy] class DriverRunner(...){...}

通过注释可以看到，只有在standalone cluster模式下才会被使用。

/** Starts a thread to run and manage the driver. */
private[worker] def start() = {
  new Thread("DriverRunner for " + driverId) {
    override def run() {
      // prepare driver jars and run driver 关键方法
      val exitCode = prepareAndRunDriver()
      // notify worker of final driver state, possible exception
      // 将启动结果发送给对应的worker
      worker.send(DriverStateChanged(driverId, finalState.get, finalException))
    }}.start()
}

在结尾处，worker.send将启动结果发送给对应的Worker，Worker接收到DriverStateChanged消息后，进行自身资源变更，然后再通过Worker将 DriverStateChanged消息发送给Master，通知Master进行driver状态变更，最后Master又会调用schedule()方法。当然这个过程不是我们关注的重点，我们现在去prepareAndRunDriver()中看driver启动的详细流程：

private[worker] def prepareAndRunDriver(): Int = {
	/*Creates the working directory for this driver.*/
   val driverDir = createWorkingDirectory()
   /*Download the user jar into the supplied directory and return its local path.*/
   // 调用流程 Utils.fetchFile->copyFile->copyRecursive->Files.copy
   val localJarFilename = downloadUserJar(driverDir)

   def substituteVariables(argument: String): String = argument match {
     case "{{WORKER_URL}}" => workerUrl
     case "{{USER_JAR}}" => localJarFilename
     case other => other
   }

   // TODO: If we add ability to submit multiple jars they should also be added here
   // 根据系统环境变量及 driverDesc携带信息，实例化 ProcessBuilder
   val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
     driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
   // 启动入口，下面详解
   runDriver(builder, driverDir, driverDesc.supervise)
}

buildProcessBuilder 重要方法

先进入buildProcessBuilder查看builder的内容，

 /*Build a ProcessBuilder based on the given parameters.*/
def buildProcessBuilder(...): ProcessBuilder = {
  // 根据本地系统环境变量，重新构造模板类 Command
  val localCommand = buildLocalCommand(command, securityMgr, substituteArguments, classPaths, env)
  // 根据构造好的 Command实例化 ProcessBuilder
  val commandSeq = buildCommandSeq(localCommand, memory, sparkHome)
  val builder = new ProcessBuilder(commandSeq: _*)
  // 填充 ProcessBuilder中的 Map<String,String> environment;
  val environment = builder.environment()
  for ((key, value) <- localCommand.environment) {
    environment.put(key, value)
  }
  builder
}

进入runDriver(builder,driverDir,driverDesc.supervise)方法，下面源码仅保留主干内容：

private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = {
 	// 定义 initialize方法对象作为下面方法的传参，initialize方法的作用仅仅是重定向 Process的输出流和错误流至工作目录下的文件
 	// Redirect stdout and stderr to files
    def initialize(process: Process): Unit = {...}
    // 继续往下走
    runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
}
private[worker] def runCommandWithRetry(...): Int = {
	synchronized {
        if (killed) { return exitCode }
        // 调用 command.start() 方法，继续往下
        process = Some(command.start())
        initialize(process.get)
    }
}
public Process start() throws IOException {
    String[] cmdarray = command.toArray(new String[command.size()]);
    // 到此为止，调用 ProcessImpl启动一个进程执行对应的 command，不再往下深挖command是如何运行的
	return ProcessImpl.start(cmdarray,environment,dir,redirects,redirectErrorStream);
}

到此，一个Driver从分配Worker，到最终运行所携带的命令完成启动的全过程算上走完。而关于Driver的实例化是在submit过程中完成的，我们在后续章节中会详解介绍。

回顾 startExecutorsOnWorkers()

让我们回到最初的地方，schedule()方法的最后，调用startExecutorsOnWorkers()启动Executors。

/*Schedule and launch executors on workers*/
private def startExecutorsOnWorkers(): Unit = {
  // Right now this is a very simple FIFO scheduler. 先进先出模式
  for (app <- waitingApps) {
    // 每个 Executor占用core个数默认为1
    val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
    // workers过滤条件，然后按可用核数逆序
    val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
        .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && worker.coresFree >= coresPerExecutor)
        .sortBy(_.coresFree).reverse
      // 规划每个可用的 Worker上提供的 Core 核数，下面详解
      val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
      // Now that we've decided how many cores to allocate on each worker, let's allocate them
      // 开始为 Executors分配计算资源并启动，下面详解
      for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
        allocateWorkerResourceToExecutors(app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
  	  }
}

先看scheduleExecutorsOnWorkers方法，涉及的东西还是比较有意思的：

/* Schedule executors to be launched on the workers.
 Returns an array containing number of cores assigned to each worker.*/
private def scheduleExecutorsOnWorkers(...): Array[Int] = {
    val coresPerExecutor = app.desc.coresPerExecutor
    // 每个 Executor最少需要的Core核数，未提前配置则默认为1
    val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
    // 若提前配置了每个 Executor最少需要的Core核数，则一个Worker可以启动多个 Executor
    //否则，每个Worker上指启动一个 Executor，并为其分配足够多的Core核数。oneExecutorPerWorker是一个判断结果的标志位
    val oneExecutorPerWorker = coresPerExecutor.isEmpty
    val memoryPerExecutor = app.desc.memoryPerExecutorMB
    val numUsable = usableWorkers.length
    // 按顺序记录每个Worker即将提供的Core个数
    val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
    // 按顺序记录每个Worker上的 Executor个数
    val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
    // app还需要的Core个数和各个Worker剩余的Core个数，二者较小的作为等待分配的个数
    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
    // Return whether the specified worker can launch an executor for this app.
    // 就是判断当前Worker是否还可以启动 executor，(要么加Core数，要么启动新的 executor，看标志位 oneExecutorPerWorker的值)
    def canLaunchExecutor(pos: Int): Boolean = {...}
	// Keep launching executors until no more workers can accommodate any more executors, or if we have reached this application's limits
	// 记录当前有资源可供使用的 Worker位置序列
    var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
    while (freeWorkers.nonEmpty) { 
      freeWorkers.foreach { pos => //遍历每个可用的 Worker
        var keepScheduling = true
        while (keepScheduling && canLaunchExecutor(pos)) {
          // 更新等待分配的 Core个数
          coresToAssign -= minCoresPerExecutor
          // 更新当前 Worker提供的Core核数，assignedCores最终作为结果数组被返回
          assignedCores(pos) += minCoresPerExecutor
          // If we are launching one executor per worker, then every iteration assigns 1 core to the executor. Otherwise, every iteration assigns cores to a new executor.
          if (oneExecutorPerWorker) {
            // 如果标志位为TRUE，也就是说没有配置 Executor的核数，那么每个 Worker会分配一个 Executor，并未其分配尽可能的 Core。所以 assignedExecutors数组内的值一直是1
            assignedExecutors(pos) = 1
          } else {
            // 反之，则会启动多个 Executor，那么assignedExecutors的值继续增长
            assignedExecutors(pos) += 1
          }
          // spreadOutApps 标志位用于区分两种分布模型，第一种是将executors分配到尽可能多的workers上；第二种反之。
          // 默认使用的是第一种模型，也即 spreadOutApps = true，将 keepScheduling标志设置为fasle，
          // 跳出内部循环，去继续遍历外部的Workers数组
          if (spreadOutApps) {
            keepScheduling = false
          }
        }
      }
      freeWorkers = freeWorkers.filter(canLaunchExecutor)
    }
    assignedCores
  }

然后我们回过头继续看allocateWorkerResourceToExecutors方法：

/* Allocate a worker's resources to one or more executors.*/
private def allocateWorkerResourceToExecutors(...): Unit = {
  // 根据之前得到的结果，计算当前Worker上要启动的 Executor个数
  val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
  // 每个 Executor所占用的 Core核数
  val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
  for (i <- 1 to numExecutors) {
    // 内部实例化 ExecutorDesc，初始化状态为 state = ExecutorState.LAUNCHING；
    // 并且在 ApplicationInfo中 记录 Executor，并更新 coresGranted，具体源码不再展开
    val exec = app.addExecutor(worker, coresToAssign)
    // 启动 Executor
    launchExecutor(worker, exec)
    // 更新 Application的状态为 RUNNING
    app.state = ApplicationState.RUNNING
  }
}

进入launchExecutor

private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
   logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
   // 记录 Executor，并更新 coresUsed和memoryUsed值
   worker.addExecutor(exec)
   // Worker给自身发送 LaunchExecutor消息
   worker.endpoint.send(LaunchExecutor(masterUrl,exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
   // 向绑定的 driver发送 ExecutorAdded消息
   exec.application.driver.send(ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}

先来看看Worker接收到LaunchExecutor消息后做的操作：

case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
    try {
      // Create the executor's working directory
      val executorDir = new File(workDir, appId + "/" + execId)
      // Create local dirs for the executor. These are passed to the executor via the
      // SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the
      // application finishes.
      val appLocalDirs = appDirectories.getOrElse(appId, {
        val localRootDirs = Utils.getOrCreateLocalRootDirs(conf)
        // 创建本地工作目录，省略部分源码
        val dirs = localRootDirs.flatMap {...}.toSeq
        dirs
        })
      appDirectories(appId) = appLocalDirs
      // 类似前面 driver启动的操作，实例化一个 ExecutorRunner，然后调用start方法
      val manager = new ExecutorRunner(...)
      executors(appId + "/" + execId) = manager
      manager.start()
      coresUsed += cores_
      memoryUsed += memory_
      // 向 Master发送状态变更消息，流程不再展开
      sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
    } catch {
      case e: Exception =>
        // 省略一些其他错误处理代码，保留向Master发送 ExecutorStateChanged消息，不过参数和上面不通
        sendToMaster(ExecutorStateChanged(appId, execId, ExecutorState.FAILED,Some(e.toString), None))
    }

其中sendToMaster发送消息给Master，期间Master会发送消息给相应的Driver来变更状态，同时还要更新Application的状态信息，总之都是通过各组件之间的消息传输，来完成相关状态的变更，此处不再继续深追。
抛开ExecutorRunner的实例化过程，我们继续看manager.start()方法

private[worker] def start() {
    workerThread = new Thread("ExecutorRunner for " + fullId) {
      override def run() { fetchAndRunExecutor() }
    }
    workerThread.start()
}
/* Download and run the executor described in our ApplicationDescription*/
private def fetchAndRunExecutor() {  //仅保留一些主干源码
	  val builder = CommandUtils.buildProcessBuilder(subsCommand, new SecurityManager(conf),memory, sparkHome.getAbsolutePath, substituteVariables)
	  // 可以看到和前面的 Driver部分保持一致，构造 ProcessBuilder实例，然后通过start方法间接调用 ProcessImpl.start(...)方法，详情可参见本文前面部分
      process = builder.start()
      val exitCode = process.waitFor()
      state = ExecutorState.EXITED
      // Worker当然也会接到 ExecutorStateChanged消息来进行资源状态的变更
      worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode)))
}

小结

到此，整个schedule方法介绍完毕，涉及Driver的启动、Executor的分配与启动等内容。再回到方法开始，其实我们在启动集群的时候，waitingDrivers和waitingApps容器都是空的，具体的内容需要等到用户submit提交程序开始，因此对此不必感到疑惑，后续的文章会从程序提交开始追踪源码。

private def schedule(): Unit = {
	for (driver <- waitingDrivers.toList) {...}
}
private def startExecutorsOnWorkers(): Unit = {
	for (app <- waitingApps) {...}
}

参考：

Apache Spark 源码