Spark-Core源码学习记录
该系列作为Spark源码回顾学习的记录,旨在捋清Spark分发程序运行的机制和流程,对部分关键源码进行追踪,争取做到知其所以然,对枝节部分源码仅进行文字说明,不深入下钻,避免混淆主干内容。
本文是对Worker
注册过程的补充,在这里Spark-Core源码学习记录 1提及,在Worker
向Master
注册完成后,Master
会调用schedule方法进行资源调度,下面就详细追踪一下schedule方法的资源调度流程。
schedule方法源码,Driver如何见缝插针
/* Schedule the currently available resources among waiting apps. This method
will be called every time a new app joins or resource availability changes.*/
private def schedule(): Unit = {...}
首先通过注释我们可以清晰的看到,每当有资源变化或者新应用提交,该方法都会被调用,给等待中的应用分配当前可用的资源。
下面看看里面的具体内容:
private def schedule(): Unit = {
// Drivers take strict precedence over executors
// Drivers 的优先权 高于 executors
// 在已注册的Workers中,过滤状态为alive的,并且随机打乱
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
for (driver <- waitingDrivers.toList) {
var launched = false // 标志当前遍历到的driver状态
var numWorkersVisited = 0 // 当前的driver,遍历至worker列表中的位置记录
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
// 启动Driver,放在下面讲解
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
// 启动Executor,放在下面讲解
startExecutorsOnWorkers()
}
整个for循环和内部的while循环,简单来讲就是,外层遍历等待中的Drivers,然后内层遍历可用的workers,判断worker的cpu和内存是否满足driver的需求,满足就启动DriverlaunchDriver(worker, driver)
,不满足就往下继续遍历,见缝插针。
下面我们先来看launchDriver方法,然后再关注startExecutorsOnWorkers方法。
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
logInfo("Launching driver " + driver.id + " on worker " + worker.id)
// worker和 driver之间互相记录
worker.addDriver(driver)
driver.worker = Some(worker)
// worker向自身发送 LaunchDriver消息,参数 driver.desc为一个模板类,记录 driver需要的资源
worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
driver.state = DriverState.RUNNING
}
追踪worker
接收到LaunchDriver消息后的操作:
override def receive: PartialFunction[Any, Unit] = synchronized {
case LaunchDriver(driverId, driverDesc) =>
logInfo(s"Asked to launch driver $driverId")
//实例化 DriverRunner
val driver = new DriverRunner(conf,driverId,workDir,sparkHome,
// 调用传入模板类的 copy方法,更新部分属性值
driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
self,workerUri,securityMgr)
// 在当前 worker 中记录该 driver,然后调用 start方法启动
drivers(driverId) = driver
driver.start()
// 更新 worker自身的资源使用情况
coresUsed += driverDesc.cores
memoryUsed += driverDesc.mem
}
看一下DriverRunner的实例化过程,重点是start方法做的操作。
/**
* Manages the execution of one driver, including automatically restarting the driver on failure.
* This is currently only used in standalone cluster deploy mode.
*/
private[deploy] class DriverRunner(...){...}
通过注释可以看到,只有在standalone cluster
模式下才会被使用。
/** Starts a thread to run and manage the driver. */
private[worker] def start() = {
new Thread("DriverRunner for " + driverId) {
override def run() {
// prepare driver jars and run driver 关键方法
val exitCode = prepareAndRunDriver()
// notify worker of final driver state, possible exception
// 将启动结果发送给对应的worker
worker.send(DriverStateChanged(driverId, finalState.get, finalException))
}}.start()
}
在结尾处,worker.send将启动结果发送给对应的Worker
,Worker
接收到DriverStateChanged消息后,进行自身资源变更,然后再通过Worker
将 DriverStateChanged消息发送给Master
,通知Master
进行driver状态变更,最后Master
又会调用schedule()
方法。当然这个过程不是我们关注的重点,我们现在去prepareAndRunDriver()
中看driver启动的详细流程:
private[worker] def prepareAndRunDriver(): Int = {
/*Creates the working directory for this driver.*/
val driverDir = createWorkingDirectory()
/*Download the user jar into the supplied directory and return its local path.*/
// 调用流程 Utils.fetchFile->copyFile->copyRecursive->Files.copy
val localJarFilename = downloadUserJar(driverDir)
def substituteVariables(argument: String): String = argument match {
case "{{WORKER_URL}}" => workerUrl
case "{{USER_JAR}}" => localJarFilename
case other => other
}
// TODO: If we add ability to submit multiple jars they should also be added here
// 根据系统环境变量及 driverDesc携带信息,实例化 ProcessBuilder
val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
// 启动入口,下面详解
runDriver(builder, driverDir, driverDesc.supervise)
}
buildProcessBuilder 重要方法
先进入buildProcessBuilder查看builder的内容,
/*Build a ProcessBuilder based on the given parameters.*/
def buildProcessBuilder(...): ProcessBuilder = {
// 根据本地系统环境变量,重新构造模板类 Command
val localCommand = buildLocalCommand(command, securityMgr, substituteArguments, classPaths, env)
// 根据构造好的 Command实例化 ProcessBuilder
val commandSeq = buildCommandSeq(localCommand, memory, sparkHome)
val builder = new ProcessBuilder(commandSeq: _*)
// 填充 ProcessBuilder中的 Map<String,String> environment;
val environment = builder.environment()
for ((key, value) <- localCommand.environment) {
environment.put(key, value)
}
builder
}
进入runDriver(builder,driverDir,driverDesc.supervise)
方法,下面源码仅保留主干内容:
private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = {
// 定义 initialize方法对象作为下面方法的传参,initialize方法的作用仅仅是重定向 Process的输出流和错误流至工作目录下的文件
// Redirect stdout and stderr to files
def initialize(process: Process): Unit = {...}
// 继续往下走
runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
}
private[worker] def runCommandWithRetry(...): Int = {
synchronized {
if (killed) { return exitCode }
// 调用 command.start() 方法,继续往下
process = Some(command.start())
initialize(process.get)
}
}
public Process start() throws IOException {
String[] cmdarray = command.toArray(new String[command.size()]);
// 到此为止,调用 ProcessImpl启动一个进程执行对应的 command,不再往下深挖command是如何运行的
return ProcessImpl.start(cmdarray,environment,dir,redirects,redirectErrorStream);
}
到此,一个Driver
从分配Worker
,到最终运行所携带的命令完成启动的全过程算上走完。而关于Driver
的实例化是在submit
过程中完成的,我们在后续章节中会详解介绍。
回顾 startExecutorsOnWorkers()
让我们回到最初的地方,schedule()
方法的最后,调用startExecutorsOnWorkers()
启动Executors。
/*Schedule and launch executors on workers*/
private def startExecutorsOnWorkers(): Unit = {
// Right now this is a very simple FIFO scheduler. 先进先出模式
for (app <- waitingApps) {
// 每个 Executor占用core个数默认为1
val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
// workers过滤条件,然后按可用核数逆序
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && worker.coresFree >= coresPerExecutor)
.sortBy(_.coresFree).reverse
// 规划每个可用的 Worker上提供的 Core 核数,下面详解
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
// Now that we've decided how many cores to allocate on each worker, let's allocate them
// 开始为 Executors分配计算资源并启动,下面详解
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
}
}
先看scheduleExecutorsOnWorkers方法,涉及的东西还是比较有意思的:
/* Schedule executors to be launched on the workers.
Returns an array containing number of cores assigned to each worker.*/
private def scheduleExecutorsOnWorkers(...): Array[Int] = {
val coresPerExecutor = app.desc.coresPerExecutor
// 每个 Executor最少需要的Core核数,未提前配置则默认为1
val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
// 若提前配置了每个 Executor最少需要的Core核数,则一个Worker可以启动多个 Executor
//否则,每个Worker上指启动一个 Executor,并为其分配足够多的Core核数。oneExecutorPerWorker是一个判断结果的标志位
val oneExecutorPerWorker = coresPerExecutor.isEmpty
val memoryPerExecutor = app.desc.memoryPerExecutorMB
val numUsable = usableWorkers.length
// 按顺序记录每个Worker即将提供的Core个数
val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
// 按顺序记录每个Worker上的 Executor个数
val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
// app还需要的Core个数和各个Worker剩余的Core个数,二者较小的作为等待分配的个数
var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
// Return whether the specified worker can launch an executor for this app.
// 就是判断当前Worker是否还可以启动 executor,(要么加Core数,要么启动新的 executor,看标志位 oneExecutorPerWorker的值)
def canLaunchExecutor(pos: Int): Boolean = {...}
// Keep launching executors until no more workers can accommodate any more executors, or if we have reached this application's limits
// 记录当前有资源可供使用的 Worker位置序列
var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
while (freeWorkers.nonEmpty) {
freeWorkers.foreach { pos => //遍历每个可用的 Worker
var keepScheduling = true
while (keepScheduling && canLaunchExecutor(pos)) {
// 更新等待分配的 Core个数
coresToAssign -= minCoresPerExecutor
// 更新当前 Worker提供的Core核数,assignedCores最终作为结果数组被返回
assignedCores(pos) += minCoresPerExecutor
// If we are launching one executor per worker, then every iteration assigns 1 core to the executor. Otherwise, every iteration assigns cores to a new executor.
if (oneExecutorPerWorker) {
// 如果标志位为TRUE,也就是说没有配置 Executor的核数,那么每个 Worker会分配一个 Executor,并未其分配尽可能的 Core。所以 assignedExecutors数组内的值一直是1
assignedExecutors(pos) = 1
} else {
// 反之,则会启动多个 Executor,那么assignedExecutors的值继续增长
assignedExecutors(pos) += 1
}
// spreadOutApps 标志位用于区分两种分布模型,第一种是将executors分配到尽可能多的workers上;第二种反之。
// 默认使用的是第一种模型,也即 spreadOutApps = true,将 keepScheduling标志设置为fasle,
// 跳出内部循环,去继续遍历外部的Workers数组
if (spreadOutApps) {
keepScheduling = false
}
}
}
freeWorkers = freeWorkers.filter(canLaunchExecutor)
}
assignedCores
}
然后我们回过头继续看allocateWorkerResourceToExecutors
方法:
/* Allocate a worker's resources to one or more executors.*/
private def allocateWorkerResourceToExecutors(...): Unit = {
// 根据之前得到的结果,计算当前Worker上要启动的 Executor个数
val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
// 每个 Executor所占用的 Core核数
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
// 内部实例化 ExecutorDesc,初始化状态为 state = ExecutorState.LAUNCHING;
// 并且在 ApplicationInfo中 记录 Executor,并更新 coresGranted,具体源码不再展开
val exec = app.addExecutor(worker, coresToAssign)
// 启动 Executor
launchExecutor(worker, exec)
// 更新 Application的状态为 RUNNING
app.state = ApplicationState.RUNNING
}
}
进入launchExecutor
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
// 记录 Executor,并更新 coresUsed和memoryUsed值
worker.addExecutor(exec)
// Worker给自身发送 LaunchExecutor消息
worker.endpoint.send(LaunchExecutor(masterUrl,exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
// 向绑定的 driver发送 ExecutorAdded消息
exec.application.driver.send(ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}
先来看看Worker
接收到LaunchExecutor消息后做的操作:
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
try {
// Create the executor's working directory
val executorDir = new File(workDir, appId + "/" + execId)
// Create local dirs for the executor. These are passed to the executor via the
// SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the
// application finishes.
val appLocalDirs = appDirectories.getOrElse(appId, {
val localRootDirs = Utils.getOrCreateLocalRootDirs(conf)
// 创建本地工作目录,省略部分源码
val dirs = localRootDirs.flatMap {...}.toSeq
dirs
})
appDirectories(appId) = appLocalDirs
// 类似前面 driver启动的操作,实例化一个 ExecutorRunner,然后调用start方法
val manager = new ExecutorRunner(...)
executors(appId + "/" + execId) = manager
manager.start()
coresUsed += cores_
memoryUsed += memory_
// 向 Master发送状态变更消息,流程不再展开
sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))
} catch {
case e: Exception =>
// 省略一些其他错误处理代码,保留向Master发送 ExecutorStateChanged消息,不过参数和上面不通
sendToMaster(ExecutorStateChanged(appId, execId, ExecutorState.FAILED,Some(e.toString), None))
}
其中sendToMaster发送消息给Master
,期间Master
会发送消息给相应的Driver
来变更状态,同时还要更新Application
的状态信息,总之都是通过各组件之间的消息传输,来完成相关状态的变更,此处不再继续深追。
抛开ExecutorRunner
的实例化过程,我们继续看manager.start()
方法
private[worker] def start() {
workerThread = new Thread("ExecutorRunner for " + fullId) {
override def run() { fetchAndRunExecutor() }
}
workerThread.start()
}
/* Download and run the executor described in our ApplicationDescription*/
private def fetchAndRunExecutor() { //仅保留一些主干源码
val builder = CommandUtils.buildProcessBuilder(subsCommand, new SecurityManager(conf),memory, sparkHome.getAbsolutePath, substituteVariables)
// 可以看到和前面的 Driver部分保持一致,构造 ProcessBuilder实例,然后通过start方法间接调用 ProcessImpl.start(...)方法,详情可参见本文前面部分
process = builder.start()
val exitCode = process.waitFor()
state = ExecutorState.EXITED
// Worker当然也会接到 ExecutorStateChanged消息来进行资源状态的变更
worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode)))
}
小结
到此,整个schedule方法介绍完毕,涉及Driver
的启动、Executor
的分配与启动等内容。再回到方法开始,其实我们在启动集群的时候,waitingDrivers和waitingApps容器都是空的,具体的内容需要等到用户submit
提交程序开始,因此对此不必感到疑惑,后续的文章会从程序提交开始追踪源码。
private def schedule(): Unit = {
for (driver <- waitingDrivers.toList) {...}
}
private def startExecutorsOnWorkers(): Unit = {
for (app <- waitingApps) {...}
}
参考: