master在schedule()时会先启动注册过来的waitingDrivers,然后启动Worker上的所有Executors。
在standalone模式下。
Worker启动Driver
master向worker发送LaunchDriver(driver.id, driver.desc)消息,
worker收到LaunchDriver消息后,创建一个DriverRunner()来管理driver的执行,在driver失败时重启等。
worker调用DriverRunner()的start(),记录driver使用的cpu和内存。
DriverRunner() 启动java线程,运行线程函数run()。
创建driver的工作目录,下载用户上传的jar,用 java ProcessBuilder启动Driver进程。
每个 ProcessBuilder 实例管理一个进程属性集。它的start() 方法利用这些属性创建一个新的 Process 实例。start() 方法可以从同一实例重复调用,以利用相同的或相关的属性创建新的子进程。在J2SE 1.5之前,都是由Process类处理实现进程的控制管理。
Driver执行完成后,DriverRunner线程向它的worker(这里没有直接发送给Master)发送DriverStateChanged(driverId, finalState.get, finalException)消息,通知worker最终的结果状态,包括可能是异常。
Worker收到DriverStateChanged消息后,保存下Driver的结果,转发给Master。
然后worker把这个driver从内存缓存中移除,将它保存到finishedDrivers里进一步处理。
最后从coresUsed/memoryUsed 中释放这个driver的资源。
Master收到DriverStateChanged消息后,也执行removeDriver,从本地缓存移除,记录状态信息,重新调度。
Worker启动Executor
master向worker发送LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)消息,
向driver发送ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)消息。
worker收到LaunchExecutor消息后,判断masterUrl如果不是active的,就不启动Executor。
worker为executor创建本地目录,
创建一个new ExecutorRunner()并启动。
启动完成后,同样把Executor的运行结果发送给Master,发送ExecutorStateChanged消息。
ExecutorRunner() 启动工作线程,下载并运行app description中记录的executor,调用ProcessBuilder启动Executor进程。
process = builder.start(),做stdout/stderr重定向,exitCode = process.waitFor()
最后发送ExecutorStateChanged消息反馈给Worker。
如果运行异常,就killProcess(),但是不再通知worker。
Worker收到ExecutorStateChanged消息,转发给Master,
Worker将Executor从内存缓存移除,释放cpu/mem资源。
private[worker] def handleExecutorStateChanged(executorStateChanged: ExecutorStateChanged):
Unit = {
sendToMaster(executorStateChanged)
val state = executorStateChanged.state
if (ExecutorState.isFinished(state)) {
val appId = executorStateChanged.appId
val fullId = appId + "/" + executorStateChanged.execId
val message = executorStateChanged.message
val exitStatus = executorStateChanged.exitStatus
executors.get(fullId) match {
case Some(executor) =>
logInfo("Executor " + fullId + " finished with state " + state +
message.map(" message " + _).getOrElse("") +
exitStatus.map(" exitStatus " + _).getOrElse(""))
executors -= fullId
finishedExecutors(fullId) = executor
trimFinishedExecutorsIfNecessary()
coresUsed -= executor.cores
memoryUsed -= executor.memory
if (CLEANUP_NON_SHUFFLE_FILES_ENABLED) {
shuffleService.executorRemoved(executorStateChanged.execId.toString, appId)
}
case None =>
logInfo("Unknown Executor " + fullId + " finished with state " + state +
message.map(" message " + _).getOrElse("") +
exitStatus.map(" exitStatus " + _).getOrElse(""))
}
maybeCleanupApplication(appId)
}
}
Master收到ExecutorStateChanged消息,先向对应的driver发送ExecutorUpdated消息,
Master判断ExecutorState如果完成了,就从worker中removeExecutor,
如果非正常退出,并且app的重试次数超过了MAX_EXECUTOR_RETRIES,就removeApplication,报告失败。
最后Master重新schedule()。
case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>
val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
execOption match {
case Some(exec) =>
val appInfo = idToApp(appId)
val oldState = exec.state
exec.state = state
if (state == ExecutorState.RUNNING) {
assert(oldState == ExecutorState.LAUNCHING,
s"executor $execId state transfer from $oldState to RUNNING is illegal")
appInfo.resetRetryCount()
}
exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))
if (ExecutorState.isFinished(state)) {
// Remove this executor from the worker and app
logInfo(s"Removing executor ${exec.fullId} because it is $state")
// If an application has already finished, preserve its
// state to display its information properly on the UI
if (!appInfo.isFinished) {
appInfo.removeExecutor(exec)
}
exec.worker.removeExecutor(exec)
val normalExit = exitStatus == Some(0)
// Only retry certain number of times so we don't go into an infinite loop.
// Important note: this code path is not exercised by tests, so be very careful when
// changing this `if` condition.
if (!normalExit
&& appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
&& MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
val execs = appInfo.executors.values
if (!execs.exists(_.state == ExecutorState.RUNNING)) {
logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
s"${appInfo.retryCount} times; removing it")
removeApplication(appInfo, ApplicationState.FAILED)
}
}
}
schedule()
case None =>
logWarning(s"Got status update for unknown executor $appId/$execId")
}