spark(三)-Worker启动Driver和Executor

master在schedule()时会先启动注册过来的waitingDrivers,然后启动Worker上的所有Executors。
在standalone模式下。

Worker启动Driver

master向worker发送LaunchDriver(driver.id, driver.desc)消息,

worker收到LaunchDriver消息后,创建一个DriverRunner()来管理driver的执行,在driver失败时重启等。
worker调用DriverRunner()的start(),记录driver使用的cpu和内存。

DriverRunner() 启动java线程,运行线程函数run()。
创建driver的工作目录,下载用户上传的jar,用 java ProcessBuilder启动Driver进程

每个 ProcessBuilder 实例管理一个进程属性集。它的start() 方法利用这些属性创建一个新的 Process 实例。start() 方法可以从同一实例重复调用,以利用相同的或相关的属性创建新的子进程。在J2SE 1.5之前,都是由Process类处理实现进程的控制管理。

Driver执行完成后,DriverRunner线程向它的worker(这里没有直接发送给Master)发送DriverStateChanged(driverId, finalState.get, finalException)消息,通知worker最终的结果状态,包括可能是异常。

Worker收到DriverStateChanged消息后,保存下Driver的结果,转发给Master。
然后worker把这个driver从内存缓存中移除,将它保存到finishedDrivers里进一步处理。
最后从coresUsed/memoryUsed 中释放这个driver的资源。

Master收到DriverStateChanged消息后,也执行removeDriver,从本地缓存移除,记录状态信息,重新调度。

Worker启动Executor

master向worker发送LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)消息,
向driver发送ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)消息。

worker收到LaunchExecutor消息后,判断masterUrl如果不是active的,就不启动Executor。
worker为executor创建本地目录,
创建一个new ExecutorRunner()并启动。
启动完成后,同样把Executor的运行结果发送给Master,发送ExecutorStateChanged消息。

ExecutorRunner() 启动工作线程,下载并运行app description中记录的executor,调用ProcessBuilder启动Executor进程。
process = builder.start(),做stdout/stderr重定向,exitCode = process.waitFor()
最后发送ExecutorStateChanged消息反馈给Worker。
如果运行异常,就killProcess(),但是不再通知worker。

Worker收到ExecutorStateChanged消息,转发给Master,
Worker将Executor从内存缓存移除,释放cpu/mem资源。

private[worker] def handleExecutorStateChanged(executorStateChanged: ExecutorStateChanged):
    Unit = {
    sendToMaster(executorStateChanged)
    val state = executorStateChanged.state
    if (ExecutorState.isFinished(state)) {
      val appId = executorStateChanged.appId
      val fullId = appId + "/" + executorStateChanged.execId
      val message = executorStateChanged.message
      val exitStatus = executorStateChanged.exitStatus
      executors.get(fullId) match {
        case Some(executor) =>
          logInfo("Executor " + fullId + " finished with state " + state +
            message.map(" message " + _).getOrElse("") +
            exitStatus.map(" exitStatus " + _).getOrElse(""))
          executors -= fullId
          finishedExecutors(fullId) = executor
          trimFinishedExecutorsIfNecessary()
          coresUsed -= executor.cores
          memoryUsed -= executor.memory
          if (CLEANUP_NON_SHUFFLE_FILES_ENABLED) {
            shuffleService.executorRemoved(executorStateChanged.execId.toString, appId)
          }
        case None =>
          logInfo("Unknown Executor " + fullId + " finished with state " + state +
            message.map(" message " + _).getOrElse("") +
            exitStatus.map(" exitStatus " + _).getOrElse(""))
      }
      maybeCleanupApplication(appId)
    }
  }

Master收到ExecutorStateChanged消息,先向对应的driver发送ExecutorUpdated消息,
Master判断ExecutorState如果完成了,就从worker中removeExecutor,
如果非正常退出,并且app的重试次数超过了MAX_EXECUTOR_RETRIES,就removeApplication,报告失败。
最后Master重新schedule()。

    case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>
      val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
      execOption match {
        case Some(exec) =>
          val appInfo = idToApp(appId)
          val oldState = exec.state
          exec.state = state

          if (state == ExecutorState.RUNNING) {
            assert(oldState == ExecutorState.LAUNCHING,
              s"executor $execId state transfer from $oldState to RUNNING is illegal")
            appInfo.resetRetryCount()
          }

          exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))

          if (ExecutorState.isFinished(state)) {
            // Remove this executor from the worker and app
            logInfo(s"Removing executor ${exec.fullId} because it is $state")
            // If an application has already finished, preserve its
            // state to display its information properly on the UI
            if (!appInfo.isFinished) {
              appInfo.removeExecutor(exec)
            }
            exec.worker.removeExecutor(exec)

            val normalExit = exitStatus == Some(0)
            // Only retry certain number of times so we don't go into an infinite loop.
            // Important note: this code path is not exercised by tests, so be very careful when
            // changing this `if` condition.
            if (!normalExit
                && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
                && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
              val execs = appInfo.executors.values
              if (!execs.exists(_.state == ExecutorState.RUNNING)) {
                logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
                  s"${appInfo.retryCount} times; removing it")
                removeApplication(appInfo, ApplicationState.FAILED)
              }
            }
          }
          schedule()
        case None =>
          logWarning(s"Got status update for unknown executor $appId/$execId")
      }

猜你喜欢

转载自blog.csdn.net/rover2002/article/details/106072473