消息传递机制

Spark 通信层是基于优秀的网络通信框架Netty设计开发的,同时获得了Netty所具有的网络通信的可靠性和高效性。通信框架使用了工厂设计模式实现,这种方式实现了对Netty的解耦,能够根据需要引入其他的消息通信工具。

看看Spark的消息通信类图:

通信框架创建参见上图左边的四个类,具体实现步骤:

1、首先定义了RpcEnv和RpcEnvFactory两个抽象类,在RpcEnv定义了RPC通信框架的启动、停止和关闭等抽象方法,在RpcEnvFactory中定义了创建抽象方法。

2、然后在NettyRpcEnv和NettyRpcEnvFactory类中使用Netty对继承的方法进行了实现。需要注意的是,在NettyRpcEnv中启动终端点方法setupEndpoint,在这个方法中把RpcEndPoint和RpcEndpointRef相互以键值方式存放在线程安全的ConcurrentHashMap中。

3、最后在RpcEnv的object类中,通过反射方式实现了创建RpcEnv的实例的静态方法。

在各模块使用中,如Master、Worker等,会先使用RpcEnv的静态方法创建RpcEnv实例,然后实例化Master,由于Master继承于ThreadSafeRpcEndpoint,创建的Master实例是一个线程安全的终端点,接着调用RpcEnv启动终端点方法,把Master的终端点和其对应的引用注册到RpcEnv中。在消息通信中,其他对象只要获取了Master终端点的引用,就能够发送消息给Master进行通信。下面为Master的startRpcEnvAndEndPoint方法中,启动消息通信框架的代码:

 
  1. /**

  2. * Start the Master and return a three tuple of:

  3. * (1) The Master RpcEnv

  4. * (2) The web UI bound port

  5. * (3) The REST server bound port, if any

  6. */

  7. def startRpcEnvAndEndpoint(

  8. host: String,

  9. port: Int,

  10. webUiPort: Int,

  11. conf: SparkConf): (RpcEnv, Int, Option[Int]) = {

  12. val securityMgr = new SecurityManager(conf)

  13. val rpcEnv = RpcEnv.create(SYSTEM_NAME, host, port, conf, securityMgr)

  14. val masterEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME,

  15. new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))

  16. val portsResponse = masterEndpoint.askWithRetry[BoundPortsResponse](BoundPortsRequest)

  17. (rpcEnv, portsResponse.webUIPort, portsResponse.restPort)

  18. }

Spark消息通信启动过程

Spark启动过程中主要是进行Master和Worker之间的通信,其消息发送关系如下,首先由worker节点向Master发送注册消息,然后Master处理完毕后,返回注册成功消息或失败消息。

(1) 当Master启动后,随之启动各Worker,Worker启动时会创建通信环境RpcEnv和终端点EndPoint,并向Master发送注册Worker的消息RegisterWorker.Worker.tryRegisterAllMasters方法如下:

 
  1. // 因为Master可能不止一个

  2. private def tryRegisterAllMasters(): Array[JFuture[_]] = {

  3. masterRpcAddresses.map { masterAddress =>

  4. registerMasterThreadPool.submit(new Runnable {

  5. override def run(): Unit = {

  6. try {

  7. logInfo("Connecting to master " + masterAddress + "...")

  8. // 获取Master终端点的引用

  9. val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)

  10. registerWithMaster(masterEndpoint)

  11. } catch {}

  12. ...

  13. }

  14.  
  15. private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {

  16. // 根据Master节点的引用发送注册信息

  17. masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(

  18. workerId, host, port, self, cores, memory, workerWebUiUrl))

  19. .onComplete {

  20. // 返回注册成功或失败的结果

  21. // This is a very fast action so we can use "ThreadUtils.sameThread"

  22. case Success(msg) =>

  23. Utils.tryLogNonFatalError {handleRegisterResponse(msg)}

  24. case Failure(e) =>

  25. logError(s"Cannot register with master: ${masterEndpoint.address}", e)

  26. System.exit(1)

  27. }(ThreadUtils.sameThread)

  28. }

(2) Master收到消息后,需要对Worker发送的信息进行验证、记录。如果注册成功,则发送RegisteredWorker消息给对应的Worker,告诉Worker已经完成注册,

随之进行步骤3,即Worker定期发送心跳给Master;如果注册过程中失败,则会发送RegisterWorkerFailed消息,Woker打印出错日志并结束Worker启动。Master.receiverAndReply方法如下:

 
  1. override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {

  2. case RegisterWorker(

  3. id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl) =>

  4. logInfo("Registering worker %s:%d with %d cores, %s RAM".format(

  5. workerHost, workerPort, cores, Utils.megabytesToString(memory)))

  6. // Master处于STANDBY状态

  7. if (state == RecoveryState.STANDBY) {

  8. context.reply(MasterInStandby)

  9. } else if (idToWorker.contains(id)) { // 在注册列表中发现了该Worker节点

  10. context.reply(RegisterWorkerFailed("Duplicate worker ID"))

  11. } else {

  12. val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,

  13. workerRef, workerWebUiUrl)

  14. // registerWorker方法会把Worker放到注册列表中

  15. if (registerWorker(worker)) {

  16. persistenceEngine.addWorker(worker)

  17. context.reply(RegisteredWorker(self, masterWebUiUrl))

  18. schedule()

  19. } else {

  20. val workerAddress = worker.endpoint.address

  21. logWarning("Worker registration failed. Attempted to re-register worker at same " +

  22. "address: " + workerAddress)

  23. context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: "

  24. + workerAddress))

  25. }

  26. }

  27.  
  28. ...

  29. }

Worker的handleRegisterResponse方法:

 
  1. private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized {

  2. msg match {

  3. case RegisteredWorker(masterRef, masterWebUiUrl) =>

  4. logInfo("Successfully registered with master " + masterRef.address.toSparkURL)

  5. registered = true

  6. changeMaster(masterRef, masterWebUiUrl)

  7. forwordMessageScheduler.scheduleAtFixedRate(new Runnable {

  8. override def run(): Unit = Utils.tryLogNonFatalError {

  9. self.send(SendHeartbeat)

  10. }

  11. }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)

  12. if (CLEANUP_ENABLED) {

  13. logInfo(

  14. s"Worker cleanup enabled; old application directories will be deleted in: $workDir")

  15. forwordMessageScheduler.scheduleAtFixedRate(new Runnable {

  16. override def run(): Unit = Utils.tryLogNonFatalError {

  17. self.send(WorkDirCleanup)

  18. }

  19. }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)

  20. }

  21.  
  22. case RegisterWorkerFailed(message) =>

  23. if (!registered) {

  24. logError("Worker registration failed: " + message)

  25. System.exit(1)

  26. }

  27.  
  28. case MasterInStandby =>

  29. // Ignore. Master not yet ready.

  30. }

  31. }

(3) 当Worker接收到注册成功后,会定时发送心跳信息Heartbeat给Master,以便Master了解Worker的实时状态。间隔时间可以在spark.worker.timeout中设置,注意,该设置值的1/4为心跳间隔。

private val HEARTBEAT_MILLIS = conf.getLong("spark.worker.timeout", 60) * 1000 / 4

Spark运行时消息通信

用户提交应用程序时,应用程序的SparkContext会向Master发送注册应用信息,并由Master给该应用分配Executor,Executor启动后会向SparkContext发送注册成功消息。

(1) 在SparkContext创建过程中会先实例化SchedulerBackend对象,standalone模式中实际创建的是StandaloneSchedulerBackend对象,在该对象启动过程中会继承父类DriverEndpoint和创建StandaloneAppClient的ClientEndpoint两个终端点。

在ClientEndpoint的tryRegisterAllMasters方法中创建注册线程池registerMasterThreadPool, 在该线程池中启动注册线程并向Master发送RegisterApplication注册应用的消息,代码如下:

 
  1. private def tryRegisterAllMasters(): Array[JFuture[_]] = {

  2. // 遍历所有的Master, 这是一个for推导式,会构造会一个集合

  3. for (masterAddress <- masterRpcAddresses) yield {

  4. // 在线程池中启动注册线程,当该线程读到应用注册成功标识registered==true时退出注册线程

  5. registerMasterThreadPool.submit(new Runnable {

  6. override def run(): Unit = try {

  7. if (registered.get) { // private val registered = new AtomicBoolean(false) 原子类型

  8. return

  9. }

  10. logInfo("Connecting to master " + masterAddress.toSparkURL + "...")

  11. val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)

  12. // 发送注册消息

  13. masterRef.send(RegisterApplication(appDescription, self))

  14. } catch {...}

  15. })

  16. }

  17. }

当Master接收到注册应用消息时,在registerApplication方法中记录应用信息并把该应用加入到等待运行列表中,发送注册成功消息RegisteredApplication给ClientEndpoint:

 
  1. override def receive: PartialFunction[Any, Unit] = {

  2. case ElectedLeader => {

  3. val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv)

  4. state = if (storedApps.isEmpty && storedDrivers.isEmpty && storedWorkers.isEmpty) {

  5. RecoveryState.ALIVE

  6. } else {

  7. RecoveryState.RECOVERING

  8. }

  9. logInfo("I have been elected leader! New state: " + state)

  10. if (state == RecoveryState.RECOVERING) {

  11. beginRecovery(storedApps, storedDrivers, storedWorkers)

  12. recoveryCompletionTask = forwardMessageThread.schedule(new Runnable {

  13. override def run(): Unit = Utils.tryLogNonFatalError {

  14. self.send(CompleteRecovery)

  15. }

  16. }, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)

  17. }

  18. }

  19.  
  20. case CompleteRecovery => completeRecovery()

  21.  
  22. case RevokedLeadership => {

  23. logError("Leadership has been revoked -- master shutting down.")

  24. System.exit(0)

  25. }

  26.  
  27. // 接收Worker发来的RegisteredAppliction消息

  28. case RegisterApplication(description, driver) => {

  29. // TODO Prevent repeated registrations from some driver

  30. if (state == RecoveryState.STANDBY) {

  31. // ignore, don't send response

  32. } else {

  33. logInfo("Registering app " + description.name)

  34. val app = createApplication(description, driver)

  35. registerApplication(app)

  36. logInfo("Registered app " + description.name + " with ID " + app.id)

  37. persistenceEngine.addApplication(app)

  38. driver.send(RegisteredApplication(app.id, self))

  39. schedule()

  40. }

  41. }

  42.  
  43. ....

  44. }

启动Driver和Executors:

 
  1. private def schedule(): Unit = {

  2. if (state != RecoveryState.ALIVE) { return }

  3. // 对Worker节点进行随机排序

  4. val shuffledWorkers = Random.shuffle(workers) // Randomization helps balance drivers

  5. for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {

  6. // 按照顺序在集群中启动Driver,Driver尽量在不同的Worker节点上运行

  7. for (driver <- waitingDrivers) {

  8. if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {

  9. launchDriver(worker, driver)

  10. waitingDrivers -= driver

  11. }

  12. }

  13. }

  14. startExecutorsOnWorkers()

  15. }

  16.  
  17.  
  18. /**

  19. * Schedule and launch executors on workers

  20. */

  21. private def startExecutorsOnWorkers(): Unit = {

  22. // 使用FIFO算法运行应用,即先注册的应用先运行

  23. for (app <- waitingApps if app.coresLeft > 0) {

  24. val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor

  25. // Filter out workers that don't have enough resources to launch an executor

  26. val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)

  27. .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&

  28. worker.coresFree >= coresPerExecutor.getOrElse(1))

  29. .sortBy(_.coresFree).reverse

  30. // 一种是spreadOutApps,就是把应用运行在尽量多的Worker上,另一种是非spreadOutApps

  31. val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

  32.  
  33. // 给每个worker分配完application要求的cpu core之后,遍历worker启动executor

  34. for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {

  35. allocateWorkerResourceToExecutors(

  36. app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))

  37. }

  38. }

  39. }

  40.  

(2) AppClient.ClientEndpoint接收到Master发送的RegisteredApplication消息,需要把注册标识registered置为true。代码如下:

 
  1. override def receive: PartialFunction[Any, Unit] = {

  2. case RegisteredApplication(appId_, masterRef) =>

  3. // FIXME How to handle the following cases?

  4. // 1. A master receives multiple registrations and sends back multiple

  5. // RegisteredApplications due to an unstable network.

  6. // 2. Receive multiple RegisteredApplication from different masters because the master is

  7. // changing.

  8. appId.set(appId_)

  9. registered.set(true)

  10. master = Some(masterRef)

  11. listener.connected(appId.get)

  12.  
  13. case ApplicationRemoved(message) =>

  14. markDead("Master removed our application: %s".format(message))

  15. stop()

  16. ....

  17. }

(3) 在Master类的starExecutorsOnWorkers方法中分配资源运行应用程序时,调用allocateWorkerResourceToExecutors方法实现在Worker中启动Executor。当
Worker收到Master发送过来的LaunchExecutor消息,先实例化ExecutorRunner对象,在ExecutorRunner启动中会创建进程生成器ProcessBuilder, 然后由该生成器使用command

创建CoarseGrainedExecutorBackend对象,该对象是Executor运行的容器,最后Worker发送ExecutorStateChanged消息给Master,通知Executor容器已经创建完毕。

 
  1. case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>

  2. if (masterUrl != activeMasterUrl) {

  3. logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")

  4. } else {

  5. try {

  6. logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))

  7.  
  8. // 创建executor执行目录

  9. val executorDir = new File(workDir, appId + "/" + execId)

  10. if (!executorDir.mkdirs()) {

  11. throw new IOException("Failed to create directory " + executorDir)

  12. }

  13.  
  14. // 创建executor本地目录,当应用程序结束后由worker删除

  15. val appLocalDirs = appDirectories.getOrElse(appId,

  16. Utils.getOrCreateLocalRootDirs(conf).map { dir =>

  17. val appDir = Utils.createDirectory(dir, namePrefix = "executor")

  18. Utils.chmod700(appDir)

  19. appDir.getAbsolutePath()

  20. }.toSeq)

  21. appDirectories(appId) = appLocalDirs

  22.  
  23. // 在ExecutorRunner中创建CoarseGrainedExecutorBackend对象,创建的是使用应用信息中的command,而command在

  24. // StandaloneSchedulerBackend的start方法中构建

  25. val manager = new ExecutorRunner(appId,execId,appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),

  26. cores_,memory_,self,workerId,host,webUi.boundPort,publicAddress,sparkHome,executorDir,workerUri,conf,

  27. appLocalDirs, ExecutorState.RUNNING)

  28. executors(appId + "/" + execId) = manager

  29. manager.start() // 启动ExecutorRunner

  30. coresUsed += cores_

  31. memoryUsed += memory_

  32. sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))

  33. } catch {...}

  34. }

在ExecutorRunner创建中调用了fetchAndRunExecutor方法进行实现,在该方法中command内容在StandaloneSchedulerBackend中定义,指定构造Executor运行容器CoarseGrainedExecutorBacken:

 
  1. private def fetchAndRunExecutor() {

  2. try {

  3. // 通过应用程序信息和环境配置创建构造器builder

  4. val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),

  5. memory, sparkHome.getAbsolutePath, substituteVariables)

  6. val command = builder.command()

  7. val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"")

  8. logInfo(s"Launch command: $formattedCommand")

  9.  
  10. // 在构造器builder中添加执行目录等信息

  11. builder.directory(executorDir)

  12. builder.environment.put("SPARK_EXECUTOR_DIRS", appLocalDirs.mkString(File.pathSeparator))

  13. builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0")

  14.  
  15. // Add webUI log urls

  16. val baseUrl =

  17. s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="

  18. builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr")

  19. builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout")

  20.  
  21. // 启动构造器,创建CoarseGrainedExecutorBackend实例

  22. process = builder.start()

  23. val header = "Spark Executor Command: %s\n%s\n\n".format(

  24. formattedCommand, "=" * 40)

  25.  
  26. // 输出CoarseGrainedExecutorBackend实例运行信息

  27. val stdout = new File(executorDir, "stdout")

  28. stdoutAppender = FileAppender(process.getInputStream, stdout, conf)

  29. val stderr = new File(executorDir, "stderr")

  30. Files.write(header, stderr, StandardCharsets.UTF_8)

  31. stderrAppender = FileAppender(process.getErrorStream, stderr, conf)

  32.  
  33. // 等待CoarseGrainedExecutorBackend运行结束,当结束时向Worker发送退出状态信息

  34. val exitCode = process.waitFor()

  35. state = ExecutorState.EXITED

  36. val message = "Command exited with code " + exitCode

  37. worker.send(ExecutorStateChanged(appId, execId, state, Some(message), Some(exitCode)))

  38. } catch {...}

  39. }

(4) Master接收到Worker发送的ExecutorStateChanged消息:

 
  1. case ExecutorStateChanged(appId, execId, state, message, exitStatus) =>

  2. // 找到executor对应的app,然后flatMap,通过app内部的缓存获取executor信息

  3. val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))

  4. execOption match {

  5. case Some(exec) =>

  6. // 设置executor的当前状态

  7. val appInfo = idToApp(appId)

  8. val oldState = exec.state

  9. exec.state = state

  10.  
  11. if (state == ExecutorState.RUNNING) {

  12. assert(oldState == ExecutorState.LAUNCHING,

  13. s"executor $execId state transfer from $oldState to RUNNING is illegal")

  14. appInfo.resetRetryCount()

  15. }

  16. // 向Driver发送ExecutorUpdated消息

  17. exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus, false))

  18. ...

(5) 在3中的CoarseGrainedExecutorBackend启动方法onStart中,会发送注册Executor消息RegisterExecutor给DriverEndpoint,DriverEndpoint先判断该Executor是否已经注册,在makeOffers()方法

中分配运行任务资源,最后发送LaunchTask消息执行任务:

 
  1. case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls) =>

  2. if (executorDataMap.contains(executorId)) {

  3. executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + executorId))

  4. context.reply(true)

  5. } else {

  6. ...

  7. // 记录executor编号以及该executor需要使用的核数

  8. addressToExecutorId(executorAddress) = executorId

  9. totalCoreCount.addAndGet(cores)

  10. totalRegisteredExecutors.addAndGet(1)

  11. val data = new ExecutorData(executorRef, executorRef.address, hostname,

  12. cores, cores, logUrls)

  13. // 创建executor编号和其具体信息的键值列表

  14. CoarseGrainedSchedulerBackend.this.synchronized {

  15. executorDataMap.put(executorId, data)

  16. if (currentExecutorIdCounter < executorId.toInt) {

  17. currentExecutorIdCounter = executorId.toInt

  18. }

  19. if (numPendingExecutors > 0) {

  20. numPendingExecutors -= 1

  21. logDebug(s"Decremented number of pending executors ($numPendingExecutors left)")

  22. }

  23. }

  24. // 回复Executor完成注册消息并在监听总线中加入添加executor事件

  25. executorRef.send(RegisteredExecutor)

  26. context.reply(true)

  27. listenerBus.post(

  28. SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data))

  29. // 分配运行任务资源并发送LaunchTask消息执行任务

  30. makeOffers()

  31. }

(6) CoarseGrainedExecutorBackend接收到Executor注册成功RegisteredExecutor消息时,在CoarseGrainedExecutorBackend容器中实例化

Executor对象。启动完毕后,会定时向Driver发送心跳信息, 等待接收从DriverEndpoint发送执行任务的消息。

 
  1. // 向driver注册成功了,返回RegisteredExecutor消息

  2. case RegisteredExecutor =>

  3. logInfo("Successfully registered with driver")

  4. try {

  5. // 新建Executor, 该Executor会定时向Driver发送心跳信息,等待Driver下发任务

  6. executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)

  7. } catch {...}

(7) CoarseGrainedExecutorBackend的Executor启动后接收从DriverEndpoint发送的LaunchTask执行任务消息,任务执行是在Executor的launchTask方法实现的。在执行时会创建TaskRunner进程,由该进程进行任务处理,

处理完毕后发送StateUpdate消息返回给CoarseGrainedExecutorBackend:

 
  1. def launchTask(context: ExecutorBackend,taskId: Long,

  2. attemptNumber: Int,taskName: String,serializedTask: ByteBuffer): Unit = {

  3. // 对于每一个task创建一个TaskRunner

  4. val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,serializedTask)

  5. // 将taskRunner放入内存缓存

  6. runningTasks.put(taskId, tr)

  7. // 将taskRunner放入线程池中,会自动排队

  8. threadPool.execute(tr)

  9. }

(8) 在TaskRunner执行任务完成时,会向DriverEndpoint发送StatusUpdate消息,DriverEndpoint接收到消息会调用TaskSchedulerImpl的statusUpdate方法,根据任务执行不同的结果处理,处理完毕后再给该Executor分配执行任务:

 
  1. case StatusUpdate(executorId, taskId, state, data) =>

  2. // 调用TaskSchedulerImpl的statusUpdate方法,根据任务执行不同的结果处理

  3. scheduler.statusUpdate(taskId, state, data.value)

  4. if (TaskState.isFinished(state)) {

  5. executorDataMap.get(executorId) match {

  6. // 任务执行成功后,回收该Executor运行该任务的CPU,再根据实际情况分配任务

  7. case Some(executorInfo) =>

  8. executorInfo.freeCores += scheduler.CPUS_PER_TASK

  9. makeOffers(executorId)

  10. case None => ...

  11. }

  12. }

猜你喜欢

转载自blog.csdn.net/liudongdong19/article/details/81872212