Spark架构原理-Spark启动消息通信

在学习完Spark架构原理-Master源码分析Spark架构原理-Worker源码分析,我们来结合源码学习一下Spark启动消息通信的整个过程。

Spark启动过程中主要是进行Master和Worker之间的通信,其消息发送关系如下图所示。首先由Worker节点向Master发送注册消息,然后Master处理完毕后,返回注册成功消息或失败消息,如果成功注册,则Worker定时发送心跳消息给Master。

其详细过程如下:

(1)当Master启动后,随之启动各Worker,Worker启动时会创建通信环境RpcEnv和终端店EndPoint, 并向Master发送注册Worker的消息RegisterWorker。

由于Worker可能需要注册到多个Master中(如HA环境),在Worker的tryRegisterAllMasters方法中初始化注册线程池registerMasterThreadPool,把需要申请注册的请求放在该线程池中,然后通过该线程池启动注册线程。Worker.RegisterAllMasters方法代码如下:

private def tryRegisterAllMasters(): Array[JFuture[_]] = {
  masterRpcAddresses.map { masterAddress =>
    registerMasterThreadPool.submit(new Runnable {
      override def run(): Unit = {
        try {
          logInfo("Connecting to master " + masterAddress + "...")
          // 获取Master终端点的引用
          val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
          // 调用registerWithMaster方法注册消息
          registerWithMaster(masterEndpoint)
        } catch {
          case ie: InterruptedException => // Cancelled
          case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
        }
      }
    })
  }
}

private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {
  // 向master发送RegisterWorker请求
  masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(
    workerId, host, port, self, cores, memory, workerWebUiUrl))
    .onComplete {
      // 回调成功,则调用handleRegisterResponse
      case Success(msg) =>
        Utils.tryLogNonFatalError {
          handleRegisterResponse(msg)
        }
      // 回调失败,则退出
      case Failure(e) =>
        logError(s"Cannot register with master: ${masterEndpoint.address}", e)
        System.exit(1)
    }(ThreadUtils.sameThread)
}

(2)Master收到RegisterWorker请求,需要对Worker发送的信息进行验证、记录。如果注册成功,则发送RegisteredWorker消息给对应的Worker,告诉Worker已经完成注册,随之进行定时发送心跳给Master;如果注册失败,则发送RegisterWorkerFailed消息,Worker打印错误日志并结束启动Worker。

在Master中,Master接收到Worker注册消息,先判断Master当前状态是否处于STANDBY状态,如果是则忽略消息;如果在注册列表中发现了该Worker编号,则发送注册失败的消息;判断完毕后调用registerWorker方法把Worker加入到列表中,用于集群进行处理任务时进行调度。Master.RegisterWorker请求代码如下:

case RegisterWorker(
      id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl) =>
    logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
      workerHost, workerPort, cores, Utils.megabytesToString(memory)))
    // 如果当前节点状态是standby,返回MasterInStandby
    if (state == RecoveryState.STANDBY) {
      context.reply(MasterInStandby)
    } else if (idToWorker.contains(id)) {
      // 判断维护的workerid->WorkerInfo映射是否包含这个worker id
      // 如果包含返回wokerid,则返回 worker id重复的RegisterWorkerFailed
      context.reply(RegisterWorkerFailed("Duplicateworker ID"))
    } else {// 表示当前节点为master,且要注册是worker id之前是不存在的
      // 创建worker,并进行注册,注册成功并且返回RegisteredWorker请求,然后开始调度
      // 否则返回RegisterWorkerFailed请求,worker注册失败
      val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
        workerRef, workerWebUiUrl)
      if (registerWorker(worker)) {
        persistenceEngine.addWorker(worker)
        context.reply(RegisteredWorker(self, masterWebUiUrl))
        schedule()
      } else {
        val workerAddress = worker.endpoint.address
        logWarning("Worker registration failed. Attempted to re-register worker at same" +
          "address:" + workerAddress)
        context.reply(RegisterWorkerFailed("Attemptedto re-register worker at same address: "
          + workerAddress))
      }
}

(3)当Worker接收到注册成功后,先记录日志并更新Master信息,然后会定时发送心跳信息Heartbeat给Master,以便Master了解Worker的实时状态。

case RegisteredWorker(masterRef, masterWebUiUrl) =>
      logInfo("Successfully registered with master " + masterRef.address.toSparkURL)
      registered = true // 更新registered状态
      changeMaster(masterRef, masterWebUiUrl)
      // 后台线程开始定时调度向master发送心跳的线程
      forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
        override def run(): Unit = Utils.tryLogNonFatalError {
          self.send(SendHeartbeat)
        }
      }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
      // 如果启用了cleanup功能,后台线程开始定时调度发送WorkDirCleanup指令,清理目录
      if (CLEANUP_ENABLED) {
        logInfo(
          s"Worker cleanup enabled; old application directories will be deleted in: $workDir")
        forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
          override def run(): Unit = Utils.tryLogNonFatalError {
            self.send(WorkDirCleanup)
          }
        }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)
      }
      // 根据worker所持有的executor构造ExecutorDescription对象,描述该executor
      val execs = executors.values.map { e =>
        new ExecutorDescription(e.appId, e.execId, e.cores, e.state)
      }
      // 向master发送WorkerLatestState请求,获取worker最近状态
      masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq))

说明:本文参考郭景瞻的《图解Spark:核心技术与案例实战》

猜你喜欢

转载自blog.csdn.net/Anbang713/article/details/81604693