[spark] Master and Worker startup process in Standalone mode

This article is based on spark2.1 for analysis

foreword

Spark, as a distributed computing framework, supports multiple operating modes:

  • Local operation mode (stand-alone)
  • Local pseudo-cluster operation mode (single-machine simulated cluster)
  • Standalone Client mode (cluster)
  • Standalone Cluster mode (cluster)
  • YARN Client mode (cluster)
  • YARN Cluster mode (cluster)

As Spark comes with its own cluster manager, Standalone needs to start the Master and Worker daemons. This article will analyze the startup process of the two from the source code perspective. The communication between Master and Worker uses netty-based RPC. Spark's Rpc is recommended to read in-depth analysis of RPC in Spark .

Master start

Starting the Master is started through the script start-master.sh, and the actual calling class is:

org.apache.spark.deploy.master.Master

Take a look at its main method:

def main(argStrings: Array[String]) {
    Utils.initDaemon(log)
    val conf = new SparkConf
    val args = new MasterArguments(argStrings, conf)
    // 创建RpcEnv,启动Rpc服务
    val (rpcEnv, _, _) = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, conf)
    //阻塞等待
    rpcEnv.awaitTermination()
  }

The main method first obtains the configuration parameters to create SparkConf, starts an RPCEnv through startRpcEnvAndEndpoint and creates an Endpoint, and calls awaitTermination to block the server to listen for requests and process them. Let's take a closer look at the startRpcEnvAndEndpoint method:

  def startRpcEnvAndEndpoint(
      host: String,
      port: Int,
      webUiPort: Int,
      conf: SparkConf): (RpcEnv, Int, Option[Int]) = {
    val securityMgr = new SecurityManager(conf)
    // 创建RpcEnv
    val rpcEnv = RpcEnv.create(SYSTEM_NAME, host, port, conf, securityMgr)
    //通过rpcEnv 创建一个Endpoint
    val masterEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME,
      new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))
    val portsResponse = masterEndpoint.askWithRetry[BoundPortsResponse](BoundPortsRequest)
    (rpcEnv, portsResponse.webUIPort, portsResponse.restPort)
  }

First, RpcEnv is created. RpcEnv is the core of the entire Spark RPC. RPCEndpoint defines the logic of processing messages. After it is created, it is managed by RpcEnv. The entire life cycle sequence is onStart, receive, onStop, of which receive can be called at the same time, ThreadSafeRpcEndpoint The receive in is thread-safe and can only be accessed by one thread at a time.

The Endpoint registered with rpcEnv in this method is Master (inheriting ThreadSafeRpcEndpoint), and variables to save various information are created in the constructor of Master.

 ...
  //一个HashSet用于保存WorkerInfo
  val workers = new HashSet[WorkerInfo]
 //一个HashSet用于保存客户端(SparkSubmit)提交的任务
  val apps = new HashSet[ApplicationInfo]
 //等待调度的App
  val waitingApps = new ArrayBuffer[ApplicationInfo]
 //保存DriverInfo
  val drivers = new HashSet[DriverInfo]
 ...

Since the Master is an Endpoint and is managed by RpcEnv, the onStart method of the life cycle needs to be executed first:

override def onStart(): Unit = {
   ...
    checkForWorkerTimeOutTask = forwardMessageThread.scheduleAtFixedRate(new Runnable {
      override def run(): Unit = Utils.tryLogNonFatalError {
        self.send(CheckForWorkerTimeOut)
      }
    }, 0, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
   ...
  }

A thread is added to the thread pool, and every WORKER_TIMEOUT_MS (default 60 seconds) is used to detect whether there is a Worker timeout. In fact, it sends a CheckForWorkerTimeOut event to itself, which will be discussed in detail later.

Worker start

Workers on multiple nodes are started through the script start-slaves.sh, and the underlying classes are called:

org.apache.spark.deploy.worker.Worker

Take a look at its main method:

def main(argStrings: Array[String]) {
    Utils.initDaemon(log)
    val conf = new SparkConf
    val args = new WorkerArguments(argStrings, conf)
    val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores,
      args.memory, args.masters, args.workDir, conf = conf)
    rpcEnv.awaitTermination()
  }

Similar to Master, it also first obtains configuration parameters to create SparkConf, then calls startRpcEnvAndEndpoint to start an RPCEnv and creates an Endpoint, and calls awaitTermination to block the server from listening for requests and processing them.

 def startRpcEnvAndEndpoint(
      host: String,
      port: Int,
      webUiPort: Int,
      cores: Int,
      memory: Int,
      masterUrls: Array[String],
      workDir: String,
      workerNumber: Option[Int] = None,
      conf: SparkConf = new SparkConf): RpcEnv = {

    // The LocalSparkCluster runs multiple local sparkWorkerX RPC Environments
    val systemName = SYSTEM_NAME + workerNumber.map(_.toString).getOrElse("")
    val securityMgr = new SecurityManager(conf)
    val rpcEnv = RpcEnv.create(systemName, host, port, conf, securityMgr)
    val masterAddresses = masterUrls.map(RpcAddress.fromSparkURL(_))
    rpcEnv.setupEndpoint(ENDPOINT_NAME, new Worker(rpcEnv, webUiPort, cores, memory,
      masterAddresses, ENDPOINT_NAME, workDir, conf, securityMgr))
    rpcEnv
  }

Here, a new Worker instance is created as an Endpoint and registered in RpcEnv. The Worker's constructor initializes the heartbeat timeout time to 1/4 of the Master side and other variables

Worker registers with Master

Worker needs to execute the onStart() method according to the life cycle:

override def onStart() {
   ...
    registerWithMaster()
   ...
  }

In the onStart() method, registerWithMaster is called to register itself with the Master:

private def registerWithMaster() {
    // onDisconnected may be triggered multiple times, so don't attempt registration
    // if there are outstanding registration attempts scheduled.
    registrationRetryTimer match {
      case None =>
        // 是否已注册
        registered = false
        // 尝试向所有Master注册自己
        registerMasterFutures = tryRegisterAllMasters()
        // 尝试连接次数
        connectionAttemptCount = 0
        // 网络或者Master故障的时候就需要重新注册自己
        // 注册重试次数超过阈值则直接退出
        registrationRetryTimer = Some(forwordMessageScheduler.scheduleAtFixedRate(
          new Runnable {
            override def run(): Unit = Utils.tryLogNonFatalError {
              Option(self).foreach(_.send(ReregisterWithMaster))
            }
          },
          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
          TimeUnit.SECONDS))
      case Some(_) =>
        logInfo("Not spawning another attempt to register with the master, since there is an" +
          " attempt scheduled already.")
    }
  }

The first call of registrationRetryTimer must be None, register itself with the Master through tryRegisterAllMasters, and then start a thread to try to re-register within a limited number of times (re-registration is required if the network or Master fails). Here's how the tryRegisterAllMasters method registers with the Master:

private def tryRegisterAllMasters(): Array[JFuture[_]] = {
    masterRpcAddresses.map { masterAddress =>
      registerMasterThreadPool.submit(new Runnable {
        override def run(): Unit = {
          try {
            logInfo("Connecting to master " + masterAddress + "...")
            val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
            registerWithMaster(masterEndpoint)
          } catch {
            case ie: InterruptedException => // Cancelled
            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
          }
        }
      })
    }
  }

This calls rpcEnv.setupEndpointRef, where RpcEndpointRef is a reference to the RpcEndpoint in RpcEnv, a serialized entity that can be transferred over the network or saved for later use. An RpcEndpointRef has an address and a name. You can call the send method of RpcEndpointRef to send an asynchronous one-way message to the corresponding RpcEndpoint.

The whole code here means: traverse all masterRpcAddresses, call the registerWithMaster method, and pass in the RpcEndpoint reference RpcEndpointRef on the master side, and continue to look at the registerWithMaster method:

private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {
    masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(
      workerId, host, port, self, cores, memory, workerWebUiUrl))
      .onComplete {
        // This is a very fast action so we can use "ThreadUtils.sameThread"
        case Success(msg) =>
          Utils.tryLogNonFatalError {
            handleRegisterResponse(msg)
          }
        case Failure(e) =>
          logError(s"Cannot register with master: ${masterEndpoint.address}", e)
          System.exit(1)
      }(ThreadUtils.sameThread)
  }

Establish communication with the Master through RpcEndpointRef and send a RegisterWorker message to the Master, and bring in parameter information such as workerid, host, Port, cores, memory, etc., and there are callback functions for success or failure to be explained later.

Master receives worker registration

In the Master, various events that need to be responded are processed by the receiveAndReply method (one-way messages are received through receive), and the RegisterWorker processing logic for the Worker registration message:

case RegisterWorker(
        id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl) =>
      logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
        workerHost, workerPort, cores, Utils.megabytesToString(memory)))
      // 当前Master处于STANDBY
      if (state == RecoveryState.STANDBY) {
        context.reply(MasterInStandby)
      // Worker已经注册过了
      } else if (idToWorker.contains(id)) {
        context.reply(RegisterWorkerFailed("Duplicate worker ID"))
      } else {
        // 根据Worker注册信息为Worker创建WorkerInfo
        val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
          workerRef, workerWebUiUrl)
        if (registerWorker(worker)) {
          // 持久化记录Worker信息
          persistenceEngine.addWorker(worker)
          // 向Worker回复注册成功消息
          context.reply(RegisteredWorker(self, masterWebUiUrl))
          // 有了新的Worker,资源新增,为等待的app进行调度
          schedule()
        } else {
          val workerAddress = worker.endpoint.address
          logWarning("Worker registration failed. Attempted to re-register worker at same " +
            "address: " + workerAddress)
          // 向Worker回复注册失败消息
          context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: "
            + workerAddress))
        }
      }
  1. If the current Master is in the STANDBY state, return the MasterInStandby message directly
  2. If the Worker has already been registered, return the RegisterWorkerFailed message directly
  3. Create WorkerInfo for the Worker according to the Worker registration information, and call the registerWorker method to register:
    • If the registration is successful, the Worker information will be persisted, and a registration success message will be returned to the Worker. In addition, an additional Worker means that the increase in resources will use schedule() to schedule apps waiting to be scheduled.
    • If the registration fails, it will directly reply the registration failure message to the worker.

How to judge whether the registration is successful? Follow up on the registerWorker method:

private def registerWorker(worker: WorkerInfo): Boolean = {
    // There may be one or more refs to dead workers on this same node (w/ different ID's),
    // remove them.
    workers.filter { w =>
      (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
    }.foreach { w =>
      workers -= w
    }
    // 获取新worker的workerAddress 
    val workerAddress = worker.endpoint.address
    if (addressToWorker.contains(workerAddress)) {
      // 根据workerAddress 获取以前注册的老Worker
      val oldWorker = addressToWorker(workerAddress)
      // 若为UNKNOWN则说明正在Master recovery,Worker处于恢复中
      if (oldWorker.state == WorkerState.UNKNOWN) {
        // 移除老Worker,接受新注册的Worker
        removeWorker(oldWorker)
      } else {
        logInfo("Attempted to re-register worker at same address: " + workerAddress)
        return false
      }
    }
    // 跟新变量
    workers += worker
    idToWorker(worker.id) = worker
    addressToWorker(workerAddress) = worker
    true
  }

Traverse all managed Workers, if there is a Worker with the same host and port as the newly registered Worker and in the Dead (timeout) state, it will be directly removed from the Workers. If the managed addressToWorker already has the same WorkerAddress as the newly registered Worker, the old Worker is obtained. If the status is UNKNOWN, it means that the Master is recovering and the Worker is recovering. The old Worker is removed, and the new Worker is directly added and returned successfully. If the old Worker is in another state, it means that it has been registered repeatedly, and the return fails.

Worker receives Master registration feedback message

private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {
    masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(
      workerId, host, port, self, cores, memory, workerWebUiUrl))
      .onComplete {
        // This is a very fast action so we can use "ThreadUtils.sameThread"
        case Success(msg) =>
          Utils.tryLogNonFatalError {
            handleRegisterResponse(msg)
          }
        case Failure(e) =>
          logError(s"Cannot register with master: ${masterEndpoint.address}", e)
          System.exit(1)
      }(ThreadUtils.sameThread)
  }

This registerWithMaster method is called when the Worker registers with the Master, followed by a callback method to process the result, and handle various types of feedback messages through handleRegisterResponse:

private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized {
    msg match {
      // 成功注册
      case RegisteredWorker(masterRef, masterWebUiUrl) =>
        logInfo("Successfully registered with master " + masterRef.address.toSparkURL)
        // 标记成功注册
        registered = true
        // 跟新映射,删除其他的registeration retry
        changeMaster(masterRef, masterWebUiUrl)
        // 向Master发送心跳
        forwordMessageScheduler.scheduleAtFixedRate(new Runnable {
          override def run(): Unit = Utils.tryLogNonFatalError {
            self.send(SendHeartbeat)
          }
        }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)
       ...
      // 注册失败,直接退出进程
      case RegisterWorkerFailed(message) =>
        if (!registered) {
          logError("Worker registration failed: " + message)
          System.exit(1)
        }
      // Master不是处于Active的Master,忽略
      case MasterInStandby =>
        // Ignore. Master not yet ready.
    }
  }
  1. When registering the Worker fails and receives the RegisterWorkerFailed message, it exits.
  2. When the registered Master is in the Standby state, it is ignored directly.
  3. When the registered Worker successfully returns the RegisteredWorker message, first mark the registration success, and then change some variables (such as activeMasterUrl, master, connected, etc.) through changeMaster, and delete other currently retried registrations. Then a new task is created and executed in the thread pool. The thread sends a SendHeartbeat message to itself every HEARTBEAT_MILLIS time. In the message processing method receive, you can see the message processing method, that is, send a heartbeat to the Master:
 case SendHeartbeat =>
      if (connected) { sendToMaster(Heartbeat(workerId, self)) }

Master receives heartbeat

case Heartbeat(workerId, worker) =>
      idToWorker.get(workerId) match {
        case Some(workerInfo) =>
          workerInfo.lastHeartbeat = System.currentTimeMillis()
        case None =>
          if (workers.map(_.id).contains(workerId)) {
            logWarning(s"Got heartbeat from unregistered worker $workerId." +
              " Asking it to re-register.")
            worker.send(ReconnectWorker(masterUrl))
          } else {
            logWarning(s"Got heartbeat from unregistered worker $workerId." +
              " This worker was never registered, so ignoring the heartbeat.")
          }
      }

The master obtains the corresponding workerInfo. If there is, it obtains the last heartbeat time lastHeartbeat. If not, it sends a message to the worker that the connection needs to be re-established.

Master detects Worker heartbeat timeout

In addition, it can be seen from the above that a thread is specially started in the onStart of the Master's life cycle to check whether the worker has timed out and see how the Master handles it:

case CheckForWorkerTimeOut =>
      timeOutDeadWorkers()

private def timeOutDeadWorkers() {
    // Copy the workers into an array so we don't modify the hashset while iterating through it
    val currentTime = System.currentTimeMillis()
    val toRemove = workers.filter(_.lastHeartbeat < currentTime - WORKER_TIMEOUT_MS).toArray
    for (worker <- toRemove) {
      if (worker.state != WorkerState.DEAD) {
        logWarning("Removing %s because we got no heartbeat in %d seconds".format(
          worker.id, WORKER_TIMEOUT_MS / 1000))
        removeWorker(worker)
      } else {
        if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * WORKER_TIMEOUT_MS)) {
          workers -= worker // we've seen this DEAD worker in the UI, etc. for long enough; cull it
        }
      }
    }
  }

Traverse all managed workers. If the last heartbeat time has exceeded the timeout time, it will be judged as timeout and will be removed from the worker list.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325521341&siteId=291194637
Recommended