Spark source code learning (6) - Worker registration process to Master

background

         When using start-slaves.sh to start the Worker instance, the instance of Worker.scala is actually started. After starting, it will register with the Master. Note that the Executor will not register with the Master when it starts. Please see the blog post for the reason. : Click to open the link , the specific Master registration process is as follows

Documents involved:

             (1) Worker.scala

             (2) Master.scale


text

0.Worker.scala

      It can be seen that the Worker Century City has ThreadSafeRpcEndpoint, which means that the Worker is a message loop body, and it can send messages to the object by holding a reference to the object. The same is true for the Master, which is also the main way for the Master and the Worker to communicate.

private[deploy] class Worker(
    override val rpcEnv: RpcEnv,
    webUiPort: Int,
    cores: Int,
    memory: Int,
    masterRpcAddresses: Array[RpcAddress],
    endpointName: String,
    workDirPath: String = null,
    val conf: SparkConf,
    val securityMgr: SecurityManager)
  extends ThreadSafeRpcEndpoint with Logging

1.Worker.main()

       The main method is located in the companion object of the Worker. The companion object will create a new Worker object in the main method. The detailed code of the main method is as follows:

def main(argStrings: Array[String]) {
    Utils.initDaemon(log)
    val conf = new SparkConf

    //Initialization parameters
    val args = new WorkerArguments(argStrings, conf)

    //Open the endpoint and instantiate the Worker object in this method, because the Worker itself is an EndPoint
    val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores,
      args.memory, args.masters, args.workDir, conf = conf)
    rpcEnv.awaitTermination()
}
     The main method in main is startRpcEnvAndEndPoint(). First, an RpcEnv object is instantiated using IP, port and conf, and then rpcEndPoint.setEndPoint() is called to set a Worker object. The specific code is as follows:
       
  def startRpcEnvAndEndpoint(
      host: String,
      port: Int,
      webUiPort: Int,
      cores: Int,
      memory: Int,
      masterUrls: Array[String],
      workDir: String,
      workerNumber: Option[Int] = None,
      conf: SparkConf = new SparkConf): RpcEnv = {

    // The LocalSparkCluster runs multiple local sparkWorkerX RPC Environments
    val systemName = SYSTEM_NAME + workerNumber.map(_.toString).getOrElse("")
    val securityMgr = new SecurityManager(conf)
    //Create rpcEnv
    val rpcEnv = RpcEnv.create(systemName, host, port, conf, securityMgr)
    val masterAddresses = masterUrls.map(RpcAddress.fromSparkURL(_))

    //Instantiate the Worker object and set rpcEndPoint
    rpcEnv.setupEndpoint(ENDPOINT_NAME, new Worker(rpcEnv, webUiPort, cores, memory,
      masterAddresses, ENDPOINT_NAME, workDir, conf, securityMgr))
    rpcEnv
  }

2.Worker.onStart()

        The Worker is not a companion object, but an object instantiated in the previous step. Because it inherits ThreadSafeEndPoint, it will execute the onstart() method. The specific code is shown as follows, of which the core code is registerWithMaster():

override def onStart() {
    assert(!registered)
    logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format(
      host, port, cores, Utils.megabytesToString(memory)))
    logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
    logInfo("Spark home: " + sparkHome)

    //Create a worker directory, the default is spark_home/worker, you can attach --worker-dir at startup, or set environment variables
    createWorkDir()

    //Start the shuffer service

    //Create your own web service
    shuffleService.startIfEnabled()
    webUi = new WorkerWebUI(this, workDir, webUiPort)
    webUi.bind()

    workerWebUiUrl = s"http://$publicAddress:${webUi.boundPort}"


    //Core method, register with Master
    registerWithMaster()

    metricsSystem.registerSource(workerSource)
    metricsSystem.start()
    // Attach the worker metrics servlet handler to the web ui after the metrics system is started.

    //Add your own information to the Master's page
    metricsSystem.getServletHandlers.foreach(webUi.attachHandler)
  }

3.Worker.registerWithMaster()

       registerWithMaster will call tryRegisterAllMaster() to register with all Masters

private def registerWithMaster() {
    // onDisconnected may be triggered multiple times, so don't attempt registration
    // if there are outstanding registration attempts scheduled.
    registrationRetryTimer match {
      case None =>
        registered = false


        //Core code, register with all Masters
        registerMasterFutures = tryRegisterAllMasters()
        connectionAttemptCount = 0
        registrationRetryTimer = Some(forwordMessageScheduler.scheduleAtFixedRate(
          new Runnable {
            override def run(): Unit = Utils.tryLogNonFatalError {
              Option(self).foreach(_.send(ReregisterWithMaster))
            }
          },
          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
          TimeUnit.SECONDS))
      case Some(_) =>
        logInfo("Not spawning another attempt to register with the master, since there is an" +
          " attempt scheduled already.")
    }
  }

4.Worker.tryRegisterAllMaster()

        Because there can be multiple Masters, a thread pool is used here.
  private def tryRegisterAllMasters(): Array[JFuture[_]] = {
    masterRpcAddresses.map { masterAddress =>

      //Because there may be multiple masters, a thread pool is used here
      registerMasterThreadPool.submit(new Runnable {
        override def run(): Unit = {
          try {
            logInfo("Connecting to master " + masterAddress + "...")
            val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)

            //Use masterEndPoint to register
            sendRegisterMessageToMaster(masterEndpoint)
          } catch {
            case ie: InterruptedException => // Cancelled
            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
          }
        }
      })
    }
  }
    Spark 2.2 version, sendRegiterMessageToMaster, just after sending a message to the Master, there is no other registration, the Master will use WorkerEndPont to send the response information, if it fails, the registerWithMaster code will retry the registration.
private def sendRegisterMessageToMaster(masterEndpoint: RpcEndpointRef): Unit = {
    masterEndpoint.send(RegisterWorker(
      workerId,
      host,
      port,
      self,
      cores,
      memory,
      workerWebUiUrl,
      masterEndpoint.address))
  }

5.Master.receive()

  Above, the RegisterWorker information is sent using masterEndPoint, which will be processed in the master's receive. The code is as follows:
case RegisterWorker(
      id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) =>
      logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
        workerHost, workerPort, cores, Utils.megabytesToString(memory)))

        //master is in standby and does not register
      if (state == RecoveryState.STANDBY) {
        workerRef.send(MasterInStandby)
      } else if (idToWorker.contains(id)) {

        // already registered
        workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
      } else {

        // Formally register
        val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
          workerRef, workerWebUiUrl)

          //registration method
        if (registerWorker(worker)) {
          persistenceEngine.addWorker(worker)

          //return the registration information response
          workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress))


          //Add new worker, reschedule
          schedule()
        } else {
          val workerAddress = worker.endpoint.address
          logWarning("Worker registration failed. Attempted to re-register worker at same " +
            "address: " + workerAddress)
          workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "
            + workerAddress))
        }
      }

        registerWorker only executes the registered code specifically on the master at the end, checks the status of the current worker, and adds it to the data structure. Let's take a look:

private def registerWorker(worker: WorkerInfo): Boolean = {
    // There may be one or more refs to dead workers on this same node (w/ different ID's),
    // remove them.


    //dead processing
    workers.filter { w =>
      (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
    }.foreach { w =>
      workers -= w
    }
    
    //unknow processing
    val workerAddress = worker.endpoint.address
    if (addressToWorker.contains(workerAddress)) {
      val oldWorker = addressToWorker(workerAddress)
      if (oldWorker.state == WorkerState.UNKNOWN) {
        // A worker registering from UNKNOWN implies that the worker was restarted during recovery.
        // The old worker must thus be dead, so we will remove it and accept the new worker.
        removeWorker(oldWorker)
      } else {
        logInfo("Attempted to re-register worker at same address: " + workerAddress)
        return false
      }
    }
    

    //Add worker information
    workers += worker
    idToWorker(worker.id) = worker
    addressToWorker(workerAddress) = worker
    if (reverseProxy) {
       webUi.addProxyTargets(worker.id, worker.webUiAddress)
    }
    true
  }

   After the method is successfully registered, the persistenceEngine.addWorker(worker) method is also called, which is to persist the worker information, such as persisting the worker information in zookeeper.

6.Worker.receive()

     The master finally uses the application of workerEndPoint to play the registerWorker message. In the receive method of the worker, there are the following cases:

case msg: RegisterWorkerResponse =>
      handleRegisterResponse(msg)
     Among them, handleRegisterResponse is the last worker to process the registration successfully. There are two main operations:
     a. Set the registered variable of the current worker to true b. Start sending the heartbeat thread


Summarize:

     The startup of the worker is generally started by the administrator manually using the script. The script under /bin/* first uses the main method in the companion object of the worker to instantiate a worker object, and because the worker is a ThreadSafeRpcEndPoint, it will execute onstart The method in , the initiation operation related to registration is performed in onstart, and the operation between them is mainly the communication completed by the reference of the master and the worker itself.

    After that, the worker will receive the master's request to start the executor, create the backend, and then the communication between the executorBackEnd and the driver.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325584215&siteId=291194637