background
When using start-slaves.sh to start the Worker instance, the instance of Worker.scala is actually started. After starting, it will register with the Master. Note that the Executor will not register with the Master when it starts. Please see the blog post for the reason. : Click to open the link , the specific Master registration process is as follows
Documents involved:
(1) Worker.scala
(2) Master.scaletext
0.Worker.scala
It can be seen that the Worker Century City has ThreadSafeRpcEndpoint, which means that the Worker is a message loop body, and it can send messages to the object by holding a reference to the object. The same is true for the Master, which is also the main way for the Master and the Worker to communicate.
private[deploy] class Worker( override val rpcEnv: RpcEnv, webUiPort: Int, cores: Int, memory: Int, masterRpcAddresses: Array[RpcAddress], endpointName: String, workDirPath: String = null, val conf: SparkConf, val securityMgr: SecurityManager) extends ThreadSafeRpcEndpoint with Logging
1.Worker.main()
The main method is located in the companion object of the Worker. The companion object will create a new Worker object in the main method. The detailed code of the main method is as follows:
def main(argStrings: Array[String]) { Utils.initDaemon(log) val conf = new SparkConf //Initialization parameters val args = new WorkerArguments(argStrings, conf) //Open the endpoint and instantiate the Worker object in this method, because the Worker itself is an EndPoint val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores, args.memory, args.masters, args.workDir, conf = conf) rpcEnv.awaitTermination() }The main method in main is startRpcEnvAndEndPoint(). First, an RpcEnv object is instantiated using IP, port and conf, and then rpcEndPoint.setEndPoint() is called to set a Worker object. The specific code is as follows:
def startRpcEnvAndEndpoint( host: String, port: Int, webUiPort: Int, cores: Int, memory: Int, masterUrls: Array[String], workDir: String, workerNumber: Option[Int] = None, conf: SparkConf = new SparkConf): RpcEnv = { // The LocalSparkCluster runs multiple local sparkWorkerX RPC Environments val systemName = SYSTEM_NAME + workerNumber.map(_.toString).getOrElse("") val securityMgr = new SecurityManager(conf) //Create rpcEnv val rpcEnv = RpcEnv.create(systemName, host, port, conf, securityMgr) val masterAddresses = masterUrls.map(RpcAddress.fromSparkURL(_)) //Instantiate the Worker object and set rpcEndPoint rpcEnv.setupEndpoint(ENDPOINT_NAME, new Worker(rpcEnv, webUiPort, cores, memory, masterAddresses, ENDPOINT_NAME, workDir, conf, securityMgr)) rpcEnv }
2.Worker.onStart()
The Worker is not a companion object, but an object instantiated in the previous step. Because it inherits ThreadSafeEndPoint, it will execute the onstart() method. The specific code is shown as follows, of which the core code is registerWithMaster():
override def onStart() { assert(!registered) logInfo("Starting Spark worker %s:%d with %d cores, %s RAM".format( host, port, cores, Utils.megabytesToString(memory))) logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}") logInfo("Spark home: " + sparkHome) //Create a worker directory, the default is spark_home/worker, you can attach --worker-dir at startup, or set environment variables createWorkDir() //Start the shuffer service //Create your own web service shuffleService.startIfEnabled() webUi = new WorkerWebUI(this, workDir, webUiPort) webUi.bind() workerWebUiUrl = s"http://$publicAddress:${webUi.boundPort}" //Core method, register with Master registerWithMaster() metricsSystem.registerSource(workerSource) metricsSystem.start() // Attach the worker metrics servlet handler to the web ui after the metrics system is started. //Add your own information to the Master's page metricsSystem.getServletHandlers.foreach(webUi.attachHandler) }
3.Worker.registerWithMaster()
registerWithMaster will call tryRegisterAllMaster() to register with all Masters
private def registerWithMaster() { // onDisconnected may be triggered multiple times, so don't attempt registration // if there are outstanding registration attempts scheduled. registrationRetryTimer match { case None => registered = false //Core code, register with all Masters registerMasterFutures = tryRegisterAllMasters() connectionAttemptCount = 0 registrationRetryTimer = Some(forwordMessageScheduler.scheduleAtFixedRate( new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { Option(self).foreach(_.send(ReregisterWithMaster)) } }, INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS, INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS, TimeUnit.SECONDS)) case Some(_) => logInfo("Not spawning another attempt to register with the master, since there is an" + " attempt scheduled already.") } }
4.Worker.tryRegisterAllMaster()
Because there can be multiple Masters, a thread pool is used here.private def tryRegisterAllMasters(): Array[JFuture[_]] = { masterRpcAddresses.map { masterAddress => //Because there may be multiple masters, a thread pool is used here registerMasterThreadPool.submit(new Runnable { override def run(): Unit = { try { logInfo("Connecting to master " + masterAddress + "...") val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME) //Use masterEndPoint to register sendRegisterMessageToMaster(masterEndpoint) } catch { case ie: InterruptedException => // Cancelled case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e) } } }) } }Spark 2.2 version, sendRegiterMessageToMaster, just after sending a message to the Master, there is no other registration, the Master will use WorkerEndPont to send the response information, if it fails, the registerWithMaster code will retry the registration.
private def sendRegisterMessageToMaster(masterEndpoint: RpcEndpointRef): Unit = { masterEndpoint.send(RegisterWorker( workerId, host, port, self, cores, memory, workerWebUiUrl, masterEndpoint.address)) }
5.Master.receive()
Above, the RegisterWorker information is sent using masterEndPoint, which will be processed in the master's receive. The code is as follows:case RegisterWorker( id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl, masterAddress) => logInfo("Registering worker %s:%d with %d cores, %s RAM".format( workerHost, workerPort, cores, Utils.megabytesToString(memory))) //master is in standby and does not register if (state == RecoveryState.STANDBY) { workerRef.send(MasterInStandby) } else if (idToWorker.contains(id)) { // already registered workerRef.send(RegisterWorkerFailed("Duplicate worker ID")) } else { // Formally register val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, workerRef, workerWebUiUrl) //registration method if (registerWorker(worker)) { persistenceEngine.addWorker(worker) //return the registration information response workerRef.send(RegisteredWorker(self, masterWebUiUrl, masterAddress)) //Add new worker, reschedule schedule() } else { val workerAddress = worker.endpoint.address logWarning("Worker registration failed. Attempted to re-register worker at same " + "address: " + workerAddress) workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: " + workerAddress)) } }
registerWorker only executes the registered code specifically on the master at the end, checks the status of the current worker, and adds it to the data structure. Let's take a look:
private def registerWorker(worker: WorkerInfo): Boolean = { // There may be one or more refs to dead workers on this same node (w/ different ID's), // remove them. //dead processing workers.filter { w => (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD) }.foreach { w => workers -= w } //unknow processing val workerAddress = worker.endpoint.address if (addressToWorker.contains(workerAddress)) { val oldWorker = addressToWorker(workerAddress) if (oldWorker.state == WorkerState.UNKNOWN) { // A worker registering from UNKNOWN implies that the worker was restarted during recovery. // The old worker must thus be dead, so we will remove it and accept the new worker. removeWorker(oldWorker) } else { logInfo("Attempted to re-register worker at same address: " + workerAddress) return false } } //Add worker information workers += worker idToWorker(worker.id) = worker addressToWorker(workerAddress) = worker if (reverseProxy) { webUi.addProxyTargets(worker.id, worker.webUiAddress) } true }
After the method is successfully registered, the persistenceEngine.addWorker(worker) method is also called, which is to persist the worker information, such as persisting the worker information in zookeeper.
6.Worker.receive()
The master finally uses the application of workerEndPoint to play the registerWorker message. In the receive method of the worker, there are the following cases:
case msg: RegisterWorkerResponse => handleRegisterResponse(msg)Among them, handleRegisterResponse is the last worker to process the registration successfully. There are two main operations:
a. Set the registered variable of the current worker to true b. Start sending the heartbeat thread
Summarize:
The startup of the worker is generally started by the administrator manually using the script. The script under /bin/* first uses the main method in the companion object of the worker to instantiate a worker object, and because the worker is a ThreadSafeRpcEndPoint, it will execute onstart The method in , the initiation operation related to registration is performed in onstart, and the operation between them is mainly the communication completed by the reference of the master and the worker itself.
After that, the worker will receive the master's request to start the executor, create the backend, and then the communication between the executorBackEnd and the driver.