Spark:Master原理剖析与源码分析

Master主备切换

Spark原生的standalone模式是支持主备切换的,也就是说master可以配置两个,当Action Master因故障挂了的时候,系统会自动将Standby Master 切换成 Active Master。

Master的准备切换分为两种:

  1. 一种是基于文件系统的,spark提供目录保存spark Application和worker的注册信息,并将他们的恢复状态写入该目录,当spark的master节点宕掉的时候,重启master,就能获取application和worker的注册信息。需要手动进行切换
  2. 一种是基于zookeeper的,用于生产模式。其基本原理是通过zookeeper来选举一个Master,其他的Master处于Standby状态。将Standalone集群连接到同一个ZooKeeper实例并启动多个Master,利用zookeeper提供的选举和状态保存功能,可以使一个Master被选举,而其他Master处于Standby状态。如果现任Master死去,另一个Master会通过选举产生,并恢复到旧的Master状态,然后恢复调度。整个恢复过程可能要1-2分钟。

在这里插入图片描述

流程:

  1. 在active Master宕掉之后,内部持久化(FileSystemPersistenceEngine和ZookeeperPersistenceEngine)引擎首先会读取持久化的storedApps、storedDrivers、storedWorkers
  2. 判断,如果storedApps、storedDrivers、storedWorkers有任何一个是有内容的
  3. 将持久化的Application、Driver、Worker信息重新注册,注册到Master内部的内存缓存结构中
  4. 将Application和Worker的状态都修改为UNKNOWN,然后向Application所对应的Driver和Worker发送StandBy Master的地址
  5. 如果Driver和Wroker是正常运转的情况下,接收到Master发送过来的地址后,就会返回响应消息到新的Master
  6. Master陆续接收到Driver和Worker发送过来的消息后,会使用completeRecovery()方法对没有发送响应消息的Driver和Worker进行处理,过滤掉他们的信息。
  7. 调用Master的schedule()方法,对正在调度的Driver和Application进行调度。在worker上启动driver,或者是为Applicaiton在worker上启动executor

源码分析:
源码:org/apache/spark/deploy/master/Master.scala

override def receive: PartialFunction[Any, Unit] = {
    case ElectedLeader => {
      //读取持久化的storedApps、storedDrivers、storedWorkers
      val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv)
      //判断,如果storedApps、storedDrivers、storedWorkers有任何一个是有内容的
      state = if (storedApps.isEmpty && storedDrivers.isEmpty && storedWorkers.isEmpty) {
        RecoveryState.ALIVE
      } else {
        RecoveryState.RECOVERING
      }
      logInfo("I have been elected leader! New state: " + state)
      if (state == RecoveryState.RECOVERING) {
        //开始恢复master
        beginRecovery(storedApps, storedDrivers, storedWorkers)
        recoveryCompletionTask = forwardMessageThread.schedule(new Runnable {
          override def run(): Unit = Utils.tryLogNonFatalError {
            self.send(CompleteRecovery)
          }
        }, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
      }
    }
    //完成master的主备切换
    case CompleteRecovery => completeRecovery()

......

}

第一步:开始恢复master

/**
   *  开始恢复master
   */
  private def beginRecovery(storedApps: Seq[ApplicationInfo], storedDrivers: Seq[DriverInfo],
    storedWorkers: Seq[WorkerInfo]) {
    for (app <- storedApps) {
      logInfo("Trying to recover app: " + app.id)
      try {
        // 注册application
        registerApplication(app)
        app.state = ApplicationState.UNKNOWN
        app.driver.send(MasterChanged(self, masterWebUiUrl))
      } catch {
        case e: Exception => logInfo("App " + app.id + " had exception on reconnect")
      }
    }

    for (driver <- storedDrivers) {
      // Here we just read in the list of drivers. Any drivers associated with now-lost workers
      // will be re-launched when we detect that the worker is missing.
      drivers += driver
    }

    for (worker <- storedWorkers) {
      logInfo("Trying to recover worker: " + worker.id)
      try {
        // 注册worker
        registerWorker(worker)
        worker.state = WorkerState.UNKNOWN
        worker.endpoint.send(MasterChanged(self, masterWebUiUrl))
      } catch {
        case e: Exception => logInfo("Worker " + worker.id + " had exception on reconnect")
      }
    }
  }

第二步:点击第一步中registerApplication

  private def registerApplication(app: ApplicationInfo): Unit = {
    //拿到driver的地址
    val appAddress = app.driver.address
    //如果driver的地址存在的情况下,就直接返回,就相当于对driver进行重复注册
    if (addressToApp.contains(appAddress)) {
      logInfo("Attempted to re-register application at same address: " + appAddress)
      return
    }

    applicationMetricsSystem.registerSource(app.appSource)
    
    //将Application的信息加入到内存缓存中
    apps += app
    idToApp(app.id) = app
    endpointToApp(app.driver) = app
    addressToApp(appAddress) = app
    
    //将Application的信息加入到等待调度的队列中,调度的算法为FIFO
    waitingApps += app
  }

第三步:点击第一步中registerWorker

  /**
   * 注册worker
   *
   * 遍历所有管理的Worker,若有与新注册的Worker相同的host,port且处于Dead(超时)状态的Worker则直接从workers中移除。
   * 若管理的addressToWorker已经存在新注册的Worker一样的workerAddress,则获取老Worker,若状态是UNKNOWN说明Master 处于recovery,
   * Worker正处于恢复中,则将老Worker移除,将新Worker直接加入并成功返回,若老Worker是其他状态则说明已经重复注册了,返回失败。
   *
   */
  private def registerWorker(worker: WorkerInfo): Boolean = {
    // There may be one or more refs to dead workers on this same node (w/ different ID's),
    // remove them.
    //过滤掉状态为DEAD的worker
    workers.filter { w =>
      (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
    }.foreach { w =>
      //  遍历workers HashSet,移除该worker
      workers -= w
    }

    //worker地址
    val workerAddress = worker.endpoint.address
    //如果已存在,该worker为oldWorker
    if (addressToWorker.contains(workerAddress)) {
      val oldWorker = addressToWorker(workerAddress)
      //  如果oldWorker的状态为UNKOWN,则移除该oldWorker
      if (oldWorker.state == WorkerState.UNKNOWN) {
        // A worker registering from UNKNOWN implies that the worker was restarted during recovery.
        // The old worker must thus be dead, so we will remove it and accept the new worker.
        //一个从UNKOWN状态中注册的worker,该worker在恢复过程被重启
        // old worker必须被杀死,移除old worker,接收新的worker
        removeWorker(oldWorker)
      } else {
        logInfo("Attempted to re-register worker at same address: " + workerAddress)
        return false
      }
    }
    //  将新的worker添加到HashSet中去
    workers += worker
    //  将worker id,address信息添加到对应的HashMap中
    idToWorker(worker.id) = worker
    addressToWorker(workerAddress) = worker
    //  返回True,worker注册成功
    true
  }

第四步:完成master的主备切换

/**
   * 完成master的主备切换,也就是完成master的主备切换
   */
  private def completeRecovery() {
    // Ensure "only-once" recovery semantics using a short synchronization period.
    if (state != RecoveryState.RECOVERING) { return }
    state = RecoveryState.COMPLETING_RECOVERY

    // Kill off any workers and apps that didn't respond to us.
    //将Applicaiton和Worker都过滤出来,目前状况还是UNKNOWN的
    //然后遍历,分别调用removeWorker和finishApplication方法,对可能已经出故障,或者已经死掉的Application和Worker进行清理
    /**
     * 清理三点:
     *   1、从内存缓存结构中移除
     *   2、从相关组件的内存缓存中移除(比如说worker所在的driver也要移除)
     *   3、从持久化存储中移除
     */
    workers.filter(_.state == WorkerState.UNKNOWN).foreach(removeWorker)
    apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication)

    // Reschedule drivers which were not claimed by any workers
    drivers.filter(_.worker.isEmpty).foreach { d =>
      logWarning(s"Driver ${d.id} was not found after master recovery")
      // 重新启动driver,对于sparkstreaming程序而言
      if (d.desc.supervise) {
        logWarning(s"Re-launching ${d.id}")
        relaunchDriver(d)
      } else {
        removeDriver(d.id, DriverState.ERROR, None)
        logWarning(s"Did not re-launch ${d.id} because it was not supervised")
      }
    }

    state = RecoveryState.ALIVE
    //资源调度算法
    schedule()
    logInfo("Recovery complete - resuming operations!")
  }

第五步:点击第四步中的removeWorker

/**
   * 移除worker
   */
  private def removeWorker(worker: WorkerInfo) {
    logInfo("Removing worker " + worker.id + " on " + worker.host + ":" + worker.port)
    worker.setState(WorkerState.DEAD)
    //将内存缓存结构中移除(HashMap)
    idToWorker -= worker.id
    addressToWorker -= worker.endpoint.address
    for (exec <- worker.executors.values) {
      logInfo("Telling app of lost executor: " + exec.id)
      // 向driver发送exeutor丢失了
      exec.application.driver.send(ExecutorUpdated(
        exec.id, ExecutorState.LOST, Some("worker lost"), None))
      // 将worker上的所有executor给清楚掉
      exec.application.removeExecutor(exec)
    }
    for (driver <- worker.drivers.values) {
      //spark自动监视,driver所在的worker挂掉的时候,也会把这个driver移除掉,如果配置supervise这个属性的时候,driver也挂掉的时候master会重新启动driver
      if (driver.desc.supervise) {
        logInfo(s"Re-launching ${driver.id}")
        relaunchDriver(driver)
      } else {
        logInfo(s"Not re-launching ${driver.id} because it was not supervised")
        removeDriver(driver.id, DriverState.ERROR, None)
      }
    }
    // 持久化引擎会移除worker
    persistenceEngine.removeWorker(worker)
  }

第六步:点击finishApplication

private def finishApplication(app: ApplicationInfo) {
    removeApplication(app, ApplicationState.FINISHED)
  }

  def removeApplication(app: ApplicationInfo, state: ApplicationState.Value) {
    // 将数据从内存缓存结果中移除
    if (apps.contains(app)) {
      logInfo("Removing app " + app.id)
      apps -= app
      idToApp -= app.id
      endpointToApp -= app.driver
      addressToApp -= app.driver.address
      if (completedApps.size >= RETAINED_APPLICATIONS) {
        val toRemove = math.max(RETAINED_APPLICATIONS / 10, 1)
        completedApps.take(toRemove).foreach(a => {
          appIdToUI.remove(a.id).foreach { ui => webUi.detachSparkUI(ui) }
          applicationMetricsSystem.removeSource(a.appSource)
        })
        completedApps.trimStart(toRemove)
      }
      completedApps += app // Remember it in our history
      waitingApps -= app

      // If application events are logged, use them to rebuild the UI
      rebuildSparkUI(app)

      for (exec <- app.executors.values) {
        // 杀掉app对应的executor
        killExecutor(exec)
      }
      app.markFinished(state)
      if (state != ApplicationState.FINISHED) {
        app.driver.send(ApplicationRemoved(state.toString))
      }
      // 从持久化引擎中移除application
      persistenceEngine.removeApplication(app)
      schedule()

      // Tell all workers that the application has finished, so they can clean up any app state.
      workers.foreach { w =>
        w.endpoint.send(ApplicationFinished(app.id))
      }
    }
  }

第七步:点击relaunchDriver

  /**
   * 重新发布Driver
   */
  private def relaunchDriver(driver: DriverInfo) {
    driver.worker = None
    // 将driver的状态设置为relaunching
    driver.state = DriverState.RELAUNCHING
    // 将driver加入到等待的队列当中
    waitingDrivers += driver
    schedule()
  }

第八步:点击removeDriver

  /**
   * 移除driver
   */
  private def removeDriver(
    driverId: String,
    finalState: DriverState,
    exception: Option[Exception]) {
    drivers.find(d => d.id == driverId) match {
      case Some(driver) =>
        logInfo(s"Removing driver: $driverId")
        //内存缓存移除(HashSet)
        drivers -= driver
        if (completedDrivers.size >= RETAINED_DRIVERS) {
          val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
          completedDrivers.trimStart(toRemove)
        }
        // 将driver加入到已经完成的driver中
        completedDrivers += driver
        // 将driver从持久化引擎中移除掉
        persistenceEngine.removeDriver(driver)
        // 将driver的状态设置为final
        driver.state = finalState
        driver.exception = exception
        // 将driver所在的worker中移除掉driver
        driver.worker.foreach(w => w.removeDriver(driver))
        schedule()
      case None =>
        logWarning(s"Asked to remove unknown driver: $driverId")
    }
  }

Master注册机制

在这里插入图片描述

worker向master注册

  1. worker在启动之后,就会向master进行注册
  2. 对于worker状态为dead的,过滤掉。比如说master在一定时间期限已经完成了恢复,但是发现其中的worker为unknown的情况下,对worker进行remove,将worker的状态设置为dead,如果过了很长时间worker又莫名其妙的向master进行注册的情况下,直接过滤掉;对于worker状态为unknown,master会将旧的worker信息给清理掉,替换成新的worker信息。比如说master刚启动的时候,会向worker发送新的地址的时候,master会将该worker状态设置为unknown,worker向master返回注册信息的时候,master会将旧的worker信息给清理掉,替换成新的worker信息
  3. 将worker加入内存缓存中(HashMap)
  4. 用持久化引擎将worker信息持久化,可能是文件系统,可能是zookeeper
  5. 调用schedule()方法进行调度

源码分析:
第一步:处理worker的注册请求
源码位置:org.apache.spark.deploy.master.Master

/**
  * 处理worker的注册请求
  * 注:
  * 这样设计Worker注册机制有一个很大的好处,在生产环境下,想要把新的Worker 加入到已经运行的Spark 集群上,
  * 不需要重新启动Spark 集群,就能够使用新加入的Worker 以提升处理性能
  * 
*/
case RegisterWorker(
      id, workerHost, workerPort, workerRef, cores, memory, workerUiPort, publicAddress) => {
      logInfo("Registering worker %s:%d with %d cores, %s RAM".format(
        workerHost, workerPort, cores, Utils.megabytesToString(memory)))
      //判断Master的状态是否为StandBy  
      if (state == RecoveryState.STANDBY) {
        // ignore, don't send response 忽略,不发送响应
      } else if (idToWorker.contains(id)) {
        //  如果WorkerId存在,则回复Worker注册失败"重复的worker id"
        workerRef.send(RegisterWorkerFailed("Duplicate worker ID"))
      } else {
        //  将worker的id,host,端口,cpu数,内存等信息封装成一个WorkerInfo
        val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,
          workerRef, workerUiPort, publicAddress)
        //  判断注册Worker注册是否成功
        if (registerWorker(worker)) {
          // 持久化引擎添加worker的信息
          persistenceEngine.addWorker(worker)
          workerRef.send(RegisteredWorker(self, masterWebUiUrl))
          // 资源调度算法
          schedule()
        } else {
          //worker注册失败
          val workerAddress = worker.endpoint.address
          logWarning("Worker registration failed. Attempted to re-register worker at same " +
            "address: " + workerAddress)
          workerRef.send(RegisterWorkerFailed("Attempted to re-register worker at same address: "
            + workerAddress))
        }
      }
}

第二步:点击第一步的registerWorke

  /**
   * 注册worker
   */
  private def registerWorker(worker: WorkerInfo): Boolean = {
    // There may be one or more refs to dead workers on this same node (w/ different ID's),
    // remove them.
    //过滤掉状态为DEAD的worker
    workers.filter { w =>
      (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)
    }.foreach { w =>
      //  遍历workers HashSet,移除该worker
      workers -= w
    }
 
    //worker地址
    val workerAddress = worker.endpoint.address
    //如果已存在,该worker为oldWorker
    if (addressToWorker.contains(workerAddress)) {
      val oldWorker = addressToWorker(workerAddress)
      //  如果oldWorker的状态为UNKOWN,则移除该oldWorker
      if (oldWorker.state == WorkerState.UNKNOWN) {
        // A worker registering from UNKNOWN implies that the worker was restarted during recovery.
        // The old worker must thus be dead, so we will remove it and accept the new worker.
        //一个从UNKOWN状态中注册的worker,该worker在恢复过程被重启
        // old worker必须被杀死,移除old worker,接收新的worker
        removeWorker(oldWorker)
      } else {
        logInfo("Attempted to re-register worker at same address: " + workerAddress)
        return false
      }
    }
    //  将新的worker添加到HashSet中去 
    workers += worker
    //  将worker id,address信息添加到对应的HashMap中
    idToWorker(worker.id) = worker
    addressToWorker(workerAddress) = worker
    //  返回True,worker注册成功
    true
  }

第三步:点击第二步中的removeWorker

/**
   * 移除worker
   */
  private def removeWorker(worker: WorkerInfo) {
    logInfo("Removing worker " + worker.id + " on " + worker.host + ":" + worker.port)
    worker.setState(WorkerState.DEAD)
    //将内存缓存结构中移除(HashMap)
    idToWorker -= worker.id
    addressToWorker -= worker.endpoint.address
    for (exec <- worker.executors.values) {
      logInfo("Telling app of lost executor: " + exec.id)
      // 向driver发送exeutor丢失了
      exec.application.driver.send(ExecutorUpdated(
        exec.id, ExecutorState.LOST, Some("worker lost"), None))
      // 将worker上的所有executor给清楚掉
      exec.application.removeExecutor(exec)
    }
    for (driver <- worker.drivers.values) {
      //spark自动监视,driver所在的worker挂掉的时候,也会把这个driver移除掉,如果配置supervise这个属性的时候,driver也挂掉的时候master会重新启动driver
      if (driver.desc.supervise) {
        logInfo(s"Re-launching ${driver.id}")
        relaunchDriver(driver)
      } else {
        logInfo(s"Not re-launching ${driver.id} because it was not supervised")
        removeDriver(driver.id, DriverState.ERROR, None)
      }
    }
    // 持久化引擎会移除worker
    persistenceEngine.removeWorker(worker)
  }

第四步:点击第三步中的removeExecutor
源码位置:org.apache.spark.deploy.master.ApplicationInfo

 /**
   * 将executor从内部的内存缓冲结构中移除
   */
 private[master] def removeExecutor(exec: ExecutorDesc) {
    if (executors.contains(exec.id)) {
      removedExecutors += executors(exec.id)
      executors -= exec.id
      coresGranted -= exec.cores
    }
  }

第五步:点击第三步中的relaunchDriver

 /**
   * 重新发布Driver
   */
  private def relaunchDriver(driver: DriverInfo) {
    driver.worker = None
    // 将driver的状态设置为relaunching
    driver.state = DriverState.RELAUNCHING
    // 将driver加入到等待的队列当中
    waitingDrivers += driver
    schedule()
  }

第六步:点击第三步中的removeDriver

  /**
   * 移除driver
   */
  private def removeDriver(
    driverId: String,
    finalState: DriverState,
    exception: Option[Exception]) {
    drivers.find(d => d.id == driverId) match {
      case Some(driver) =>
        logInfo(s"Removing driver: $driverId")
        //内存缓存移除(HashSet)
        drivers -= driver
        if (completedDrivers.size >= RETAINED_DRIVERS) {
          val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
          completedDrivers.trimStart(toRemove)
        }
        // 将driver加入到已经完成的driver中
        completedDrivers += driver
        // 将driver从持久化引擎中移除掉
        persistenceEngine.removeDriver(driver)
        // 将driver的状态设置为final
        driver.state = finalState
        driver.exception = exception
        // 将driver所在的worker中移除掉driver
        driver.worker.foreach(w => w.removeDriver(driver))
        schedule()
      case None =>
        logWarning(s"Asked to remove unknown driver: $driverId")
    }
  }

driver向master注册

  1. 用spark-submit提交sparkApplication的时候,dirver首先就会向master进行注册,将driver信息放入到内存缓存中,也就是hashmap中
  2. 加入等待调度队列,也就是ArrayBuffer
  3. 用持久化引擎将driver信息持久化,可能是文件系统,可能是zookeeper
  4. 调用schedule()方法进行调度

源码分析:
第一步:请求提交Driver

override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
    /**
     * 请求提交Driver,参数 DriverDescription
     */
    case RequestSubmitDriver(description) => {
      // 判断Master状态是否为Alive
      if (state != RecoveryState.ALIVE) {
       //  如果不为Alive,则回应提交Driver失败,需要向Alive的Master提交
        val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
          "Can only accept driver submissions in ALIVE state."
        context.reply(SubmitDriverResponse(self, false, None, msg))
      } else {
        logInfo("Driver submitted " + description.command.mainClass)
         //使用createDriver方法,创建driver
        val driver = createDriver(description)
         // 使用持久化引擎将driver的信息持久化
        persistenceEngine.addDriver(driver)
        // 将driver加入到等待调度的队列中(ArrayBuffer)
        waitingDrivers += driver
        // 在持久化内存中将driver加入到内部的内存缓冲结构中(HashSet)
        drivers.add(driver)
        //调度算法
        schedule()

        // TODO: It might be good to instead have the submission client poll the master to determine
        //       the current status of the driver. For now it's simply "fire and forget".
        //注册成功后,RpcCallContext回应Driver注册响应,已成功提交driver + id
        context.reply(SubmitDriverResponse(self, true, Some(driver.id),
          s"Driver successfully submitted as ${driver.id}"))
      }
    }
    
......

}

第二步:点击第一步中createDriver

  /**
   * 创建Driver,返回封装好的DriverInfo
   */
  private def createDriver(desc: DriverDescription): DriverInfo = {
    val now = System.currentTimeMillis()
    val date = new Date(now)
    //将时间,driver id,desc等信息封装成一个DriverInfo返回
    new DriverInfo(now, newDriverId(date), desc, date)
  }

application向master进行注册(registerApplication()方法)

  1. Driver启动好之后,会执行我们的application代码,执行sparkContext的初始化,底层的SparkDeploySchedulerBackend,会通过AppClient内部的线程,ClientActor发送RegisterAppliction,到master进行Application进行注册
  2. 将application信息放入到内存缓存中(HashmMap)
  3. 将application加入等待的调度队列(ArrayBuffer)
  4. 用持久化引擎将application信息持久化,可能是文件系统,可能是zookeeper
  5. 调用schedule()方法进行调度

源码分析:
第一步:底层的SparkDeploySchedulerBackend,会通过AppClient内部的线程,ClientActor发送Register Appliction,到master进行Application进行注册
源码位置:org.apache.spark.deploy.client.AppClient

/**
 *  Register with all masters asynchronously and returns an array `Future`s for cancellation.
 *  master支持两种主备切换机制,一种的hdfs的,另外一种是基于zookeeper的(动态ha,热切换的)
*/
 private def tryRegisterAllMasters(): Array[JFuture[_]] = {
      for (masterAddress <- masterRpcAddresses) yield {
        registerMasterThreadPool.submit(new Runnable {
          override def run(): Unit = try {
            if (registered) {
              return
            }
            logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
            val masterRef =
              rpcEnv.setupEndpointRef(Master.SYSTEM_NAME, masterAddress, Master.ENDPOINT_NAME)
            //发送RegisterApplication这个case cass,把appDescription发送给master,进行向master注册
            masterRef.send(RegisterApplication(appDescription, self))
          } catch {
            case ie: InterruptedException => // Cancelled
            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
          }
        })
      }
 }

第二步:处理Applicaton的注册的请求
源码位置:org.apache.spark.deploy.master.Master

 /**
  * 处理Applicaton的注册的请求
 */
case RegisterApplication(description, driver) => {
      // TODO Prevent repeated registrations from some driver
      // 如果master的状态为standby,也就是当前的这个master,不是active
      // 那么applicaiton来请求注册,什么都不会做
      if (state == RecoveryState.STANDBY) {
        // ignore, don't send response
      } else {
        logInfo("Registering app " + description.name)
        // 用ApplicationDescription信息,创建ApplicationInfo
        val app = createApplication(description, driver)
        // 注册Application ,将Application加入缓存,将Application加入等待调度的队列
        registerApplication(app)
        logInfo("Registered app " + description.name + " with ID " + app.id)
        // 用持久化引擎,将ApplicationInfo进行持久化
        persistenceEngine.addApplication(app)
        // 反向,向SparkDeploySchedulerBackend的AppClient的ClientActor发送消息,也就是registeredApplication,而不是registerApplication
        driver.send(RegisteredApplication(app.id, self))
        schedule()
      }
 }

第三步:点击第二步中的createApplication

private def createApplication(desc: ApplicationDescription, driver: RpcEndpointRef): ApplicationInfo = {
    val now = System.currentTimeMillis()
    val date = new Date(now)
    // 用当前时间戳给该application生成唯一的id
    new ApplicationInfo(now, newApplicationId(date), desc, date, driver, defaultCores)
 }

第四步:点击registerApplication

private def registerApplication(app: ApplicationInfo): Unit = {
    //拿到driver的地址
    val appAddress = app.driver.address
    //如果driver的地址存在的情况下,就直接返回,就相当于对driver进行重复注册
    if (addressToApp.contains(appAddress)) {
      logInfo("Attempted to re-register application at same address: " + appAddress)
      return
    }

    applicationMetricsSystem.registerSource(app.appSource)
    
    //将Application的信息加入到内存缓存中
    apps += app
    idToApp(app.id) = app
    endpointToApp(app.driver) = app
    addressToApp(appAddress) = app
    
    //将Application的信息加入到等待调度的队列中,调度的算法为FIFO
    waitingApps += app
 }

第五步:反向,向SparkDeploySchedulerBackend的AppClient的ClientActor发送消息,点击SparkDeploySchedulerBackend的RegisteredApplication
源码位置:org.apache.spark.deploy.client.AppClient

override def receive: PartialFunction[Any, Unit] = {
      case RegisteredApplication(appId_, masterRef) =>
        // FIXME How to handle the following cases?
        // 1. A master receives multiple registrations and sends back multiple
        // RegisteredApplications due to an unstable network.
        // 2. Receive multiple RegisteredApplication from different masters because the master is
        // changing.
        
        /**
         * 如何修复这个bug?
         * 1.由于不稳定的网络,一个主接收多个注册并返回多个注册应用程序
         * 2.在master的变化的过程中,如何从不同的registeredApplication从不同的master中
         */
        appId = appId_
        registered = true
        master = Some(masterRef)
        listener.connected(appId)

      case ApplicationRemoved(message) =>
        markDead("Master removed our application: %s".format(message))
        stop()

      case ExecutorAdded(id: Int, workerId: String, hostPort: String, cores: Int, memory: Int) =>
        val fullId = appId + "/" + id
        logInfo("Executor added: %s on %s (%s) with %d cores".format(fullId, workerId, hostPort,
          cores))
        // FIXME if changing master and `ExecutorAdded` happen at the same time (the order is not
        // guaranteed), `ExecutorStateChanged` may be sent to a dead master.
        sendToMaster(ExecutorStateChanged(appId, id, ExecutorState.RUNNING, None, None))
        listener.executorAdded(fullId, workerId, hostPort, cores, memory)

      case ExecutorUpdated(id, state, message, exitStatus) =>
        val fullId = appId + "/" + id
        val messageText = message.map(s => " (" + s + ")").getOrElse("")
        logInfo("Executor updated: %s is now %s%s".format(fullId, state, messageText))
        if (ExecutorState.isFinished(state)) {
          listener.executorRemoved(fullId, message.getOrElse(""), exitStatus)
        }

      case MasterChanged(masterRef, masterWebUiUrl) =>
        logInfo("Master has changed, new master is at " + masterRef.address.toSparkURL)
        master = Some(masterRef)
        alreadyDisconnected = false
        masterRef.send(MasterChangeAcknowledged(appId))
}

Master状态改变处理机制原理

Driver状态改变(DriverStateChanged)

  1. 如果driver的状态是错误的、完成的、被杀掉、失败,那么就移除driver
  2. 首先将driver加入到已经完成的driver中,然后将driver的状态设置为final,最后从driver所对应的worker中移除driver,释放资源

第一步:Driver状态改变
源码分析:

    /**
     * Driver状态改变
     */
    case DriverStateChanged(driverId, state, exception) => {
      state match {
        //如果Driver的状态是错误,完成,被杀掉,失败
        case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
          //移除Driver
          removeDriver(driverId, state, exception)
        case _ =>
          throw new Exception(s"Received unexpected state update for driver $driverId: $state")
      }
    }

第二步:点击第一步中removeDriver

/**
 1. 移除driver
*/
 private def removeDriver(
    driverId: String,
    finalState: DriverState,
    exception: Option[Exception]) {
    //scala的find高阶函数,找到driverId对应的 driver
    drivers.find(d => d.id == driverId) match {
      //如果找到了,Some,样例类(Option)
      case Some(driver) =>
        logInfo(s"Removing driver: $driverId")
        //内存缓存移除(HashSet)
        drivers -= driver
        if (completedDrivers.size >= RETAINED_DRIVERS) {
          val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
          completedDrivers.trimStart(toRemove)
        }
        // 将driver加入到已经完成的driver中
        completedDrivers += driver
        // 将driver从持久化引擎中移除掉
        persistenceEngine.removeDriver(driver)
        // 将driver的状态设置为final
        driver.state = finalState
        driver.exception = exception
        // 将driver所在的worker中移除掉driver
        driver.worker.foreach(w => w.removeDriver(driver))
        schedule()
      case None =>
        logWarning(s"Asked to remove unknown driver: $driverId")
    }
  }

Executor状态改变(ExecutorStateChanged)

  1. 找到executor对应的applicaiton,然后再反过来通过applicaiton内部的executors缓存获得executor信息
  2. 设置executor的当前状态为LAUNCHING状态,并向driver同步发送ExecutorUpdated消息
  3. 如果executor的状态已经发生变化,从application中移除executor,从运行中executor对应的worker中移除executor
  4. 判断如果executor的状态是异常的,进行applicatino重试操作,如果重试了10次,都是失败,那么就认为调度失败,移除application。如果在重试的10次之内恢复正常了,就进行重新调度

源码分析:

/**
  * Executor状态发生改变
  */
case ExecutorStateChanged(appId, execId, state, message, exitStatus) => {
      // 找到executor对应的App,然后再反过来通过App内部的executors缓存获得executor信息
      val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
      execOption match {
        // 如果有值
        case Some(exec) => {
          // 设置executor的当前状态为LAUNCHING状态
          val appInfo = idToApp(appId)
          exec.state = state
          if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }

          // 向driver同步发送ExecutorUpdated消息
          exec.application.driver.send(ExecutorUpdated(execId, state, message, exitStatus))
          // 如果executor的状态已经发生变化
          if (ExecutorState.isFinished(state)) {
            // Remove this executor from the worker and app
            logInfo(s"Removing executor ${exec.fullId} because it is $state")
            // If an application has already finished, preserve its
            // state to display its information properly on the UI
            if (!appInfo.isFinished) {
              // 从app缓存中移除executor
              appInfo.removeExecutor(exec)
            }
            // 从运行executor的worker的缓存中移除executor
            exec.worker.removeExecutor(exec)
            // 判断,如果executor的状态是非正常的
            val normalExit = exitStatus == Some(0)
            // Only retry certain number of times so we don't go into an infinite loop.
            if (!normalExit) {
              // 判断applcation当前的重试次数,是否达到了最大值,最大值为10次
              if (appInfo.incrementRetryCount() < ApplicationState.MAX_NUM_RETRY) {
                //重新进行调度
                schedule()
              } else {
                // 否则就进行removeApplication操作
                // 也就是说,executor反复调度都是失败,那么就认为application失败了
                val execs = appInfo.executors.values
                if (!execs.exists(_.state == ExecutorState.RUNNING)) {
                  logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
                    s"${appInfo.retryCount} times; removing it")
                  removeApplication(appInfo, ApplicationState.FAILED)
                }
              }
            }
          }
        }
        case None =>
          logWarning(s"Got status update for unknown executor $appId/$execId")
      }
}

第二步:点击第一步的removeExecutor

  /**
   * 将executor从内部的内存缓冲结构中移除
   */
  private[master] def removeExecutor(exec: ExecutorDesc) {
    if (executors.contains(exec.id)) {
      removedExecutors += executors(exec.id)
      executors -= exec.id
      coresGranted -= exec.cores
    }
  }

第三步:点击第一步的removeApplication

def removeApplication(app: ApplicationInfo, state: ApplicationState.Value) {
    // 将数据从内存缓存结果中移除
    if (apps.contains(app)) {
      logInfo("Removing app " + app.id)
      apps -= app
      idToApp -= app.id
      endpointToApp -= app.driver
      addressToApp -= app.driver.address
      if (completedApps.size >= RETAINED_APPLICATIONS) {
        val toRemove = math.max(RETAINED_APPLICATIONS / 10, 1)
        completedApps.take(toRemove).foreach(a => {
          appIdToUI.remove(a.id).foreach { ui => webUi.detachSparkUI(ui) }
          applicationMetricsSystem.removeSource(a.appSource)
        })
        completedApps.trimStart(toRemove)
      }
      completedApps += app // Remember it in our history
      waitingApps -= app

      // If application events are logged, use them to rebuild the UI
      rebuildSparkUI(app)

      for (exec <- app.executors.values) {
        // 杀掉app对应的executor
        killExecutor(exec)
      }
      app.markFinished(state)
      if (state != ApplicationState.FINISHED) {
        //driver发送消息
        app.driver.send(ApplicationRemoved(state.toString))
      }
      // 从持久化引擎中移除application
      persistenceEngine.removeApplication(app)
      schedule()

      // Tell all workers that the application has finished, so they can clean up any app state.
      workers.foreach { w =>
        w.endpoint.send(ApplicationFinished(app.id))
      }
    }
 }

Master资源调度算法

原理:

  1. 首先判断master的状态不是alive的话直接返回,也就是说,standby master是不会进行application等资源的调度的
  2. 取出workers中所有注册上来上的worker,进行过滤,必须是状态为alive的worker,调用rondom的shuffle方法进行随机的打乱(从第三个worker进行shuffle)
  3. 遍历活着的worker,启动driver,将driver加入都内存缓冲结构中,并将driver从等待的driver的队列中移除
  4. 在workers上启动executors,使用是默认SpreadOutApps算法

源码分析:
第一步:调用schedule()方法
源码位置:org.apache.spark.deploy.master.Master

  /**
   * Schedule the currently available resources among waiting apps. This method will be called
   * every time a new app joins or resource availability changes.
   * 
   * 资源调度算法
   * 
   */
  private def schedule(): Unit = {
    //先判断master的状态不是alive的话直接返回
    // 也就是说,standby master是不会进行application等资源的调度的
    if (state != RecoveryState.ALIVE) { return }
    // Drivers take strict precedence over executors
    
    /**
     * Random.shuffle的原理是对传入的集合的元素进行随机的打乱
     * 意思就是从ArrayBuffer的最后一个元素开始到第三个元素,对于每个元素,都会取出该范围内的随机数
     * 比如说buf.length为10,然后next会取0到10的一个随机数,然后就会把buf随机的一个位置和该数字进行交换
     */
    val shuffledWorkers = Random.shuffle(workers) // Randomization helps balance drivers
    
     /**
     * 这里是对取出workers中所有注册上来上的worker,进行过滤,必须是状态为alive的worker
     */
    for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {
      
      /**
       * 遍历等待的driver,只有yarn-cluster和standalone的cluster模式提交的时候,才会注册driver,其他方式都是在
       * 本地启动driver,而不是来注册driver,更不可能让master来调度driver
       */
      for (driver <- waitingDrivers) {
        //判断当前的worker的空闲内存量大于等于driver需要的内存量,并且判断worker的空闲cpu数大于等于driver需要的cpu数量 
        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
          // 启动driver
          launchDriver(worker, driver)
          //将driver从等待的driver的队列中删除
          waitingDrivers -= driver
        }
      }
    }
    //在workers上启动和调度executor
    startExecutorsOnWorkers()
  }

第二步:调用第一步的shuffle方法
源码位置:scala.util.Random

/** Returns a new collection of the same type in a randomly chosen order.
   *  对传入的集合的元素进行随机的打乱
   *  @return         the shuffled collection
   */
  def shuffle[T, CC[X] <: TraversableOnce[X]](xs: CC[T])(implicit bf: CanBuildFrom[CC[T], T, CC[T]]): CC[T] = {
    //传递的值赋值给ArrayBuffer
    val buf = new ArrayBuffer[T] ++= xs
    //位置替换
    def swap(i1: Int, i2: Int) {
      val tmp = buf(i1)
      buf(i1) = buf(i2)
      buf(i2) = tmp
    }
   //从ArrayBuffer的最后一个元素开始到第三个元素,对于每个元素,都会取出该范围内的随机数
   //比如说buf.length为10,然后next会取0到10的一个随机数,然后就会把buf随机的一个位置和该数字进行交换
    for (n <- buf.length to 2 by -1) {
      val k = nextInt(n)
      swap(n - 1, k)
    }

    (bf(xs) ++= buf).result
  }

第三步:调用第一步的launchDriver方法

/**
  * 在某一个worker上启动driver
  */
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
    logInfo("Launching driver " + driver.id + " on worker " + worker.id)
    // 将driver加入到worker内部的缓冲结构中
    // 将worker中使用的内存和cpu的数量,都加上driver需要的内存和cpu的数量
    worker.addDriver(driver)
    // 将worker加入到driver的内存缓冲结构中
    driver.worker = Some(worker)
    // 调用worker的actor,给worker发送注册driver的信息,让worker启动driver
    worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
    //将driver状态设置成运行中
    driver.state = DriverState.RUNNING
 }

第四步:调用第一步的startExecutorsOnWorkers方法

  /**
   * Schedule and launch executors on workers
   *
   * Application的调度机制(核心之核心)
   * ①spreadOutApps 将app要使用的资源分平均分配到workers上
   * ②非spreadOutApps 将app尽可能多的分配到一个或几个worker上,这样其他的worker就不用分配了,尽可能资源大的分配
   */
  private def startExecutorsOnWorkers(): Unit = {
    // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
    // in the queue, then the second app, etc.
    // 首先遍历waitingApps中ApplicationInfo,并且还需要判断程序中定义的使用cpu的数量-启动执行application上
    // worker上的excutor所使用的的cpu的要大于0
    for (app <- waitingApps if app.coresLeft > 0) {
      val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
      // Filter out workers that don't have enough resources to launch an executor
      // 从workers中,过滤出worker的状态为alive的,按照cpu的数量进行倒序排序
      val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
        .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
          worker.coresFree >= coresPerExecutor.getOrElse(1))
        .sortBy(_.coresFree).reverse
      val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
      
      // Now that we've decided how many cores to allocate on each worker, let's allocate them
      // 给每个worker分配完application要求的cpu core之后,遍历worker,只要判断之前给这个worker分配到了core
      for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
        // 就在worker上启动executor
        allocateWorkerResourceToExecutors(
          app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
      }
    }
  }

在这个方法中会为每一个application启动一个executor,首先需要从workers中筛选出那些满足条件的worker用来启动executor,需要满足的条件如下:

  1. worker中的剩余的空闲的内存需要大于application中指定的每一个executor所需要的内存
  2. worker中的剩余的空闲的cpu的核数必须大于application中指定的每一个executor所需要的cpu的核数(如果启动的时候没有指定coresPerExecutor,那么就认为这个值是1)

第五步:调用第四步的scheduleExecutorsOnWorkers方法

/**
   * Schedule executors to be launched on the workers.
   * Returns an array containing number of cores assigned to each worker.
   *
   * There are two modes of launching executors. The first attempts to spread out an application's
   * executors on as many workers as possible, while the second does the opposite (i.e. launch them
   * on as few workers as possible). The former is usually better for data locality purposes and is
   * the default.
   *
   * The number of cores assigned to each executor is configurable. When this is explicitly set,
   * multiple executors from the same application may be launched on the same worker if the worker
   * has enough cores and memory. Otherwise, each executor grabs all the cores available on the
   * worker by default, in which case only one executor may be launched on each worker.
   *
   * It is important to allocate coresPerExecutor on each worker at a time (instead of 1 core
   * at a time). Consider the following example: cluster has 4 workers with 16 cores each.
   * User requests 3 executors (spark.cores.max = 48, spark.executor.cores = 16). If 1 core is
   * allocated at a time, 12 cores from each worker would be assigned to each executor.
   * Since 12 < 16, no executors would launch [SPARK-8881].
   *  
   *  这里有一个bug问题,集群中有4个worker,每一个worker是16核,我们分配3个executor,要求每一个executor分配16核,总共分配48核
   *  实际情况下,我们每一个worker中分配了12核,12核小于16核,报错[SPARK-8881].
   *  
   *  为Application分配要在Worker上启动的Executor返回一个数组,包括每个worker分配到的core数
   * 
   */
  private def scheduleExecutorsOnWorkers(
    app: ApplicationInfo,
    usableWorkers: Array[WorkerInfo],
    spreadOutApps: Boolean): Array[Int] = {
    //每一个executor上的core的数量
    val coresPerExecutor = app.desc.coresPerExecutor
    //app中每个executor所需要的最小cpu核数,如果没有默认最小核数为1
    val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
    //如果没有设置coresPerExecutor则代表对于当前的application只能在一个worker上启动该app的一个executor
    val oneExecutorPerWorker = coresPerExecutor.isEmpty
    //  每个Executor的内存
    val memoryPerExecutor = app.desc.memoryPerExecutorMB
    // 获取可用worker数量
    val numUsable = usableWorkers.length
    //构建一个可用worker长度的数组,用于存放每个worker节点可用于分配的cpu核数(16,16,16,16)
    val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
    //构建一个可用worker长度的数组,用于存放每一个worker上新分配的executor数量(1,2,1,0)
    val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
    //app实际会分配的CPU的数量,取app剩余的还没有被分配的CPU的数量和可以使用的workers的cpu数量总和之间的最小值
   //防止超过当前可用的cpu核数
    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)

    /** Return whether the specified worker can launch an executor for this app. */

    /**
     * 判断指定的worker是否可以为这个app启动一个executor
     * 1.主要判断是否还有core没有被分配
     * 2.判断worker上的core的数量是否充足
     * 3.if 新建一个executor 判断worker上的内存是否足够,判断executor的数量是否达到了上限
     * 4.不是 新建一个executor 那么就只需要判断1和2
     */
    def canLaunchExecutor(pos: Int): Boolean = {

      /**
      * 判断当前需要分配的cpu核数是否大于或者等于每个executor所需要的cpu核数,
      * 比如总共只能分配8核,但是每个executor所需要的cpu核数是12,那么就不能发起executor了,因为资源不够用
      */
      val keepScheduling = coresToAssign >= minCoresPerExecutor

      /**
       * 判断该worker上是否还有足够的cores,worker上剩余的cores减去已经被分配出去的cores
       * 
       * 比如现在worker剩余核数16,然后又给application他分配了12核,即还剩4核可用
       * 但是启动一个executor需要12核,那么4 < 12 表示内核不足使用了
       */
      val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor

      // If we allow multiple executors per worker, then we can always launch new executors.
      // Otherwise, if there is already an executor on this worker, just give it more cores.

     /**
       * 如果我们允许每一个worker启动多个executor,然后我们就可以启动一个新的executor
       * 否则如果worker已经启动一个新executor,只需要将更多的内核分配给该executor即可
       */
      val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
      // 如果需要发起新的executor,既需要判断cpu核数是否足够,还需要判断 executor是否超过限制总数以及内存是否足够
      if (launchingNewExecutor) {
        //该worker上已经被分配的内存
        val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
        //worker上剩余的内存减去被分配的内存需要大于将要被分配的内存
        val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
        //判断 executor是否超过限制
        val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
        keepScheduling && enoughCores && enoughMemory && underLimit
      } else {
        // We're adding cores to an existing executor, so no need
        // to check memory and executor limits
        // 否则只是对已经存在的executor添加cpu核数,没必要检查内存和executor限制
        keepScheduling && enoughCores
      }
    }

    // Keep launching executors until no more workers can accommodate any
    // more executors, or if we have reached this application's limits
    //canLaunchExecutor无非就是判断worker上的cores和memory的数量是否满足条件
    var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)

    //不断的启动executor,直到不再有Worker可以容纳任何Executor,或者达到了这个Application的要求
    while (freeWorkers.nonEmpty) {
      freeWorkers.foreach { pos =>
        var keepScheduling = true
        //再次判断是否可以启动一个executor
        while (keepScheduling && canLaunchExecutor(pos)) {
          // 需要分配的核数减去每个executor所需要的最小核数
          coresToAssign -= minCoresPerExecutor
          // 对应的worker节点需要分配的cpu核数加上要启动该executor所需要的最小CPU核数
          assignedCores(pos) += minCoresPerExecutor

          // If we are launching one executor per worker, then every iteration assigns 1 core
          // to the executor. Otherwise, every iteration assigns cores to a new executor.
          /**
           * 假如我们只允许在一个worker上启动一个executor,那么设置该worker上被分配的executor的数量为1
           * 否则就是在原来的executors的数量上加上1
           */
          if (oneExecutorPerWorker) {
            assignedExecutors(pos) = 1
          } else {
            assignedExecutors(pos) += 1
          }

          // Spreading out an application means spreading out its executors across as
          // many workers as possible. If we are not spreading out, then we should keep
          // scheduling executors on this worker until we use all of its resources.
          // Otherwise, just move on to the next worker.
          /**
           * 如果我们采用的是spreadOutApps这个算法,就意味着我们需要尽可能的将executor分配到足够多的
           * worker上,此时就应该设置keepScheduling设置为false,结束在该executors上的分配
           *
           * 如果我们采用的不是spreadOutApps这个算法,就意味着我们需要一直在这个worker上分配这个executor
           * 这个时候我们就需要设置keepScheduling为true,让其一直循环,一直到该worker上的资源不满足再继续分配executor了
           */
          if (spreadOutApps) {
            keepScheduling = false
          }
        }
      }
      //因为最外层是一个while循环,所以这边在过滤一遍,如果分配结束了,那么canLaunchExecutor就会返回false
      //得到的freeworkers必然是一个空,最外层的循环就结束了
      freeWorkers = freeWorkers.filter(canLaunchExecutor)
    }
    assignedCores
  }

关于这两种资源调度的策略:

spreadOutApps其实就是在可用的Worker列表中,逐个遍历,在每个worker上面平均的启动相应的Executor, 比如,现在需要启动10个Executor,但是第一次判断有5个可用(有空闲的资源)的worker节点,那么第一趟遍历的时候,会在每个Worker节点上面都启动一个Executor,然后会进行第二次判断,因为可能在第一次分配之后某个节点就没有空闲的资源了,所以需要再次的判断,然后进行第二次的遍历分配,这里我们假设在第二趟遍历的时候,这5个worker依然有足够的空闲资源,那么在第二趟的遍历之后,每个Worker节点上面都为这个app启动了2个Executor(平均分配)

非 spreadOutApps其实有两重的含义,在之前的版本中,这种方式含义是,在第一判断之后,在有可用的Worker节点上面选取空闲资源最多的呢个节点,尽可能的在这个节点上面启动app所需的所有Executor,如果资源不够的话,会进行在一次的判断,选取空闲资源最多的Worker继续启动(不是太合理).现在是按照assignedExecutors变量,可以指定在可用的Worker节点上面分别启动相应个数的Executor,(按需分配).

至于这两种资源分配的方式是由 if (spreadOutApps) { keepScheduling = false } 这个判断条件来判断的

第六步:调用第四步的allocateWorkerResourceToExecutors方法

private def allocateWorkerResourceToExecutors(
    app: ApplicationInfo,
    //在这个worker上需要分配的cores
    assignedCores: Int,
    //每一个executor需要的cores
    coresPerExecutor: Option[Int],
    worker: WorkerInfo): Unit = {
    // If the number of cores per executor is specified, we divide the cores assigned
    // to this worker evenly among the executors with no remainder.
    // Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
    // 获取该worker应该有多少个executor
    val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
   // 获取每一个executor应该分配的核数,如果没有指定则使用计算的应该分配的核数
    val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
    for (i <- 1 to numExecutors) {
      // 向worker上添加executor,创建ExecutorDesc对象,更新application已经分配到的cpu核数
      val exec = app.addExecutor(worker, coresToAssign)
      // 启动executor
      launchExecutor(worker, exec)
      // 更新application的状态
      app.state = ApplicationState.RUNNING
    }
  }

第六步:调用第六步的launchExecutor方法

  private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
   // worker启动executor,并且更新worker的cpu和内存信息
    worker.addExecutor(exec)
    worker.endpoint.send(LaunchExecutor(
      masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
    // 向application发送ExecutorAdded消息
    exec.application.driver.send(ExecutorAdded(
      exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
  }

猜你喜欢

转载自blog.csdn.net/jiaojiao521765146514/article/details/85098823