Hadoop Yarn 3.1.0 源码分析（02 作业调度）

在上一节中，我们详细分析了作业是如何上岸的，现在作业已经到达了RM端，并且交给了 RMAppManager进行继续运转，我们继续跟踪作业是如何在YARN中如何运转。
Server端：
ApplicationClientProtocolPBSeriveImpl.submitApplication() -> ClientRMService.submitApplicaiton() -> RMAppManger.submitApplication():

 protected void submitApplication(
      ApplicationSubmissionContext submissionContext, long submitTime,
      String user) throws YarnException {
    ApplicationId applicationId = submissionContext.getApplicationId();
    //根据提交的，submissionContext创造一个RMAppImpl实例，并且创造一个applicationID到RMAppImpl的映射关系的缓存，方便之后查询
    RMAppImpl application = createAndPopulateNewRMApp(
        submissionContext, submitTime, user, false, -1);
    try {
      //是否开启安全机制，一些安全相关操作
      if (UserGroupInformation.isSecurityEnabled()) {
        this.rmContext.getDelegationTokenRenewer()
            .addApplicationAsync(applicationId,
                BuilderUtils.parseCredentials(submissionContext),
                submissionContext.getCancelTokensWhenComplete(),
                application.getUser(),
                BuilderUtils.parseTokensConf(submissionContext));
      } else {
        //未开启安全机制，就获取dispatcher然后交给对应的EventHandler,处理RMAppEventType.START事件，触发了RMAppImpl对象的状态机
        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.START));
      }
    } catch (Exception e) {
      LOG.warn("Unable to parse credentials for " + applicationId, e);
      //异常情况触发，RMAppEventType.APP_REJECTED事件
      // Sending APP_REJECTED is fine, since we assume that the
      // RMApp is in NEW state and thus we haven't yet informed the
      // scheduler about the existence of the application
      this.rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppEvent(applicationId,
              RMAppEventType.APP_REJECTED, e.getMessage()));
      throw RPCUtil.getRemoteException(e);
    }
  }

我们来看一下createAndPopulateNewRMApp这个函数的主要操作
ApplicationClientProtocolPBSeriveImpl.submitApplication() -> ClientRMService.submitApplicaiton() -> RMAppManger.submitApplication() -> RMAppManger.createAndPopulateNewRMApp():

  private RMAppImpl createAndPopulateNewRMApp(
      ApplicationSubmissionContext submissionContext, long submitTime,
      String user, boolean isRecovery, long startTime) throws YarnException {
      //得到当前作业的applicationId
       ApplicationId applicationId = submissionContext.getApplicationId();
      //检验资源请求是否合理
    List<ResourceRequest> amReqs = validateAndCreateResourceRequest(
        submissionContext, isRecovery);
      // 创建RMApp的实例即 RMAppImpl
    RMAppImpl application =
        new RMAppImpl(applicationId, rmContext, this.conf,
            submissionContext.getApplicationName(), user,
            submissionContext.getQueue(),
            submissionContext, this.scheduler, this.masterService,
            submitTime, submissionContext.getApplicationType(),
            submissionContext.getApplicationTags(), amReqs, placementContext,
            startTime);
     //如果rmContext中的缓存中没有该applicationId对应的applicationId，则将其缓存进去，需要的时候就可以直接查询
if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
        null) {
      String message = "Application with id " + applicationId
          + " is already present! Cannot add a duplicate!";
      LOG.warn(message);
      throw new YarnException(message);
    }

创建RMAppImpl ，是最主要的操作，这是一个作业在RM端的呈现形式。构造此类的参数中包括两部分的内容。

客户端提交过来的submissionContext相关的一些信息：
作业的名字：getApplicationName（），作业的ID:applicationId，提交的队列：getQueue（），作业类型：getApplicationType（），作业的标志信息：getApplicationTags（），资源请求 amReqs ，等等。
RM端RMAppManager提供的的一些相关信息：
RM的上下文信息：rmContext，一些相关配置信息：this.conf，采用的调度器是哪一个：this.scheduler, 管理AM的服务：this.masterService 等。

回到上面的 RMAppManger.submitApplication()过程：

        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.START));

作业的提交在RM端触发了，作业RMAppEventType.START事件。通过RPC传到RM端的每一个作业，RMAppManager都会为其创建一个RMAppImpl对象，是RM中表示一个作业的对象，作业的运转是通过状态机来表示的，关于状态机，会在其他章节进行详细讲解，这里就是RMAppImpl的状态机。
this.rmContext.getDispatcher()得到的是一个异步调度器，是一个AsyncDispatcher类对象rmDispatcher ，然后通过注册到这个对象具体的rmDispatcher.getEventHandler()得到一个实现了不同界面的EventHandler事件类，进行handle()具体的事件，然后调用该事件对应的跳变函数，同时发生对应的状态机跳变。这里发送的事件是RMAppEventType.START事件。
RMAppManager中的rmContext是ResourceManager中的rmConext对象传进去的，我们首先来看一下rmConext在ResourceManager类中是如何创建的：

protected void serviceInit(Configuration conf) throws Exception {
    this.conf = conf;
    //先创造这个RMContextImpl这个类的实例
    this.rmContext = new RMContextImpl();
    //通过set的方式来填充一些变量，我们这里先只看Dispatcher相关的set操作 
    rmDispatcher = setupDispatcher();
    addIfService(rmDispatcher);
    rmContext.setDispatcher(rmDispatcher);
    super.serviceInit(this.conf);
  }

 private Dispatcher setupDispatcher() {
    Dispatcher dispatcher = createDispatcher();
    dispatcher.register(RMFatalEventType.class,
        new ResourceManager.RMFatalEventDispatcher());
    return dispatcher;
  }

  protected Dispatcher createDispatcher() {
    return new AsyncDispatcher("RM Event dispatcher");
  }

我们可以看到AsyncDispatcher类对象 rmDispatcher 是在ResourceManager serviceInit()过程中创建的。在ResourceManager的内部类RMActiveServices 中的serviceInit()过程中，还给这个rmDispatcher 注册了很多EventHandler事件类，供以后进行对不同的事件进行分发到不同的事件类中：

//RM中包含所有活跃服务的类
public class RMActiveServices extends CompositeService {
    //创建安全相关的Token的类 
    private DelegationTokenRenewer delegationTokenRenewer;
    //调度器相关的事件类
    private EventHandler<SchedulerEvent> schedulerDispatcher;
    //在RM端发起AM的对象
    private ApplicationMasterLauncher applicationMasterLauncher;
    //容器分配过期对应的类
    private ContainerAllocationExpirer containerAllocationExpirer;
    private ResourceManager rm;
    private boolean fromActive = false;
    private StandByTransitionRunnable standByTransitionRunnable;
 //RMActiveServices 中的serviceInit()过程，我们只关心rmDispatcher的注册过程
 protected void serviceInit(Configuration configuration) throws Exception {
    //为NodesListManager在rmDispatcher中注册事件处理器类
    rmDispatcher.register(NodesListManagerEventType.class, nodesListManager);
    //为调度器在rmDispatcher中注册事件处理器类
    rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher);
    //为RM中的application呈现对象RMApp在rmDispatcher注册事件处理器类
    rmDispatcher.register(RMAppEventType.class, new ApplicationEventDispatcher(rmContext));
    //为RM中的RMApp对象的一次尝试对应的对象RMAppAttempt在rmDispatcher注册事件处理器类
    rmDispatcher.register(RMAppAttemptEventType.class, new ApplicationAttemptEventDispatcher(rmContext));
    //为RM中的节点呈现对象RMNode在rmDispatcher注册事件处理器类
    rmDispatcher.register(RMNodeEventType.class, new NodeEventDispatcher(rmContext));

注册的过程我们举一个例子看一下，就看为调度器在rmDispatcher中注册事件处理器类的过程吧：rmDispatcher.register(SchedulerEventType.class, schedulerDispatcher)：
其中SchedulerEventType类，是事件的类型类，是一个enum，我们可以看到其中所有调度器相关的事件类型有哪些

public enum SchedulerEventType {

  // Node操作对应的调度器事件类型
  NODE_ADDED,
  NODE_REMOVED,
  NODE_UPDATE,
  NODE_RESOURCE_UPDATE,
  NODE_LABELS_UPDATE,

  // RMApp操作对应的调度器事件类型
  APP_ADDED,
  APP_REMOVED,

  // RMAppAttempt操作对应的调度器事件类型
  APP_ATTEMPT_ADDED,
  APP_ATTEMPT_REMOVED,

  //ContainerAllocationExpirer对应的事件类型
  CONTAINER_EXPIRED,

  // Source: SchedulerAppAttempt::pullNewlyUpdatedContainer.
  RELEASE_CONTAINER,

  /* Source: SchedulingEditPolicy */
  KILL_RESERVED_CONTAINER,

  // Mark a container for preemption
  MARK_CONTAINER_FOR_PREEMPTION,

  // Mark a for-preemption container killable
  MARK_CONTAINER_FOR_KILLABLE,

  // Cancel a killable container
  MARK_CONTAINER_FOR_NONKILLABLE,

  //Queue Management Change
  MANAGE_QUEUE
}

schedulerDispatcher就是一个调度器有关的事件对应的事件处理器类，调度器相关的事件处理器类较其他事件处理器类复杂很多，我们深入看一下，schedulerDispatcher = createSchedulerEventDispatcher() -> ResourceManager.createSchedulerEventDispatcher()

protected ResourceScheduler scheduler;
protected EventHandler<SchedulerEvent> createSchedulerEventDispatcher() {
    return new EventDispatcher(this.scheduler, "SchedulerEventDispatcher");
  }

public interface ResourceScheduler extends YarnScheduler, Recoverable { }

public interface YarnScheduler extends EventHandler<SchedulerEvent> {}

//传入的handler是一个ResourceScheduler ，继承了 YarnScheduler， 而YarnScheduler又继承了 EventHandler<SchedulerEvent>

大概了解了事件类型和，事件处理器后，再次回到上面的 RMAppManger.submitApplication()过程：

        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.START));

this.rmContext.getDispatcher().getEventHandler()这个过程我们上面已经了解了，rmDispatcher把事件交给之前注册过的RMApp对应的事件处理器之后，handle（）函数就是对于具体事件类型，事件处理器的操作了：
ApplicationEventDispatcher.handle():

public void handle(RMAppEvent event) {
      ApplicationId appID = event.getApplicationId();
      RMApp rmApp = this.rmContext.getRMApps().get(appID);
      if (rmApp != null) {
        try {
          rmApp.handle(event);
        } catch (Throwable t) {
          LOG.error("Error in handling event type " + event.getType()
              + " for application " + appID, t);
        }
      }
    }

事件处理器，把对应的事件类型转交给RMApp，然后调用rmApp.handle(event)，事件上调用的是RMAppImpl.handle(event)，为什么可以转交给RMAppImpl处理呢，因为RMAppImpl也是实现了EventHandler，RMAppImpl的类图如下，实现的接口RMApp是继承自EventHandler的：
这里写图片描述

ApplicationEventDispatcher.handle() -> RMAppImpl.handle() :

public void handle(RMAppEvent event) {

    this.writeLock.lock();

    try {
      ApplicationId appID = event.getApplicationId();
      LOG.debug("Processing event for " + appID + " of type "
          + event.getType());
      final RMAppState oldState = getState();
      try {
        /* keep the master in sync with the state machine */
        //根据事件类型，完成相应的状态机跳变操作
        this.stateMachine.doTransition(event.getType(), event);
      } catch (InvalidStateTransitionException e) {
        LOG.error("App: " + appID
            + " can't handle this event at current state", e);
        onInvalidStateTransition(event.getType(), oldState);
      }

      // Log at INFO if we're not recovering or not in a terminal state.
      // Log at DEBUG otherwise.
      if ((oldState != getState()) &&
          (((recoveredFinalState == null)) ||
            (event.getType() != RMAppEventType.RECOVER))) {
        LOG.info(String.format(STATE_CHANGE_MESSAGE, appID, oldState,
            getState(), event.getType()));
      } else if ((oldState != getState()) && LOG.isDebugEnabled()) {
        LOG.debug(String.format(STATE_CHANGE_MESSAGE, appID, oldState,
            getState(), event.getType()));
      }
    } finally {
      this.writeLock.unlock();
    }
  }

现在传入的事件类型 event.getType()即为： RMAppEventType.START。RMAppImpl通过StateMachineFactory，初始化状态机跳变的函数，我们截取部分来看一下，初始化的RMApp的初始化状态是RMAppState.NEW状态，接下来是添加的从NEW状态转移到其他状态对应的跳变操作函数。当然StateMachineFactory不仅仅是从NEW状态跳变相关的部分，还保存了其他状态进行转移的跳变表。

private static final StateMachineFactory<RMAppImpl,
                                           RMAppState,
                                           RMAppEventType,
                                           RMAppEvent> stateMachineFactory
                               = new StateMachineFactory<RMAppImpl,
                                           RMAppState,
                                           RMAppEventType,
                                           RMAppEvent>(RMAppState.NEW)


     // Transitions from NEW state
    .addTransition(RMAppState.NEW, RMAppState.NEW,
        RMAppEventType.NODE_UPDATE, new RMAppNodeUpdateTransition())
    .addTransition(RMAppState.NEW, RMAppState.NEW_SAVING,
        RMAppEventType.START, new RMAppNewlySavingTransition())
    .addTransition(RMAppState.NEW, EnumSet.of(RMAppState.SUBMITTED,
            RMAppState.ACCEPTED, RMAppState.FINISHED, RMAppState.FAILED,
            RMAppState.KILLED, RMAppState.FINAL_SAVING),
        RMAppEventType.RECOVER, new RMAppRecoveredTransition())
    .addTransition(RMAppState.NEW, RMAppState.KILLED, RMAppEventType.KILL,
        new AppKilledTransition())
    .addTransition(RMAppState.NEW, RMAppState.FINAL_SAVING,
        RMAppEventType.APP_REJECTED,
        new FinalSavingTransition(new AppRejectedTransition(),
          RMAppState.FAILED))

从上述状态机跳变表中找到对应的状态转移过程，我们可以看到从初始化的RMAppState.NEW状态开始，对应 RMAppEventType.START事件类型的跳变只有一条为：

addTransition(RMAppState.NEW, RMAppState.NEW_SAVING,
        RMAppEventType.START, new RMAppNewlySavingTransition())

状态机跳变对应的跳变类为RMAppNewlySavingTransition，跳变操作为函数：RMAppNewlySavingTransition.transition()

//是一个SingleArcTransition，即跳变只能从一个状态跳变到另一个状态，是一对一的
private static class RMAppTransition implements
      SingleArcTransition<RMAppImpl, RMAppEvent> {
    public void transition(RMAppImpl app, RMAppEvent event) {
    };
  }
//继承了上述的单弧跳变类RMAppTransition.transition()   
private static final class RMAppNewlySavingTransition extends RMAppTransition {
    @Override
    public void transition(RMAppImpl app, RMAppEvent event) {
      //作业生命周期时间
      long applicationLifetime =
          app.getApplicationLifetime(ApplicationTimeoutType.LIFETIME);
      applicationLifetime = app.scheduler
          .checkAndGetApplicationLifetime(app.queue, applicationLifetime);
      if (applicationLifetime > 0) {
        // calculate next timeout value
        Long newTimeout =
            Long.valueOf(app.submitTime + (applicationLifetime * 1000));
        app.rmContext.getRMAppLifetimeMonitor().registerApp(app.applicationId,
            ApplicationTimeoutType.LIFETIME, newTimeout);

        // 更新作业过期时间
        app.applicationTimeouts.put(ApplicationTimeoutType.LIFETIME,
            newTimeout);

        LOG.info("Application " + app.applicationId
            + " is registered for timeout monitor, type="
            + ApplicationTimeoutType.LIFETIME + " value=" + applicationLifetime
            + " seconds");
      }

      // If recovery is enabled then store the application information in a
      // non-blocking call so make sure that RM has stored the information
      // needed to restart the AM after RM restart without further client
      // communication
      LOG.info("Storing application with id " + app.applicationId);
      app.rmContext.getStateStore().storeNewApplication(app);
    }
  }

在YARN中，状态机之间大多数都是相辅相成，互相推动的，的、RMApp的状态机从NEW到NEW_SAVING改变过程推动RMStateStore的状态机变化，而RMStateStore的状态机变化，又反过来推动RMApp的状态机推进。
RMAppNewlySavingTransition.transition()->RMStateStore.storeNewApplictaion():

ResourceManager服务使用这个类对应的storeNewApplication()来保存application的状态。这个过程同样使用AsyncDispatcher来实现，是非阻塞的，一旦完成，发送响应的事件类型给RMApp，触发其状态机变化。

  public void storeNewApplication(RMApp app) {
    ApplicationSubmissionContext context = app
                                            .getApplicationSubmissionContext();
    assert context instanceof ApplicationSubmissionContextPBImpl;
    ApplicationStateData appState =
        ApplicationStateData.newInstance(app.getSubmitTime(),
            app.getStartTime(), context, app.getUser(), app.getCallerContext());
    appState.setApplicationTimeouts(app.getApplicationTimeouts());
    getRMStateStoreEventHandler().handle(new RMStateStoreAppEvent(appState));
  }

事件类型是：

 public RMStateStoreAppEvent(ApplicationStateData appState) {
    super(RMStateStoreEventType.STORE_APP);
    this.appState = appState;
  }

对应的状态机跳变和对应跳变操作的类为：

addTransition(RMStateStoreState.ACTIVE,
          EnumSet.of(RMStateStoreState.ACTIVE, RMStateStoreState.FENCED),
          RMStateStoreEventType.STORE_APP, new StoreAppTransition())

下面来看这个具体的跳变类中的transaction操作：

private static class StoreAppTransition
      implements MultipleArcTransition<RMStateStore, RMStateStoreEvent,
          RMStateStoreState> {
    public RMStateStoreState transition(RMStateStore store,
        RMStateStoreEvent event) {
      if (!(event instanceof RMStateStoreAppEvent)) {
        // should never happen
        LOG.error("Illegal event type: " + event.getClass());
        return RMStateStoreState.ACTIVE;
      }
      boolean isFenced = false;
      ApplicationStateData appState =
          ((RMStateStoreAppEvent) event).getAppState();
      ApplicationId appId =
          appState.getApplicationSubmissionContext().getApplicationId();
      LOG.info("Storing info for app: " + appId);
      try {
        store.storeApplicationStateInternal(appId, appState);
        store.notifyApplication(
            new RMAppEvent(appId, RMAppEventType.APP_NEW_SAVED));
      } catch (Exception e) {
        LOG.error("Error storing app: " + appId, e);
        if (e instanceof StoreLimitException) {
          store.notifyApplication(
              new RMAppEvent(appId, RMAppEventType.APP_SAVE_FAILED,
                  e.getMessage()));
        } else {
          isFenced = store.notifyStoreOperationFailedInternal(e);
        }
      }
      return finalState(isFenced);
    };

  }

很清楚的看到，主要干了一件事情，保存好了application的信息以后，通知RMApp：

store.notifyApplication(
            new RMAppEvent(appId, RMAppEventType.APP_NEW_SAVED));

 private void notifyApplication(RMAppEvent event) {
    rmDispatcher.getEventHandler().handle(event);
  }

就是触发了RMApp的RMAppEventType.APP_NEW_SAVED事件，状态机之间真是互相帮助才能互相推进，下面继续到RMAppImpl中看看，这个事件对应的状态机跳变，和跳变的操作：

addTransition(RMAppState.NEW_SAVING, RMAppState.SUBMITTED,
        RMAppEventType.APP_NEW_SAVED, new AddApplicationToSchedulerTransition())

AddApplicationToSchedulerTransition这个跳变操作的类，听名字就像是把application加入到调度器中，我们具体来看一下：

 private static final class AddApplicationToSchedulerTransition extends
      RMAppTransition {
    public void transition(RMAppImpl app, RMAppEvent event) {
      app.handler.handle(
          new AppAddedSchedulerEvent(app.user, app.submissionContext, false,
              app.applicationPriority, app.placementContext));
      // send the ATS create Event
      app.sendATSCreateEvent();
    }
  }

public AppAddedSchedulerEvent(ApplicationId applicationId, String queue,
      String user, boolean isAppRecovering, ReservationId reservationID,
      Priority appPriority, ApplicationPlacementContext placementContext) {
    super(SchedulerEventType.APP_ADDED);
    this.applicationId = applicationId;
    this.queue = queue;
    this.user = user;
    this.reservationID = reservationID;
    this.isAppRecovering = isAppRecovering;
    this.appPriority = appPriority;
    this.placementContext = placementContext;
  }

AddApplicationToSchedulerTransition.transaction，实际上是触发了SchedulerEventType.APP_ADDED事件，然后ResourceManager注册的schedulerDispatcher事件处理器类对这个事件进行处理：
ResourceManager.schedulerDispatcher.handle():

 public void handle(T event) {
    try {
      int qSize = eventQueue.size();
      if (qSize !=0 && qSize %1000 == 0) {
        LOG.info("Size of " + getName() + " event-queue is " + qSize);
      }
      int remCapacity = eventQueue.remainingCapacity();
      if (remCapacity < 1000) {
        LOG.info("Very low remaining capacity on " + getName() + "" +
            "event queue: " + remCapacity);
      }
      //将事件挂入调度器的事件队列
      this.eventQueue.put(event);
    } catch (InterruptedException e) {
      LOG.info("Interrupted. Trying to exit gracefully.");
    }
  }

EventProcessor 从队列中取出事件进行处理：

 private final class EventProcessor implements Runnable {
    @Override
    public void run() {

      T event;

      while (!stopped && !Thread.currentThread().isInterrupted()) {
        try {
          event = eventQueue.take();
        } catch (InterruptedException e) {
          LOG.error("Returning, interrupted : " + e);
          return; // TODO: Kill RM.
        }

        try {
          //这里的handler是传入的具体类型的调度器
          handler.handle(event);
        } catch (Throwable t) {
          // An error occurred, but we are shutting down anyway.
          // If it was an InterruptedException, the very act of
          // shutdown could have caused it and is probably harmless.
          if (stopped) {
            LOG.warn("Exception during shutdown: ", t);
            break;
          }
          LOG.fatal("Error in handling event type " + event.getType()
              + " to the Event Dispatcher", t);
          if (shouldExitOnError
              && !ShutdownHookManager.get().isShutdownInProgress()) {
            LOG.info("Exiting, bbye..");
            System.exit(-1);
          }
        }
      }
    }
  }

handler是在ResourceManager中初始化该schedulerDispatcher的时候，传入的：

 schedulerDispatcher = createSchedulerEventDispatcher();

 protected EventHandler<SchedulerEvent> createSchedulerEventDispatcher() {
    return new EventDispatcher(this.scheduler, "SchedulerEventDispatcher");
  }

ResourceManager中的scheduler有三种选项，我这里选择FairScheduler进行分析：
ResourceManager.schedulerDispatcher.handle() -> FairScheduler.handle():

public void handle(SchedulerEvent event) {
    switch (event.getType()) {
    case NODE_ADDED:
    ...
    case NODE_REMOVED:
    ...
    case NODE_UPDATE:
    ... 
    case APP_ADDED:
      if (!(event instanceof AppAddedSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      AppAddedSchedulerEvent appAddedEvent = (AppAddedSchedulerEvent) event;
      String queueName =
          resolveReservationQueueName(appAddedEvent.getQueue(),
              appAddedEvent.getApplicationId(),
              appAddedEvent.getReservationID(),
              appAddedEvent.getIsAppRecovering());
      if (queueName != null) {
        addApplication(appAddedEvent.getApplicationId(),
            queueName, appAddedEvent.getUser(),
            appAddedEvent.getIsAppRecovering());
      }
      break;
    case APP_REMOVED:
    ... 
    case NODE_RESOURCE_UPDATE:
    ...
    case APP_ATTEMPT_ADDED:
    ...
    case APP_ATTEMPT_REMOVED:
    ...  
    case RELEASE_CONTAINER:
    ... 
    case CONTAINER_EXPIRED:
    ...  
    }
  }

由于当前传入的事件是SchedulerEventType.APP_ADDED类型：即对应的APP_ADDED，那么我们暂时屏蔽其他事件类型，这是用switch的形式，而没有用状态机转换。
ResourceManager.schedulerDispatcher.handle() -> FairScheduler.handle()-> FairScheduler.addApplication():

protected void addApplication(ApplicationId applicationId,
      String queueName, String user, boolean isAppRecovering) {
    //队列名合法性检查相关的一些东西
    if (queueName == null || queueName.isEmpty()) {
      String message =
          "Reject application " + applicationId + " submitted by user " + user
              + " with an empty queue name.";
      LOG.info(message);
      rmContext.getDispatcher().getEventHandler().handle(
          new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED,
              message));
      return;
    }

    if (queueName.startsWith(".") || queueName.endsWith(".")) {
      String message =
          "Reject application " + applicationId + " submitted by user " + user
              + " with an illegal queue name " + queueName + ". "
              + "The queue name cannot start/end with period.";
      LOG.info(message);
      rmContext.getDispatcher().getEventHandler().handle(
          new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED,
              message));
      return;
    }

    try {
      writeLock.lock();
      //将application分配的某一个具体的队列中
      RMApp rmApp = rmContext.getRMApps().get(applicationId);
      FSLeafQueue queue = assignToQueue(rmApp, queueName, user);
      if (queue == null) {
        return;
      }

      //ACLs检查等相关操作
      UserGroupInformation userUgi = UserGroupInformation.createRemoteUser(
          user);

      if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !queue
          .hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
        String msg = "User " + userUgi.getUserName()
            + " cannot submit applications to queue " + queue.getName()
            + "(requested queuename is " + queueName + ")";
        LOG.info(msg);
        rmContext.getDispatcher().getEventHandler().handle(
            new RMAppEvent(applicationId, RMAppEventType.APP_REJECTED, msg));
        return;
      }
      //创建一个application在调度器中的表示，SchedulerApplication，并且将applicationId和这个调度器表示关联起来，以后通过applicationId就能得到这个调度器的表示。
      SchedulerApplication<FSAppAttempt> application =
          new SchedulerApplication<FSAppAttempt>(queue, user);
      applications.put(applicationId, application);
      //添加一些application的统计信息
      queue.getMetrics().submitApp(user);

      LOG.info("Accepted application " + applicationId + " from user: " + user
          + ", in queue: " + queue.getName()
          + ", currently num of applications: " + applications.size());
      //一些恢复相关的操作
      if (isAppRecovering) {
        if (LOG.isDebugEnabled()) {
          LOG.debug(applicationId
              + " is recovering. Skip notifying APP_ACCEPTED");
        }
      } else {
        if (rmApp != null && rmApp.getApplicationSubmissionContext() != null) {

        //把队列名设置到ASC中  rmApp.getApplicationSubmissionContext().setQueue(queue.getName());
        }
        //推动RMApp的状态机
        rmContext.getDispatcher().getEventHandler().handle(
            new RMAppEvent(applicationId, RMAppEventType.APP_ACCEPTED));
      }
    } finally {
      writeLock.unlock();
    }
  }

主要操作就是，创建application在调度器中的表示SchedulerApplication，并记录applicationId和其对应关系，方便下次查询，然后就发送RMAppEventType.APP_ACCEPTED事件，反过来推动了RMApp的状态机转移，在RMAppImpl中看一下具体的状态机转移，以及对应的状态转移伴随的操作函数：

addTransition(RMAppState.SUBMITTED, RMAppState.ACCEPTED,
        RMAppEventType.APP_ACCEPTED, new StartAppAttemptTransition())

private static final class StartAppAttemptTransition extends RMAppTransition {
    public void transition(RMAppImpl app, RMAppEvent event) {
      app.createAndStartNewAttempt(false);
    };
  }

 private void
      createAndStartNewAttempt(boolean transferStateFromPreviousAttempt) {
    createNewAttempt();
    handler.handle(new RMAppStartAttemptEvent(currentAttempt.getAppAttemptId(),
      transferStateFromPreviousAttempt));
  }
//生成当前ApplicationAttemptId，
private void createNewAttempt() {
    ApplicationAttemptId appAttemptId =
        ApplicationAttemptId.newInstance(applicationId, nextAttemptId++);
    createNewAttempt(appAttemptId);
  }
//然后根据这个ApplicationAttemptId创建一个新的AppAttempt
private void createNewAttempt(ApplicationAttemptId appAttemptId) {
    BlacklistManager currentAMBlacklistManager;
    if (currentAttempt != null) {
      currentAMBlacklistManager = currentAttempt.getAMBlacklistManager();
    } else {
      if (amBlacklistingEnabled && !submissionContext.getUnmanagedAM()) {
        currentAMBlacklistManager = new SimpleBlacklistManager(
            RMServerUtils.getApplicableNodeCountForAM(rmContext, conf,
                getAMResourceRequests()),
            blacklistDisableThreshold);
      } else {
        currentAMBlacklistManager = new DisabledBlacklistManager();
      }
    }
    //创建一个新的RMAppAttempt
    RMAppAttempt attempt =
        new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
          submissionContext, conf, amReqs, this, currentAMBlacklistManager);
    //将对应的appAttemptId和具体的attempt放入attempts中，方便下次查询
    attempts.put(appAttemptId, attempt);
    currentAttempt = attempt;
  }
 public RMAppStartAttemptEvent(ApplicationAttemptId appAttemptId,
      boolean transferStateFromPreviousAttempt) {
    super(appAttemptId, RMAppAttemptEventType.START);
    this.transferStateFromPreviousAttempt = transferStateFromPreviousAttempt;
  }

既然application已经被调度器接受，接下来就可以触发了RMAppAttemptEventType.START事件，来开始一次application的运行尝试。我们来看一下 RMAppAttemptImpl中事件对应的状态机转移以及状态转移操作：

addTransition(RMAppAttemptState.NEW, RMAppAttemptState.SUBMITTED,
          RMAppAttemptEventType.START, new AttemptStartedTransition())

private static final class AttemptStartedTransition extends BaseTransition {
    public void transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {
    //是否是从上一次尝试的状态开始
        boolean transferStateFromPreviousAttempt = false;
      if (event instanceof RMAppStartAttemptEvent) {
        transferStateFromPreviousAttempt =
            ((RMAppStartAttemptEvent) event)
              .getTransferStateFromPreviousAttempt();
      }
      appAttempt.startTime = System.currentTimeMillis();

      // 向ApplicationMasterService注册appAttempt，备案一下，ApplicationMasterService是专门和AM交流的
      appAttempt.masterService
          .registerAppAttempt(appAttempt.applicationAttemptId);

      if (UserGroupInformation.isSecurityEnabled()) {
        appAttempt.clientTokenMasterKey =
            appAttempt.rmContext.getClientToAMTokenSecretManager()
              .createMasterKey(appAttempt.applicationAttemptId);
      }

      // 把一个applicationAttempt添加到调度器中，然后通知调度器
      appAttempt.eventHandler.handle(new AppAttemptAddedSchedulerEvent(
        appAttempt.applicationAttemptId, transferStateFromPreviousAttempt));
    }
  }

  public AppAttemptAddedSchedulerEvent(
      ApplicationAttemptId applicationAttemptId,
      boolean transferStateFromPreviousAttempt) {
    this(applicationAttemptId, transferStateFromPreviousAttempt, false);
  }

  public AppAttemptAddedSchedulerEvent(
      ApplicationAttemptId applicationAttemptId,
      boolean transferStateFromPreviousAttempt,
      boolean isAttemptRecovering) {
    super(SchedulerEventType.APP_ATTEMPT_ADDED);
    this.applicationAttemptId = applicationAttemptId;
    this.transferStateFromPreviousAttempt = transferStateFromPreviousAttempt;
    this.isAttemptRecovering = isAttemptRecovering;
  }

主要操作是把这个application的尝试添加到调度器中，并通知调度器对此作出反应，对应的事件为SchedulerEventType.APP_ATTEMPT_ADDED，对应的FairScheduler对该事件的操作为：

public void handle(SchedulerEvent event) {
    switch (event.getType()) {
     case APP_ATTEMPT_ADDED:
      if (!(event instanceof AppAttemptAddedSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      AppAttemptAddedSchedulerEvent appAttemptAddedEvent =
          (AppAttemptAddedSchedulerEvent) event;
      addApplicationAttempt(appAttemptAddedEvent.getApplicationAttemptId(),
        appAttemptAddedEvent.getTransferStateFromPreviousAttempt(),
        appAttemptAddedEvent.getIsAttemptRecovering());
      break;
    }
 }

然后调用了FairScheduler.addApplicationAttempt函数，来添加一个应用尝试到调度器中：

  protected void addApplicationAttempt(
      ApplicationAttemptId applicationAttemptId,
      boolean transferStateFromPreviousAttempt,
      boolean isAttemptRecovering) {
    try {
      writeLock.lock();
      //通过applicationId获取调度器中对应的SchedulerApplication
      SchedulerApplication<FSAppAttempt> application = applications.get(
          applicationAttemptId.getApplicationId());
      String user = application.getUser();
      FSLeafQueue queue = (FSLeafQueue) application.getQueue();
      //创建AppAttempt在调度器中的表示FSAppAttempt
      FSAppAttempt attempt = new FSAppAttempt(this, applicationAttemptId, user,
          queue, new ActiveUsersManager(getRootQueueMetrics()), rmContext);
      if (transferStateFromPreviousAttempt) {
        attempt.transferStateFromPreviousAttempt(
            application.getCurrentAppAttempt());
      }
      //把这个新创建的FSAppAttempt attempt设为当前的FSAppAttempt
      application.setCurrentAppAttempt(attempt);

      boolean runnable = maxRunningEnforcer.canAppBeRunnable(queue, attempt);
      queue.addApp(attempt, runnable);
      if (runnable) {
        maxRunningEnforcer.trackRunnableApp(attempt);
      } else{
        maxRunningEnforcer.trackNonRunnableApp(attempt);
      }
      //把attempt加入metrics信息
      queue.getMetrics().submitAppAttempt(user);

      LOG.info("Added Application Attempt " + applicationAttemptId
          + " to scheduler from user: " + user);

      if (isAttemptRecovering) {
      //若是恢复attempt操作则不需要notify
        if (LOG.isDebugEnabled()) {
          LOG.debug(applicationAttemptId
              + " is recovering. Skipping notifying ATTEMPT_ADDED");
        }
      } else{
        //发送RMAppAttemptEventType.ATTEMPT_ADDED事件给RMAppAttempt
        rmContext.getDispatcher().getEventHandler().handle(
            new RMAppAttemptEvent(applicationAttemptId,
                RMAppAttemptEventType.ATTEMPT_ADDED));
      }
    } finally {
      writeLock.unlock();
    }
  }

RMAppAttemptImpl类中对应的状态机跳变为：

addTransition(RMAppAttemptState.SUBMITTED, 
          EnumSet.of(RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING,
                     RMAppAttemptState.SCHEDULED),
          RMAppAttemptEventType.ATTEMPT_ADDED,
          new ScheduleTransition())

这是一个多弧跳变，要看这个AM是通过向RM申请的，还是直接在NodeManager上面启的。
ScheduleTransition.transition:

 public static final class ScheduleTransition
      implements
      MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {

    public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {
      ApplicationSubmissionContext subCtx = appAttempt.submissionContext;
      //如果是通过向RM申请的AM，并且受RM控制管理
      if (!subCtx.getUnmanagedAM()) {
        for (ResourceRequest amReq : appAttempt.amReqs) {
          //设置AM的所需的Container个数为1， 并且设置为AM对应的优先级
          amReq.setNumContainers(1);
          amReq.setPriority(AM_CONTAINER_PRIORITY);
        }

        int numNodes =
            RMServerUtils.getApplicableNodeCountForAM(appAttempt.rmContext,
                appAttempt.conf, appAttempt.amReqs);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Setting node count for blacklist to " + numNodes);
        }
        appAttempt.getAMBlacklistManager().refreshNodeHostCount(numNodes);

        ResourceBlacklistRequest amBlacklist =
            appAttempt.getAMBlacklistManager().getBlacklistUpdates();
        if (LOG.isDebugEnabled()) {
          LOG.debug("Using blacklist for AM: additions(" +
              amBlacklist.getBlacklistAdditions() + ") and removals(" +
              amBlacklist.getBlacklistRemovals() + ")");
        }
        // 调用具体调度器，这里是FairScheduler的资源分配过程
        Allocation amContainerAllocation =
            appAttempt.scheduler.allocate(
                appAttempt.applicationAttemptId,
                appAttempt.amReqs, null, EMPTY_CONTAINER_RELEASE_LIST,
                amBlacklist.getBlacklistAdditions(),
                amBlacklist.getBlacklistRemovals(),
                new ContainerUpdates());
        if (amContainerAllocation != null
            && amContainerAllocation.getContainers() != null) {
          assert (amContainerAllocation.getContainers().size() == 0);
        }
        //分配到的Container为0的情况下返回RMAppAttemptState.SCHEDULED
        return RMAppAttemptState.SCHEDULED;
      } else {
        // 如果是unmanaged的情况，说明AM不通过RM来控制管理，而是用户直接通过RM来申请的，直接向RMAppAttempt发送RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING事件
        appAttempt.storeAttempt();
        return RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING;
      }
    }
  }

我们看到amContainerAllocation 中没有分配到容器的时候，才会返回RMAppAttemptState.SCHEDULED，使其停留在RMAppAttemptState.SCHEDULED状态，那么状态机还怎么前行呢，我们对应调度器分配过程看一下，分配到容器的情况下，会让RMAppAttempt的状态机发生怎么样的变化。
appAttempt.scheduler.allocate()实际上调用的是 FairScheduler.allocate()过程，直接看其中容器相关的部分：

 public Allocation allocate(ApplicationAttemptId appAttemptId,
      List<ResourceRequest> ask, List<SchedulingRequest> schedulingRequests,
      List<ContainerId> release, List<String> blacklistAdditions,
      List<String> blacklistRemovals, ContainerUpdates updateRequests) {

    List<Container> newlyAllocatedContainers =
        application.pullNewlyAllocatedContainers();

    return new Allocation(newlyAllocatedContainers, headroom,
        preemptionContainerIds, null, null,
        application.pullUpdatedNMTokens(), null, null,
        application.pullNewlyPromotedContainers(),
        application.pullNewlyDemotedContainers(),
        application.pullPreviousAttemptContainers());
  }

通过调用application.pullNewlyAllocatedContainers()来收揽容器：

 public List<Container> pullNewlyAllocatedContainers() {
    try {
      writeLock.lock();
      List<Container> returnContainerList = new ArrayList<Container>(
          newlyAllocatedContainers.size());

      Iterator<RMContainer> i = newlyAllocatedContainers.iterator();
      while (i.hasNext()) {
        RMContainer rmContainer = i.next();
        Container updatedContainer =
            updateContainerAndNMToken(rmContainer, null);
        //收揽到容器就把他加入到returnContainerList中返回
        if (updatedContainer != null) {
          returnContainerList.add(updatedContainer);
          i.remove();
        }
      }
      return returnContainerList;
    } finally {
      writeLock.unlock();
    }
  }

继续看updateContainerAndNMToken(rmContainer, null)：

 private Container updateContainerAndNMToken(RMContainer rmContainer,
      ContainerUpdateType updateType) {
    try {
    //只看关键部分
    ......
    if (updateType == null) {
      // 这是一个新分配的容器
      rmContainer.handle(new RMContainerEvent(
          rmContainer.getContainerId(), RMContainerEventType.ACQUIRED));
    } else {
    }
    return container;
  }

RMContainerImpl对于事件反应做出的状态机转变：

addTransition(RMContainerState.NEW, RMContainerState.ACQUIRED,
        RMContainerEventType.ACQUIRED, new AcquiredTransition())

状态机转变伴随的跳变函数为：

 private static final class AcquiredTransition extends BaseTransition {

    public void transition(RMContainerImpl container, RMContainerEvent event) {
      //把这个容器加入到容器分配过期的服务中
      container.containerAllocationExpirer.register(
          new AllocationExpirationInfo(container.getContainerId()));

      //通知APP，推进他的状态机
      container.eventHandler.handle(new RMAppRunningOnNodeEvent(container
          .getApplicationAttemptId().getApplicationId(), container.nodeId));
    }
  }

 public RMAppRunningOnNodeEvent(ApplicationId appId, NodeId node) {
    super(appId, RMAppEventType.APP_RUNNING_ON_NODE);
    this.node = node;
  }

通知RMApp告诉他，APP已经获得容器在某个NM节点的Container了。

private static final class AppRunningOnNodeTransition extends RMAppTransition {
    public void transition(RMAppImpl app, RMAppEvent event) {
      RMAppRunningOnNodeEvent nodeAddedEvent = (RMAppRunningOnNodeEvent) event;

      // 若最终信息已经被保存，运行完了就通知运行的节点RMNode清除该app信息
      if (isAppInFinalState(app)) {
        app.handler.handle(
            new RMNodeCleanAppEvent(nodeAddedEvent.getNodeId(), nodeAddedEvent
                .getApplicationId()));
        return;
      }

      // 否则把节点信息加入保存，等待进一步处理
      app.ranNodes.add(nodeAddedEvent.getNodeId());


  }

收揽到容器，就可以供application运行了，现在回到另一个问题，就是什么时候才有容器收揽，这是通过NM节点进行心跳得到的，在NodeManager所在的jvm中，有个NodeStatusUpdaterImpl线程，这个线程周期性地收集本节点的状态信息，然后像RM进行汇报：
NodeStatusUpdaterImpl.statusUpdaterRunnable.run() ->

private class StatusUpdaterRunnable implements Runnable {

    public void run() {
      int lastHeartbeatID = 0;
      while (!isStopped) {
        // Send heartbeat
        try {
          NodeHeartbeatResponse response = null;
          Set<NodeLabel> nodeLabelsForHeartbeat =
              nodeLabelsHandler.getNodeLabelsForHeartbeat();
          //手机本节点的状态信息
          NodeStatus nodeStatus = getNodeStatus(lastHeartbeatID);
          //构成一个心跳请求
          NodeHeartbeatRequest request =
              NodeHeartbeatRequest.newInstance(nodeStatus,
                  NodeStatusUpdaterImpl.this.context
                      .getContainerTokenSecretManager().getCurrentKey(),
                  NodeStatusUpdaterImpl.this.context
                      .getNMTokenSecretManager().getCurrentKey(),
                  nodeLabelsForHeartbeat,
                  NodeStatusUpdaterImpl.this.context
                      .getRegisteringCollectors());

         //像RM发送心跳请求，并收到回应
          response = resourceTracker.nodeHeartbeat(request);
          //get next heartbeat interval from response
          nextHeartBeatInterval = response.getNextHeartBeatInterval();
          updateMasterKeys(response);

          if (!handleShutdownOrResyncCommand(response)) {
            nodeLabelsHandler.verifyRMHeartbeatResponseForNodeLabels(
                response);

           //根据命令清除不在需要运行的容器和应用
            removeOrTrackCompletedContainersFromContext(response
                .getContainersToBeRemovedFromNM());

            logAggregationReportForAppsTempList.clear();
            lastHeartbeatID = response.getResponseId();
            List<ContainerId> containersToCleanup = response
                .getContainersToCleanup();
            if (!containersToCleanup.isEmpty()) {
              dispatcher.getEventHandler().handle(
                  new CMgrCompletedContainersEvent(containersToCleanup,
                      CMgrCompletedContainersEvent.Reason
                          .BY_RESOURCEMANAGER));
            }
            List<ApplicationId> appsToCleanup =
                response.getApplicationsToCleanup();
            //Only start tracking for keepAlive on FINISH_APP
            trackAppsForKeepAlive(appsToCleanup);
            if (!appsToCleanup.isEmpty()) {
              dispatcher.getEventHandler().handle(
                  new CMgrCompletedAppsEvent(appsToCleanup,
                      CMgrCompletedAppsEvent.Reason.BY_RESOURCEMANAGER));
            }
            Map<ApplicationId, ByteBuffer> systemCredentials =
                response.getSystemCredentialsForApps();
            if (systemCredentials != null && !systemCredentials.isEmpty()) {
              ((NMContext) context).setSystemCrendentialsForApps(
                  parseCredentials(systemCredentials));
            }
            List<org.apache.hadoop.yarn.api.records.Container>
                containersToUpdate = response.getContainersToUpdate();
            if (!containersToUpdate.isEmpty()) {
              dispatcher.getEventHandler().handle(
                  new CMgrUpdateContainersEvent(containersToUpdate));
            }


            List<SignalContainerRequest> containersToSignal = response
                .getContainersToSignalList();
            if (!containersToSignal.isEmpty()) {
              dispatcher.getEventHandler().handle(
                  new CMgrSignalContainersEvent(containersToSignal));
            }

            // Update QueuingLimits if ContainerManager supports queuing
            ContainerQueuingLimit queuingLimit =
                response.getContainerQueuingLimit();
            if (queuingLimit != null) {
              context.getContainerManager().updateQueuingLimit(queuingLimit);
            }
          }
          // Handling node resource update case.
          Resource newResource = response.getResource();
          if (newResource != null) {
            updateNMResource(newResource);
            if (LOG.isDebugEnabled()) {
              LOG.debug("Node's resource is updated to " +
                  newResource.toString());
            }
          }
          if (YarnConfiguration.timelineServiceV2Enabled(context.getConf())) {
            updateTimelineCollectorData(response);
          }

        } catch (ConnectException e) {
          //catch and throw the exception if tried MAX wait time to connect RM
          dispatcher.getEventHandler().handle(
              new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
          // failed to connect to RM.
          failedToConnect = true;
          throw new YarnRuntimeException(e);
        } catch (Exception e) {

          // TODO Better error handling. Thread can die with the rest of the
          // NM still running.
          LOG.error("Caught exception in status-updater", e);
        } finally {
          synchronized (heartbeatMonitor) {
            nextHeartBeatInterval = nextHeartBeatInterval <= 0 ?
                YarnConfiguration.DEFAULT_RM_NM_HEARTBEAT_INTERVAL_MS :
                nextHeartBeatInterval;
            try {
            //睡眠心跳时间间隔后进入下一伦循环
              heartbeatMonitor.wait(nextHeartBeatInterval);
            } catch (InterruptedException e) {
              // Do Nothing
            }
          }
        }
      }
    }

下面来看RM端接受到NM心跳的操作：
ResourceTracker.nodeHeartbeat() -> ResourceTrackerService.nodeHeartbeat():

 public NodeHeartbeatResponse nodeHeartbeat(NodeHeartbeatRequest request)
      throws YarnException, IOException {

    NodeStatus remoteNodeStatus = request.getNodeStatus();

    NodeId nodeId = remoteNodeStatus.getNodeId();
    //省略一些合法性验证

    // 发送节点存活的证明
    this.nmLivelinessMonitor.receivedPing(nodeId);
    this.decommissioningWatcher.update(rmNode, remoteNodeStatus);


    //省略一些日志等相关的
    // 准备心跳回应
    NodeHeartbeatResponse nodeHeartBeatResponse =
        YarnServerBuilderUtils.newNodeHeartbeatResponse(
            getNextResponseId(lastNodeHeartbeatResponse.getResponseId()),
            NodeAction.NORMAL, null, null, null, null, nextHeartBeatInterval);
    rmNode.setAndUpdateNodeHeartbeatResponse(nodeHeartBeatResponse);

    populateKeys(request, nodeHeartBeatResponse);

    ConcurrentMap<ApplicationId, ByteBuffer> systemCredentials =
        rmContext.getSystemCredentialsForApps();
    if (!systemCredentials.isEmpty()) {
      nodeHeartBeatResponse.setSystemCredentialsForApps(systemCredentials);
    }


    //向RMNode发送状态更新信息，保存最近的一个心跳回应 
    RMNodeStatusEvent nodeStatusEvent =
        new RMNodeStatusEvent(nodeId, remoteNodeStatus);
    if (request.getLogAggregationReportsForApps() != null
        && !request.getLogAggregationReportsForApps().isEmpty()) {
      nodeStatusEvent.setLogAggregationReportsForApps(request
        .getLogAggregationReportsForApps());
    }

//用该事件来驱动RMNodeImpl的状态机  this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent);


   //...
    return nodeHeartBeatResponse;
  }

RM会为所有已经注册的NM节点保存一个RMNodeImpl对象，当收到对应NM的节点的心跳的时候，会向RMNodeImpl发送一个STATUS_UPDATE事件：

addTransition(NodeState.RUNNING,
          EnumSet.of(NodeState.RUNNING, NodeState.UNHEALTHY),
          RMNodeEventType.STATUS_UPDATE,
          new StatusUpdateWhenHealthyTransition())

  public static class StatusUpdateWhenHealthyTransition implements
      MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {

    public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {

      RMNodeStatusEvent statusEvent = (RMNodeStatusEvent) event;
      rmNode.setOpportunisticContainersStatus(
          statusEvent.getOpportunisticContainersStatus());
      NodeHealthStatus remoteNodeHealthStatus = updateRMNodeFromStatusEvents(
          rmNode, statusEvent);
      NodeState initialState = rmNode.getState();
      boolean isNodeDecommissioning =
          initialState.equals(NodeState.DECOMMISSIONING);
      if (isNodeDecommissioning) {
        List<ApplicationId> keepAliveApps = statusEvent.getKeepAliveAppIds();
        if (rmNode.runningApplications.isEmpty() &&
            (keepAliveApps == null || keepAliveApps.isEmpty())) {
          RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
          return NodeState.DECOMMISSIONED;
        }
      }
      //如果节点已经不健康了
      if (!remoteNodeHealthStatus.getIsNodeHealthy()) {
        LOG.info("Node " + rmNode.nodeId +
            " reported UNHEALTHY with details: " +
            remoteNodeHealthStatus.getHealthReport());
        // if a node in decommissioning receives an unhealthy report,
        // it will stay in decommissioning.
        if (isNodeDecommissioning) {
          return NodeState.DECOMMISSIONING;
        } else {
          reportNodeUnusable(rmNode, NodeState.UNHEALTHY);
          return NodeState.UNHEALTHY;
        }
      }
      //节点健康的情况下
      rmNode.handleContainerStatus(statusEvent.getContainers());
      rmNode.handleReportedIncreasedContainers(
          statusEvent.getNMReportedIncreasedContainers());

      List<LogAggregationReport> logAggregationReportsForApps =
          statusEvent.getLogAggregationReportsForApps();
      if (logAggregationReportsForApps != null
          && !logAggregationReportsForApps.isEmpty()) {
        rmNode.handleLogAggregationStatus(logAggregationReportsForApps);
      }

      if(rmNode.nextHeartBeat) {
        rmNode.nextHeartBeat = false;
        rmNode.context.getDispatcher().getEventHandler().handle(
            new NodeUpdateSchedulerEvent(rmNode));
      }

      // Update DTRenewer in secure mode to keep these apps alive. Today this is
      // needed for log-aggregation to finish long after the apps are gone.
      if (UserGroupInformation.isSecurityEnabled()) {
        rmNode.context.getDelegationTokenRenewer().updateKeepAliveApplications(
          statusEvent.getKeepAliveAppIds());
      }

      return initialState;
    }
  }

节点健康的情况下是我们的主线，来看一下：
rmNode.handleContainerStatus():

 private void handleContainerStatus(List<ContainerStatus> containerStatuses) {
    // 获取该节点刚刚新发起的Containers 以及刚刚新完成的Containers
    List<ContainerStatus> newlyLaunchedContainers =
        new ArrayList<ContainerStatus>();
    List<ContainerStatus> newlyCompletedContainers =
        new ArrayList<ContainerStatus>();
    int numRemoteRunningContainers = 0;
    for (ContainerStatus remoteContainer : containerStatuses) {
      ContainerId containerId = remoteContainer.getContainerId();

      // 我们不需要知道那些因为application被杀，而正在被清理的Containers
      if (containersToClean.contains(containerId)) {
        LOG.info("Container " + containerId + " already scheduled for "
            + "cleanup, no further processing");
        continue;
      }

      ApplicationId containerAppId =
          containerId.getApplicationAttemptId().getApplicationId();

      if (finishedApplications.contains(containerAppId)) {
        LOG.info("Container " + containerId
            + " belongs to an application that is already killed,"
            + " no further processing");
        continue;
      } else if (!runningApplications.contains(containerAppId)) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Container " + containerId
              + " is the first container get launched for application "
              + containerAppId);
        }
        handleRunningAppOnNode(this, context, containerAppId, nodeId);
      }

      // 处理正在RUNNING状态的Container
      if (remoteContainer.getState() == ContainerState.RUNNING) {
        ++numRemoteRunningContainers;
        if (!launchedContainers.contains(containerId)) {
          // 刚刚发起的Container让RM知道一下
          launchedContainers.add(containerId);
          newlyLaunchedContainers.add(remoteContainer);
          // 注销 containerAllocationExpirer.
          containerAllocationExpirer
              .unregister(new AllocationExpirationInfo(containerId));
        }
      } else {
        // 一个刚刚运行结束的container，从已发起的列表中删除
        launchedContainers.remove(containerId);
        if (completedContainers.add(containerId)) {
          newlyCompletedContainers.add(remoteContainer);
        }
        //注销containerAllocationExpirer.
        containerAllocationExpirer
            .unregister(new AllocationExpirationInfo(containerId));
      }
    }
    //丢失的Container，把他视为刚刚运行完成的Container
    List<ContainerStatus> lostContainers =
        findLostContainers(numRemoteRunningContainers, containerStatuses);
    for (ContainerStatus remoteContainer : lostContainers) {
      ContainerId containerId = remoteContainer.getContainerId();
      if (completedContainers.add(containerId)) {
        newlyCompletedContainers.add(remoteContainer);
      }
    }
    //最后把新发起的容器和新完成的容器的信息更新到有变化的容器队列nodeUpdateQueue中
    if (newlyLaunchedContainers.size() != 0
        || newlyCompletedContainers.size() != 0) {
      nodeUpdateQueue.add(new UpdatedContainerInfo(newlyLaunchedContainers,
          newlyCompletedContainers));
    }
  }

回到StatusUpdateWhenHealthyTransition继续：

  public static class StatusUpdateWhenHealthyTransition implements
      MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {
    public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {

     //....

      //如果有状态变化的容器
      if(rmNode.nextHeartBeat) {
      //保证只发生一次
        rmNode.nextHeartBeat = false;
        rmNode.context.getDispatcher().getEventHandler().handle(
            new NodeUpdateSchedulerEvent(rmNode));
      }



      return initialState;
    }
  }

new NodeUpdateSchedulerEvent():

 public NodeUpdateSchedulerEvent(RMNode rmNode) {
    super(SchedulerEventType.NODE_UPDATE);
    this.rmNode = rmNode;
  }

触发了调度器 SchedulerEventType.NODE_UPDATE事件，我们这里的调度器类型是FairScheduler:
FairScheduler.handle():

 public void handle(SchedulerEvent event) {
    switch (event.getType()) {

     case NODE_UPDATE:
      if (!(event instanceof NodeUpdateSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      NodeUpdateSchedulerEvent nodeUpdatedEvent = (NodeUpdateSchedulerEvent)event;
      nodeUpdate(nodeUpdatedEvent.getRMNode());
      break;

    //.........
    }
  }

FairScheduler.handle() -> FairScheduler.nodeUpdate():

 protected void nodeUpdate(RMNode nm) {
    try {
      writeLock.lock();
      long start = getClock().getTime();
      super.nodeUpdate(nm);

      FSSchedulerNode fsNode = getFSSchedulerNode(nm.getNodeID());
      attemptScheduling(fsNode);

      long duration = getClock().getTime() - start;
      fsOpDurations.addNodeUpdateDuration(duration);
    } finally {
      writeLock.unlock();
    }
  }

FairScheduler.handle() -> FairScheduler.nodeUpdate() -> FairScheduler.attemptScheduling()

 void attemptScheduling(FSSchedulerNode node) {
    try {
      writeLock.lock();
      //省略一些验证工作
      //首先分配被抢占的容器   
      assignPreemptedContainers(node);
      FSAppAttempt reservedAppSchedulable = node.getReservedAppSchedulable();
      boolean validReservation = false;
      //然后分配被预留的容器
      if (reservedAppSchedulable != null) {
        validReservation = reservedAppSchedulable.assignReservedContainer(node);
      }
      if (!validReservation) {
        //没有预留的情况下，调度低于fair share最多的队列
        int assignedContainers = 0;
        Resource assignedResource = Resources.clone(Resources.none());
        //当前节点的最大可分配资源
        Resource maxResourcesToAssign = Resources.multiply(
            node.getUnallocatedResource(), 0.5f);
        while (node.getReservedContainer() == null) {
          //从根节点开始按照调度规则进行调度，然后容器分配
          Resource assignment = queueMgr.getRootQueue().assignContainer(node);
          if (assignment.equals(Resources.none())) {
            break;
          }

          assignedContainers++;
          Resources.addTo(assignedResource, assignment);
          if (!shouldContinueAssigning(assignedContainers, maxResourcesToAssign,
              assignedResource)) {
            break;
          }
        }
      }
      //更新队列的信息
      updateRootQueueMetrics();
    } finally {
      writeLock.unlock();
    }
  }

凡哲_Lucas

发布了14 篇原创文章 · 获赞 4 · 访问量 5682

私信关注

Hadoop Yarn 3.1.0 源码分析 （02 作业调度）

猜你喜欢

Hadoop Yarn 3.1.0 源码分析（02 作业调度）