CapacityScheduler --ApplicationMaster资源分配

CapacityScheduler --ApplicationMaster资源分配(基于hadoop 2.7.6)

资源分配是被动分配的方式,在数据节点发送心跳(NODE_UPDATE)时,根据数据节点汇报的资源情况进行调度分配.

先贴下: ApplicationMaster启动需要的资源多少(memory和virtualcores)在客户端提交应用程序的时候已经初始化(在YARNRunner类里),memory默认是1536M,virtualcores默认是1.

代码清单:

	case NODE_UPDATE:
    {
      NodeUpdateSchedulerEvent nodeUpdatedEvent = (NodeUpdateSchedulerEvent)event;
      RMNode node = nodeUpdatedEvent.getRMNode();
      /**
       * 	更新节点信息:
       * 	1.处理已分配的container
       * 		触发RMContainerEventType.LAUNCHED事件是由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了
       * 	2.处理已经完成的container
       * 		主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
       */
      nodeUpdate(node);
      /**
       * 是否异步分配,默认值是false,默认capacity-scheduler.xml配置文件里是没有配置的.
       * 配置项:yarn.scheduler.capacity.scheduler-asynchronously.enable
       */
      if (!scheduleAsynchronously) {
    	  /**
    	   * 	进行资源分配
    	   */
		allocateContainersToNode(getNode(node.getNodeID()));
      }
    }

NODE_UPDATE事件处理逻辑:
1.节点更新信息处理
2.分配资源

/**
   * 	1.处理已分配的container
   * 		触发RMContainerEventType.LAUNCHED事件,该事件由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了(在处理APP_ATTEMPT_ADDED事件时,会将container加入到containerAllocationExpirer进行监控)
   * 
   * 	2.处理已经完成的container
   * 		主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
   * @param nm
   */
   private synchronized void nodeUpdate(RMNode nm) {
    if (LOG.isDebugEnabled()) {
      LOG.debug("nodeUpdate: " + nm + " clusterResources: " + clusterResource);
    }
    FiCaSchedulerNode node = getNode(nm.getNodeID());
    List<UpdatedContainerInfo> containerInfoList = nm.pullContainerUpdates();
    List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>();
    List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>();
    for(UpdatedContainerInfo containerInfo : containerInfoList) {
      newlyLaunchedContainers.addAll(containerInfo.getNewlyLaunchedContainers());
      completedContainers.addAll(containerInfo.getCompletedContainers());
    }
    
    // Processing the newly launched containers
    for (ContainerStatus launchedContainer : newlyLaunchedContainers) {
    	/**
         * 	触发RMContainerEventType.LAUNCHED事件,该事件由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了(在处理APP_ATTEMPT_ADDED事件时,会将container加入到containerAllocationExpirer进行监控)
         */
      containerLaunchedOnNode(launchedContainer.getContainerId(), node);
    }

    // Process completed containers
    for (ContainerStatus completedContainer : completedContainers) {
      ContainerId containerId = completedContainer.getContainerId();
      LOG.debug("Container FINISHED: " + containerId);
      /**
       * 	主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新
       */
      completedContainer(getRMContainer(containerId), 
          completedContainer, RMContainerEventType.FINISHED);
    }

    // Now node data structures are upto date and ready for scheduling.
    if(LOG.isDebugEnabled()) {
      LOG.debug("Node being looked for scheduling " + nm
        + " availableResource: " + node.getAvailableResource());
    }
  }

更新数据节点信息:
     1.处理已分配的container
           触发RMContainerEventType.LAUNCHED事件,该事件是由LaunchedTransition转换器处理,LaunchedTransition的主要逻辑是从containerAllocationExpirer去除对Container的监控,因为已经执行了
    2.处理已经完成的container
           主要是将queue,user,(FiCaSchedulerApp)application,(FiCaSchedulerNode)node中相关的资源计数更新

在贴分配逻辑代码前,先YY几个问题:
1.分配是以队列为单位,那么是怎么选队列的(按什么顺序、条件选队列)?
2.选中队列后,又是怎么选应用程序进行分配(按什么顺序分配提交到队列内的应用程序)?

/**
* 为了尽量简单,能先看懂主体逻辑流程,先不考虑reserved情况
*/
@VisibleForTesting
  public synchronized void allocateContainersToNode(FiCaSchedulerNode node) {
    if (rmContext.isWorkPreservingRecoveryEnabled()
        && !rmContext.isSchedulerReadyForAllocatingContainers()) {
      return;
    }
    /**
     * 		数据节点还未注册过
     */
    if (!nodes.containsKey(node.getNodeID())) {
      LOG.info("Skipping scheduling as the node " + node.getNodeID() +
          " has been removed");
      return;
    }

    // Assign new containers...
    // 1. Check for reserved applications
    // 2. Schedule if there are no reservations

    /**
     * 		看容器节点上有无预留资源,有预留资源则先用
     * 		
     * 		为了尽量简单,先不考虑reservedContainer情况
     */
    RMContainer reservedContainer = node.getReservedContainer();
    if (reservedContainer != null) {
      FiCaSchedulerApp reservedApplication =
          getCurrentAttemptForContainer(reservedContainer.getContainerId());
      
      // Try to fulfill the reservation
      LOG.info("Trying to fulfill reservation for application " + 
          reservedApplication.getApplicationId() + " on node: " + 
          node.getNodeID());
      
      LeafQueue queue = ((LeafQueue)reservedApplication.getQueue());
      CSAssignment assignment =
          queue.assignContainers(
              clusterResource,
              node,
              new ResourceLimits(labelManager.getResourceByLabel(
                  RMNodeLabelsManager.NO_LABEL, clusterResource)));
      
      RMContainer excessReservation = assignment.getExcessReservation();
      if (excessReservation != null) {
      Container container = excessReservation.getContainer();
      queue.completedContainer(
          clusterResource, assignment.getApplication(), node, 
          excessReservation, 
          SchedulerUtils.createAbnormalContainerStatus(
              container.getId(), 
              SchedulerUtils.UNRESERVED_CONTAINER), 
          RMContainerEventType.RELEASED, null, true);
      }
    }

    /**
	   * 	minimumAllocation包括最小内存和最小虚拟CPU数,在CapacityScheduler初始化initScheduler的时候初始化
	   * 		最小内存: 配置项是yarn.scheduler.minimum-allocation-mb,默认值是1024M
	   * 		最小虚拟CPU数: 配置项是yarn.scheduler.minimum-allocation-vcores,默认值是1
	   */
    // Try to schedule more if there are no reservations to fulfill
    if (node.getReservedContainer() == null) {
    	/**
    	 * 		数据节点的可用资源是否能满足,算法:
    	 * 		node.getAvailableResource()/minimumAllocation
    	 */
      if (calculator.computeAvailableContainers(node.getAvailableResource(),
        minimumAllocation) > 0) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Trying to schedule on node: " + node.getNodeName() +
              ", available: " + node.getAvailableResource());
        }
        /**
         * 这里有两个思路或问题:
         * 	1.从root开始匹配,那么先匹配哪个队列呢?
         * 		队列是根据可使用容量来排序遍历,可使用容量越多越靠前
         * 	2.队列内部按什么顺序匹配需求?
         * 		队列内是安排FIFO的顺序匹配需求
         * 
         * 	注意:assignContainers是从根节点开始匹配,assignContainers和assignContainersToChildQueues方法是相互调用的递归方法,
         * 	直到叶子节点的时候才调用叶子节点的assignContainers进行实质上的分配
         */
        root.assignContainers(
            clusterResource,
            node,
            new ResourceLimits(labelManager.getResourceByLabel(
                RMNodeLabelsManager.NO_LABEL, clusterResource)));
      }
    } else {
      LOG.info("Skipping scheduling since node " + node.getNodeID() + 
          " is reserved by application " + 
          node.getReservedContainer().getContainerId().getApplicationAttemptId()
          );
    }
  }

allocateContainersToNode方法的主要实现:
  从根节点root开始调用assignContainers进行匹配,一直到叶子节点真正完成分配.这个匹配过程中与parentQueue.assignContainersToChildQueues方法两者相互递归调用完成.
主要的是否可分配的检查逻辑是:
      1.数据节点汇报上来的可用资源是否大于等于配置的minimumAllocation.
      2.检查分配后队列的总占用资源是否超过队列的资源上限.
重新回到主体逻辑代码:

@Override
  public synchronized CSAssignment ParantQueue.assignContainers(Resource clusterResource,
      FiCaSchedulerNode node, ResourceLimits resourceLimits) {
    CSAssignment assignment = 
        new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL);
    Set<String> nodeLabels = node.getLabels();
    
    /**
     * 	数据节点是否标签是否正匹配:
     * 		1.如果队列标签是*,则可以访问任何一个计算节点
     * 		2.如果节点没有打标签,则任何队列都可以访问
     * 		3.如果队列打了固定标签,则只能访问对应标签的节点
     */
    if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels, nodeLabels)) {
      return assignment;
    }
    /**
	   * 	检查node上的可用资源是否达到minimumAllocation要求
	   * 
	   * 	计算node上的资源是否可以用(是与minimumAllocation匹配),计算公式:node.getAvailableResource()-minimumAllocation>0
	   * 		1.如果DefaultResourceCalculator是直接用上述公式计算,不需要用到clusterResource
	   * 		2.如果DominantResourceCalculator是用资源占用率算的,则需要用到clusterResource
	  */
    while (canAssign(clusterResource, node)) {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Trying to assign containers to child-queue of "
          + getQueueName());
      }
      /**
       * 	检查是否超过当前队列资源上限,即判断当前队列是否可分配
       */
      if (!super.canAssignToThisQueue(clusterResource, nodeLabels, resourceLimits,
          minimumAllocation, Resources.createResource(getMetrics()
              .getReservedMB(), getMetrics().getReservedVirtualCores()))) {
        break;
      }
      
      /**
      * 检查通过后,分派到子队列
      */
      CSAssignment assignedToChild = 
          assignContainersToChildQueues(clusterResource, node, resourceLimits);
      assignment.setType(assignedToChild.getType());
      
      // Done if no child-queue assigned anything
      /**
	   *   有分配到资源就说明分配成功
      */
      if (Resources.greaterThan(
              resourceCalculator, clusterResource, 
              assignedToChild.getResource(), Resources.none())) {
        // Track resource utilization for the parent-queue
    	  /**
    	   * 	分配成功后,更新父队列资源使用情况
    	   */
        super.allocateResource(clusterResource, assignedToChild.getResource(),
            nodeLabels);
        
        /**
         * 	将子队列的资源使用情况,与当前队列分配的资源合并更新
         */
        Resources.addTo(assignment.getResource(), assignedToChild.getResource());
        
        LOG.info("assignedContainer" +
            " queue=" + getQueueName() + 
            " usedCapacity=" + getUsedCapacity() +
            " absoluteUsedCapacity=" + getAbsoluteUsedCapacity() +
            " used=" + queueUsage.getUsed() + 
            " cluster=" + clusterResource);

      } else {
        break;
      }

      if (LOG.isDebugEnabled()) {
        LOG.debug("ParentQ=" + getQueueName()
          + " assignedSoFarInThisIteration=" + assignment.getResource()
          + " usedCapacity=" + getUsedCapacity()
          + " absoluteUsedCapacity=" + getAbsoluteUsedCapacity());
      }

      if (!rootQueue || assignment.getType() == NodeType.OFF_SWITCH) {
        if (LOG.isDebugEnabled()) {
          if (rootQueue && assignment.getType() == NodeType.OFF_SWITCH) {
            LOG.debug("Not assigning more than one off-switch container," +
                " assignments so far: " + assignment);
          }
        }
        break;
      }
    } 
    
    return assignment;
  }

ParantQueue.assignContainers的主要逻辑:
   1.检查汇报上来的数据节点标签是否匹配.
   2.检查汇报上来的数据节点的可用资源是否达到minimumAllocation要求.
   3.检查是否超过当前队列的资源上限.
   4.检查通过后分派到子节点进行匹配.

/**
   * 	数据节点标签是否匹配:
   * 		1.如果队列标签是星号,则可以访问任何一个计算节点
   * 		2.如果节点没有打标签,则任何队列都可以访问
   * 		3.如果队列打了固定标签,则只能访问对应标签的节点
   * @param queueLabels
   * @param nodeLabels
   * @return
   */
  public static boolean checkQueueAccessToNode(Set<String> queueLabels,
      Set<String> nodeLabels) {
    if (queueLabels != null && queueLabels.contains(RMNodeLabelsManager.ANY)) {
      return true;
    }
    // any queue can access to a node without label
    if (nodeLabels == null || nodeLabels.isEmpty()) {
      return true;
    }
    // a queue can access to a node only if it contains any label of the node
    if (queueLabels != null
        && Sets.intersection(queueLabels, nodeLabels).size() > 0) {
      return true;
    }
    return false;
  }

检查汇报上来的数据节点标签是否匹配:
   1.如果队列标签是星号,则可以访问任何一个计算节点
   2.如果节点没有打标签,则任何队列都可以访问
   3.如果队列打了固定标签,则只能访问对应标签的节点

/**
   * 汇报上来的数据节点上的资源是否可以用,计算公式:node.getAvailableResource()-minimumAllocation>0
   * 		1.如果DefaultResourceCalculator是直接用上述公式计算,不需要用到clusterResource
   * 		2.如果DominantResourceCalculator是用资源占用率算的,则需要用到clusterResource
   */
  private boolean canAssign(Resource clusterResource, FiCaSchedulerNode node) {
	  /**
	   * 	汇报上来的数据节点的资源是否可以用,计算公式:node.getAvailableResource()-minimumAllocation>0
	   * 		1.如果DefaultResourceCalculator是直接用上述公式计算,不需要用到clusterResource
	   * 		2.如果DominantResourceCalculator是用资源占用率算的,则需要用到clusterResource
	   */
    return (node.getReservedContainer() == null) && 
        Resources.greaterThanOrEqual(resourceCalculator, clusterResource, 
            node.getAvailableResource(), minimumAllocation);
  }

检查汇报上来的数据节点的可用资源是否达到minimumAllocation要求.

/**
   * 	检查分配后是否会超过当前队列的资源上限
   * 		
   * @param clusterResource
   * @param nodeLabels
   * @param currentResourceLimits
   * @param nowRequired
   * @param resourceCouldBeUnreserved
   * @return
   */
  synchronized boolean canAssignToThisQueue(Resource clusterResource,
      Set<String> nodeLabels, ResourceLimits currentResourceLimits,
      Resource nowRequired, Resource resourceCouldBeUnreserved) {
    // Get label of this queue can access, it's (nodeLabel AND queueLabel)
    Set<String> labelCanAccess;
    if (null == nodeLabels || nodeLabels.isEmpty()) {
      labelCanAccess = new HashSet<String>();
      // Any queue can always access any node without label
      labelCanAccess.add(RMNodeLabelsManager.NO_LABEL);
    } else {
      labelCanAccess = new HashSet<String>(
          accessibleLabels.contains(CommonNodeLabelsManager.ANY) ? nodeLabels
              : Sets.intersection(accessibleLabels, nodeLabels));
    }
    
    for (String label : labelCanAccess) {
      // New total resource = used + required
      Resource newTotalResource =
          Resources.add(queueUsage.getUsed(label), nowRequired);

      /**
       * 	没有标签的队列的资源上限: min(当前层级队列的资源上限,父节点指定的上限)
       * 	有标签的队列的资源上限: 当前层级队列的资源上限
       * 		
       * 	看root传入的是整个集群的资源,所以一般情况下都是当前层级队列的资源上限
       */
      Resource currentLimitResource =
          getCurrentLimitResource(label, clusterResource, currentResourceLimits);

      /**
       * 	假如分配成功后,是不是超过了资源上限
       */
      if (Resources.greaterThan(resourceCalculator, clusterResource,
          newTotalResource, currentLimitResource)) {

        if (this.reservationsContinueLooking
            && label.equals(RMNodeLabelsManager.NO_LABEL)
            && Resources.greaterThan(resourceCalculator, clusterResource,
            resourceCouldBeUnreserved, Resources.none())) {
          // resource-without-reserved = used - reserved
          Resource newTotalWithoutReservedResource =
              Resources.subtract(newTotalResource, resourceCouldBeUnreserved);

          if (Resources.lessThanOrEqual(resourceCalculator, clusterResource,
              newTotalWithoutReservedResource, currentLimitResource)) {
            if (LOG.isDebugEnabled()) {
              LOG.debug("try to use reserved: " + getQueueName()
                  + " usedResources: " + queueUsage.getUsed()
                  + ", clusterResources: " + clusterResource
                  + ", reservedResources: " + resourceCouldBeUnreserved
                  + ", capacity-without-reserved: "
                  + newTotalWithoutReservedResource + ", maxLimitCapacity: "
                  + currentLimitResource);
            }  currentResourceLimits.setAmountNeededUnreserve(Resources.subtract(newTotalResource,
                currentLimitResource));
            return true;
          }
        }
        if (LOG.isDebugEnabled()) {
          LOG.debug(getQueueName()
              + "Check assign to queue, label=" + label
              + " usedResources: " + queueUsage.getUsed(label)
              + " clusterResources: " + clusterResource
              + " currentUsedCapacity "
              + Resources.divide(resourceCalculator, clusterResource,
              queueUsage.getUsed(label),
              labelManager.getResourceByLabel(label, clusterResource))
              + " max-capacity: "
              + queueCapacities.getAbsoluteMaximumCapacity(label)
              + ")");
        }
        return false;
      }
      return true;
    }
    return false;
  }

检查分配后是否会超过当前队列的资源上限.

扫描二维码关注公众号，回复： 4370987 查看本文章

/**
   * 	遍历当前队列的子队列,那么想到一个问题,遍历顺序:
   * 		CapacityScheduler内实现了一个比较器用于给队列排序.
   * 			1.首先按队列可使用容量排序,可使用资源越多,排序越靠前
   * 			2.可使用资源一样时,按队列路径排序,路径越短越靠前
   * @param cluster
   * @param node
   * @param limits
   * @return
   */
  private synchronized CSAssignment ParentQueue.assignContainersToChildQueues(
      Resource cluster, FiCaSchedulerNode node, ResourceLimits limits) {
    CSAssignment assignment = 
        new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL);
    printChildQueues();
    /**
     * 遍历每个子节点,所有子节点中有一个能匹配就返回.如果都不能提交,则失败
     * 两种分配情况:
     * 		1.所有子队列都不满足分配条件,分配失败,等待下一次匹配(也许会有释放,也可能是因为配置错误导致永远分配失败)
     * 		2.分配成功
     */
    for (Iterator<CSQueue> iter = childQueues.iterator(); iter.hasNext();) {
      CSQueue childQueue = iter.next();
      if(LOG.isDebugEnabled()) {
        LOG.debug("Trying to assign to queue: " + childQueue.getQueuePath()
          + " stats: " + childQueue);
      }
      /**
       * 		获取子节点队列资源上限
       */
      ResourceLimits childLimits =
          getResourceLimitsOfChild(childQueue, cluster, limits);
      
      /**
       * 	根据队列的多层级结构,这里的childQueue可能是ParentQueue,也可能是LeafQueue.
       * 	如果ParentQueue则递归调用assignContainers(又会调用assignContainersToChildQueues),
       * 	直到是LeafQueue,才调用LeafQueue.assignContainers方法则真正进行分配
       * 
       */
      assignment = childQueue.assignContainers(cluster, node, childLimits);
      if(LOG.isDebugEnabled()) {
        LOG.debug("Assigned to queue: " + childQueue.getQueuePath() +
          " stats: " + childQueue + " --> " + 
          assignment.getResource() + ", " + assignment.getType());
      }

      /**
       * 	把完成分配的队列先删除后再添加到队列列表里,以完成重新排序,让已经完成分配的排序靠后(根据队列可用容量和队列路径)
       */
      if (Resources.greaterThan(
              resourceCalculator, cluster, 
              assignment.getResource(), Resources.none())) {
        // Remove and re-insert to sort
        iter.remove();
        LOG.info("Re-sorting assigned queue: " + childQueue.getQueuePath() + 
            " stats: " + childQueue);
        childQueues.add(childQueue);
        if (LOG.isDebugEnabled()) {
          printChildQueues();
        }
        break;
      }
    }
    return assignment;
  }

分派到子节点进行匹配.
这里就涉及到最开始的一个问题:分配是以队列为单位,那么是怎么选队列的(按什么顺序、条件选队列)？
从代码看只是简单的for循环遍历,那么就要看childQueue的排序规则了.

this.childQueues = new TreeSet<CSQueue>(queueComparator);

/**
   * 	队列比较器:
   * 		1.可用容量越多越排前面
   * 		2.可用容量一样时,根据队列路径排序(路径越短越排前面)
   */
  static final Comparator<CSQueue> queueComparator = new Comparator<CSQueue>() {
    @Override
    public int compare(CSQueue q1, CSQueue q2) {
      if (q1.getUsedCapacity() < q2.getUsedCapacity()) {
        return -1;
      } else if (q1.getUsedCapacity() > q2.getUsedCapacity()) {
        return 1;
      }
      return q1.getQueuePath().compareTo(q2.getQueuePath());
    }
  };

从队列比较器可以看出,队列匹配的规则:
1.可用容量越多先匹配.
2.可用容量一样时,根据队列路径排序,路径越短先匹配.

/**
   * 	1.检查数据节点标签是否匹配
   * 	2.遍历应用程序列表(activeApplications),进行分配
   * 	
   */
  @Override
  public synchronized CSAssignment assignContainers(Resource clusterResource,
      FiCaSchedulerNode node, ResourceLimits currentResourceLimits) {
    updateCurrentResourceLimits(currentResourceLimits, clusterResource);
    
    if(LOG.isDebugEnabled()) {
      LOG.debug("assignContainers: node=" + node.getNodeName()
        + " #applications=" + activeApplications.size());
    }
    
    // if our queue cannot access this node, just return
    /**
     * 	数据节点标签是否匹配:
     * 		1.如果队列标签是*,则可以访问任何一个计算节点
     * 		2.如果节点没有打标签,则任何队列都可以访问
     * 		3.如果队列打了固定标签,则只能访问对应标签的节点
     */
    if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels,
        node.getLabels())) {
      return NULL_ASSIGNMENT;
    }
    
    // Check for reserved resources
    RMContainer reservedContainer = node.getReservedContainer();
    /**
     * 		为了尽量简单,先不考虑
     */
    if (reservedContainer != null) {
      FiCaSchedulerApp application = 
          getApplication(reservedContainer.getApplicationAttemptId());
      synchronized (application) {
        return assignReservedContainer(application, node, reservedContainer,
            clusterResource);
      }
    }
    
    Resource initAmountNeededUnreserve =
        currentResourceLimits.getAmountNeededUnreserve();

    // Try to assign containers to applications in order
    /**
     * 		activeApplications是处理APP_ATTEMPT_ADDED事件时维护添加的
     * 		这里有遍历,那么就有顺序问题,先看下activeApplications的比较器:
     * 			static final Comparator<FiCaSchedulerApp> applicationComparator = 
     *					new Comparator<FiCaSchedulerApp>() {
     *					@Override
     *					public int compare(FiCaSchedulerApp a1, FiCaSchedulerApp a2) {
     * 						return a1.getApplicationId().compareTo(a2.getApplicationId());
     *					}
     *			};
     *		看比较器的实现,比较明显队列内部分配FiCaSchedulerApp是FIFO的原则
     */
    for (FiCaSchedulerApp application : activeApplications) {
      
      if(LOG.isDebugEnabled()) {
        LOG.debug("pre-assignContainers for application "
        + application.getApplicationId());
        application.showRequests();
      }

      synchronized (application) {
        // Check if this resource is on the blacklist
    	  /**
    	   * 	检查node是否在黑名单列表中
    	   */
        if (SchedulerAppUtils.isBlacklisted(application, node, LOG)) {
          continue;
        }
        // Schedule in priority order
        for (Priority priority : application.getPriorities()) {
        	/**
        	 * 		request需求已经在SchedulerTransition中调用了scheduler.allocate方法做了添加更新
        	 */
          ResourceRequest anyRequest =
              application.getResourceRequest(priority, ResourceRequest.ANY);
          if (null == anyRequest) {
            continue;
          }
          
          // Required resource
          Resource required = anyRequest.getCapability();

          // Do we need containers at this 'priority'?
          /**
           * 	判断NumContainers是否大于0 
           */
          if (application.getTotalRequiredResources(priority) <= 0) {
            continue;
          }
          if (!this.reservationsContinueLooking) {
            if (!shouldAllocOrReserveNewContainer(application, priority, required)) {
              if (LOG.isDebugEnabled()) {
                LOG.debug("doesn't need containers based on reservation algo!");
              }
              continue;
            }
          }
          
          Set<String> requestedNodeLabels =
              getRequestLabelSetByExpression(anyRequest
                  .getNodeLabelExpression());
          Resource userLimit = 
              computeUserLimitAndSetHeadroom(application, clusterResource, 
                  required, requestedNodeLabels);          
          
          currentResourceLimits.setAmountNeededUnreserve(
              initAmountNeededUnreserve);
          // Check queue max-capacity limit
          if (!super.canAssignToThisQueue(clusterResource, node.getLabels(),
              currentResourceLimits, required, application.getCurrentReservation())) {
            return NULL_ASSIGNMENT;
          }
          // Check user limit
          if (!assignToUser(clusterResource, application.getUser(), userLimit,
              application, requestedNodeLabels, currentResourceLimits)) {
            break;
          }
          // Inform the application it is about to get a scheduling opportunity
          application.addSchedulingOpportunity(priority);
          
          // Try to schedule
          CSAssignment assignment =  
            assignContainersOnNode(clusterResource, node, application, priority, 
                null, currentResourceLimits);

          // Did the application skip this node?
          if (assignment.getSkipped()) {
            // Don't count 'skipped nodes' as a scheduling opportunity!
            application.subtractSchedulingOpportunity(priority);
            continue;
          }
          
          Resource assigned = assignment.getResource();
          if (Resources.greaterThan(
              resourceCalculator, clusterResource, assigned, Resources.none())) {

            allocateResource(clusterResource, application, assigned,
                node.getLabels());
            
            if (assignment.getType() != NodeType.OFF_SWITCH) {
              if (LOG.isDebugEnabled()) {
                LOG.debug("Resetting scheduling opportunities");
              }
              if (assignment.getType() == NodeType.NODE_LOCAL
                  || getRackLocalityFullReset()) {
                application.resetSchedulingOpportunities(priority);
              }
            }
            return assignment;
          } else {
            // Do not assign out of order w.r.t priorities
            break;
          }
        }
      }
      if(LOG.isDebugEnabled()) {
        LOG.debug("post-assignContainers for application "
          + application.getApplicationId());
      }
      application.showRequests();
    }
    return NULL_ASSIGNMENT;
  }

LeafQueue.assignContainers方法实现了最后的分配,触发一系列事件来启动Container.具体又是由assignContainersOnNode方法实现,这个方法会触发一系列的事件,最后由AMLauncher.launch方法调用了rpc方法startContainers来启动Container.翻看叶子节点的assignContainers的实现,还可以回答开始YY的第二个问题:选中队列后,又按什么顺序分配提交到队列内的应用程序的?看其中的for循环遍历,其顺序依赖于activeApplications集合的排序,activeApplications是一个Set类型,其比较器是:

static final Comparator applicationComparator =
new Comparator() {
@Override
public int compare(FiCaSchedulerApp a1, FiCaSchedulerApp a2) {
return a1.getApplicationId().compareTo(a2.getApplicationId());
}
};
看比较器的实现,比较明显队列内部分配(FiCaSchedulerApp)应用程序是FIFO的原则.

/**
   * 	会触发一系列事件,最后经由AMLauncher.launch方法调用rpc方法startContainers启动Container
   * @param clusterResource
   * @param node
   * @param application
   * @param priority
   * @param reservedContainer
   * @param currentResoureLimits
   * @return
   */
  private CSAssignment assignContainersOnNode(Resource clusterResource,
      FiCaSchedulerNode node, FiCaSchedulerApp application, Priority priority,
      RMContainer reservedContainer, ResourceLimits currentResoureLimits) {
    Resource assigned = Resources.none();

    NodeType requestType = null;
    MutableObject allocatedContainer = new MutableObject();
    // Data-local
    ResourceRequest nodeLocalResourceRequest =
        application.getResourceRequest(priority, node.getNodeName());
    if (nodeLocalResourceRequest != null) {
      requestType = NodeType.NODE_LOCAL;
      assigned =
          assignNodeLocalContainers(clusterResource, nodeLocalResourceRequest, 
            node, application, priority, reservedContainer,
            allocatedContainer, currentResoureLimits);
      if (Resources.greaterThan(resourceCalculator, clusterResource,
          assigned, Resources.none())) {

        //update locality statistics
        if (allocatedContainer.getValue() != null) {
          application.incNumAllocatedContainers(NodeType.NODE_LOCAL,
            requestType);
        }
        return new CSAssignment(assigned, NodeType.NODE_LOCAL);
      }
    }

    // Rack-local
    ResourceRequest rackLocalResourceRequest =
        application.getResourceRequest(priority, node.getRackName());
    if (rackLocalResourceRequest != null) {
      if (!rackLocalResourceRequest.getRelaxLocality()) {
        return SKIP_ASSIGNMENT;
      }

      if (requestType != NodeType.NODE_LOCAL) {
        requestType = NodeType.RACK_LOCAL;
      }

      assigned = 
          assignRackLocalContainers(clusterResource, rackLocalResourceRequest, 
            node, application, priority, reservedContainer,
            allocatedContainer, currentResoureLimits);
      if (Resources.greaterThan(resourceCalculator, clusterResource,
          assigned, Resources.none())) {

        //update locality statistics
        if (allocatedContainer.getValue() != null) {
          application.incNumAllocatedContainers(NodeType.RACK_LOCAL,
            requestType);
        }
        return new CSAssignment(assigned, NodeType.RACK_LOCAL);
      }
    }
    // Off-switch
    /**
     * 	AM的资源需求设置了ResourceName为ResourceRequest.ANY
     */
    ResourceRequest offSwitchResourceRequest =
        application.getResourceRequest(priority, ResourceRequest.ANY);
    if (offSwitchResourceRequest != null) {
      if (!offSwitchResourceRequest.getRelaxLocality()) {
        return SKIP_ASSIGNMENT;
      }
      if (requestType != NodeType.NODE_LOCAL
          && requestType != NodeType.RACK_LOCAL) {
        requestType = NodeType.OFF_SWITCH;
      }
      assigned =
          assignOffSwitchContainers(clusterResource, offSwitchResourceRequest,
            node, application, priority, reservedContainer,
            allocatedContainer, currentResoureLimits);

      if (allocatedContainer.getValue() != null) {
 application.incNumAllocatedContainers(NodeType.OFF_SWITCH, requestType);
      }
      return new CSAssignment(assigned, NodeType.OFF_SWITCH);
    }
    return SKIP_ASSIGNMENT;
  }

1.完成分配
2.会触发一系列事件,最后经由AMLauncher.launch方法调用rpc方法startContainers启动Container

private Resource assignOffSwitchContainers(Resource clusterResource,
      ResourceRequest offSwitchResourceRequest, FiCaSchedulerNode node,
      FiCaSchedulerApp application, Priority priority,
      RMContainer reservedContainer, MutableObject allocatedContainer,
      ResourceLimits currentResoureLimits) {
	  /**
	   * 	主要是从调度延迟角度考虑是否可分配
	   */
    if (canAssign(application, priority, node, NodeType.OFF_SWITCH,
        reservedContainer)) {
    	/**
    	 * 		assignContainer方法用户会生成RMContainer,触发RMContainerEventType.START事件
    	 */
      return assignContainer(clusterResource, node, application, priority,
          offSwitchResourceRequest, NodeType.OFF_SWITCH, reservedContainer,
          allocatedContainer, currentResoureLimits);
    }
    return Resources.none();
  }

private Resource assignContainer(Resource clusterResource, FiCaSchedulerNode node, 
      FiCaSchedulerApp application, Priority priority, 
      ResourceRequest request, NodeType type, RMContainer rmContainer,
      MutableObject createdContainer, ResourceLimits currentResoureLimits) {
    if (LOG.isDebugEnabled()) {
      LOG.debug("assignContainers: node=" + node.getNodeName()
        + " application=" + application.getApplicationId()
        + " priority=" + priority.getPriority()
        + " request=" + request + " type=" + type);
    }
    
    if (!SchedulerUtils.checkNodeLabelExpression(
        node.getLabels(),
        request.getNodeLabelExpression())) {
      if (rmContainer != null) {
        unreserve(application, priority, node, rmContainer);
      }
      return Resources.none();
    }
    
    Resource capability = request.getCapability();
    Resource available = node.getAvailableResource();
    Resource totalResource = node.getTotalResource();

    if (!Resources.lessThanOrEqual(resourceCalculator, clusterResource,
        capability, totalResource)) {
      LOG.warn("Node : " + node.getNodeID()
          + " does not have sufficient resource for request : " + request
          + " node total capability : " + node.getTotalResource());
      return Resources.none();
    }

    assert Resources.greaterThan(
        resourceCalculator, clusterResource, available, Resources.none());

    // Create the container if necessary
    /**
     * 	基于node信息创建一个Container
     * 		Container主要有以下几个成员变量:
     * 			1.nodeId:数据节点id
     * 			2.containerId:根据appAttemptId和containerId生成
     * 			3.priority:优先级
     * 			4.resource:资源需求
     * 			5.httpAddress:与数据节点通信地址
     * 			6.containerToken:Token
     */
    Container container = 
        getContainer(rmContainer, application, node, capability, priority);
  
    // something went wrong getting/creating the container 
    if (container == null) {
      LOG.warn("Couldn't get container for allocation!");
      return Resources.none();
    }
    
    boolean shouldAllocOrReserveNewContainer = shouldAllocOrReserveNewContainer(
        application, priority, capability);

    // Can we allocate a container on this node?
    /**
     * 	数据节点的可用资源与申请的资源大小相比,是否足够
     */
    int availableContainers = 
        resourceCalculator.computeAvailableContainers(available, capability);

    boolean needToUnreserve = Resources.greaterThan(resourceCalculator,clusterResource,
        currentResoureLimits.getAmountNeededUnreserve(), Resources.none());

    if (availableContainers > 0) {
      // Allocate...

      // Did we previously reserve containers at this 'priority'?
    	/**
    	 * 	rmContainer是一个传参,值为null,是一个RMContainer类型的reserveContainer
    	 */
      if (rmContainer != null) {
        unreserve(application, priority, node, rmContainer);
      } else if (this.reservationsContinueLooking && node.getLabels().isEmpty()) {
        if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
          Resource amountToUnreserve = capability;
          if (needToUnreserve) {
            amountToUnreserve = currentResoureLimits.getAmountNeededUnreserve();
          }
          boolean containerUnreserved =
              findNodeToUnreserve(clusterResource, node, application, priority,
                  amountToUnreserve);
          if (!containerUnreserved) {
            return Resources.none();
          }
        }
      }

      // Inform the application
      /**
       * 	1.创建RMContainer
       * 	2.将创建的RMContainer加入newlyAllocatedContainers(后续的NODE_UPDATE事件处理时会把该列表中已经分配的Container进行启动)
       * 	3.将创建的RMContainer加入liveContainers(liveContainers干啥用)
       * 	4.记录已分配的resourceRequests到对应的RMContainer中,以便后面恢复
       * 	5.触发RMContainerEventType.START
       */
      RMContainer allocatedContainer = 
          application.allocate(type, node, priority, request, container);

      // Does the application need this resource?
      if (allocatedContainer == null) {
        return Resources.none();
      }
      /**
       * 	1.node上(launchedContainers)记录已经分配的Container
       * 	2.给node上的可用资源做减数,已用资源做加数
       */
      node.allocateContainer(allocatedContainer);

      String label = RMNodeLabelsManager.NO_LABEL;
      if (node.getLabels() != null && !node.getLabels().isEmpty()) {
        label = node.getLabels().iterator().next();
      }
      LOG.info("assignedContainer" +
          " application attempt=" + application.getApplicationAttemptId() +
          " container=" + container + 
          " queue=" + this + 
          " clusterResource=" + clusterResource + 
          " type=" + type +
          " requestedPartition=" + label);

      createdContainer.setValue(allocatedContainer);
      return container.getResource();
    } else {
      if (shouldAllocOrReserveNewContainer || rmContainer != null) {

        if (reservationsContinueLooking && rmContainer == null) {
          if (needToUnreserve) {
            if (LOG.isDebugEnabled()) {
              LOG.debug("we needed to unreserve to be able to allocate");
            }
            return Resources.none();
          }
        }

        // Reserve by 'charging' in advance...
        reserve(application, priority, node, rmContainer, container);

        LOG.info("Reserved container " + 
            " application=" + application.getApplicationId() + 
            " resource=" + request.getCapability() + 
            " queue=" + this.toString() + 
            " usedCapacity=" + getUsedCapacity() + 
            " absoluteUsedCapacity=" + getAbsoluteUsedCapacity() + 
            " used=" + queueUsage.getUsed() +
            " cluster=" + clusterResource);

        return request.getCapability();
      }
      return Resources.none();
    }
  }

/**
   * 	1.创建RMContainer
   * 	2.将创建的RMContainer加入newlyAllocatedContainers(后续的NODE_UPDATE事件处理时会把该列表中已经分配的Container进行启动)
   * 	3.将创建的RMContainer加入liveContainers(liveContainers干啥用)
   * 	4.记录已分配的resourceRequests到对应的RMContainer,以便后面恢复
   * 	5.触发RMContainerEventType.START
   * @param type
   * @param node
   * @param priority
   * @param request
   * @param container
   * @return
   */
  synchronized public RMContainer allocate(NodeType type, FiCaSchedulerNode node,
      Priority priority, ResourceRequest request, 
      Container container) {

    if (isStopped) {
      return null;
    }
    
    if (getTotalRequiredResources(priority) <= 0) {
      return null;
    }
    
    // Create RMContainer
    /**
     * 		创建RMContainer
     */
    RMContainer rmContainer = new RMContainerImpl(container, this
        .getApplicationAttemptId(), node.getNodeID(),
        appSchedulingInfo.getUser(), this.rmContext);

    // Add it to allContainers list.
    newlyAllocatedContainers.add(rmContainer);
    liveContainers.put(container.getId(), rmContainer);    

    // Update consumption and track allocations
    /**
     * 1.这里已经认为分配成功,将相关资源需求的NumContainer做减数
     * 2.记录做了减数的request到resourceRequests,以便后面恢复
     */
    List<ResourceRequest> resourceRequestList = appSchedulingInfo.allocate(
        type, node, priority, request, container);
    Resources.addTo(currentConsumption, container.getResource());
    
    /**
     * 	将appSchedulingInfo.allocate返回的resourceRequests记录下来,以便后面恢复
     */
    ((RMContainerImpl)rmContainer).setResourceRequests(resourceRequestList);

    // Inform the container
    /**
     * 		触发RMContainerEventType.START
     */
    rmContainer.handle(
        new RMContainerEvent(container.getId(), RMContainerEventType.START));

    if (LOG.isDebugEnabled()) {
      LOG.debug("allocate: applicationAttemptId=" 
          + container.getId().getApplicationAttemptId() 
          + " container=" + container.getId() + " host="
          + container.getNodeId().getHost() + " type=" + type);
    }
    RMAuditLogger.logSuccess(getUser(), 
        AuditConstants.ALLOC_CONTAINER, "SchedulerApp", 
        getApplicationId(), container.getId());
    return rmContainer;
  }

CapacityScheduler --ApplicationMaster资源分配

CapacityScheduler --ApplicationMaster资源分配(基于hadoop 2.7.6)

猜你喜欢