抢占概念

当一个job提交到一个繁忙集群中的空队列时，job并不会马上执行，而是阻塞直到正在运行的job释放系统资源。

为了使提交job的执行时间更具预测性，可以设置等待的超时时间，MinShareTimeout与FairShareTimeout.下面会提到）。Fair调度器支持抢占。

抢占就是允许调度器杀掉占用超过其应占份额资源队列的containers，这些containers资源便可被分配到应该享有这些份额资源的队列中。（注意：抢的就是container）

需要注意抢占会降低集群的执行效率，因为被终止的containers需要被重新执行。

可以通过在yarn-site.xml中设置一个全局的参数yarn.scheduler.fair.preemption=true来启用抢占功能。此外，还有两个参数用来控制抢占的过期时间（这两个参数默认没有配置，需要至少配置一个来允许抢占Container）：

1- minimum share preemption timeout
2- fair share preemption timeout

如果队列在minimum share preemption timeout指定的时间内未获得最小的资源保障，调度器就会抢占containers。我们可以通过配置文件中的顶级元素为所有队列配置这个超时时间；我们还可以在元素内配置元素来为某个队列指定超时时间。

Container对象

container源码

@Private
	@Unstable
	public static Container newInstance(ContainerId containerId, NodeId nodeId,
			String nodeHttpAddress, Resource resource, Priority priority,
			Token containerToken) {
		Container container = Records.newRecord(Container.class);
		container.setId(containerId);
		container.setNodeId(nodeId);
		container.setNodeHttpAddress(nodeHttpAddress);
		container.setResource(resource);
		container.setPriority(priority);
		container.setContainerToken(containerToken);
		return container;
	}

id是Container在整个集群中唯一的标识；
NodeId是Container所在的节点标识，通过这个值，Application可以联系到对应的NM，拉取到RM分配给自己的资源；
NodeHttpAddress即所在节点的地址
resource为包含的资源
priority为此container的优先级

Container是Yarn框架的计算单元，是具体执行应用task（如map task、reduce task）的基本单位。
可以理解为运行map/reduce task的容器

（1）Container是YARN中资源的抽象，它封装了某个节点上一定量的资源（CPU和内存两类资源）。它跟Linux Container没有任何关系，仅仅是YARN提出的一个概念（从实现上看，可看做一个可序列化/反序列化的Java类）。
（2）Container由ApplicationMaster向ResourceManager申请的，由ResouceManager中的资源调度器异步分配给ApplicationMaster；
（3）Container的运行是由ApplicationMaster向资源所在的NodeManager发起的，

另外，一个应用程序所需的Container分为两大类，如下：
（1）运行ApplicationMaster的Container：这是由ResourceManager（向内部的资源调度器）申请和启动的，用户提交应用程序时，可指定唯一的ApplicationMaster所需的资源；
（2）运行各类任务的Container：这是由ApplicationMaster向ResourceManager申请的，并由ApplicationMaster与NodeManager通信以启动之。
以上两类Container可能在任意节点上，它们的位置通常而言是随机的，即ApplicationMaster可能与它管理的任务运行在一个节点上

抢占源码分析

https://blog.csdn.net/zhanyuanlin/article/details/71516286
Yarn基于树的队列管理逻辑，在资源层面，无论是树的根节点（root 队列），非根节点、叶子节点，都是资源的抽象，在Yarn中，都是一个Schedulable，因此，无论是FSLeafQueue(队列树的叶子节点), 还是FSParentQueue（队列树的非叶子节点），或者是FSAppAttempt(FairScheduler调度器层面的应用)，是实现了Schedulable的preemptContainer()方法，他们都有自己的fair share属性（资源量）、weight属性（权重）、minShare属性（最小资源量）、maxShare属性(最大资源量)，priority属性(优先级)、resourceUsage属性（资源使用量属性）以及资源需求量属性(demand)，从Schedulable接口的定义就可以看出来：

public interface Schedulable {
  /**
   * Name of job/queue, used for debugging as well as for breaking ties in
   * scheduling order deterministically.
   */
  public String getName();

  /**
   * Maximum number of resources required by this Schedulable. This is defined as
   * number of currently utilized resources + number of unlaunched resources (that
   * are either not yet launched or need to be speculated).
   */
  public Resource getDemand();

  /** Get the aggregate amount of resources consumed by the schedulable. */
  public Resource getResourceUsage();

  /** Minimum Resource share assigned to the schedulable. */
  public Resource getMinShare();

  /** Maximum Resource share assigned to the schedulable. */
  public Resource getMaxShare();

  /** Job/queue weight in fair sharing. */
  public ResourceWeights getWeights();

  /** Start time for jobs in FIFO queues; meaningless for QueueSchedulables.*/
  public long getStartTime();

 /** Job priority for jobs in FIFO queues; meaningless for QueueSchedulables. */
  public Priority getPriority();

  /** Refresh the Schedulable's demand and those of its children if any. */
  public void updateDemand();

  /**
   * Assign a container on this node if possible, and return the amount of
   * resources assigned.
   */
  public Resource assignContainer(FSSchedulerNode node);

  /**
   * Preempt a container from this Schedulable if possible.
   */
  public RMContainer preemptContainer();

  /** Get the fair share assigned to this Schedulable. */
  public Resource getFairShare();

  /** Assign a fair share to this Schedulable. */
  public void setFairShare(Resource fairShare);
}

FairScheduler初始化时，会创建一个updateThread线程

private void initScheduler(Configuration conf) throws IOException {
    synchronized (this) {
      //创建updateThread 线程，监控队列状态；控制抢占
      updateThread = new UpdateThread();
      updateThread.setName("FairSchedulerUpdateThread");
      updateThread.setDaemon(true);
  }

UpdateThread这个线程作用：

判断是否需要抢占
计算需要抢占的资源
进行抢占

private class UpdateThread extends Thread {

    @Override
    public void run() {
      while (!Thread.currentThread().isInterrupted()) {
        try {
          Thread.sleep(updateInterval);
          long start = getClock().getTime();
          update();
          preemptTasksIfNecessary();
          long duration = getClock().getTime() - start;
          fsOpDurations.addUpdateThreadRunDuration(duration);
        } catch (InterruptedException ie) {
          LOG.warn("Update thread interrupted. Exiting.");
          return;
        } catch (Exception e) {
          LOG.error("Exception in fair scheduler UpdateThread", e);
        }
      }
    }
  }

发生在UpdateThread.preemptTasksIfNecessary()方法中

 	* 检查所有缺乏资源的Scheduler, 无论它缺乏资源是因为处于minShare的时间超过了minSharePreemptionTimeout
 * 还是因为它处于fairShare的时间已经超过了fairSharePreemptionTimeout。在统计了所有Scheduler
 * 缺乏的资源并求和以后，就开始尝试进行资源抢占。
   */
  protected synchronized void preemptTasksIfNecessary() {
    //检查集群是否允许抢占发生
    if (!shouldAttemptPreemption()) {
      return;
    }

    long curTime = getClock().getTime();
    //判断是否到达抢占时机
    if (curTime - lastPreemptCheckTime < preemptionInterval) {
      return;
    }
    lastPreemptCheckTime = curTime;
    //初始化抢占参数为none，即什么也不抢占
    Resource resToPreempt = Resources.clone(Resources.none());
    for (FSLeafQueue sched : queueMgr.getLeafQueues()) {
      //resourceDeficit（FSLeafQueue，long）抢占的资源计算
      Resources.addTo(resToPreempt, resourceDeficit(sched, curTime));
    }
    if (isResourceGreaterThanNone(resToPreempt)) {
      //抢占的动作
      preemptResources(resToPreempt);
    }
  }

shouldAttemptPreemption()作用是检查集群是否允许抢占发生，它主要铜鼓2个参数进行判断

看yarn.scheduler.fair.preemption参数是否为true
判断utilization threshold是否超过了yarn.scheduler.fair.preemption.cluster-utilization-threshold的配置值

private boolean shouldAttemptPreemption() {
    if (preemptionEnabled) {//看yarn.scheduler.fair.preemption参数是否为true
      //判断utilization threshold是否超过了yarn.scheduler.fair.preemption.cluster-utilization-threshold的配置值
      return (preemptionUtilizationThreshold < Math.max(
          (float) rootMetrics.getAllocatedMB() / clusterResource.getMemorySize(),
          (float) rootMetrics.getAllocatedVirtualCores() /
              clusterResource.getVirtualCores()));
    }
    return false;
  }

我们再回到preemptTasksIfNecessary（）中，如下

 	for (FSLeafQueue sched : queueMgr.getLeafQueues()) {
      //resourceDeficit（FSLeafQueue，long）抢占的资源计算
      Resources.addTo(resToPreempt, resourceDeficit(sched, curTime));
    }

其中resourceDeficit（FSLeafQueue，long）作用主要是进行抢占的资源计算
参数FSLeafQueue为队列对象，long未current time
resourceDeficit（）是最核心的地方

 * 计算这个队列允许抢占其它队列的资源大小。如果这个队列使用的资源低于其最小资源的时间超过了抢占超时时间，那么，
   * 应该抢占的资源量就在它当前的fair share和它的min share之间的差额。如果队列资源已经低于它的fair share
   * 的时间超过了fairSharePreemptionTimeout，那么他应该进行抢占的资源就是满足其fair share的资源总量。
   * 如果两者都发生了，则抢占两个的较多者。
   *
   * minSharePreemptionTimeout 表示如果超过该指定时间，Scheduler还没有获得minShare的资源，则进行抢占
   * fairSharePreemptionTimeout 表示如果超过该指定时间，Scheduler还没有获得fairShare的资源，则进行抢占
   *
   * resDueToMinShare=Max(0,substract(min(minshare,demand),ResourceUsage))
   * resDueToFairShare=Max(0,substract(min(fairshare,demand),ResourceUsage))
   * deficit=Max(resDueToMinShare,resDueToFairShare)
   *
   * 资源使用量 < min(最小份额，资源需求量) || 资源使用量 < min(公平份额，资源需求量)
   * 条件满足时求两个差值的最大值作为需要抢占的资源量
   *
   */
  protected Resource resourceDeficit(FSLeafQueue sched, long curTime) {
    long minShareTimeout = sched.getMinSharePreemptionTimeout();
    long fairShareTimeout = sched.getFairSharePreemptionTimeout();
    Resource resDueToMinShare = Resources.none();
    Resource resDueToFairShare = Resources.none();
    ResourceCalculator calc = sched.getPolicy().getResourceCalculator();
    //minShare超时条件下
    if (curTime - sched.getLastTimeAtMinShare() > minShareTimeout) {
    	//获取minShare，demand之间的最小值作为target
      Resource target = Resources.componentwiseMin(
          sched.getMinShare(), sched.getDemand());
        //target与ResourceUsage之间的最大值即为resDueToMinShare （由于minShare超时需要获取的资源）
      resDueToMinShare = Resources.max(calc, clusterResource,
          Resources.none(), Resources.subtract(target, sched.getResourceUsage()));
    }
        //fairShare超时条件下
    if (curTime - sched.getLastTimeAtFairShareThreshold() > fairShareTimeout) {
      Resource target = Resources.componentwiseMin(
              sched.getFairShare(), sched.getDemand());
      resDueToFairShare = Resources.max(calc, clusterResource,
          Resources.none(), Resources.subtract(target, sched.getResourceUsage()));
    }
    Resource deficit = Resources.max(calc, clusterResource,
        resDueToMinShare, resDueToFairShare);
    if (Resources.greaterThan(calc, clusterResource,
        deficit, Resources.none())) {
      String message = "Should preempt " + deficit + " res for queue "
          + sched.getName() + ": resDueToMinShare = " + resDueToMinShare
          + ", resDueToFairShare = " + resDueToFairShare;
      LOG.info(message);
    }
    return deficit;
  }

抢占模型

在FairScheduler.xml中，需要配置这两个超时时间：

抢占条件
minSharePreemptionTimeout 表示如果超过该指定时间，Scheduler还没有获得minShare的资源，则进行抢占
fairSharePreemptionTimeout 表示如果超过该指定时间，Scheduler还没有获得fairShare的资源，则进行抢占

从源码中分析出的模型如下
当一个队列的
资源使用量 < min(最小份额，资源需求量) || 资源使用量 < min(公平份额，资源需求量)
条件满足时求两个差值的最大值作为此队列需要抢占的资源量

resDueToMinShare=Max(0,substract(min(minshare,demand),ResourceUsage))
resDueToFairShare=Max(0,substract(min(fairshare,demand),ResourceUsage))
deficit=Max(resDueToMinShare,resDueToFairShare)

resDueToMinShare：由于MinShare超时的原因需要获取的res(资源量)
resDueToFairShare：由于FariShare超时的原因需要获取的res(资源量)
deficit：缺少的资源

双椒叔叔

发布了15 篇原创文章 · 获赞 28 · 访问量 1140

私信关注

Yarn抢占最核心剖析

抢占概念

Container对象

抢占源码分析

抢占模型

猜你喜欢