[翻译] Load tracking in the scheduler - Linux 调度之负载追踪- 基于PELT

写在最前

这篇翻译原文来自 https://lwn.net/Articles/639543/
原文是April 15, 2015发表的，在此后的2015年7月份，PELT又进行了一些改进，主要是sched_avg的部分，所以为了完整性，在译文的最后一节会补充上基于新的sched_avg结构的分析。

背景

The scheduler is an essential part of an operating system, tasked with, among other things, ensuring that processes get their fair share of CPU time. This is not as easy as it may seem initially. While some processes perform critical operations and have to be completed at high priority, others are not time-constrained. The former category of processes expect a bigger share of CPU time than the latter so as to finish as quickly as possible. But how big a share should the scheduler allocate to them?

调度程序是操作系统的重要组成部分，其任务是确保进程获得公平的CPU时间份额。这并不像最初看起来那么容易。当有些进程担负着关键的操作，必须以高优先级来执行时，其他的进程则不受时间约束。前一类进程期望相比后者得到更多的CPU时间，以尽快完成。但是调度程序应该分配给他们多大的份额？

Another factor that adds to the complexity in scheduling is the CPU topology. Scheduling on uniprocessor systems is simpler than scheduling on the multiprocessor systems that are more commonly found today. The topology of CPUs is only getting more complex with hyperthreads and heterogeneous processors like the big.LITTLE taking the place of symmetric processors. Scheduling a process on the wrong processor can adversely affect its performance. Thus, designing a scheduling algorithm that can keep all processes happy with the computing time allocated to them can be a formidable challenge.

另一个增加调度复杂性的因素是CPU拓扑结构。在单核处理器系统上进行调度比在当今更常见的多处理器系统上调度更更简单。随着超线程和异构处理器（如big.LITTLE）取代对称处理器，CPU的拓扑结构才变得越来越复杂。把进程调度到错误的处理器上会对其性能产生不利影响。因此，设计一个可以让所有进程满意分配给它们的计算时间的调度算法可能是一项艰巨的挑战。

The Linux kernel scheduler has addressed many of these challenges and matured over the years. Today there are different scheduling algorithms (or “scheduling classes”) in the kernel to suit processes having different requirements. The Completely Fair Scheduling (CFS) class is designed to suit a majority of today’s workloads. The realtime and deadline scheduling classes are designed for latency-sensitive and deadline-driven processes respectively. So we see that the scheduler developers have answered a range of requirements.

Linux内核调度程序已经解决了许多这些挑战，并且多年来已经成熟。现在，内核中有不同的调度算法（或“调度类”），以适应具有不同需求的进程。完全公平的调度（CFS）类旨在适应当今的大多数工作负载。 Readtime 和 Deadline 调度类别分别设计用于latency-sensitive和deadline-driven 的进程。所以我们可以看到调度程序开发人员已经回答/响应了一系列的要求。

The Completely Fair Scheduling class

完全公平调度类

The CFS class is the class to which most tasks belong. In spite of the robustness of this algorithm, an area that has always had scope for improvement is process-load estimation.

CFS类是大多数任务所属的类。尽管这种算法的鲁棒性，在process-load 估算方面一直存在改进的空间。

If a CPU is associated with a number C that represents its ability to process tasks (let’s call it “capacity”), then the load of a process is a metric that is expressed in units of C, indicating the number of such CPUs required to make satisfactory progress on its job. This number could also be a fraction of C, in which case it indicates that a single such CPU is good enough. The load of a process is important in scheduling because, besides influencing the time that a task spends running on the CPU, it helps to estimate overall CPU load, which is required during load balancing.

如果一个CPU与数字 C 相关联，表示它处理任务的能力（我们称之为“capacity”），那么进程的负载就是一个以 C 为单位表示的度量，表示在工作中取得令人满意的进展所需的cpu数量。这个数字也可能是 C 的一小部分，在这种情况下，它表示单个这样的CPU就已经足够。进程的负载在调度中非常重要，因为除了影响进程在CPU上运行的时间外，还有助于预估在load-balance中所需要的CPU 整体负载。

The question is how to estimate the load of a process. Should it be set statically or should it be set dynamically at run time based on the behavior of the process? Either way, how should it be calculated? There have been significant efforts at answering these questions in the recent past. As a consequence, the number of load-tracking metrics has grown significantly and load estimation itself has gotten quite complex. This landscape appears quite formidable to reviewers and developers of CFS. The aim of this article is to bring about clarification on this front.

问题是如何估计一个进程的负载。是否应该静态地设置，还是应该在运行时根据进程行为动态地设置？无论通过哪种方式，它应该如何计算？最近他们在回答这些问题方面做出了重大努力。结果是，load-tracking 指标的数量显着增加，负载评估本身变得相当复杂。这种情况对于CFS的 reviewer们和开发者来说显得相当艰巨。这篇文章的目的是为了理清这一方面。

Before proceeding, it is helpful to point out that the granularity of scheduling in Linux is at a thread level and not at a process level. However, the scheduling jargon for thread is “task.” Hence throughout this article the term “task” has been used to mean a thread.

在继续之前，有价值来指出的Linux中的调度粒度处于线程级别而不是进程级别。但是，线程的调度术语是“task”。因此，在整篇文章中，术语“task”一直被用来表示一个线程。

Scheduling entities and task groups

调度实体和任务组

The CFS algorithm defines a time duration called the “scheduling period,” during which every runnable task on the CPU should run at least once. This way no task gets starved for longer than a scheduling period. The scheduling period is divided among the tasks into time slices, which are the maximum amount of time that a task runs within a scheduling period before it gets preempted. This approach may seem to avoid task starvation at first. However it can lead to an undesirable consequence.

CFS算法定义了一个称为“调度周期”的持续时间，在此期间，CPU上的每个runnable的task应至少运行一次。这样，任何任务都不会比调度周期更长。调度周期在任务中被分成时间片，这是任务在一个调度周期内被抢占之前运行的最大时间量。这种方法起初可能似乎避免了任务匮乏。但它可能会导致不良后果。

Linux is a multi-user operating system. Consider a scenario where user A spawns ten tasks and user B spawns five. Using the above approach, every task would get ~7% of the available CPU time within a scheduling period. So user A gets 67% and user B gets 33% of the CPU time during their runs. Clearly, if user A continues to spawn more tasks, he can starve user B of even more CPU time. To address this problem, the concept of “group scheduling” was introduced in the scheduler, where, instead of dividing the CPU time among tasks, it is divided among groups of tasks.

Linux是一个多用户操作系统。考虑一个场景，用户A产生10个任务，用户B产生5个任务。使用上述方法，每个任务在一个调度周期内将获得可用CPU时间的约7％。因此，用户A在运行期间获得67％，而用户B获得33％的CPU时间。显然，如果用户A继续产生更多的任务，他就会压缩掉用户B的更多的CPU时间。为了解决这个问题，在调度程序中引入了“组调度”的概念，其中，不是在任务之间划分CPU时间，而是在任务组之间进行划分。

In the above example, the tasks spawned by user A belong to one group and those spawned by user B belong to another. The granularity of scheduling is at a group level; when a group is picked to run, its time slice is further divided between its tasks. In the above example, each group gets 50% of the CPU’s time and tasks within each group divide this share further among themselves. As a consequence, each task in group A gets 5% of the CPU and each task in group B gets 10% of the CPU. So the group that has more tasks to run gets penalized with less CPU time per task and, more importantly, it is not allowed to starve sibling groups.

在上面的例子中，用户A产生的任务属于一个group，而用户B产生的任务属于另一个group。调度的粒度在group level; 当一个group被选中运行时，其时间片会在其任务之间进一步分割。在上面的例子中，每个组获得50％的CPU时间，每个组中的任务进一步将这个份额进一步分开。结果，group A中的每个任务获得CPU的5％，而group B中的每个任务获得10％的CPU。因此，有更多任务运行的group会受到每个任务的CPU时间更少的惩罚，更重要的是，不允许压缩兄弟group的时间。

Group scheduling is enabled only if CONFIG_FAIR_GROUP_SCHED is set in the kernel configuration. A group of tasks is called a “scheduling entity” in the kernel and is represented by the sched_entity data structure:
只有在内核配置中设置了CONFIG_FAIR_GROUP_SCHED时，才能启用组调度。一组任务在内核中称为“调度实体”，由sched_entity数据结构表示：

struct sched_entity { 
    struct load_weight load;
    struct sched_entity *parent;
    struct cfs_rq *cfs_rq;
    struct cfs_rq *my_rq;
    struct sched_avg avg;
    /* ... */
};

Before getting into the details of how this structure is used, it is worth considering how and when groups of tasks are created. This happens under two scenarios:

在深入讨论如何使用这种结构的细节之前，值得考虑的是如何以及何时创建任务组。这发生在两种情况下：

Users may use the control group (“cgroup”) infrastructure to partition system resources between tasks. Tasks belonging to a cgroup are associated with a group in the scheduler (if the scheduler controller is attached to the group).

用户可以使用control group（“cgroup”）基础结构来分配任务之间的系统资源。属于cgroup的任务与调度程序中的组相关联（如果调度程序控制器已连接到group）。

When a new session is created through the set_sid() system call. All tasks belonging to a specific session also belong to the same scheduling group. This feature is enabled when CONFIG_SCHED_AUTOGROUP is set in the kernel configuration.

当通过set_sid（）系统调用创建新session 时。属于特定session 的所有任务也属于同一个调度组。在内核配置中设置CONFIG_SCHED_AUTOGROUP时启用此功能。

Outside of these scenarios, a single task becomes a scheduling entity on its own. A task is represented by the task_struct data structure:
在这些场景之外，单个任务自己成为调度实体。任务由task_struct数据结构表示：

struct task_struct {
    struct sched_entity se;
    /* ... */
};

Scheduling is always at the granularity of a sched_entity. That is why every task_struct is associated with a sched_entity data structure. CFS also accommodates nested groups of tasks. Each scheduling entity contains a run queue represented by:

调度总是处于sched_entity的粒度。这就是为什么每个task_struct都与sched_entity数据结构关联的原因。 CFS也适应嵌套的任务组。每个调度实体都包含一个运行队列，表示为：

struct cfs_rq {
    struct load_weight load;
    unsigned long runnable_load_avg;
    unsigned long blocked_load_avg;
    unsigned long tg_load_contrib;
    /* ... */
};

Each scheduling entity may, in turn, be queued on a parent scheduling entity’s run queue. At the lowest level of this hierarchy, the scheduling entity is a task; the scheduler traverses this hierarchy until the end when it has to pick a task to run on the CPU.

每个调度实体可以依次在父调度实体的运行队列中排队。在这个层次的最底层，调度实体是一个任务; 调度程序遍历这个层次结构，直到必须选择要在CPU上运行的任务时为止。

The parent run queue on which a scheduling entity is queued is represented by cfs_rq, while the run queue that it owns is represented by my_rq in the sched_entity data structure. The scheduling entity gets picked from the cfs_rq when its turn arrives, and its time slice gets divided among the tasks on my_rq.

调度实体在其上排队的父run queue由cfs_rq表示，而它所拥有的run queue由sched_entity数据结构中的my_rq表示。调度实体在轮到它时从cfs_rq中被挑选出来，并且它的时间片在my_rq上的任务之间被分开。

Let us now extend the concept of group scheduling to multiprocessor systems. Tasks belonging to a group can be scheduled on any CPU. Therefore it is not sufficient for a group to have a single scheduling entity; instead, every group must have one scheduling entity for each CPU. Tasks belonging to a group must move between the run queues in these per-CPU scheduling entities only, so that the footprint of the task is associated with the group even during task migrations. The data structure that represents scheduling entities of a group across CPUs is:

现在让我们将组调度的概念扩展到多处理器系统。属于一个group的任务可以调度在任何CPU上。因此，一个小组只有一个调度实体是不够的; 相反，每个group在每个CPU上都必须有一个调度实体。属于group的任务必须仅在这些per-CPU类型的调度实体中的run queue之间移动，所以即使在任务迁移过程中，任务的所分配的足迹空间也与其group相关联。表示跨CPU调度组的调度实体的数据结构是：

struct task_group {
    struct sched_entity **se;
    struct cfs_rq **cfs_rq;
    unsigned long shares;
    atomic_long_t load_avg;
    /* ... */
};

For every CPU c, a given task_group tg has a sched_entity called se and a run queue cfs_rq associated with it. These are related as follows:

对于每个CPU c，给定的task_group tg都有一个名为se的sched_entity和一个与其关联的运行队列cfs_rq。这些相关如下：

    tg->se[c] = &se;
    tg->cfs_rq[c] = &se->my_q;

So when a task belonging to tg migrates from CPUx to CPUy, it will be dequeued from tg->cfs_rq[x] and enqueued on tg->cfs_rq[y].

因此，当属于tg的任务从CPUx迁移到CPUy时，它将从tg-> cfs_rq [x]出队，并在tg-> cfs_rq [y]上入队。

Time slice and task load

时间片和任务负载

The concept of a time slice was introduced above as the amount of time that a task is allowed to run on a CPU within a scheduling period. Any given task’s time slice is dependent on its priority and the number of tasks on the run queue. The priority of a task is a number that represents its importance; it is represented in the kernel by a number between zero and 139. The lower the value, the higher the priority. A task that has a stricter time requirement needs to have higher priority than others.

上面介绍了时间片的概念，即在调度周期内允许任务在CPU上运行的时间量。任何给定任务的时间片都取决于其优先级和运行队列上的任务数量。任务的优先级是一个代表其重要性的数字; 它在内核中以0到139之间的数字表示。值越低，优先级越高。具有更严格时间要求的任务需要比其他任务具有更高的优先级。

But the priority value by itself is not helpful to the scheduler, which also needs to know the load of the task to estimate its time slice. As mentioned above, the load must be the multiple of the capacity of a standard CPU that is required to make satisfactory progress on the task. Hence this priority number must be mapped to such a value; this is done in the array prio_to_weight[].

但是优先级的值本身对调度器没有帮助，调度器也需要知道任务的负载以估计其时间片。如上所述，负载必须是以标准CPU capacity的倍数(不一定是整数)，这是使得任务取得令人满意的进展所需。因此，这个优先级的编号必须映射到这样一个值; 这是在数组prio_to_weight []中完成的。

A priority number of 120, which is the priority of a normal task, is mapped to a load of 1024, which is the value that the kernel uses to represent the capacity of a single standard CPU. The remaining values in the array are arranged such that the multiplier between two successive entries is ~1.25. This number is chosen such that if the priority number of a task is reduced by one level, its gets 10% higher share of CPU time than otherwise. Similarly if the priority number is increased by one level, the task will get a 10% lower share of the available CPU time.

作为一个normal任务的优先级的优先级index 120被映射到1024的负载，该负载是内核用来表示单个标准CPU的容量的值。数组中的其余值被排列成两个连续条目之间的比例是〜1.25。这个数字是这样选择的，如果一个任务的优先级数值减少了一个级别，那么它的CPU时间份额就相应的要比剩下其他的高10％。同样，如果优先级数字增加一级，则任务可获得可用CPU时间将会减少10％。

Let us consider an example to illustrate this. If there are two tasks, A and B, running at a priority of 120, the portion of available CPU time given to each task is calculated as:
1024/(1024*2) = 0.5
However if the priority of task A is increased by one level to 121, its load becomes:
(1024/1.25) = ~820
(Recall that higher the number, lesser is the load). Then, task A’s portion of the CPU becomes:
820/(1024+820)) = ~0.45
while task B will get:
(1024/(1024+820)) = ~0.55
This is a 10% decrease in the CPU time share for Task A.

让我们考虑一个例子来说明这一点。如果有两个任务A和B以120的优先级运行，则为每个任务分配的可用CPU时间部分计算如下：
1024/(1024*2) = 0.5
但是，如果任务A的优先级增加一级到121，则其负载变为：
(1024/1.25) = ~820
（回想一下，数字越高，负载越小）。然后，任务A的分配到的CPU变成：
820/(1024+820)) = ~0.45
而任务B将得到：
(1024/(1024+820)) = ~0.55
这是任务A的CPU时间份额下降10％。

The load value of a process is stored in the weight field of the load_weight structure (which is, in turn, found in struct sched_entity):
进程的负载值存储在load_weight结构的 weight 字段中（这又可以在struct sched_entity中找到）：

struct load_weight {
    unsigned long weight;
};

A run queue (struct cfs_rq) is also characterized by a “weight” value that is the accumulation of weights of all tasks on its run queue.

Run queue（struct cfs_rq）也有一个特征是“weight”值，即所有任务在其运行队列上权重的累加。

The time slice can now be calculated as:
时间片现在可以这样计算：
time_slice = (sched_period() * se.load.weight) / cfs_rq.load.weight;
where sched_period() returns the scheduling period as a factor of the number of running tasks on the CPU. We see that the higher the load, the higher the fraction of the scheduling period that the task gets to run on the CPU.
其中sched_period（）将调度周期作为CPU上正在运行的任务数的一个因子返回。我们看到负载越高，任务在CPU上运行的调度周期的比例就越高。

Runtime and task load

运行时和任务负载

We have seen how long a task runs on a CPU when picked, but how does the scheduler decide which task to pick? The tasks are arranged in a red-black tree in increasing order of the amount of time that they have spent running on the CPU, which is accumulated in a variable called vruntime. The lowest vruntime found in the queue is stored in cfs_rq.min_vruntime. When a new task is picked to run, the leftmost node of the red-black tree is chosen since that task has had the least running time on the CPU. Each time a new task forks or a task wakes up, its vruntime is assigned to a value that is the maximum of its last updated value and cfs_rq.min_vruntime. If not for this, its vruntime would be very small as an effect of not having run for a long time (or at all) and would take an unacceptably long time to catch up to the vruntime of its sibling tasks and hence starve them of CPU time.
Every periodic tick, the vruntime of the currently-running task is updated as follows:

我们已经看到当任务被选中时再CPU上运行了多长时间，但调度程序如何决定选择哪个任务？这些任务按照它们在CPU上运行的时间量的增加顺序排列在红黑树中，该时间量在名为vruntime的变量中累积。在队列中找到的最低vruntime存储在cfs_rq.min_vruntime中。选择新任务运行时，会选择红黑树的最左端节点，因为该任务在CPU上的运行时间最短。每次新任务fork出来或任务唤醒时，其vruntime都会被分配一个值，该值是其最后更新值和cfs_rq.min_vruntime之间的最大值。如果不是这样，它的vruntime会很小，因为长时间不运行（或根本不运行）的缘故，并且需要很长的时间才能赶上它的同级任务的运行时间，从而使他们的CPU时间被压缩。
每个周期性tick，当前正在运行的任务的vruntime更新如下：
vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight);
where delta_exec is the time spent by the task since the last time vruntime was updated, NICE_0_LOAD is the load of a task with normal priority, and curr is the currently-running task. We see that vruntime progresses slowly for tasks of higher priority. It has to, because the time slice for these tasks is large and they cannot be preempted until the time slice is exhausted.
其中delta_exec是自上次更新vruntime以来该任务花费的时间，NICE_0_LOAD是具有普通优先级的任务的负载，curr是当前正在运行的任务。我们发现vruntime对于更高优先级的任务而言增长缓慢。它必须这样做，因为这些任务的时间片很大，在时间片耗尽之前它们不能被抢占。

Per-entity load-tracking metrics

Per-entity load-tracking 的指标

The load of a CPU could have simply been the sum of the load of all the scheduling entities running on its run queue. In fact, that was once all there was to it. This approach has a disadvantage, though, in that tasks are associated with load values based only on their priorities. This approach does not take into account the nature of a task, such as whether it is a bursty or a steady task, or whether it is a CPU-intensive or an I/O-bound task. While this does not matter for scheduling within a CPU, it does matter when load balancing across CPUs because it helps estimate the CPU load more accurately. Therefore the per-entity load tracking metric was introduced to estimate the nature of a task numerically. This metric calculates task load as the amount of time that the task was runnable during the time that it was alive. This is kept track of in the sched_avg data structure (stored in the sched_entity structure):

CPU的负载可以简单的归结为运行在其运行队列上的所有调度实体的负载的总和。事实上，这曾经是它的一部分。不过，这种方法有一个缺点，那就是只根据优先级来确定负载值。这种方法没有考虑到任务的性质，比如它是一个突发或稳定的任务，还是CPU密集型任务或I/O约束型的任务。虽然这对于在CPU内部进行调度无关紧要，但是跨CPU的负载平衡确实很重要，因为它有助于更准确地估计CPU负载。因此引入了每个实体的负载跟踪度量来数字化地估计任务的性质。此度量标准将任务负载计算为任务在活动期间处于runnable状态的时间总量。这在sched_avg数据结构中保存（存储在sched_entity结构中）：

struct sched_avg {
    u32 runnable_sum, runnable_avg_period;
    unsigned long load_avg_contrib;
};

Given a task p, if the sched_entity associated with it is se and the sched_avg of se is sa, then:
给定任务p，如果与其关联的sched_entity为se，并且se的sched_avg为sa，则：
sa.load_avg_contrib = (sa.runnable_sum * se.load.weight) / sa.runnable_period;
where runnable_sum is the amount of time that the task was runnable, runnable_period is period during which the task could have been runnable.
其中runnable_sum是任务处于runnable状态的时间总量，runnable_period是任务从runnable状态开始处于系统中的总时间。
Therefore load_avg_contrib is the fraction of the time that the task was ready to run. Again, the higher the priority, the higher the load.
因此，load_avg_contrib是任务准备运行之后的时间的一小部分。同样，优先级越高，负载越高。
So tasks showing peaks of activity after long periods of inactivity and tasks that are blocked on disk access (and thus non-runnable) most of the time have a smaller load_avg_contrib than CPU-intensive tasks such as code doing matrix multiplication. In the former case, runnable_sum would be a fraction of the runnable_period. In the latter, both these numbers would be equal (i.e. the task was runnable throughout the time that it was alive), identifying it as a high-load task.

因此，大多数情况下，长时间不活动后显示活动高峰的任务以及在磁盘访问中被阻塞的任务（因此不可运行）的负载比CPU密集型任务（如执行矩阵乘法的代码）的load_avg_contrib更小。在前一种情况下，runnable_sum将是runnable_period的一小部分。在后者中，这两个数字都是相等的（即任务在其存活期间可以运行），将其识别为高负载任务。
The load on a CPU is the sum of the load_avg_contrib of all the scheduling entities on its run queue; it is accumulated in a field called runnable_load_avg in the cfs_rq data structure. This is roughly a measure of how heavily contended the CPU is. The kernel also tracks the load associated with blocked tasks. When a task gets blocked, its load is accumulated in the blocked_load_avg metric of the cfs_rq structure.

CPU上的负载是其运行队列上所有调度实体的load_avg_contrib的总和; 它在cfs_rq数据结构中的一个名为runnable_load_avg的字段中累积。这大致上是衡量CPU的竞争程度的一个标准。内核还跟踪与阻塞的任务相关的负载。当任务被阻塞时，其负载将累积在cfs_rq结构的blocked_load_avg指标中。

Per-entity load tracking in presence of task groups

Per-entity load tracking 在任务组中

Now what about the load_avg_contrib of a scheduling entity, se, when it is a group of tasks? The cfs_rq that the scheduling entity owns accumulates the load of its children in runnable_load_avg as explained above. From there, the parent task group of cfs_rq is first retrieved:
那调度实体的load_avg_contrib是什么呢，当它是一组任务时呢？如上所述，调度实体拥有的cfs_rq将其子节点的负载累积到runnable_load_avg中。从那里，首先检索cfs_rq的父任务组：
tg = cfs_rq->tg;
The load contributed by this cfs_rq is added to the load of the task group tg:
由此cfs_rq贡献的负载被添加到任务组tg的负载：
cfs_rq->tg_load_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
tg->load_avg += cfs_rq->tg_load_contrib;
The load_avg_contrib of the scheduling entity se is now calculated as:
调度实体se的load_avg_contrib现在计算如下：
se->avg.load_avg_contrib =
(cfs_rq->tg_load_contrib * tg->shares / tg->load_avg);
Where tg->shares is the maximum allowed load for the task group. This means that the load of a sched_entity should be a fraction of the shares of its parent task group, which is in proportion to the load of its children.
tg->shares can be set by users to indicate the importance of a task group. As is clear now, both the runnable_load_avg and and blocked_load_avg are required to estimate the load contributed by the task group.
There are still drawbacks in load tracking. The load metrics that are currently used are not CPU-frequency invariant. So if the CPU frequency increases, the load of the currently running task may appear smaller than otherwise. This may upset load-balancing decisions. The current load-tracking algorithm also falls apart in a few places when run on big.LITTLE processors. It either underestimates or overestimates the capacity of these processors. There are efforts ongoing to fix these problems. So there is good scope for improving the load-tracking heuristics in the scheduler. Hopefully this writeup has laid out the basics to help ease understanding and reviewing of the ongoing improvements on this front.
其中tg-> shares是任务组允许的最大负载。这意味着sched_entity的负载应该是其父任务组的份额的一小部分，这与其子项的负载成比例。
tg->shares 可由用户设置以表明任务组的重要性。现在很清楚了，需要runnable_load_avg和blocked_load_avg来估计任务组贡献的负载。
负载跟踪仍然存在缺陷。目前使用的负载指标并不是CPU频率不变的情况下的。所以如果CPU频率增加，当前正在运行的任务的负载可能会比其他的负载小。这可能会破坏load-balance 的决定。在big.LITTLE架构处理器上运行时，当前的load-tracking 算法在几个地方也会分崩离析。它要么低估了或者高估了这些处理器的能力。目前正在努力解决这些问题。因此，在调度程序中改善负载跟踪启发式算法是有很好的余地的。希望这篇文章能够帮助我们理解并回顾这方面正在进行的改进。

Per-entity load-tracking 的指标 - 基于新的sched_avg (补充部分，非原文)

新的sched_avg一个是内部结构的变化，另外一个则是不光是面对调度实体，也在cfs rq中增加了sched_avg

/*
 * The load_avg/util_avg accumulates an infinite geometric series.
 * 1) load_avg factors frequency scaling into the amount of time that a
 * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
 * aggregated such weights of all runnable and blocked sched_entities.
 * 2) util_avg factors frequency and cpu scaling into the amount of time
 * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
 * For cfs_rq, it is the aggregated such times of all runnable and
 * blocked sched_entities.
 * The 64 bit load_sum can:
 * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
 * the highest weight (=88761) always runnable, we should not overflow
 * 2) for entity, support any load.weight always runnable
 */
struct sched_avg {
    u64 last_update_time, load_sum;
    u32 util_sum, period_contrib;
    unsigned long load_avg, util_avg;
    unsigned long loadwop_avg, loadwop_sum;
};

load_sum: runnable 状态的时间累加, 也是会经过衰减的（weight * 衰减过的累计时间 * (当前频率/所在CPU的最高频率))
loadwop_sum: 和load_sum不一样的是 weight变成了NICE_0_LOAD，也就是1024

load_avg: sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); 在时间超过1024微秒的情况下
loadwop_avg: sa->loadwop_avg = div_u64(sa->loadwop_sum, LOAD_AVG_MAX); 在时间超过1024微秒的情况下

LOAD_AVG_MAX是单个最大可能达到的runnable_avg_sum 47742

util_sum: running 状态时间累加 (衰减过的累加时间 * (当前频率/所在CPU的最高频率) * （当前CPU的最大cap/系统最高性能CPU的cap） )
util_avg: sa->util_avg = sa->util_sum / LOAD_AVG_MAX;