CFS scheduler (Principle->Source code->Summary)

1. CFS Scheduler-Basic Principles

The first question you need to think about is: What is a scheduler? What is the role of the scheduler? The scheduler is a core part of an operating system. Can be likened to being a manager of CPU time. The scheduler is mainly responsible for selecting certain ready processes for execution. Different schedulers use different methods to select the best processes to run. Currently, the schedulers supported by Linux include RT scheduler, Deadline scheduler, CFS scheduler and Idle scheduler.

1. What is the scheduling class?

Starting from Linux 2.6.23, Linux introduces the concept of scheduling class with the purpose of modularizing the scheduler. This improves scalability and makes it easier to add a new scheduler. Multiple schedulers can also coexist in a system. In Linux, the common parts of the scheduler are abstracted and the struct sched_class structure is used to describe a specific scheduling class. The system core scheduling code will call the core algorithm of the specific scheduling class through the members of the struct sched_class structure. First, let’s briefly introduce the functions of some members of struct sched_class.

struct sched_class {
	const struct sched_class *next;
	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
	struct task_struct * (*pick_next_task)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
    /* ... */
};

next: The next member points to the next scheduling class (one priority lower than itself). In Linux, each scheduling class has a clear priority relationship. Processes managed by high-priority scheduling classes will get the right to use the CPU first.
enqueue_task: Add a process to the runqueue managed by the scheduler. We call this operation enqueuing.
dequeue_task: Delete a process from the runqueue managed by the scheduler. We call this operation dequeuing.
check_preempt_curr: When a process is awakened or created, it needs to be checked whether the current process can preempt the process running on the current CPU. If it can preempt, the TIF_NEED_RESCHED flag needs to be marked.
pick_next_task: Select the most suitable task to run from the runqueue. This is also considered a core operation of the scheduler. For example, on what basis do we select the most suitable process to run? This is what every scheduler needs to focus on.

2. What are the scheduling classes in Linux?

Linux mainly includes scheduling classes such as dl_sched_class, rt_sched_class, fair_sched_class and idle_sched_class. Each process corresponds to a scheduling strategy, and each scheduling strategy corresponds to a scheduling class (each scheduling class can correspond to multiple scheduling strategies). For example, the real-time scheduler is priority-oriented and selects the process with the highest priority to run. After each process is created, it always chooses a scheduling strategy. For different scheduling strategies, the schedulers selected are also different. The scheduling classes corresponding to different scheduling strategies are as follows.

Scheduling class	describe	Scheduling strategy
dl_sched_class	deadline scheduler	SCHED_DEADLINE
rt_sched_class	real-time scheduler	SCHED_FIFO、SCHED_RR
fair_sched_class	completely fair scheduler	SCHED_NORMAL、SCHED_BATCH
idle_sched_class	idle task	SCHED_IDLE

For the above scheduling categories, the system has a clear concept of priority. Each scheduling class uses the next member to build a single-linked list. The priority diagram from high to low is as follows:

sched_class_highest----->stop_sched_class
                         .next---------->dl_sched_class
                                         .next---------->rt_sched_class
                                                         .next--------->fair_sched_class
                                                                        .next----------->idle_sched_class
                                                                                         .next = NULL

When the Linux scheduling core selects the next appropriate task to run, it will facilitate the pick_next_task function of the scheduling class in order of priority. Therefore, the real-time process of the SCHED_FIFO scheduling policy will always run first than the ordinary process of the SCHED_NORMAL scheduling policy. The pick_next_task function is also reflected in the code. The pick_next_task function is responsible for selecting a process that is about to run. The omitted version of the code is posted below.

static inline struct task_struct *pick_next_task(struct rq *rq,
                                                 struct task_struct *prev, struct rq_flags *rf)
{
	const struct sched_class *class;
	struct task_struct *p;
 
	for_each_class(class) {          /* 按照优先级顺序便利所有的调度类，通过next指针便利单链表 */
		p = class->pick_next_task(rq, prev, rf);
		if (p)
			return p;
	}
}

针对CFS调度器，管理的进程都属于SCHED_NORMAL或者SCHED_BATCH策略。后面的部分主要针对CFS调度器讲解。

3、普通进程的优先级

CFS是Completely Fair Scheduler简称，即完全公平调度器。CFS的设计理念是在真实硬件上实现理想的、精确的多任务CPU。CFS调度器和以往的调度器不同之处在于没有时间片的概念，而是分配cpu使用时间的比例。例如：2个相同优先级的进程在一个cpu上运行，那么每个进程都将会分配50%的cpu运行时间。这就是要实现的公平。

以上举例是基于同等优先级的情况下。但是现实却并非如此，有些任务优先级就是比较高。那么CFS调度器的优先级是如何实现的呢？首先，我们引入权重的概念，权重代表着进程的优先级。各个进程之间按照权重的比例分配cpu时间。例如：2个进程A和B。A的权重是1024，B的权重是2048。那么A获得cpu的时间比例是1024/(1024+2048) = 33.3%。B进程获得的cpu时间比例是2048/(1024+2048)=66.7%。我们可以看出，权重越大分配的时间比例越大，相当于优先级越高。在引入权重之后，分配给进程的时间计算公式如下：

分配给进程的时间 = 总的cpu时间 * 进程的权重/就绪队列（runqueue）所有进程权重之和

CFS调度器针对优先级又提出了nice值的概念，其实和权重是一一对应的关系。nice值就是一个具体的数字，取值范围是[-20, 19]。数值越小代表优先级越大，同时也意味着权重值越大，nice值和权重之间可以互相转换。内核提供了一个表格转换nice值和权重。

const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

The value of the array can be viewed as the formula: weight = 1024 / 1.25nice is calculated. The basis for the value of 1.25 in the formula is: every time the process decreases the nice value, it will gain 10% more CPU time. The formula is calculated using 1024 weight as the base value. 1024 weight corresponds to a nice value of 0, and its weight is called NICE_0_LOAD. By default, the weight of most processes is basically NICE_0_LOAD.

4. Scheduling delay

What is scheduling delay? Scheduling delay is the time interval that ensures that every runnable process runs at least once. For example, if each process runs for 10ms and there are a total of 2 processes in the system, the scheduling delay is 20ms. If there are 5 processes, the scheduling delay is 50ms. If the scheduling delay is now guaranteed to be unchanged and fixed at 6ms, then if there are 2 processes in the system, each process will run for 3ms. If there are 6 processes, then each process runs for 1ms. If there are 100 processes, then the time allocated to each process is 0.06ms. As the number of processes increases, the time allocated to each process decreases. If processes are scheduled too frequently, the context switching time overhead will become larger. Therefore, the setting of the scheduling delay time of the CFS scheduler is not fixed. When the number of processes in the system's ready state is less than a fixed value (default value 8), the scheduling delay is also fixed at a constant value (default value 6ms). When the number of system ready processes exceeds this value, we ensure that each process runs for at least a certain period of time before giving up the CPU. This "at least a certain amount of time" is called the minimum granularity time. In the CFS default settings, the minimum granularity time is 0.75ms. Recorded with the variable sysctl_sched_min_granularity. Therefore, the scheduling period is a dynamically changing value. The scheduling period calculation function is __sched_period().

static u64 __sched_period(unsigned long nr_running)
{
	if (unlikely(nr_running > sched_nr_latency))
		return nr_running * sysctl_sched_min_granularity;
	else
		return sysctl_sched_latency;
}

nr_running is the number of ready processes in the system. When it exceeds sched_nr_latency, we cannot guarantee the scheduling delay, so we switch to ensuring the minimum granularity of scheduling. If nr_running does not exceed sched_nr_latency, then the scheduling period is equal to the scheduling delay sysctl_sched_latency (6ms).

5. Virtual time

The goal of the CFS scheduler is to ensure completely fair scheduling of each process. The CFS scheduler is like a mother with many children (processes). However, there is only one toy (CPU) in hand that needs to be distributed fairly to children to play with. Suppose there are 2 children, then how can one toy be played fairly by two children? A simpler idea is that the first child plays for 10 minutes, then the second child plays for 10 minutes, and the cycle continues. The CFS scheduler also records the execution time of each process in this way to ensure that each process gets fair CPU execution time. Therefore, whichever process takes the least amount of time to run should be allowed to run.

For example, if the scheduling period is 6ms, and there are two processes A and B with the same priority in the system, then each process will run for 3ms within the 6ms period. If processes A and B, their weights are 1024 and 820 respectively (nice values are 0 and 1 respectively). The running time obtained by process A is 6x1024/(1024+820)=3.3ms, and the execution time obtained by process B is 6x820/(1024+820)=2.7ms. The cpu usage ratio of process A is 3.3/6x100%=55%, and the cpu usage ratio of process B is 2.7/6x100%=45%. The calculation results are also in line with what is said above: "Every time the process decreases the nice value, it will gain 10% more CPU time." Obviously, the actual execution time of the two processes is not equal, but CFS wants to ensure that the running time of each process is equal. Therefore, CFS introduces the concept of virtual time, which means that the above 2.7ms and 3.3ms can get the same value after conversion by a formula. This converted value is called virtual time. In this case, CFS only needs to ensure that the virtual time each process runs is equal. The conversion formula between virtual time vriture_runtime and real time (wall time) is as follows:

NICE_0_LOAD
vriture_runtime = wall_time * ----------------
                                    weight

The virtual time of process A is 3.3 * 1024 / 1024 = 3.3ms. We can see that the virtual time and actual time of the process with a nice value of 0 are equal. The virtual time of process B is 2.7 * 1024 / 820 = 3.3ms. We can see that although the weight values of processes A and B are different, the calculated virtual time is the same. Therefore, CFS mainly ensures that each process obtains the same virtual time for execution. When selecting the next process to be run, you only need to find the process with the smallest virtual time. In order to avoid floating point number operations, we use the method of first enlarging and then reducing to ensure calculation accuracy. The kernel also performs the following conversion on the formula.

NICE_0_LOAD
vriture_runtime = wall_time * ----------------
                                    weight
  
                                   NICE_0_LOAD * 2^32
                = (wall_time * -------------------------) >> 32
                                        weight
                                                                                        2^32
                = (wall_time * NICE_0_LOAD * inv_weight) >> 32        (inv_weight = ------------ )
                                                                                        weight

The value of the weight has been calculated and saved in the sched_prio_to_weight array. Based on this array, we can easily calculate the value of inv_weight. The sched_prio_to_wmult array is used in the kernel to save the value of inv_weight. The calculation formula is: sched_prio_to_wmult[i] = 232/sched_prio_to_weight[i].

const u32 sched_prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

The struct load_weight structure is used in the system to describe the weight information of the process. Weight represents the weight of the process, inv_weight is equal to 232/weight.

struct load_weight {
	unsigned long		weight;
	u32			inv_weight;
};

The implementation function that converts real time into virtual time is calc_delta_fair(). calc_delta_fair() calls the __calc_delta() function. The main function of __calc_delta() is to implement the calculation of the following formula.

__calc_delta() = (delta_exec * weight * lw->inv_weight) >> 32
 
                                  weight                                 2^32
               = delta_exec * ----------------    (lw->inv_weight = --------------- )
                                lw->weight                             lw->weight

Comparison with the above formula for calculating virtual time. If you need to calculate the virtual time of the process, you only need to pass the parameter NICE_0_LOAD for weight here, and the lw parameter is the struct load_weight structure corresponding to the process.

static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
	u64 fact = scale_load_down(weight);
	int shift = 32;
 
	__update_inv_weight(lw);
 
	if (unlikely(fact >> 32)) {
		while (fact >> 32) {
			fact >>= 1;
			shift--;
		}
	}
 
	fact = (u64)(u32)fact * lw->inv_weight;
 
	while (fact >> 32) {
		fact >>= 1;
		shift--;
	}
 
	return mul_u64_u32_shr(delta_exec, fact, shift);
}

According to the theory mentioned above, the weight parameter passed when the calc_delta_fair() function calls __calc_delta() is NICE_0_LOAD, and the lw parameter is the struct load_weight structure corresponding to the process.

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
	if (unlikely(se->load.weight != NICE_0_LOAD))             /* 1 */
		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);  /* 2 */
 
	return delta;
}

According to the previous theory, the virtual time and actual time of a process with a nice value of 0 (weight is NICE_0_LOAD) are equal. Therefore, if the weight of the process is NICE_0_LOAD, the virtual time corresponding to the process does not need to be calculated.
Call the __calc_delta() function.

Linux describes each process through the struct task_struct structure. However, the unit of scheduling class management and scheduling is the scheduling entity, not the task_struct. When group scheduling is supported, a group will also be abstracted into a scheduling entity, which is not a task. Therefore, we can find the following scheduling entities of different scheduling classes in the struct task_struct structure.

struct task_struct {
         struct sched_entity		se;
	struct sched_rt_entity		rt;
	struct sched_dl_entity		dl;
    /* ... */
}

se, rt, and dl correspond to the scheduling entities of the CFS scheduler, RT scheduler, and Deadline scheduler respectively.

The struct sched_entity structure describes the scheduling entity, including struct load_weight, which is used to record weight information. In addition, the time information that we have always been concerned about must also be recorded together. The struct sched_entity structure is simplified as follows:

struct sched_entity {
	struct load_weight		load;
	struct rb_node		run_node;
	unsigned int		on_rq;
	u64			sum_exec_runtime;
	u64			vruntime;
};

load: weight information, the inv_weight member will be used when calculating virtual time.
run_node: Each ready queue of the CFS scheduler maintains a red-black tree, which is full of tasks waiting to be executed. run_node is the mounting point.
on_rq: After the scheduling entity se joins the ready queue, on_rq is set to 1. After being deleted from the ready queue, on_rq is set to 0.
sum_exec_runtime: The total actual time that the scheduling entity has been running.
vruntime: The total virtual time that the scheduling entity has been running.

6. Ready queue (runqueue)

Each CPU in the system will have a global ready queue (cpu runqueue), which is described by a struct rq structure. It is a per-cpu type, that is, there will be a struct rq structure on each CPU. Each scheduling class also has its own managed ready queue. For example, struct cfs_rq is the ready queue of the CFS scheduling class, which manages the ready struct sched_entity scheduling entity. Subsequently, the pick_next_task interface is used to select the most suitable scheduling entity for running (the scheduling entity with the smallest virtual time) from the ready queue. struct rt_rq is the real-time scheduler ready queue. Struct dl_rq is the Deadline scheduler ready queue.

struct rq {
         struct cfs_rq cfs;
	struct rt_rq rt;
	struct dl_rq dl;
};
 
struct rb_root_cached {
	struct rb_root rb_root;
	struct rb_node *rb_leftmost;
};
 
struct cfs_rq {
	struct load_weight load;
	unsigned int nr_running;
	u64 min_vruntime;
	struct rb_root_cached tasks_timeline;
};

load: Ready queue weight, the sum of the weights of all scheduling entities managed by the ready queue.
nr_running: The number of scheduling entities on the ready queue.
min_vruntime: Tracks the minimum virtual time for all scheduled entities on the ready queue.
tasks_timeline: used to track information about the red-black tree of scheduling entities sorted by virtual time size (including the root of the red-black tree and the leftmost node in the red-black tree).

CFS maintains a red-black tree sorted by virtual time, and all runnable scheduling entities are inserted into the red-black tree sorted by p->se.vruntime. As shown below.

CFS selects the leftmost process in the red-black tree to run. As system time goes by, the process that originally ran on the left will slowly move to the right of the red-black tree, and the process that originally ran on the right will eventually run to the far left. So every process in the red-black tree has a chance to run.

现在我们总结一下。Linux中所有的进程使用task_struct描述。task_struct包含很多进程相关的信息（例如，优先级、进程状态以及调度实体等）。但是，每一个调度类并不是直接管理task_struct，而是引入调度实体的概念。CFS调度器使用sched_entity跟踪调度信息。CFS调度器使用cfs_rq跟踪就绪队列信息以及管理就绪态调度实体，并维护一棵按照虚拟时间排序的红黑树。tasks_timeline->rb_root是红黑树的根，tasks_timeline->rb_leftmost指向红黑树中最左边的调度实体，即虚拟时间最小的调度实体（为了更快的选择最适合运行的调度实体，因此rb_leftmost相当于一个缓存）。每个就绪态的调度实体sched_entity包含插入红黑树中使用的节点rb_node，同时vruntime成员记录已经运行的虚拟时间。我们将这几个数据结构简单梳理，如下图所示。

二、CFS调度器-源码解析

1、进程的创建

进程的创建是通过do_fork()函数完成。新进程的诞生，我们调度核心层会通知调度类，调用特别的接口函数初始化新生儿。我们一路尾随do_fork()函数。do_fork()---->_do_fork()---->copy_process()---->sched_fork()。针对sched_fork()函数，删减部分代码如下：

int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
	p->state = TASK_NEW;
	p->prio = current->normal_prio;
	p->sched_class = &fair_sched_class;         /* 1 */
 
	if (p->sched_class->task_fork)
		p->sched_class->task_fork(p);           /* 2 */
 
	return 0;
}

Here we take CFS as an example to select the scheduling class for the task. fair_sched_class is the CFS scheduling class.
Call the task_fork function in the scheduling class. The task_fork method mainly performs fork-related operations. The parameter p passed is the created task_struct.

The CFS scheduling class fair_sched_class method is as follows:

const struct sched_class fair_sched_class = {
	.next				= &idle_sched_class,
	.enqueue_task		= enqueue_task_fair,
	.dequeue_task		= dequeue_task_fair,
	.yield_task			= yield_task_fair,
	.yield_to_task		= yield_to_task_fair,
	.check_preempt_curr	= check_preempt_wakeup,
	.pick_next_task		= pick_next_task_fair,
	.put_prev_task		= put_prev_task_fair,
#ifdef CONFIG_SMP
	.select_task_rq		= select_task_rq_fair,
	.migrate_task_rq	= migrate_task_rq_fair,
	.rq_online			= rq_online_fair,
	.rq_offline			= rq_offline_fair,
	.task_dead			= task_dead_fair,
	.set_cpus_allowed	= set_cpus_allowed_common,
#endif
	.set_curr_task      = set_curr_task_fair,
	.task_tick			= task_tick_fair,
	.task_fork			= task_fork_fair,
	.prio_changed		= prio_changed_fair,
	.switched_from		= switched_from_fair,
	.switched_to		= switched_to_fair,
	.get_rr_interval	= get_rr_interval_fair,
	.update_curr		= update_curr_fair,
#ifdef CONFIG_FAIR_GROUP_SCHED
	.task_change_group	= task_change_group_fair,
#endif
};

task_fork_fair is implemented as follows:

static void task_fork_fair(struct task_struct *p)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &p->se, *curr;
	struct rq *rq = this_rq();
	struct rq_flags rf;
 
	rq_lock(rq, &rf);
	update_rq_clock(rq);
 
	cfs_rq = task_cfs_rq(current);
	curr = cfs_rq->curr;                     /* 1 */
	if (curr) {
		update_curr(cfs_rq);                 /* 2 */
		se->vruntime = curr->vruntime;       /* 3 */
	}
	place_entity(cfs_rq, se, 1);             /* 4 */
 
	se->vruntime -= cfs_rq->min_vruntime; /* 5 */
	rq_unlock(rq, &rf);
}

cfs_rq is the CFS scheduler ready queue, and curr points to the scheduling entity of the task currently running on the CPU.
The update_curr() function is a relatively important function and is called in many places. It mainly updates the running time information of the currently running scheduling entity.
Initialize the virtual time of the currently created new process.
The place_entity() function is called when the process is created and woken up. The parameter initial=1 is passed when the process is created. The main purpose is to update the scheduling entity to get the virtual time (se->vruntime member). It needs to be little different from the value of cfs_rq->min_vruntime. If it is very small, wouldn’t it go to heaven (crazyly occupying the CPU to run).
Why should we subtract cfs_rq->min_vruntime here? Because the current vruntime of the computing process is based on cfs_rq on the current cpu, and it has not yet been added to the ready queue of the current cfs_rq. When the current process is created and starts to wake up, the ready queue added is not necessarily the CPU that the current calculation is based on. Therefore, in the function that joins the ready queue, the current ready queue cfs_rq->min_vruntime will be added according to the situation. Why do we need to "subtract first and then add"? Assume that the minimum virtual time min_vruntime value of the cfs ready queue on cpu0 is 1000000. At this time, when the process is created, the virtual time given to the current process is 1000500. However, the ready queue that this process is awakened to join is the CFS ready queue on cpu1. If the value of the minimum virtual time min_vruntime of the cfs ready queue on cpu1 is 9000000. If the method of "subtract first and then add" is not adopted, then the process will definitely be "very happy" and run crazy when running on cpu1. The current processing calculates the virtual time of the scheduled entity as 1000500 - 1000000 + 9000000 = 9000500, so things are not that bad.

Let’s take a closer look at update_curr().

static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	u64 now = rq_clock_task(rq_of(cfs_rq));
	u64 delta_exec;
 
	if (unlikely(!curr))
		return;
 
	delta_exec = now - curr->exec_start;                    /* 1 */
	if (unlikely((s64)delta_exec <= 0))
		return;
 
	curr->exec_start = now;
	curr->sum_exec_runtime += delta_exec;
	curr->vruntime += calc_delta_fair(delta_exec, curr);    /* 2 */
	update_min_vruntime(cfs_rq);                            /* 3 */
}

delta_exec calculates the difference between this updated virtual time and the last updated virtual time.
To update the virtual time of the current scheduling entity, the calc_delta_fair() function calculates the virtual time based on the virtual time calculation formula mentioned above (that is, calling the __calc_delta() function).
Update the minimum virtual time min_vruntime of the CFS ready queue. min_vruntime is also constantly updated, mainly tracking the minimum virtual time of all scheduling entities in the ready queue. If min_vruntime is never updated, because min_vruntime is too small, the new process created later will initialize the virtual time of the new process based on this value. Wouldn't it be possible that the newly created process goes crazy again. This time it may be cpu0 creation, going crazy on cpu0.

Let's take a look at how update_min_vruntime() updates min_vruntime.

static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
	u64 vruntime = cfs_rq->min_vruntime;
 
	if (curr) {
		if (curr->on_rq)
			vruntime = curr->vruntime;
		else
			curr = NULL;
	}
 
	if (leftmost) { /* non-empty tree */
		struct sched_entity *se;
		se = rb_entry(leftmost, struct sched_entity, run_node);
 
		if (!curr)
			vruntime = se->vruntime;
		else
			vruntime = min_vruntime(vruntime, se->vruntime);
	}
 
	/* ensure we never gain time by being placed backwards. */
	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
}

Since we want to refine the minimum virtual time of the ready queue min_vruntime, just think about it, what are the places with the minimum virtual time?

The cfs_rq->min_vruntime member of the ready queue itself.
The lowest virtual time of the currently running process, because the CFS scheduler selects the most suitable process to run by selecting the process with the smallest virtual time in the maintained red-black tree.
If a process joins the ready queue while the current process is running, the virtual time of the leftmost process in the red-black tree may also be the minimum virtual time.

Therefore, the update_min_vruntime() function updates the minimum virtual time based on the above possible judgments and ensures that the minimum virtual time of the ready queue, min_vruntime, monotonically increases.

Let's continue with the place_entity() function.

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
	u64 vruntime = cfs_rq->min_vruntime;
 
	/*
	 * The 'current' period is already promised to the current tasks,
	 * however the extra weight of the new task will slow them down a
	 * little, place the new task so that it fits in the slot that
	 * stays open at the end.
	 */
	if (initial && sched_feat(START_DEBIT))
		vruntime += sched_vslice(cfs_rq, se);               /* 1 */
 
	/* sleeps up to a single latency don't count. */
	if (!initial) {
		unsigned long thresh = sysctl_sched_latency;
 
		/*
		 * Halve their sleep time's effect, to allow
		 * for a gentler effect of sleepers:
		 */
		if (sched_feat(GENTLE_FAIR_SLEEPERS))
			thresh >>= 1;
 
		vruntime -= thresh;                                 /* 2 */
	}
 
	/* ensure we never gain time by being placed backwards. */
	se->vruntime = max_vruntime(se->vruntime, vruntime); /* 3 */
}

If the function is called by the creation process, the initial parameter is 1. Therefore, here is the process of processing the creation. There will be a certain penalty for the newly created process. Adding a value to the virtual time is the penalty. After all, the smaller the virtual time, the easier it is to be scheduled for execution. The penalty time is calculated by sched_vslice().
This is mainly for processes that wake up. For processes that have been sleeping for a long time, we always expect them to be scheduled and executed quickly. After all, they have slept for so long. So a certain virtual time is subtracted here as compensation.
We guarantee that the virtual time of the scheduled entity cannot be reversed. Why? You can think about it, if a process just sleeps for 1ms, and then wakes up, you have to reward 3ms (virtual time minus 3ms), and then it actually earns 2ms. As a scheduler, we are not in the business of losing money. If you sleep for 100ms, you will be rewarded with 3ms, so that’s no problem.

What is the penalty time for newly created processes?

As can be seen from the above, the penalty time calculation function is the sched_vslice() function.

static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	return calc_delta_fair(sched_slice(cfs_rq, se), se);
}

The calc_delta_fair() function has been analyzed above and calculates the virtual time corresponding to the actual running time delta. The delta here is calculated by the sched_slice() function.

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);    /* 1 */
 
	for_each_sched_entity(se) {                                     /* 2 */
		struct load_weight *load;
		struct load_weight lw;
 
		cfs_rq = cfs_rq_of(se);
		load = &cfs_rq->load;                                       /* 3 */
 
		if (unlikely(!se->on_rq)) {
			lw = cfs_rq->load;
 
			update_load_add(&lw, se->load.weight);
			load = &lw;
		}
		slice = __calc_delta(slice, se->load.weight, load);         /* 4 */
	}
	return slice;
}

As mentioned earlier, __sched_period() calculates the scheduling period based on the number of ready queue scheduling entities.
For the case where group scheduling is not enabled, for_each_sched_entity(se) is for (; se; se = NULL), looping once.
Get the weight of the ready queue, which is the sum of the weights of all scheduling entities on the ready queue.
__calc_delta()函数有两个功能，除了上面说的可以计算进程运行时间转换成虚拟时间以外，还有第二个功能：计算调度实体se的权重占整个就绪队列权重的比例，然后乘以调度周期时间即可得到当前调度实体应该运行的时间（参数weught传递调度实体se权重，参数lw传递就绪队列权重cfs_rq->load）。例如，就绪队列权重是3072，当前调度实体se权重是1024，调度周期是6ms，那么调度实体应该得到的时间是6*1024/3072=2ms。

2、新进程加入就绪队列

经过do_fork()的大部分初始化工作完成之后，我们就可以唤醒新进程准别运行。也就是将新进程加入就绪队列准备调度。唤醒新进程的流程如下图。

do_fork()--->_do_fork()--->wake_up_new_task()--->activate_task()--->enqueue_task()--->enqueue_task_fair()
                                   |
                                   +------------>check_preempt_curr()--->check_preempt_wakeup()

wake_up_new_task()负责唤醒新创建的进程。简化一下函数如下。

void wake_up_new_task(struct task_struct *p)
{
	struct rq_flags rf;
	struct rq *rq;
 
	p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
	p->recent_used_cpu = task_cpu(p);
	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));   /* 1 */
#endif
	rq = __task_rq_lock(p, &rf);
	activate_task(rq, p, ENQUEUE_NOCLOCK);                                   /* 2 */
	p->on_rq = TASK_ON_RQ_QUEUED;
	check_preempt_curr(rq, p, WF_FORK);                                      /* 3 */
}

Re-select the CPU by calling the select_task_rq() function, and select the idlest CPU in the scheduling class by calling the select_task_rq method in the scheduling class.
Add the process to the ready queue by calling the enqueue_task method in the scheduling class.
Now that the new process is ready, you need to check whether the new process meets the conditions for preempting the currently running process. If the preemption conditions are met, the TIF_NEED_RESCHED flag needs to be set.

The enqueue_task method function corresponding to the CFS scheduling class is enqueue_task_fair(). We delete part of the code related to group scheduling. The concise code looks pleasing to the eye.

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &p->se;
 
	for_each_sched_entity(se) {                       /* 1 */
		if (se->on_rq)                                /* 2 */
			break;
		cfs_rq = cfs_rq_of(se);
		enqueue_entity(cfs_rq, se, flags);            /* 3 */
	}
 
	if (!se)
		add_nr_running(rq, 1);
 
	hrtick_update(rq);
}

When group scheduling is turned off, this is a loop, so don’t worry.
The on_rq member represents whether the scheduling entity is already in the ready queue. A value of 1 means it is in the ready queue. Of course, there is no need to continue adding to the ready queue.
enqueue_entity, as you can see from the name, adds the scheduling entity to the ready queue, which we call enqueue.

The enqueue_entity() code is as follows, and some parts of the code that do not require attention for the time being have been deleted.

static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
	bool curr = cfs_rq->curr == se;
 
	/*
	 * If we're the current task, we must renormalise before calling
	 * update_curr().
	 */
	if (renorm && curr)
		se->vruntime += cfs_rq->min_vruntime;
 
	update_curr(cfs_rq);                        /* 1 */
 
	if (renorm && !curr)
		se->vruntime += cfs_rq->min_vruntime; /* 2 */
 
	account_entity_enqueue(cfs_rq, se);         /* 3 */
 
	if (flags & ENQUEUE_WAKEUP)
		place_entity(cfs_rq, se, 0);            /* 4 */
 
	if (!curr)
		__enqueue_entity(cfs_rq, se);           /* 5 */
	se->on_rq = 1;                              /* 6 */
}

update_curr() updates the virtual time information of the currently running scheduling entity.
Remember the min_vruntime subtracted at the end of the task_fork_fair() function? Now it's time to add it back.
Update information related to the ready queue, such as the rights of the ready queue.
For the awakened process (the flag has the ENQUEUE_WAKEUP mark), we need to provide certain compensation according to the situation. I also talked about the role of the place_entity() function in two situations before. Of course, there is no need to call this when a new process joins the ready queue for the first time.
__enqueue_entity() is to add se to the red-black tree maintained by the ready queue. All ses use vruntime as the key.
The completion of all operations also means that se has joined the ready queue and the on_rq member is set.

What information does the account_entity_enqueue() function update in the ready queue?

static void account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	update_load_add(&cfs_rq->load, se->load.weight);  /* 1 */
	if (!parent_entity(se))
		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);    /* 2 */
#ifdef CONFIG_SMP
	if (entity_is_task(se)) {
		struct rq *rq = rq_of(cfs_rq);
 
		account_numa_enqueue(rq, task_of(se));
		list_add(&se->group_node, &rq->cfs_tasks);                 /* 3 */
	}
#endif
	cfs_rq->nr_running++;                                          /* 4 */
}

更新就绪队列权重，就是将se权重加在就绪队列权重上面。
cpu就绪队列struct rq同样也需要更新权重信息。
将调度实体se加入链表。
nr_running成员是就绪队列中所有调度实体的个数。

vruntime溢出怎么办

虽然调度实体se的vruntime成员是u64类型，可以保存非常大的数。但是当达到264ns后就溢出了。那么溢出会有问题吗？我们先看看__enqueue_entity()函数加入就绪队列的代码。

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
	struct rb_node *parent = NULL;
	struct sched_entity *entry;
	bool leftmost = true;
 
	/*
	 * Find the right place in the rbtree:
	 */
	while (*link) {
		parent = *link;
		entry = rb_entry(parent, struct sched_entity, run_node);
		/*
		 * We dont care about collisions. Nodes with
		 * the same key stay together.
		 */
		if (entity_before(se, entry)) {
			link = &parent->rb_left;
		} else {
			link = &parent->rb_right;
			leftmost = false;
		}
	}
 
	rb_link_node(&se->run_node, parent, link);
	rb_insert_color_cached(&se->run_node, &cfs_rq->tasks_timeline, leftmost);
}

We use the convenience red-black tree to find the position that matches the inserted node. Use the entity_before() function to compare the vruntime values of the two scheduling entities se to determine the search direction.

static inline int entity_before(struct sched_entity *a, struct sched_entity *b)
{
	return (s64)(a->vruntime - b->vruntime) < 0;
}

Assume that the vruntime of a to be inserted is 101 and the vruntime of b is 100, then the entity_before() function returns 0. Now suppose that a's vruntime overflows, vruntime is 5 (we expected it to be 264 + 5, but unfortunately the overflow result is 5), b's vruntime is about to overflow, and the value of vruntime is 264 - 2. Then whether the vruntime of scheduled entity a is 5 or 264 + 5, the entity_before() function will return 0. Therefore, the calculation results remain consistent, so there is no problem with overflow. To understand the code here, you need to understand the representation of negative numbers in computers.

The same C language technique is also applied to the ready queue min_vruntime member. Imagine that min_vruntime also has an overflow of the same u64 type. Is there any problem with the overflow of min_vruntime? In fact, no, let's continue to look at the last code of the update_min_vruntime function, cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); the max_vruntime() function also uses techniques similar to the entity_before() function. So there will be no problem with min_vruntime overflow. max_vruntime() can still return correct results.

static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
{
	s64 delta = (s64)(vruntime - max_vruntime);
	if (delta > 0)
		max_vruntime = vruntime;
 
	return max_vruntime;
}

Preempt current process conditions

When waking up a new process, this is also an opportunity to detect preemption. Because the awakened process may have a higher priority or a smaller virtual time. Immediately after waking up the new process in the previous section, call the check_preempt_curr() function to check whether the preemption conditions are met.

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
	const struct sched_class *class;
 
	if (p->sched_class == rq->curr->sched_class) {
		rq->curr->sched_class->check_preempt_curr(rq, p, flags);   /* 1 */
	} else {
		for_each_class(class) {                                    /* 2 */
			if (class == rq->curr->sched_class)
				break;
			if (class == p->sched_class) {
				resched_curr(rq);
				break;
			}
		}
	}
}

The awakened process and the current process belong to the same scheduling class. Directly call the check_preempt_curr method of the scheduling class to check the preemption conditions. After all, the scheduler manages the process itself and knows best whether it is suitable to preempt the current process.
If the awakened process and the current process do not belong to the same scheduling class, you need to compare the priorities of the scheduling classes. For example, the current process is a CFS scheduling class, and the awakened process is an RT scheduling class. Naturally, the real-time process needs to preempt the current process because it has a higher priority.

Now consider the situation where the awakened process and the current process belong to the same CFS scheduling class. The natural call is the check_preempt_wakeup() function.

static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
	struct sched_entity *se = &curr->se, *pse = &p->se;
	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 
	if (wakeup_preempt_entity(se, pse) == 1)    /* 1 */
		goto preempt;
 
	return;
preempt:
	resched_curr(rq);                           /* 2 */
}

Check whether the awakened process meets the conditions for preempting the current process.
If the current process can be preempted, set the TIF_NEED_RESCHED flag.

The wakeup_preempt_entity() function is as follows.

/*
 * Should 'se' preempt 'curr'.    
 */
static int wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
	s64 gran, vdiff = curr->vruntime - se->vruntime;
 
	if (vdiff <= 0)                    /* 1 */
		return -1;
 
	grand = wakeup_gran(se);
	if (vdiff > gran)                  /* 2 */
		return 1;
 
	return 0;
}

The wakeup_preempt_entity() function can return 3 results. The virtual time of se1, se2, se3 and curr scheduling entities is shown in the figure below. If the curr virtual time is smaller than se, return -1; if the curr virtual time is larger than se, and the difference between the two is less than gran, return 0; otherwise, return 1. By default, the value returned by the wakeup_gran() function is 1ms virtual time calculated based on the weight of the scheduling entity se. Therefore, the condition for satisfying preemption is that the virtual time of the awakened process is first smaller than the virtual time of the running process, and the difference must be greater than a certain value (this value is sysctl_sched_wakeup_granularity, called wake-up preemption granularity). The purpose of this is to avoid preemption too frequently, causing a large number of context switches to affect system performance.

se3             se2    curr         se1
------|---------------|------|-----------|--------> vruntime
          |<------gran------>|
                         
 
     wakeup_preempt_entity(curr, se1) = -1
     wakeup_preempt_entity(curr, se2) =  0
     wakeup_preempt_entity(curr, se3) =  1

3. Periodic scheduling

Periodic scheduling means that Linux periodically checks whether the current task has exhausted the time slice of the current process, and checks whether the current process should be preempted. Generally, in the interrupt function of the timer, function calls are made layer by layer and finally the scheduler_tick() function is reached.

void scheduler_tick(void)
{
	int cpu = smp_processor_id();
	struct rq *rq = cpu_rq(cpu);
	struct task_struct *curr = rq->curr;
	struct rq_flags rf;
 
	sched_clock_tick();
	rq_lock(rq, &rf);
	update_rq_clock(rq);
	curr->sched_class->task_tick(rq, curr, 0);        /* 1 */
	cpu_load_update_active(rq);
	calc_global_load_tick(rq);
	rq_unlock(rq, &rf);
	perf_event_task_tick();
#ifdef CONFIG_SMP
	rq->idle_balance = idle_cpu(cpu);
	trigger_load_balance(rq);                         /* 2 */
#endif
}

Call the task_tick method corresponding to the scheduling class. For the CFS scheduling class, this function is task_tick_fair.
Triggering load balancing will be discussed in detail later when we have time.

The task_tick_fair() function is as follows.

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &curr->se;
 
	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		entity_tick(cfs_rq, se, queued);
	}
}

The for loop is for group scheduling. When group scheduling is not turned on, this is a one-level loop.

entity_tick() is the main job.

static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);                      /* 1 */
 
	if (cfs_rq->nr_running > 1)
		check_preempt_tick(cfs_rq, curr);     /* 2 */
}

Call update_curr() to update the virtual time and other information of the currently running scheduling entity.
If the number of ready scheduling entities in the ready queue is greater than 1, you need to check whether the preemption conditions are met. If preemption is possible, set the TIF_NEED_RESCHED flag.

The check_preempt_tick() function is as follows.

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	unsigned long ideal_runtime, delta_exec;
	struct sched_entity *se;
	s64 delta;
 
	ideal_runtime = sched_slice(cfs_rq, curr);    /* 1 */
	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;    /* 2 */
	if (delta_exec > ideal_runtime) {
		resched_curr(rq_of(cfs_rq));              /* 3 */
		clear_buddies(cfs_rq, curr);
		return;
	}
 
	if (delta_exec < sysctl_sched_min_granularity)    /* 4 */
		return;
 
	se = __pick_first_entity(cfs_rq);             /* 5 */
	delta = curr->vruntime - se->vruntime;
 
	if (delta < 0)                                /* 6 */
		return;
 
	if (delta > ideal_runtime)                    /* 7 */
		resched_curr(rq_of(cfs_rq));
}

The sched_slice() function has been analyzed above and calculates the time slice that the curr process should allocate in this scheduling cycle. It should be preempted when the time slice is used up.
delta_exec is the actual time the current process has been running.
If the actual running time has exceeded the time slice allocated to the process, it is natural to preempt the current process. Set the TIF_NEED_RESCHED flag.
In order to prevent frequent excessive preemption, we should ensure that each process running time should not be less than the minimum granularity time sysctl_sched_min_granularity. Therefore if the running time is less than the minimum granularity time, it should not be preempted.
Find the scheduling entity with the smallest virtual time from the red-black tree.
If the virtual time of the current process is still smaller than the virtual time of the leftmost scheduling entity in the red-black tree, scheduling should not occur.
It looks strange to compare virtual time with real time here. It feels like a bug, and after checking the submission record, the author's intention is: I hope that tasks with small weight will be easier to preempt.

The process of each periodic scheduling (scheduling tick) above can be summarized as follows.

Updates the virtual time of the currently running process.
Check whether the current process meets the conditions for being preempted.

- if (delta_exec > ideal_runtime), then set TIF_NEED_RESCHED.

Check the TIF_NEED_RESCHED flag.

- If set, the process with the smallest virtual time is selected from the ready queue to run.
- Re-add the currently occupied process to the ready queue red-black tree (enqueue task).
- Delete the node that is about to run the process (dequeue task) from the ready queue red-black tree.

4. How to choose the next suitable process to run

When a process is set with the TIF_NEED_RESCHED flag, it will trigger system scheduling at a certain moment or the process calls the schedule() function to actively give up the right to use the CPU, triggering system scheduling. Let’s take the schedule() function as an example.

asmlinkage __visible void __sched schedule(void)
{
	struct task_struct *tsk = current;
 
	sched_submit_work(tsk);
	do {
		preempt_disable();
		__schedule(false);
		sched_preempt_enable_no_resched();
	} while (need_resched());
}

The main job is still the __schedule() function.

static void __sched notrace __schedule(bool preempt)
{
	struct task_struct *prev, *next;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;
 
	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;
 
	if (!preempt && prev->state) {
		if (unlikely(signal_pending_state(prev->state, prev))) {
			prev->state = TASK_RUNNING;
		} else {
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);    /* 1 */
			prev->on_rq = 0;
		}
	}
 
	next = pick_next_task(rq, prev, &rf);    /* 2 */
	clear_tsk_need_resched(prev);            /* 3 */
 
	if (likely(prev != next)) {
		rq->curr = next;
		rq = context_switch(rq, prev, next, &rf);    /* 4 */
	}
 
	balance_callback(rq);
}

For processes that actively give up the CPU and go to sleep, we need to delete the process from the corresponding ready queue.
Select the next appropriate process to start running. This function has been analyzed previously.
Clear the TIF_NEED_RESCHED flag.
Context switch, switch from prev process to next process.

The CFS scheduling class pick_next_task method is the pick_next_task_fair() function.

static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct cfs_rq *cfs_rq = &rq->cfs;
	struct sched_entity *se;
	struct task_struct *p;
	int new_tasks;
 
again:
	if (!cfs_rq->nr_running)
		goto idle;
 
	put_prev_task(rq, prev);                        /* 1 */
	do {
		se = pick_next_entity(cfs_rq, NULL);        /* 2 */
		set_next_entity(cfs_rq, se);                /* 3 */
		cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);                               /* 4 */
 
	p = task_of(se);
#ifdef CONFIG_SMP
	list_move(&p->se.group_node, &rq->cfs_tasks);
#endif
 
	if (hrtick_enabled(rq))
		hrtick_start_fair(rq, p);
 
	return p;
idle:
	new_tasks = idle_balance(rq, rf);
 
	if (new_tasks < 0)
		return RETRY_TASK;
 
	if (new_tasks > 0)
		goto again;
 
	return NULL;
}

It is mainly used to handle the funeral affairs of the prev process. This function will be called when the process gives up the CPU.
Select the most appropriate scheduling entity to run.
The selected scheduling entity se still needs to be processed before it can be put into operation. The set_next_entity() function is responsible for the processing.
When group scheduling is not enabled, the cycle ends once.

put_prev_task()究竟处理了哪些后事呢？CFS调度类put_prev_task方法的函数是put_prev_task_fair()。

static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
	struct sched_entity *se = &prev->se;
	struct cfs_rq *cfs_rq;

	for_each_sched_entity(se) {           /* 1 */
		cfs_rq = cfs_rq_of(se);
		put_prev_entity(cfs_rq, se);      /* 2 */
	}
}

针对组调度情况，暂不考虑。
put_prev_entity()是主要干活的部分。

put_prev_entity()函数如下。

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
	/*
	 * If still on the runqueue then deactivate_task()
	 * was not called and update_curr() has to be done:
	 */
	if (prev->on_rq)                            /* 1 */
		update_curr(cfs_rq);

	if (prev->on_rq) {
		/* Put 'current' back into the tree. */
		__enqueue_entity(cfs_rq, prev);         /* 2 */
		/* in !on_rq case, update occurred at dequeue */
		update_load_avg(cfs_rq, prev, 0);       /* 3 */
	}
	cfs_rq->curr = NULL;                        /* 4 */
}

If the prev process is still on the ready queue, it is most likely that the prev process has been preempted. Information such as process virtual time needs to be updated before giving up the CPU. If the prev process is not on the ready queue, the update can be skipped directly. Because the prev process has already called update_curr() in deactivate_task(), it can be omitted here.
If the prev process is still on the ready queue, we need to re-insert the prev process into the red-black tree to wait for scheduling.
update_load_avg() updates the load information of the prev process, which is used during load balancing.
After the funeral has been processed, the curr pointer of the ready queue should also point to NULL, which means that there is no running process on the ready queue.

The funeral affairs of the prev process have been dealt with, and the next process to inherit the unification needs to use the set_next_entity() function to announce it to the world.

static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	/* 'current' is not kept within the tree. */
	if (se->on_rq) {
		__dequeue_entity(cfs_rq, se);                   /* 1 */
		update_load_avg(cfs_rq, se, UPDATE_TG);         /* 2 */
	}
	cfs_rq->curr = se;                                  /* 3 */
    update_stats_curr_start(cfs_rq, se);                /* 4 */
	se->prev_sum_exec_runtime = se->sum_exec_runtime;   /* 5 */
}

__dequeue_entity() deletes the scheduling entity from the red-black tree. For the process that is about to run, we will delete the current process from the red-black tree. When the process is commandeered, calling the put_prev_entity() function will re-insert the red-black tree. Therefore, this place echoes the addition of red-black trees to the put_prev_entity() function.
Update process load information. Load balancing will be used.
Update the ready queue curr member to tell the world, "Now I am the currently running process."
The update_stats_curr_start() function is just one sentence, updating the exec_start member of the scheduling entity to prepare for the update_curr() function statistics time.
The check_preempt_tick() function is used to count the time the current process has been running to determine whether it can be preempted by other processes.

5. Process sleep

In the __schedule() function, if the prev process actively sleeps. Then the deactivate_task() function will be called. The deactivate_task() function will eventually call the dequeue_task method of the scheduling class. The function corresponding to the CFS scheduling class is dequeue_task_fair(), which is the reverse operation of the enqueue_task_fair() function.

static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &p->se;
	int task_sleep = flags & DEQUEUE_SLEEP;

	for_each_sched_entity(se) {                 /* 1 */
		cfs_rq = cfs_rq_of(se);
		dequeue_entity(cfs_rq, se, flags);      /* 2 */
	}

	if (!se)
		sub_nr_running(rq, 1);
}

For group scheduling operations, when group scheduling is not enabled, the cycle only occurs once.
Delete the scheduling entity se from the corresponding ready queue cfs_rq.

The dequeue_entity() function is as follows.

static void dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	update_curr(cfs_rq);                          /* 1 */

	if (se != cfs_rq->curr)
		__dequeue_entity(cfs_rq, se);             /* 2 */
	se->on_rq = 0;                                /* 3 */
	account_entity_dequeue(cfs_rq, se);           /* 4 */

	if (!(flags & DEQUEUE_SLEEP))
		se->vruntime -= cfs_rq->min_vruntime; /* 5 */
}

Take the opportunity to update the virtual time information of the currently running process. If the current dequeue process is the currently running process, then update_curr() is necessary this time.
For the currently running process, its corresponding scheduling entity is no longer in the red-black tree, so there is no need to call the __dequeue_entity() function to match the parameters from the red-black tree.
The scheduling entity has been removed from the red-black tree of the ready queue, so the on_rq member is updated.
Update ready queue related information, such as weight information. Introduced later.
If the process is not sleeping (for example, migrating from one CPU to another), the minimum virtual time of the process needs to be subtracted from the minimum virtual time corresponding to the current ready queue, for the reasons mentioned before. After migration, the corresponding CFS ready queue minimum time will be added when enqueueing.

account_entity_dequeue() is the opposite of the account_entity_enqueue() operation mentioned earlier. The account_entity_dequeue() function is as follows.

static void account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	update_load_sub(&cfs_rq->load, se->load.weight);      /* 1 */
	if (!parent_entity(se))
		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
	if (entity_is_task(se)) {
		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
		list_del_init(&se->group_node);                   /* 2 */
	}
#endif
	cfs_rq->nr_running--;                                 /* 3 */
}

Subtract the weight of the current dequeue scheduling entity from the sum of the ready queue weights.
Delete the scheduling entity se from the linked list.
The count of runnable scheduling entities in the ready queue is decremented by 1.

3. CFS Scheduler-Group Scheduling

1 Introduction

现在的计算机基本都支持多用户登陆。如果一台计算机被两个用户A和B使用。假设用户A运行9个进程，用户B只运行1个进程。按照之前文章对CFS调度器的讲解，我们认为用户A获得90% CPU时间，用户B只获得10% CPU时间。随着用户A不停的增加运行进程，用户B可使用的CPU时间越来越少。这显然是不公平的。因此，我们引入组调度（Group Scheduling ）的概念。我们以用户组作为调度的单位，这样用户A和用户B各获得50% CPU时间。用户A中的每个进程分别获得5.5%（50%/9）CPU时间。而用户B的进程获取50% CPU时间。这也符合我们的预期。本篇文章讲解CFS组调度实现原理。

2、再谈调度实体

通过之前的文章，我们已经介绍了CFS调度器主要管理的是调度实体。每一个进程通过task_struct描述，task_struct包含调度实体sched_entity参与调度。暂且针对这种调度实体，我们称作task se。现在引入组调度的概念，我们使用task_group描述一个组。在这个组中管理组内的所有进程。因为CFS就绪队列管理的单位是调度实体，因此，task_group也脱离不了sched_entity，所以在task_group结构体也包含调度实体sched_entity，我们称这种调度实体为group se。task_group定义在kernel/sched/sched.h文件。

struct task_group {
	struct cgroup_subsys_state css;
#ifdef CONFIG_FAIR_GROUP_SCHED
	/* schedulable entities of this group on each CPU */
	struct sched_entity	**se;                   /* 1 */
	/* runqueue "owned" by this group on each CPU */
	struct cfs_rq		**cfs_rq;               /* 2 */
	unsigned long		shares;                 /* 3 */
#ifdef	CONFIG_SMP
	atomic_long_t		load_avg ____cacheline_aligned;    /* 4 */
#endif
#endif
	struct cfs_bandwidth	cfs_bandwidth;
    /* ... */
};

Array of pointers, the size of the array is equal to the number of CPUs. Now assume a system with only one CPU. We replace a user group with a scheduling entity and insert the corresponding red-black tree. For example, user group A and user group B above are two scheduling entities se, which are hung in the top-level ready queue cfs_rq. User group A manages 9 runnable processes, and these 9 scheduling entities se serve as children of user group A's scheduling entity. Establish a relationship through the se->parent member. User group A also maintains a ready queue cfs_rq, which is temporarily called group cfs_rq. The scheduling entities of the nine managed processes are hung on group cfs_rq. When we select a process to run, we first select user group A from the root ready queue cfs_rq, and then select one of the processes from group cfs_rq of user group A to run. Now consider the case of multi-core CPUs, processes in a user group can run on multiple CPUs. Therefore, we need the number of CPU scheduling entities se, which are hung on the root cfs_rq of each CPU.
The group cfs_rq mentioned above is also an array of pointers, and the size is the number of CPUs. Because processes can run on each CPU, the group cfs_rq of the number of CPUs needs to be maintained.
The scheduling entity has the concept of weight and allocates CPU time in proportion to the weight. User groups also have the concept of weight, and share is the weight of task_group.
The total load contribution of the entire user group.

If our number of CPUs is equal to 2 and there is only one user group, then the group scheduling diagram in the system is as follows.

There are a total of 8 processes running in the system. There are 3 processes running on CPU0 and 5 processes running on CPU1. It contains a user group A, and user group A contains 5 processes. The CPU time obtained by group se on CPU0 is the sum of the CPU time obtained by all processes managed by group cfs_rq corresponding to group se. After the system starts, there is a root_task_group by default, which manages the top-level CFS ready queue cfs_rq in the system. On a 2-CPU system, the length of the task_group structure se and cfs_rq member arrays is 2, and each group se corresponds to a group cfs_rq.

3. The relationship between data structures

Assume that the system contains 4 CPUs and group scheduling is turned on. The relationship between various structures is as follows.

There is a global ready queue struct rq on each CPU. On a 4-CPU system, there will be 4 global ready queues, as shown in the purple structure in the picture. By default, the system has only one root task_group called root_task_group. rq->cfs_rq points to the system root CFS ready queue. The root CFS ready queue maintains a red-black tree. There are a total of 10 ready scheduling entities on the red-black tree, 9 of which are task se and 1 group se (blue se in the figure). The my_q member of group se points to its own ready queue. There are a total of 9 tasks se in the red-black tree of the ready queue. The parent member points to group se. Each group se corresponds to a group cfs_rq. 4 CPUs will correspond to 4 group se and group cfs_rq, which are stored in the se and cfs_rq members of the task_group structure respectively. The se->depth member records the se nesting depth. The depth of se under the top-level CFS ready queue is 0, and the group se increases layer by layer. The cfs_rq->nr_running member records the number of all scheduling entities in the CFS ready queue, excluding sub-ready queues. The cfs_rq->h_nr_running member records the number of all scheduling entities at the ready queue level, including the scheduling entities on group cfs_rq corresponding to group se. For example, in the upper half of the figure, the values of nr_running and h_nr_running are equal to 10 and 19 respectively, and the extra 9 is h_nr_running of group cfs_rq. Since group cfs_rq does not have group se, the values of nr_running and h_nr_running are both equal to 9.

4. Group process scheduling

How to schedule processes within a user group? Through the above analysis, we can conveniently select the appropriate process layer by layer through the root CFS ready queue. For example, first select the group se suitable for operation from the root ready queue, then find the corresponding group cfs_rq, and then select the task se from group cfs_rq. In the CFS scheduling class, the function to select a process is pick_next_task_fair().

static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct cfs_rq *cfs_rq = &rq->cfs;               /* 1 */
	struct sched_entity *se;
	struct task_struct *p;
 
	put_prev_task(rq, prev);
	do {
		se = pick_next_entity(cfs_rq, NULL);        /* 2 */
		set_next_entity(cfs_rq, se);
		cfs_rq = group_cfs_rq(se);                  /* 3 */
	} while (cfs_rq);                               /* 4 */
 
	p = task_of(se);
 
	return p;
}

Convenience starts from the root CFS ready queue.
Select the SE with the smallest virtual time from the red-black tree of the ready queue cfs_rq.
group_cfs_rq() returns the se->my_q member. If it is task se, then group_cfs_rq() returns NULL. If it is group se, then group_cfs_rq() returns the group cfs_rq corresponding to group se.
If it is group se, we need to select the next se with the smallest virtual time from the red-black tree on group cfs_rq, and loop until the bottom task se.

5. Group process preemption

Periodic scheduling will call the task_tick_fair() function.

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &curr->se;
 
	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		entity_tick(cfs_rq, se, queued);
	}
}

for_each_sched_entity()是一个宏定义for (; se; se = se->parent)，顺着se的parent链表往上走。entity_tick()函数继续调用check_preempt_tick()函数，这部分在之前的文章已经说过了。check_preempt_tick()函数会根据满足抢占当前进程的条件下设置TIF_NEED_RESCHED标志位。满足抢占条件也很简单，只要顺着se->parent这条链表便利下去，如果有一个se运行时间超过分配限额时间就需要重新调度。

6、用户组的权重

每一个进程都会有一个权重，CFS调度器依据权重的大小分配CPU时间。同样task_group也不例外，前面已经提到使用share成员记录。按照前面的举例，系统有2个CPU，task_group中势必包含两个group se和与之对应的group cfs_rq。这2个group se的权重按照比例分配task_group权重。如下图所示。

CPU0上group se下有2个task se，权重和是3072。CPU1上group se下有3个task se，权重和是4096。task_group权重是1024。因此，CPU0上group se权重是439（1024*3072/(3072+4096)）,CPU1上group se权重是585（1024-439）。当然这里的计算group se权重的方法是最简单的方式，代码中实际计算公式是考虑每个group cfs_rq的负载贡献比例，而不是简单的考虑权重比例。

7、用户组时间限额分配

The time calculation function assigned to each process is sched_slice(). The previous analysis was based on the case where group scheduling is not considered. Now how to adjust the time that the process should allocate when considering group scheduling? Let's take a simple example without considering group scheduling. On a single-core system, two processes have a weight of 1024. Without considering group scheduling, the time limit allocated by the scheduling entity se is calculated as follows:

se->load.weight
time = sched_period * -------------------------
                         cfs_rq->load.weight

We also need to calculate the proportion of the weight of se to the weight of the entire CFS ready queue multiplied by the scheduling cycle time. According to the analysis of the previous article, the scheduling cycle of the two processes is 6ms, so the time allocated to each process is 6ms*1024/(1024+1024)=3ms.

Now consider the case of group scheduling. The system is still single-core, there is a task_group, and the weight of all processes is 1024. The task_group weight is also 1024 (that is, the share value). As shown below.

The process allocation time calculation formula under group cfs_rq is as follows (gse := group se; gcfs_rq := group cfs_rq):

se->load.weight            gse->load.weight 
time = sched_period * ------------------------- * ------------------------
                         gcfs_rq->load.weight        cfs_rq->load.weight

According to the formula, calculate the allocation time of the process under group cfs_rq as follows:

1024              1024 
time = 6ms * --------------- * ---------------- = 1.5ms
               1024 + 1024        1024 + 1024

Based on the two calculation formulas above, we can calculate the time allocated to each process in the above example as shown in the figure below.

The above briefly introduces the situation of task_group nested one level. If task_group continues to contain task_group, then the above calculation formula must calculate the proportion of one layer higher. The function that implements this calculation formula is sched_slice().

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);    /* 1 */
 
	for_each_sched_entity(se) {                                     /* 2 */
		struct load_weight *load;
		struct load_weight lw;
 
		cfs_rq = cfs_rq_of(se);
		load = &cfs_rq->load;                                       /* 3 */
 
		if (unlikely(!se->on_rq)) {
			lw = cfs_rq->load;
 
			update_load_add(&lw, se->load.weight);
			load = &lw;
		}
		slice = __calc_delta(slice, se->load.weight, load);         /* 4 */
	}
	return slice;
}

The scheduling period is calculated based on the current number of ready processes. By default, when there are no more than 8 processes, the scheduling period defaults to 6ms.
The for loop calculates the proportion upwards based on the se->parent linked list.
Obtain the load information of cfs_rq attached to se.
Calculate the value of slice = slice * se->load.weight / cfs_rq->load.weight.

8. Group Se weight calculation

The above example mentioned that the weight calculation of group se is based on the weight ratio. However, the actual code is not. When we dequeue task, enqueue task and task tick, we will update the weight information of group se through the update_cfs_group() function.

static void update_cfs_group(struct sched_entity *se)
{
	struct cfs_rq *gcfs_rq = group_cfs_rq(se);              /* 1 */
	long shares, runnable;
 
	if (!gcfs_rq)
		return;
 
	shares   = calc_group_shares(gcfs_rq);                 /* 2 */
	runnable = calc_group_runnable(gcfs_rq, shares);
 
	reweight_entity(cfs_rq_of(se), se, shares, runnable);  /* 3 */
}

Get the group cfs_rq corresponding to group se.
Calculate new weight values.
Update the weight value of group se to shares.

calc_group_shares() calculates new weights based on the current group cfs_rq load.

static long calc_group_shares(struct cfs_rq *cfs_rq)
{
	long tg_weight, tg_shares, load, shares;
	struct task_group *tg = cfs_rq->tg;
 
	tg_shares = READ_ONCE(tg->shares);
	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
	tg_weight = atomic_long_read(&tg->load_avg);
	/* Ensure tg_weight >= load */
	tg_weight -= cfs_rq->tg_load_avg_contrib;
	tg_weight += load;
	shares = (tg_shares * load);
	if (tg_weight)
		shares /= tg_weight;
 
	return clamp_t(long, shares, MIN_SHARES, tg_shares);
}

According to the calc_group_shares() function, we can get the weight calculation formula as follows (grq := group cfs_rq):

tg->shares * load
ge->load.weight = -------------------------------------------------
                   tg->load_avg - grq->tg_load_avg_contrib + load
 
load = max(grq->load.weight, grq->avg.load_avg)

tg->load_avg refers to the sum of all group cfs_rq load contributions. grq->tg_load_avg_contrib refers to the load that the group cfs_rq has contributed to tg->load_avg. Because tg is a global shared variable, multiple CPUs may access it at the same time in order to avoid serious resource preemption. The updated value of group cfs_rq load contribution will not be added to tg->load_avg immediately. Instead, it will be added to tg->load_avg after the load contribution is greater than a certain difference of tg_load_avg_contrib. For example, a 2 CPU system. The initial value of group cfs_rq on CPU0 is tg_load_avg_contrib. When group cfs_rq updates the load each time the timer does not access the tg variable, it will wait until the load of group cfs_rq grp->avg.load_avg is much greater than tg_load_avg_contrib. This difference will When a value is reached (assuming it is 2000), tg->load_avg will be updated to 2000. Then, the value of tg_load_avg_contrib is assigned the value 2000. After many cycles, the difference between grp->avg.load_avg and tg_load_avg_contrib is equal to 2000 again, then the value of tg->load_avg is updated again to 4000. This avoids frequent access to the tg variable.

But what is the basis for the above calculation formula? How to get it? First of all, I think the calculation method we can introduce is the method mentioned in the previous section "Weight of User Group", which is to calculate the proportion of the weight of group cfs_rq. The formula is as follows.

tg->shares * grq->load.weight
ge->load.weight = -------------------------------               (1)
		      \Sum grq->load.weight

Because the calculation of the sum of \Sum grq->load.weight is too expensive (the reason may be that in systems with a relatively large number of CPUs, access to other CPU group cfs_rq causes fierce competition for data access). Therefore we use the average load to approximate the process. The average load value changes slowly, so the approximated value is easier to calculate and more stable. The approximate processing conditions are as follows, and the weight and average load are approximated.

grq->load.weight -> grq->avg.load_avg                           (2)

After approximation, formula (1) is transformed as follows:

tg->shares * grq->avg.load_avg
ge->load.weight = ------------------------------                (3)
			tg->load_avg
 
Where: tg->load_avg ~= \Sum grq->avg.load_avg

The problem with equation (3) is that because the average load value changes very slowly (which is what it is designed to do), this can lead to transients during boundary conditions. Specifically, when the idle group starts running a process. Our CPU's grq->avg.load_avg takes time to slowly change, creating undesirable latency. In this special case (which is also the case with single-core CPUs), formula (1) is calculated as follows:

tg->shares * grq->load.weight
ge->load.weight = ------------------------------- = tg->shares (4)
			grq->load.weight

Our goal is to modify the approximate formula (3) into the situation of formula (4) in the UP scenario.

ge->load.weight =
 
             tg->shares * grq->load.weight
   ---------------------------------------------------         (5)
   tg->load_avg - grq->avg.load_avg + grq->load.weight

But because grq->load.weight can be reduced to 0, the divisor is 0. Therefore we need to use grq->avg.load_avg as its lower bound, which then gives:

tg->shares * grq->load.weight
ge->load.weight = -----------------------------		           (6)
            		tg_load_avg'
 
 Where: tg_load_avg' = tg->load_avg - grq->avg.load_avg +
                       max(grq->load.weight, grq->avg.load_avg)

On the UP system, formula (6) is similar to formula (4). Under normal circumstances, formula (6) is similar to formula (3).

To be honest, it is really a lot of formulas, and there are various approximations and parameters. It’s always confusing to see the results of a formula at once, because it may involve multiple different optimization modifications, some of which may be summaries of experience, and some of which may be actual environment tests. When you can't understand the formula, you might as well go back to how it was when this function was first added. The initial version is always easier to accept. Then, follow each submission record to see the reasons for optimizing the code, step by step, maybe "facing the sea with spring flowers blooming".

4. CFS scheduler-PELT (per entity load tracking)

1. Why is PELT needed?

In order to make the scheduler smarter, we always hope that the system can meet the maximum throughput while minimizing power consumption. Although there may be some contradictions, the reality is always like this. The PELT algorithm was incorporated into Linux 3.8. So before that, what problems did we have to introduce the PELT algorithm? Prior to Linux 3.8, CFS tracked load on a per-runqueue (rq) basis. But with this approach, we cannot determine the source of the current load. At the same time, even when the workload is relatively stable, tracking the load at the rq level can cause large changes in its value. In order to solve the above problems, the PELT algorithm tracks the load of each scheduling entity (per-scheduling entity).

2. How to perform PELT

For specific principles, please refer to this article "per-entity load tracking" . Let me shamelessly excerpt an excerpt from this article. In order to achieve per-entity load tracking, time (physical time, not virtual time) is divided into a sequence of 1024us. In each 1024us period, an entity's contribution to the system load can be determined according to the entity's runnable state (being The time it takes to run on the CPU or wait for the CPU to be scheduled to run) is calculated. If the runnable time is x during this cycle, then the contribution to the system load is (x/1024). Of course, the load of an entity in one calculation cycle may exceed 1024us. This is because we will accumulate the load in past cycles. Of course, we need to multiply the past load by an attenuation factor when calculating. If we let Li represent the contribution of the scheduling entity to the system load in period pi, then the total contribution of a scheduling entity to the system load can be expressed as:

L = L0 + L1 * y + L2 * y2 + L3 * y3 + ... + Ln * yn

y32 = 0.5, y = 0.97857206

When you see the above formula for the first time, I wonder if you are wondering what this is all about! For example, how to calculate the load contribution of an SE. If there is a task that has been running for 4096us since joining rq for the first time and has been sleeping, then what is the load contribution at each moment of 1023us, 2047us, 3071us, 4095us, 5119us, 6143us, 7167us and 8191us?

1023us: L0 = 1023
2047us: L1 = 1023 + 1024 * y = 1023 + (L0 + 1) * y = 2025
3071us: L2 = 1023 + 1024 * y + 1024 * y2 = 1023 + (L1 + 1) * y = 3005
4095us: L3 = 1023 + 1024 * y + 1024 * y2 + 1024 * y3 = 1023 + (L2 + 1) * y = 3963
5119us: L4 = 0 + 1024 * y + 1024 * y2 + 1024 * y3 + 1024 * y4 = 0 + (L3 + 1) * y = 3877
6143us: L5 = 0 + 0 + 1024 * y2 + 1024 * y3 + 1024 * y4 + 1024 * y5 = 0 + L4 * y = 3792
7167us: L6 = 0    + L5 * y = L4 * y2 = 3709
8191us: L7 = 0 + L6 * y = L5 * y2 = L4 * y3 = 3627

After the above examples, it is not difficult to find a rule. To calculate the load at the current time, we only need to multiply the sum of the load contributions of the previous period by the attenuation coefficient y, and add the load at the current time point.

From the above calculation formula, we can also see that it is often necessary to calculate the value of val*yn, so the kernel provides the decay_load() function to calculate the decay value of the nth cycle. In order to avoid floating point operations, shift and multiplication operations are used to improve calculation speed. decay_load(val, n) = val*yn*232>>32. We calculate the value of yn*232 in advance and save it in the array runnable_avg_yN_inv.

runnable_avg_yN_inv[n] = in*232, n > 0 && n < 32

For the calculation of runnable_avg_yN_inv, please refer to the calc_runnable_avg_yN_inv() function in the /Documentation/scheduler/sched-pelt.c file. Since y32=0.5, we only need to calculate the values of y*232~y31*232 and save them in the array. When n is greater than 31, in order to calculate yn*232 we can use the y32=0.5 formula to calculate indirectly. For example, y33*232=y32*y*232=0.5*y*232=0.5*runnable_avg_yN_inv[1]. The simple summary of the calc_runnable_avg_yN_inv() function is: runnable_avg_yN_inv[i] = ((1UL << 32) - 1) * pow(0.97857206, i),i>=0 && i<32. pow(x, y) is to find the value of xy. The calculated value of the runnable_avg_yN_inv array is as follows:

static const u32 runnable_avg_yN_inv[] = {
	0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
	0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
	0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
	0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
	0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80,
	0x85aac367, 0x82cd8698,
};

根据runnable_avg_yN_inv数组的值，我们就方便实现decay_load()函数。

/*
 * Approximate:
 *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
 */
static u64 decay_load(u64 val, u64 n)
{
	unsigned int local_n;
 
	if (unlikely(n > LOAD_AVG_PERIOD * 63))                              /* 1 */
		return 0;
 
	/* after bounds checking we can collapse to 32-bit */
	local_n = n;
 
	/*
	 * As y^PERIOD = 1/2, we can combine
	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
	 * With a look-up table which covers y^n (n<PERIOD)
	 *
	 * To achieve constant time decay_load.
	 */
	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {                           /* 2 */
		val >>= local_n / LOAD_AVG_PERIOD;
		local_n %= LOAD_AVG_PERIOD;
	}
 
	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);         /* 2 */
	return val;
}

The value of LOAD_AVG_PERIOD is 32. We believe that after 2016 cycles, the attenuated value will be 0. That is, val*yn=0, n > 2016.
When n is greater than or equal to 32, the value of yn needs to be calculated based on the condition of y32=0.5. yn*232 = 1/2n/32 * yn%32*232=1/2n/32 * runnable_avg_yN_inv[n%32].

3. How to calculate the current load contribution

From the above example, we can know that calculating the current load contribution does not require recording all historical load contributions. We only need to know the load contribution at the previous moment to calculate the current load contribution, which greatly reduces the code implementation complexity. Let's continue thinking about the above example. We still assume that a task starts running from time 0, so the load contribution after 1022us is naturally 1022. When the task passes 10us, what is the load contribution at this time (the current time is 1032us)? It's very simple. 2us in 10us and the previous 1022us can make up a cycle of 1024us. This 1024us needs to be attenuated, that is, the current load contribution is: (1024 - 1022 + 1022)y + 10 - (1024 - 1022) = 1022y + 2y + 8 = 1010. 1022y can be understood as due to experiencing a cycle, so The load at the last moment needs to be attenuated once, so 1022 needs to be multiplied by the attenuation coefficient y. 2y can be understood as, 2us belongs to the difference from one cycle of 1024us when the previous load was calculated. Since 2 is the time of the previous cycle, it is also needed Decay once, 8 is the current cycle time, no decay is required. Another 2124us has passed. What is the load contribution at this time (the current time is 3156us)? That is: (1024 - 8 + 1010)y2 + 1024y + 2124 - 1024 - (1024 - 8) = 1010y2 + 1016y2 + 1024y + 84 = 3024. 2124us can be decomposed into 3 parts: 1016us makes up for the shortfall of 1024us at the previous moment, Make up one cycle; 1024us is a whole cycle; the current time is less than the remaining 84us of one cycle. It is equivalent to us going through 2 cycles, so the last load contribution needs to be attenuated twice, which is the 1010y2 part. 1016us is to make up for the part that was less than one cycle last time, so it also needs to be attenuated twice, so there are Section 1016y2. The 1024us part is equivalent to one cycle from the current moment, so it needs to be attenuated once. The last 84us part is the current remaining time and does not need to be attenuated.

针对以上事例，我们可以得到一个更通用情况下的计算公式。假设上一时刻负载贡献是u，经历d时间后的负载贡献如何计算呢？根据上面的例子，我们可以把时间d分成3和部分：d1是离当前时间最远（不完整的）period 的剩余部分，d2 是完整period时间，而d3是（不完整的）当前 period 的剩余部分。假设时间d是经过p个周期（d=d1+d2+d3, p=1+d2/1024）。d1，d2，d3 的示意图如下：

d1          d2           d3
      ^           ^            ^
      |           |            |
    |<->|<----------------->|<--->|
|---x---|------| ... |------|-----x (now)
          
                           p-1
 u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
                           n=1
                             p-1
    = u y^p + d1 y^p + 1024 \Sum y^n + d3 y^0
                             n=1

上面的例子现在就可以套用上面的公式计算。例如，上一次的负载贡献u=1010，经过时间d=2124us，可以分解成3部分，d1=1016us，d2=1024，d3=84。经历的周期p=2。所以当前时刻负载贡献u'=1010y2 + 1016y2 + 1024y + 84，与上面计算结果一致。

4、如何记录负载信息

Linux中使用struct sched_avg结构体记录调度实体se或者就绪队列cfs rq负载信息。每个调度实体se以及cfs就绪队列结构体中都包含一个struct sched_avg结构体用于记录负载信息。struct sched_avg定义如下。

struct sched_avg {
	u64						last_update_time;
	u64						load_sum;
	u64						runnable_load_sum;
	u32						util_sum;
	u32						period_contrib;
	unsigned long			load_avg;
	unsigned long			runnable_load_avg;
	unsigned long			util_avg;
};

last_update_time：上一次负载更新时间。用于计算时间间隔。
load_sum：基于可运行（runnable）时间的负载贡献总和。runnable时间包含两部分：一是在rq中等待cpu调度运行的时间，二是正在cpu上运行的时间。
util_sum：基于正在运行（running）时间的负载贡献总和。running时间是指调度实体se正在cpu上执行时间。
load_avg：基于可运行（runnable）时间的平均负载贡献。
util_avg：基于正在运行（running）时间的平均负载贡献。

一个调度实体se可能属于task，也有可能属于group（Linux支持组调度，需要配置CONFIG_FAIR_GROUP_SCHED）。调度实体se的初始化针对task se和group se也就有所区别。调度实体使用struct sched_entity描述如下。

struct sched_entity {
	struct load_weight		load;
	unsigned long			runnable_weight;
#ifdef CONFIG_SMP
	struct sched_avg		avg;
#endif
};

The scheduling entity se initialization function is init_entity_runnable_average(), the code is as follows.

void init_entity_runnable_average(struct sched_entity *se)
{
	struct sched_avg *sa = &se->avg;
 
	memset(in, 0, sizeof(*in));
 
	/*
	 * Tasks are intialized with full load to be seen as heavy tasks until
	 * they get a chance to stabilize to their real load level.
	 * Group entities are intialized with zero load to reflect the fact that
	 * nothing has been attached to the task group yet.
	 */
	if (entity_is_task(se))
		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
 
	se->runnable_weight = se->load.weight;
 
	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}

For task se initialization, the values of runnable_load_avg and load_avg are equal to the weight of se (se->load.weight). And according to the comments, we can also know that the maximum value accumulated by runnable_load_avg and load_avg in subsequent load calculations is actually the weight value of se. This means that the values of runnable_load_avg and load_avg can indirectly indicate the complexity of the task. The runnable_weight member is mainly proposed for group se. For task se, runnable_weight is the weight of se, and the two values are exactly the same.

For group se, the values of runnable_load_avg and load_avg are initialized to 0. This also means that there are no tasks in the current task group that need to be scheduled. Although runnable_weight is now initialized to the weight value of se, the value of runnable_weight will be continuously updated in subsequent code. runnable_weight is part of the entity weight and represents the runnable part of the group runqueue.

5. Load calculation code implementation

After understanding the above information, you can start to study the source code implementation of the formula for calculating load contribution in the previous section.

p-1
 u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
                           n=1
 
    = u y^p +								(Step 1)
 
                     p-1
      d1 y^p + 1024 \Sum y^n + d3 y^0		(Step 2)
                     n=1

The above formula is implemented in two parts in the code. The accumulate_sum() function calculates the step1 part, and then calls the __accumulate_pelt_segments() function to calculate the step2 part.

static __always_inline u32
accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
	       unsigned long load, unsigned long runnable, int running)
{
	unsigned long scale_freq, scale_cpu;
	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
	u64 periods;
 
	scale_freq = arch_scale_freq_capacity(cpu);
	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
 
	delta += sa->period_contrib;                                 /* 1 */
	periods = delta / 1024; /* A period is 1024us (~1ms) */      /* 2 */
 
	/*
	 * Step 1: decay old *_sum if we crossed period boundaries.
	 */
	if (periods) {
		sa->load_sum = decay_load(sa->load_sum, periods);        /* 3 */
		sa->runnable_load_sum = decay_load(sa->runnable_load_sum, periods);
		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
		/*
		 * Step 2
		 */
		delta %= 1024;
		contrib = __accumulate_pelt_segments(periods,            /* 4 */
				1024 - sa->period_contrib, delta);
	}
	sa->period_contrib = delta;                                  /* 5 */
 
	contrib = cap_scale(contrib, scale_freq);
	if (load)
		sa->load_sum += load * contrib;
	if (runnable)
		sa->runnable_load_sum += runnable * contrib;
	if (running)
		sa->util_sum += contrib * scale_cpu;
 
	return periods;
}

period_contrib记录的是上次更新负载不足1024us周期的时间。delta是经过的时间，为了计算经过的周期个数需要加上period_contrib，然后整除1024。
计算周期个数。
调用decay_load()函数计算公式中的step1部分。
__accumulate_pelt_segments()负责计算公式step2部分。
更新period_contrib为本次不足1024us部分。

下面分析__accumulate_pelt_segments()函数。

static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
{
	u32 c1, c2, c3 = d3; /* y^0 == 1 */
 
	/*
	 * c1 = d1 y^p
	 */
	c1 = decay_load((u64)d1, periods);
 
	/*
	 *            p-1
	 * c2 = 1024 \Sum y^n
	 *            n=1
	 *
	 *              inf        inf
	 * = 1024 ( \Sum y^n - \Sum y^n - y^0 )
	 *              n=0        n=p
	 */
	c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
 
	return c1 + c2 + c3;
}

The main focus of the __accumulate_pelt_segments() function should be how this c2 is calculated. What was originally a polynomial summation was very cleverly transformed into a very simple calculation method. This conversion process is as follows.

p-1
            c2 = 1024 \Sum y^n
                       n=1
    
    In terms of our maximum value:
    
                        inf               inf        p-1
            max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
                        n=0               n=p        n=1
    
    Further note that:
    
               inf              inf            inf
            ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
               n=0              n=0            n=p
    
    Combined that gives us:
    
                       p-1
            c2 = 1024 \Sum y^n
                       n=1
    
                         inf        inf
               = 1024 ( \Sum y^n - \Sum y^n - y^0 )
                         n=0        n=p
    
               = max - (max y^p) - 1024

LOAD_AVG_MAX is actually the maximum value of 1024 (1 + y + y2 + ... + yn). The calculation method is very simple. There is a set of geometric sequence summation formulas, and then n tends to positive infinity. The final value of LOAD_AVG_MAX is 47742. Of course, the value we calculate using mathematical methods may be slightly different from this value and is not completely equal. That's because the value 47742 is calculated through code. The computer calculation process involves floating point operations and rounding operations, so it is normal to have errors. The calculation code of LOAD_AVG_MAX is as follows.

void calc_converged_max(void)
{
    int n = -1;
    long max = 1024;
	long last = 0, y_inv = ((1UL << 32) - 1) * y;
 
	for (; ; n++) {
		if (n > -1)
			max = ((max * y_inv) >> 32) + 1024;
			/*
			 * This is the same as:
			 * max = max*y + 1024;
			 */
		if (last == max)
			break;
		last = max;
	}
	printf("#define LOAD_AVG_MAX %ld\n", max);
}

6. Scheduling entity updates load contribution

The function to update the load of the scheduling entity is update_load_avg(). This function will be called in the following situations.

Adding a process to the ready queue is the enqueue_entity operation in CFS.
Deleting a process from the ready queue is the dequeue_entity operation in CFS.
scheduler tick, called periodically to update load information.

static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	u64 now = cfs_rq_clock_task(cfs_rq);
	struct rq *rq = rq_of(cfs_rq);
	int cpu = cpu_of(rq);
	int decayed;
 
	/*
	 * Track task load average for carrying it to new CPU after migrated, and
	 * track group sched_entity load average for task_h_load calc in migration
	 */
	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
		__update_load_avg_se(now, cpu, cfs_rq, se);                  /* 1 */
 
	decayed  = update_cfs_rq_load_avg(now, cfs_rq);                  /* 2 */
	/* ...... */
}

__update_load_avg_se() is responsible for updating the load information of the scheduling entity se.
After updating the se load, update the load information of the cfs ready queue of se attach. The load of a runqueue is the sum of the loads of all SEs under the runqueue.

The __update_load_avg_se() code is as follows.

static int
__update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	if (entity_is_task(se))
		se->runnable_weight = se->load.weight;                            /* 1 */
 
	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,  /* 2 */
				cfs_rq->curr == se)) {
 
		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));     /* 3 */
		cfs_se_util_change(&se->avg);
		return 1;
	}
 
	return 0;
}

runnable_weight称作可运行权重，该概念主要针对group se提出。针对task se来说，runnable_weight的值就是和进程权重weight相等。针对group se，runnable_weight的值总是小于等于weight。
通过___update_load_sum()函数计算调度实体se的负载总和信息。
更新平均负载信息，例如se->load_avg成员。

___update_load_sum()函数实现如下。

static __always_inline int
___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
		  unsigned long load, unsigned long runnable, int running)
{
	u64 delta;
 
	delta = now - sa->last_update_time;
	delta >>= 10;                                                       /* 1 */
	if (!delta)
		return 0;
 
	sa->last_update_time += delta << 10;                                /* 2 */
 
	if (!load)
		runnable = running = 0;
 
	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))       /* 3 */
		return 0;
 
	return 1;
}

delta是两次负载更新之间时间差，单位是ns。整除1024是将ns转换成us单位。PELT算法最小时间计量单位时us，如果时间差连1us都不到，就没必要衰减计算，直接返回即可。
更新last_update_time，方便下次更新负载信息，计算时间差。
通过accumulate_sum()进行负载计算，由上面调用地方可知，这里的参数load、runnable及running非0即1。因此，在负载计算中可知，se->load_sum和se->runnable_load_sum最大值就是LOAD_AVG_MAX - 1024 + se->period_contrib。并且，se->load_sum的值和se->runnable_load_sum相等。

继续探究平均负载信息如何更新。___update_load_avg()函数如下。

static __always_inline void
___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
{
	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
	/*
	 * Step 2: update *_avg.
	 */
	sa->load_avg = div_u64(load * sa->load_sum, divider);
	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
	sa->util_avg = sa->util_sum / divider;
}

As can be seen from the above code, load is the weight of the scheduling entity se, and runnable is the runnable_weight of the scheduling entity se. Therefore, the formula for calculating average debt is as follows. For task se, the values of se->load_avg and se->runnable_load_avg are equal (because se->load_sum and se->runnable_load_sum are equal, and se->load.weight and se->runnable_weight are equal), and Its value is less than or equal to se->load.weight.

se->load_sum
se->load_avg = -------------------------------------------- * se->load.weight
                 LOAD_AVG_MAX - 1024 + sa->period_contrib
 
                                  se->runnable_load_sum
se->runnable_load_avg = -------------------------------------------- * se->runnable_weight
                          LOAD_AVG_MAX - 1024 + sa->period_contrib

For processes that run frequently, the value of load_avg will become closer and closer to the weight. For example, if a process with weight 1024 runs for a long time, its load contribution curve is as follows. The table above is the running time of the process, and the table below is the load contribution curve.

From the moment a process starts running, the load contribution starts to increase. Now if it is a process that runs periodically (running 1ms each time and sleeping 9ms), what about the load contribution curve?

The value of the load contribution is basically maintained between the two peak values of the minimum value and the maximum value. This is also in line with our expectations. We believe that load contribution is how frequently the reaction process runs. Therefore, based on the PELT algorithm, we can more clearly calculate the impact of migrating a process to other CPUs during load balancing.

7. Ready queue updates load information

As mentioned earlier, the function that updates the ready queue load information is update_cfs_rq_load_avg().

static inline int
update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
	int decayed = 0;

	decayed |= __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);

	return decayed;
}

Continue to call __update_load_avg_cfs_rq() to update the CFS ready queue load information. This function is very similar to the above update scheduling entity se load information function.

static int
__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
{
	if (___update_load_sum(now, cpu, &cfs_rq->avg,
				scale_load_down(cfs_rq->load.weight),
				scale_load_down(cfs_rq->runnable_weight),
				cfs_rq->curr != NULL)) {
 
		___update_load_avg(&cfs_rq->avg, 1, 1);
		return 1;
	}
 
	return 0;
}

The struct cfs_rq structure embeds the struct sched_avg structure, which is used to track ready queue load information. The ___update_load_sum() function has been analyzed above. The difference between this and the update scheduling entity se load is that the parameters passed are different. load and runnable pass the weight and runnability weight of the CFS ready queue respectively. The weight of the CFS ready queue refers to the sum of the weights of all ready scheduling entities on the CFS ready queue. The average load contribution of the CFS ready queue refers to the sum of the average loads of all scheduling entities. Each time the scheduling entity load information is updated, the CFS ready queue load information to which se is attached will also be updated simultaneously.

8. The difference between runnable_load_avg and load_avg

When introducing the struct sched_avg structure, we only introduced the load_avg member and ignored the runnable_load_avg member. So what exactly is the difference between them? We know that the struct sched_avg structure will be embedded in the scheduling entity struct sched_entity and the ready queue struct cfs_rq, which are used to track the load information of the scheduling entity and the ready queue respectively. For task se, there is no difference between the values of runnable_load_avg and load_avg. But for ready queue loads, the two have different meanings. load_avg represents the average load of the ready queue, which includes the load contribution of sleeping processes. runnable_load_avg only contains the load contribution of all runnable processes on the ready queue. How to reflect the difference? Let's take a look at the processing of adding a process to the ready queue. It is the familiar enqueue_entity() function again.

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	/*
	 * When enqueuing a sched_entity, we must:
	 *   - Update loads to have both entity and cfs_rq synced with now.
	 *   - Add its load to cfs_rq->runnable_avg
	 *   - For group_entity, update its weight to reflect the new share of
	 *     its group cfs_rq
	 *   - Add its new weight to cfs_rq->load.weight
	 */
	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);   /* 1 */
    enqueue_runnable_load_avg(cfs_rq, se);                /* 2 */
}

The load_avg member updates the information and passes the flag including DO_ATTACH. This flag will be used when the update_load_avg() function is called for the first time when the process is created.
Update runnable_load_avg information.

The update_load_avg() function we are familiar with is as follows.

static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	u64 now = cfs_rq_clock_task(cfs_rq);
	struct rq *rq = rq_of(cfs_rq);
	int cpu = cpu_of(rq);
	int decayed;
 
	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
		/*
		 * DO_ATTACH means we're here from enqueue_entity().
		 * !last_update_time means we've passed through
		 * migrate_task_rq_fair() indicating we migrated.
		 *
		 * IOW we're enqueueing a task on a new CPU.
		 */
		attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);  /* 1 */
		update_tg_load_avg(cfs_rq, 0);
	} else if (decayed && (flags & UPDATE_TG))
		update_tg_load_avg(cfs_rq, 0);
}

After the process is created for the first time, the value of se->avg.last_update_time is 0. Therefore, the attach_entity_load_avg() function will be called this time.

The attach_entity_load_avg() function is as follows.

static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
 
	se->avg.last_update_time = cfs_rq->avg.last_update_time;
	se->avg.period_contrib = cfs_rq->avg.period_contrib;
	se->avg.util_sum = se->avg.util_avg * divider;
 
	se->avg.load_sum = divider;
	if (se_weight(se)) {
		se->avg.load_sum =
			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
	}
 
	se->avg.runnable_load_sum = se->avg.load_sum;
	enqueue_load_avg(cfs_rq, se);
	cfs_rq->avg.util_avg += se->avg.util_avg;
	cfs_rq->avg.util_sum += se->avg.util_sum;
	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
	cfs_rq_util_change(cfs_rq, flags);
}

We can see a lot of initialization of the load in the dispatch room. Our focus now is the enqueue_load_avg() function.

The enqueue_load_avg() function is as follows, which clearly and directly accumulates the scheduling entity load information to the load_avg member of the ready queue.

static inline void
enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	cfs_rq->avg.load_avg += se->avg.load_avg;
	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
}

When the process is deleted from the ready queue, the load of se will not be deleted from the load_avg of the ready queue. Therefore, load_avg contains the load information of the runnable state and blocking state of all scheduling entities.

runnable_load_avg only contains load information for runnable processes. Let's take a look at the enqueue_runnable_load_avg() function. It is very clear. Directly accumulate the load information of the scheduling entity into the runnable_load_avg member.

static inline void
enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	cfs_rq->runnable_weight += se->runnable_weight;
 
	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
}

Let's continue to look at the dequeue_entity operation.

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	/*
	 * When dequeuing a sched_entity, we must:
	 *   - Update loads to have both entity and cfs_rq synced with now.
	 *   - Substract its load from the cfs_rq->runnable_avg.
	 *   - Substract its previous weight from cfs_rq->load.weight.
	 *   - For group entity, update its weight to reflect the new share
	 *     of its group cfs_rq.
	 */
	update_load_avg(cfs_rq, se, UPDATE_TG);
	account_entity_dequeue(cfs_rq, se);
}

account_entity_dequeue()函数就是减去即将出队的调度实体的负载信息。account_entity_dequeue()函数如下。

static inline void
dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	cfs_rq->runnable_weight -= se->runnable_weight;
 
	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
	sub_positive(&cfs_rq->avg.runnable_load_sum,
		     se_runnable(se) * se->avg.runnable_load_sum);
}

我们并没有看到load_avg成员减去调度实体的负载信息，只看到runnable_load_avg成员的变化。因此，调度实体入队和出队的操作中会对应增加和减少runnable_load_avg。所以，runnable_load_avg包含的是所有就绪队列上可运行状态调度实体的负载信息之和。load_avg是所有的可运行状态及阻塞状态进程的负载之和。

五、CFS调度器-带宽控制

1、前言

What is bandwidth control? In short, it controls the amount of CPU time a user group can consume in a given cycle. If the CPU time consumed in a given cycle exceeds the limit, task scheduling within the user group is restricted until the next cycle. Is it really necessary to limit the maximum CPU usage of a process? If there is only one process in a system, limit the CPU usage of the process to a maximum of 50%. When the process usage reaches 50%, the process is restricted from running and the CPU enters the idle state. It doesn't seem like it makes any sense. Sometimes, however, this is exactly what a system administrator might want to do. If these processes belong to customers who only pay for a certain amount of CPU time or in situations where strict resource provisioning is required, it may be necessary to limit the maximum share of CPU time that a process (or group of processes) may consume. After all, you get as much service as you pay. This article only discusses the CPU bandwidth control of the SCHED_NORMAL process.

Note: Code analysis is based on Linux 4.18.0.

2. Design principle

If you use CPU bandwith control, you need to configure the CONFIG_FAIR_GROUP_SCHED and CONFIG_CFS_BANDWIDTH options. This function limits the maximum CPU bandwidth used by a group. By setting two variables quota and period, period refers to a period of time, and quota refers to the CPU time limit that a group can use during the period period. When the running time of a group's process exceeds the quota, it will be restricted from running. This action is called throttle. Until the beginning of the next period, the group will be rescheduled. This process is called unthrottle.

In a multi-core system, a user group is described by task_group. The user group contains the number of CPU scheduling entities and the group cfs_rq corresponding to the scheduling entity. How to limit processes in a user group? We can simply delete the scheduling entity managed by the user group from the corresponding ready queue, and then mark the group cfs_rq flag bit corresponding to the scheduling entity. The values of quota and period are stored in the cfs_bandwidth structure, which is embedded in tasak_group. The cfs_bandwidth structure also contains the runtime member to record the remaining quota time. Whenever a process in the user group runs for a period of time, the corresponding runtime time is also reduced. The system will start a high-precision timer with a cycle time of period. After the timer time is reached, the remaining quota time runtime will be reset to quota and the next round of time tracking will begin. The running time of all user group processes is added together to ensure that the total running time is less than the quota. Each user group will manage the number of CPU ready queue group cfs_rq. Each group cfs_rq also has a quota time, which is applied from the global user group quota. For example, the period value is 100ms, the quota value is 50ms, and there are 2 CPU systems. Group cfs_rq on CPU0 first applies for 5ms time from the global quota time (the actual runtime value is 45), and then runs the process. When the 5ms time is used up, continue to apply for 5ms from the global time limit quota (this actual runtime value is 40). The same is true for CPU1. First, apply for a time slice from the quota as the ready queue cfs_rq, and then consume it by running the process. When the remaining time of the global quota is not enough to satisfy the request of CPU0 or CPU1, the cfs_rq corresponding to throttle is needed. After the timer time is reached, unthrottle all already throttled cfs_rq.

To sum up, cfs_bandwidth is like a global time pool (time pool manages time, analogous to memory pool managing memory). If each group cfs_rq wants to schedule the scheduling entities on the red-black tree it manages, it must first apply for a fixed time slice from the global time pool, and then consume it by its process. When the time slice is consumed, continue to apply for time slices from the global time pool. There comes a point when there is no time left in the time pool to apply for. This is a good time to throttle cfs_rq.

data structure

Each task_group contains the cfs_bandwidth structure, which mainly records and manages the time information of the time pool.

struct cfs_bandwidth {
#ifdef CONFIG_CFS_BANDWIDTH
	ktime_t				period;             /* 1 */
	u64 quota; /* 2 */
	u64					runtime;            /* 3 */
 
	struct hrtimer		period_timer;       /* 4 */
	struct list_head	throttled_cfs_rq;   /* 5 */
    /* ... */
#endif
};

The set timer cycle time.
Limit time.
The remaining runnable time is updated to quota in each timer callback function.
The high-precision timer mentioned above.
All throttled cfs_rqs are hung into this linked list, and the linked list is used to perform the unthrottle cfs_rq operation in the timer callback function.

The CFS ready queue is described using the cfs_rq structure, and the bandwidth-related members are as follows:

struct cfs_rq {
#ifdef CONFIG_FAIR_GROUP_SCHED
	struct rq *rq;	              /* 1 */
	struct task_group *tg;        /* 2 */
#ifdef CONFIG_CFS_BANDWIDTH
	int runtime_enabled;          /* 3 */
	u64 runtime_expires;
	s64 runtime_remaining;        /* 4 */
 
	u64 throttled_clock, throttled_clock_task;     /* 5 */
	u64 throttled_clock_task_time;
	int throttled, throttle_count;                 /* 6 */
	struct list_head throttled_list;               /* 7 */
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};

The cpu runqueue that cfs_rq is attached to, each CPU has only one rq run queue.
The task_group to which cfs_rq belongs.
Whether the bandwidth limit has been turned on for the ready queue. The default bandwidth limit is turned off. If the bandwidth limit is enabled, the value of runtime_enabled is 1.
The remaining time of the time slice applied by cfs_rq from the global time pool. When the remaining time is less than or equal to 0, you need to apply for the time slice again.
When cfs_rq is throttled, it is convenient to count the time of being throttled, and the time when the throttle starts needs to be recorded.
throttled: If cfs_rq is throttled, the throttled variable is set to 1. When unthrottled, the throttled variable is set to 0; throttle_count: Since task_group supports nesting, when the cfs_rq of the parent task_group is throttled, the throttle_count member count of the cfs_rq corresponding to its chaild task_group Increase.
The throttled cfs_rq is hung into the cfs_bandwidth->throttled_cfs_rq linked list.

bandwidth contribution

In periodic scheduling, the update_curr() function is called to update the virtual time of the currently running process. The bandwidth contribution of the process is also accumulated at this time. Subtract the running time of the process from the available time of cfs_rq to which the process is attached. If the time is not enough, apply for a certain time slice from the global time pool. Call the account_cfs_rq_runtime() function in the update_curr() function to count the remaining runtime of cfs_rq.

static __always_inline
void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
	if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
		return;
 
	__account_cfs_rq_runtime(cfs_rq, delta_exec);
}

If the CFS bandwidth control function is enabled, cfs_bandwidth_used() returns 1, and the cfs_rq->runtime_enabled value is 1. The __account_cfs_rq_runtime() function is as follows:

static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
	/* dock delta_exec before expiring quota (as it could span periods) */
	cfs_rq->runtime_remaining -= delta_exec;           /* 1 */
	expire_cfs_rq_runtime(cfs_rq);
 
	if (likely(cfs_rq->runtime_remaining > 0))         /* 2 */
		return;
 
	/*
	 * if we're unable to extend our runtime we resched so that the active
	 * hierarchy can be throttled
	 */
	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))         /* 4 */
		resched_curr(rq_of(cfs_rq));                   /* 5 */
}

The process has been running for delta_exec time, so the remaining runnable time of cfs_rq is reduced.
If there is still remaining running time for cfs_rq, there is no need to apply for a time slice from the global time pool.
If the runtime of cfs_rq is insufficient, assign_cfs_rq_runtime() is responsible for applying for time slices from the global time pool.
If the global time slice time is not enough, you need to throttle the current cfs_rq. Of course, the TIF_NEED_RESCHED flag is set here. Throttle in the back.

The assign_cfs_rq_runtime() function is as follows:

static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
	struct task_group *tg = cfs_rq->tg;
	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
	u64 amount = 0, min_amount, expires;
	int expires_seq;
 
	/* note: this is a positive sum as runtime_remaining <= 0 */
	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;    /* 1 */
 
	raw_spin_lock(&cfs_b->lock);
	if (cfs_b->quota == RUNTIME_INF)                          /* 2 */
		amount = min_amount;
	else {
		start_cfs_bandwidth(cfs_b);                           /* 3 */
 
		if (cfs_b->runtime > 0) {
			amount = min(cfs_b->runtime, min_amount);
			cfs_b->runtime -= amount;                         /* 4 */
			cfs_b->idle = 0;
		}
	}
	expires_seq = cfs_b->expires_seq;
	expires = cfs_b->runtime_expires;
	raw_spin_unlock(&cfs_b->lock);
 
	cfs_rq->runtime_remaining += amount;                      /* 5 */
	/*
	 * we may have advanced our local expiration to account for allowed
	 * spread between our sched_clock and the one on which runtime was
	 * issued.
	 */
	if (cfs_rq->expires_seq != expires_seq) {
		cfs_rq->expires_seq = expires_seq;
		cfs_rq->runtime_expires = expires;
	}
 
	return cfs_rq->runtime_remaining > 0;                     /* 6 */
}

The default time slice requested from the global time pool is 5ms.
If the cfs_rq does not limit the bandwidth, then the quota value is RUNTIME_INF. Since the bandwidth is not limited, the natural time pool time is inexhaustible, so the time slice application must be successful.
Make sure the timer is on, if off, turn it on. This timer will reset the remaining time available in the global time pool after the scheduled time is reached.
The time slice application is successful, and the remaining available time in the global time pool is updated.
The remaining available time of cfs_rq increases.
If cfs_rq cannot apply for a time slice from the global time pool, then this function returns 0, otherwise it returns 1, which means the application for the time slice is successful and no throttle is required.

How to throttle cfs_rq

Assume that the above assign_cfs_rq_runtime() function returns 0, which means that the application time failed. cfs_rq needs to be throttled. After the function returns, the TIF_NEED_RESCHED flag will be set, which means that scheduling is about to begin. The core layer of the scheduler selects the next process that should be run through the pick_next_task() function. The pick_next_task interface function of the CFS scheduler is pick_next_task_fair(). The CFS scheduler will put_prev_task() before selecting a process. In this function, the interface function put_prev_task_fair() will be called. The function is as follows:

static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
	struct sched_entity *se = &prev->se;
	struct cfs_rq *cfs_rq;
 
	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		put_prev_entity(cfs_rq, se);
	}
}

prev points to the process that is about to be scheduled. We will call check_cfs_rq_runtime() in the put_prev_entity() function to check whether the value of cfs_rq->runtime_remaining is less than 0. If it is less than 0, it needs to be throttled.

static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
	if (!cfs_bandwidth_used())
		return false;
 
	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))    /* 1 */
		return false;
 
	if (cfs_rq_throttled(cfs_rq))    /* 2 */
		return true;
 
	throttle_cfs_rq(cfs_rq);         /* 3 */
	return true;
}

Check whether cfs_rq meets the conditions for being throttled, and the available running time is less than 0.
If the cfs_rq has already been throttled, this does not require repeating the operation.
The throttle_cfs_rq() function is the real throttle operation and the core function of throttle.

The throttle_cfs_rq() function is as follows:

static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
	struct rq *rq = rq_of(cfs_rq);
	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
	struct sched_entity *se;
	long task_delta, dequeue = 1;
	bool empty;
 
	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];              /* 1 */
 
	/* freeze hierarchy runnable averages while throttled */
	rcu_read_lock();
	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);   /* 2 */
	rcu_read_unlock();
 
	task_delta = cfs_rq->h_nr_running;
	for_each_sched_entity(se) {
		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
		/* throttled entity or throttle-on-deactivate */
		if (!se->on_rq)
			break;
 
		if (dequeue)
			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);   /* 3 */
		qcfs_rq->h_nr_running -= task_delta;
 
		if (qcfs_rq->load.weight)                         /* 4 */
			dequeue = 0;
	}
 
	if (!se)
		sub_nr_running(rq, task_delta);
 
	cfs_rq->throttled = 1;                                /* 5 */
	cfs_rq->throttled_clock = rq_clock(rq);
	raw_spin_lock(&cfs_b->lock);
	empty = list_empty(&cfs_b->throttled_cfs_rq);
	list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);   /* 6 */
	if (empty)
		start_cfs_bandwidth(cfs_b);
	raw_spin_unlock(&cfs_b->lock);
}

throttle对应的cfs_rq可以将对应的group se从其就绪队列的红黑树上删除，这样在pick_next_task的时候，顺着根cfs_rq的红黑树往下便利，就不会找到已经throttle的se，也就是没有机会运行。
task_group可以父子关系嵌套。walk_tg_tree_from()函数功能是顺着cfs_rq->tg往下便利每一个child task_group，并且对每个task_group调用tg_throttle_down()函数。tg_throttle_down()负责增加cfs_rq->throttle_count计数。
从依附的cfs_rq的红黑树上删除。
如果qcfs_rq运行的进程只有即将被dequeue的se一个的话，那么parent se也需要dequeue。如果qcfs_rq->load.weight不为0，说明qcfs_rq就绪队列上运行的进程不止se一个，那么parent se理所应当不能被dequeue。
设置throttle标志位。
记录throttle时刻。
被throttle的cfs_rq加入cfs_b链表中，方便后续unthrottle操作可以找到这些已经被throttle的cfs_rq。

tg_throttle_down()函数如下，主要是cfs_rq->throttle_count计数递增：

static int tg_throttle_down(struct task_group *tg, void *data)
{
	struct rq *rq = data;
	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
	/* group is entering throttled state, stop time */
	if (!cfs_rq->throttle_count)
		cfs_rq->throttled_clock_task = rq_clock_task(rq);
	cfs_rq->throttle_count++;
 
	return 0;
}

When throttle cfs_rq, the data structure diagram is as follows:

Follow the children list of the task_group attached to throttle cfs_rq, find all task_groups, and add the cfs_rq->throttle_count member corresponding to the CPU.

How to unthrottle cfs_rq

The unthrottle cfs_rq operation will be performed when the periodic timer expires. The function responsible for unthrottle cfs_rq operation is unthrottle_cfs_rq(), which has the opposite operation to throttle_cfs_rq(). The function is as follows:

void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
	struct rq *rq = rq_of(cfs_rq);
	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
	struct sched_entity *se;
	int enqueue = 1;
	long task_delta;
 
	se = cfs_rq->tg->se[cpu_of(rq)];             /* 1 */
	cfs_rq->throttled = 0;                       /* 2 */
	update_rq_clock(rq);
	raw_spin_lock(&cfs_b->lock);
	cfs_b->throttled_time += rq_clock(rq) - cfs_rq->throttled_clock;  /* 3 */
	list_del_rcu(&cfs_rq->throttled_list);       /* 4 */
	raw_spin_unlock(&cfs_b->lock);
 
	/* update hierarchical throttle state */
	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);    /* 5 */
 
	if (!cfs_rq->load.weight)                    /* 6 */
		return;
 
	task_delta = cfs_rq->h_nr_running;
	for_each_sched_entity(se) {
		if (se->on_rq)
			enqueue = 0;
 
		cfs_rq = cfs_rq_of(se);
		if (enqueue)
			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);    /* 7 */
		cfs_rq->h_nr_running += task_delta;
 
		if (cfs_rq_throttled(cfs_rq))
			break;
	}
 
	if (!se)
		add_nr_running(rq, task_delta);
 
	/* Determine whether we need to wake up potentially idle CPU: */
	if (rq->curr == rq->idle && rq->cfs.nr_running)
		resched_curr(rq);
}

The unthrottle operation is the scheduling entity corresponding to cfs_rq. The scheduling entity has a chance to run only on parent cfs_rq.
The throttled flag is cleared.
throttled_time records the total time that cfs_rq is throttled, and throttled_clock records the start of throttle in the throttle_cfs_rq() function.
Remove itself from the linked list.
The tg_unthrottle_up() function is the reverse operation of the tg_throttle_down() function, decrementing the cfs_rq->throttle_count count.
If there is no process on unthrottle's cfs_rq, there is no need to perform the enqueue operation. cfs_rq->load.weight is 0, which means there is no runnable process on the ready queue.
Enqueue the scheduling entity. The for loop operation here corresponds to the dequeue operation of the throttle_cfs_rq() function.

The tg_unthrottle_up() function is as follows:

static int tg_unthrottle_up(struct task_group *tg, void *data)
{
	struct rq *rq = data;
	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
	cfs_rq->throttle_count--;
	if (!cfs_rq->throttle_count) {
		/* adjust cfs_rq_clock_task() */
		cfs_rq->throttled_clock_task_time += rq_clock_task(rq) -
					     cfs_rq->throttled_clock_task;
	}
 
	return 0;
}

In addition to decrementing the cfs_rq->throttle_count count, the throttled_clock_task_time time is also calculated. Different from throttled_time, the throttled_clock_task_time time also includes the time when parent cfs_rq was throttled. Although it is in the unthrottle state, the parent cfs_rq is in the throttle state and cannot be run by itself. Therefore, throttled_clock_task_time counts the total time it takes for cfs_rq->throttle_count to change from non-zero to 0.

Update quota periodically

The bandwidth limit is based on task_group, and each task_group has a built-in cfs_bandwidth structure. The periodic update of quota uses a high-precision timer, and the period is period. struct hrtimer period_timer is embedded in the cfs_bandwidth structure for this purpose. The timer initialization function is init_cfs_bandwidth().

void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
	raw_spin_lock_init(&cfs_b->lock);
	cfs_b->runtime = 0;
	cfs_b->quota = RUNTIME_INF;
	cfs_b->period = ns_to_ktime(default_cfs_period());
 
	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
	cfs_b->period_timer.function = sched_cfs_period_timer;
	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
	cfs_b->slack_timer.function = sched_cfs_slack_timer;
}

Initialize two hrtimers, namely period_timer and slack_timer. The callback function of period_timer is sched_cfs_period_timer(). Refresh the quota in the callback function and call the distribute_cfs_runtime() function unthrottle cfs_rq. The distribute_cfs_runtime() function is as follows:

static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
		u64 remaining, u64 expires)
{
	struct cfs_rq *cfs_rq;
	u64 runtime;
	u64 starting_runtime = remaining;
 
	rcu_read_lock();
	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,   /* 1 */
				throttled_list) {
		struct rq *rq = rq_of(cfs_rq);
		struct rq_flags rf;
 
		rq_lock(rq, &rf);
		if (!cfs_rq_throttled(cfs_rq))
			goto next;
 
		runtime = -cfs_rq->runtime_remaining + 1;
		if (runtime > remaining)
			runtime = remaining;
		remaining -= runtime;                                /* 2 */
 
		cfs_rq->runtime_remaining += runtime;                /* 3 */
		cfs_rq->runtime_expires = expires;
 
		/* we check whether we're throttled above */
		if (cfs_rq->runtime_remaining > 0)
			unthrottle_cfs_rq(cfs_rq);                       /* 3 */
 
next:
		rq_unlock(rq, &rf);
 
		if (!remaining)
			break;
	}
	rcu_read_unlock();
 
	return starting_runtime - remaining;
}

The loop facilitates all throttle cfs_rq, and the function parameter remaining is the remaining runtime of the global time pool.
remaining is the remaining time in the global time pool, and the time lent to cfs_rq here is runtime.
If the time borrowed from the global time pool ensures that the value of cfs_rq->runtime_remaining should be greater than 0, perform the unthrottle operation.

What is the role of another slack_timer? Let's think about another question first. If cfs_rq applies for a 5ms time slice from the global time pool, there is only one process on the cfs_rq. The process goes to sleep after running for 0.5ms. According to the code logic of CFS, the entire group se corresponding to cfs_rq will be dequeue . So should the remaining 4.5ms be returned to the global time pool? If it does not return, the process may have been insomnia for a long time, and the cfs_rq of other CPUs may not apply for a 5ms time slice (the global time pool time remains 4ms), causing throttle. The actual available time is 8.5ms. Therefore, we will return part of the time in this case, which can be used on other CPUs. The function calling process of this step is dequeue_entity()->return_cfs_rq_runtime()->__return_cfs_rq_runtime().

static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;  /* 1 */
 
	if (slack_runtime <= 0)
		return;
 
	raw_spin_lock(&cfs_b->lock);
	if (cfs_b->quota != RUNTIME_INF &&
	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
		cfs_b->runtime += slack_runtime;                                 /* 2 */
 
		/* we are under rq->lock, defer unthrottling using a timer */
		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
		    !list_empty(&cfs_b->throttled_cfs_rq))
			start_cfs_slack_bandwidth(cfs_b);                            /* 3 */
	}
	raw_spin_unlock(&cfs_b->lock);
 
	/* even if it's not valid for return we don't want to try again */
	cfs_rq->runtime_remaining -= slack_runtime;                          /* 4 */
}

The value of min_cfs_rq_runtime is 1ms. We choose to keep at least the min_cfs_rq_runtime time for ourselves, and the remaining time is returned to the global time pool. It is also unwise to return all the files. There may be processes running on the cfs_rq soon. If all are returned, the process needs to apply to the global time pool immediately when running, which is inefficient.
Returns the global time pool slack_runtime time.
There are two conditions for opening the slack_timer timer (as you can see from the comments, the reason for using the timer is that the rq->lock lock is currently held).

- The time of the global time pool is greater than 5ms, so that it is possible for other cfs_rq to apply for time slices (the minimum application time slice size is 5ms).
- Throttle's cfs_rq already exists. Now enable slack_timer, try to allocate time slices in the callback function, and unthrottle cfs_rq.

The remaining available time of cfs_rq is reduced.

The callback function of the slack_timer timer is sched_cfs_slack_timer(). sched_cfs_slack_timer() calls do_sched_cfs_slack_timer() to process the main logic.

static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{
	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
	u64 expires;
 
	/* confirm we're still not at a refresh boundary */
	raw_spin_lock(&cfs_b->lock);
	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {     /* 1 */
		raw_spin_unlock(&cfs_b->lock);
		return;
	}
 
	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)         /* 2 */
		runtime = cfs_b->runtime;
 
	expires = cfs_b->runtime_expires;
	raw_spin_unlock(&cfs_b->lock);
 
	if (!runtime)
		return;
 
	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);        /* 3 */
 
	raw_spin_lock(&cfs_b->lock);
	if (expires == cfs_b->runtime_expires)
		cfs_b->runtime -= min(runtime, cfs_b->runtime);
	raw_spin_unlock(&cfs_b->lock);
}

Check whether the period_timer time is coming soon. If the period_timer time is up, the global time pool will be refreshed. Therefore, you can unthrottle cfs_rq with the help of period_timer. If period_timer must be a while, then you need to use the current function unthrottle cfs_rq at this moment.
The remaining runnable time of the global time pool must be greater than slice (default 5ms), because the unit of cfs_rq application time slice is 5ms.
The distribute_cfs_runtime() function has been analyzed. According to the passed parameter runtime, it calculates how many cfs_rqs can be unthrottled, just unthrottle a few cfs_rqs, and do your best.

3. How to use user space

The interface provided by CFS bandwidth control is presented in the form of cgroupfs. The following three files are provided.

cpu.cfs_quota_us: The quota time in a cycle is the quota mentioned in the article
cpu.cfs_period_us: a cycle time, which is the period mentioned in the article
cpu.stat: Bandwidth limit status information

By default, cpu.cfs_quota_us=-1, cpu.cfs_period_us=100ms. The value of quota is -1, which means no bandwidth limit. If we want to limit bandwidth, we can write legal values to these two files. The legal value range of quota and period is 1ms~1000ms. In addition, hierarchical relationships need to be considered. Writing a negative value to cpu.cfs_quota_us will not limit bandwidth.

As mentioned above, cfs_rq applies for a fixed size of time slice from the global time pool. The default fixed size is 5ms. Of course, this value can also be changed. The file path is as follows:

/proc/sys/kernel/sched_cfs_bandwidth_slice_us

The cpu.stat file will output the following three points of information.

nr_periods: The number of time periods that have elapsed so far
nr_throttled: Number of times bandwidth throttling occurred for the user group
throttled_time: sum of the total limit time of scheduling entities in the user group

User group level restrictions

The cpu.cfs_quota_us and cpu.cfs_period_us interfaces can control the bandwidth of a task_group at: max(c_i) <= C (where C represents the bandwidth of the parent task_group, and c_i represents its children taskgroup). The maximum bandwidth of all children task_groups cannot exceed the bandwidth of the parent task_group. However, the total bandwidth of all children task_groups is allowed to be greater than the bandwidth of the parent task_group. That is: \Sum (c_i) >= C. Therefore, there are two possible reasons why task_group is throttled:

task_group consumes its own quota within one cycle
The parent task_group consumes its own quota within one cycle.

In the second case, although the child task_group still has quota left and is not consumed, the child task_group must also wait until the next cycle time of the parent task_group arrives.

Usage examples

Set the task_group bandwidth to 100% period and quota to 250ms, which is equivalent to providing the bandwidth resource of 1 CPU to the task_group, and the total CPU usage is 100%.

echo 250000 > cpu.cfs_quota_us  /* quota = 250ms */
echo 250000 > cpu.cfs_period_us /* period = 250ms */

Setting the task_group bandwidth to 200% (multi-core system) with a 500ms period and 1000ms quota setting is equivalent to providing the bandwidth resources of 2 CPUs for the task_group, and the total CPU usage is 200%.

echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
echo 500000 > cpu.cfs_period_us /* period = 500ms */

A larger period time can increase task_group throughput.

Setting task_group bandwidth below 20% can use 20% of CPU bandwidth resources.

echo 10000 > cpu.cfs_quota_us  /* quota = 10ms */
echo 50000 > cpu.cfs_period_us /* period = 50ms */

In the case of using a smaller period, the shorter the period, the smaller the corresponding delay.

6. CFS Scheduler-Summary

After describing the previous series of articles, we already have a certain understanding of the CFS scheduler. So this article will serve as a summary and reflection. Let's just recall those things about the CFS scheduler. Let's review the principles of CFS scheduler design in the form of questions. Now comes our problem.

1. Why does the CFS scheduler introduce virtual time slice vruntime?

I think if there is no priority distinction among all processes, we don't need to introduce the concept of vruntime at all. The actual time that all processes need to run is the same, and everyone remains absolutely fair. If we design a scheduler, of course we can record the actual running time of each process. Every time the schedule selects the next process, we can select the process that has been running for the shortest time. Of course, there is a relationship between process and severity. Processes that users consider important should run longer. At this time, different processes have different running times due to priorities. If we still want to use the same method as when there is no priority to select the next running process, it will naturally be a bit difficult. Because the running time of different processes should be different now, how can we judge which process has the least running time. So we introduce the concept of virtual time. Now we hope that different processes can calculate the same value through a formula based on the physical time allocated by priority. We call this value virtual time. We record the virtual time of each process running. When we need to select the next running process, just find the process with the smallest virtual time.

2. Is the initial value of vruntime of the newly created process 0?

Let's consider another question first. What happens if the vruntime of the new process created through fork() is 0? The vruntime of all processes on the ready queue is already a very large value. If the vruntime value of the new process is 0, according to the CFS scheduler pick_next_task_fair() logic, we will tend to choose the new process and keep it running more to catch up with the vruntime of other processes in the ready queue. Since it cannot be 0, what should be a more reasonable initial value? Of course, it is similar to the vruntime value of all processes on the ready queue. How to operate it will be revealed in the next question.

3. What is the role of min_vruntime recorded in the ready queue?

First of all, we need to understand what min_vruntime records. min_vruntime records the minimum virtual time of all processes managed by the ready queue. Theoretically, the virtual time of all processes is greater than min_vruntime. What is the use of recording this time? I think there are three main effects.

When we fork() a new process, the virtual time must not be assigned a value of 0. Otherwise, the newly created process will catch up with other processes until it is equal to the virtual time of other processes on the ready queue. This is of course unfair and unreasonable design. Therefore, for the new fork() process, we appropriately adjust the virtual time assigned to the new process based on the value of min_vruntime. In this way, the newly created process will not run crazy.
When the process sleeps for a period of time and is wakeup, it is just that things have changed. We also face a situation similar to the new process. Similarly, we need to appropriately adjust the value of min_vruntime and assign it to the process.
Regarding the migration process, we face a new problem. The value of min_vruntime for the ready queue on each CPU is different. It may vary a lot. If the process is migrated from CPU0 to CPU1, should the vruntime of the process change? It was needed at the time, otherwise you would face punishment or rewards after migration. Our method is to subtract the min_vruntime of CPU0 from the vruntime of the process, and then add the min_vruntime of CPU1.

4. How to deal with the vruntime of the awakened process?

After the previous question, we should have some answers. If the sleep time is very long, it will naturally be processed according to the value of min_vruntime. The question is how do we deal with this? We will subtract a value based on the value of min_vruntime as the vruntime for waking up the process. Why subtract a value? I think the process has been sleeping for a long time and is not taking up too much CPU time itself. It is normal to give some compensation. Most interactive applications basically fall into this situation. This processing improves the corresponding speed of interactive applications. What if the sleep time is very short? Of course, there is no need to interfere with the vruntime of the process.

5. Is the vruntime of all processes on the ready queue necessarily greater than min_vruntime?

The answer is of course no. Although the significance of introducing min_vruntime is the minimum virtual time of all processes on the final ready queue, it does not mean that all process vruntime is greater than min_vruntime. This question holds true in some cases. For example, if a certain amount of compensation is given to the vruntime of the awakened process as mentioned above, the value of vruntime of the awakened process will be less than min_vruntime.

6. Will the awakened process preempt the currently running process?

Divided into two situations, this depends on whether the wake-up preemption feature is turned on. That is WAKEUP_PREEMPTION of sched_feat. If the wake-up preemption feature is not turned on, then there is nothing to say. Now consider the case where this feature is turned on. Since the awakened process will receive a certain reward based on the value of min_vruntime, there is a high possibility that vruntime is smaller than the vruntime of the currently running process. Does that mean that as long as the vruntime of the waking process is smaller than the vruntime of the currently running process, it will be preempted? Not really. We must not only meet small conditions, but also add conditions on top of them. The difference between the two must be greater than the wake-up granularity time. This time is stored in the variable sysctl_sched_wakeup_granularity, and the default value is 1ms.

7. Why is the initial value of min_vruntime so strange?

The ready queue struct cfs_rq is initialized through the init_cfs_rq() function. The function is as follows:

void init_cfs_rq(struct cfs_rq *cfs_rq)
{
	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
}

The initial value is U64_MAX - (1LL << 20), U64_MAX represents the maximum value of 64 bits unsigned integer. Here, I also have the same question, why is the initial value of min_vruntime not 0? What does it mean to have such a large number? Is there any benefit compared to 0? Of course, I didn't find the answer either. The following are all my guesses to share with you. The unit of min_vruntime is ns. That is to say, the system runs for about (1<<20)ns, and min_vruntime will overflow in about 1ms. Therefore, the reason may be to detect problems caused by min_vruntime value overflow earlier. If the initial value is 0, it will take about 545 years to discover the problems caused by min_vruntime overflow in advance (calculated based on the process with NICE value 0, and if calculated based on NICE value 20, it only takes about 8 years).

8. Will the minimum granularity time sysctl_sched_min_granularity be satisfied?

The time setting of the CFS scheduling cycle depends on the number of processes. According to the __sched_period() function, when the number of processes is greater than sched_nr_latency, the time of the scheduling cycle is equal to the number of processes multiplied by sysctl_sched_min_granularity.

static u64 __sched_period(unsigned long nr_running)
{
	if (unlikely(nr_running > sched_nr_latency))
		return nr_running * sysctl_sched_min_granularity;
	else
		return sysctl_sched_latency;
}

But does this mean that the process will run for at least sysctl_sched_min_granularity time before it will be preempted? This is indeed the result if all processes have the same priority. However, when there are processes with different priorities, and the number of system processes is greater than sched_nr_latency, then the process with a high NICE value cannot guarantee that the minimum running time of sysctl_sched_min_granularity will be occupied. This is a special case where the total time allocated to this process in a scheduling cycle is less than sysctl_sched_min_granularity.

What happens if a process allocates a share greater than sysctl_sched_min_granularity in a cycle? In this case, the CFS scheduler may be able to guarantee the minimum granularity time. Let's take a look at check_preempt_tick().

If the process running time exceeds the allocated time in this cycle, just reschedule it.
The first point is not satisfied. If the running time here is less than the minimum granularity time, exit directly.

Original address:

CFS Scheduler (6) - Summary

CFS Scheduler (5)-Bandwidth Control

CFS scheduler (4)-PELT (per entity load tracking)

CFS Scheduler (3)-Group Scheduling

CFS scheduler (2)-source code analysis

CFS Scheduler (1)-Basic Principles