Linux CFS Scheduler: Principle and Implementation

The concept part only explains CFSthe scheduler

Highlights of CFS Scheduler

basic concept

Linux 2.6In my opinion, it is a relatively perfect version, and 2.4the 2.6biggest difference lies in CFSthe introduction of the scheduler

CFSDue to the introduction of priority, cpuresource allocation (that is, actual running time runtime) is allocated based on weight. The mapping between weight weightand priority niceis as follows. Priority and weight are inversely proportional.

static const int prio_to_weight[40] = {
    
    
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

The actual running time runtimeis calculated as

$\times weight_i / \sum_{i=1}^{n} weight_i(i=1,2...n)$

A virtual running time is set herevruntime

$runtime\times(1024/weight_i)$

Simplification shows that the weight of each scheduling entity vruntimedoes not change due to changes in its own weight. Therefore, from a macro perspective, the weight of each scheduling entity vruntimeshould be the same in each scheduling cycle. This is an ideal state.

$\times (1024/\sum_{i=1}^{n} weight_i(i=1,2...n))$

When a certain scheduling entity enters the blocking or sleeping state due to some reasons, it will actively give up the time slice, causing it to remain vruntimeunchanged temporarily, while other scheduling entities obtain the time slice and start running, causing it to vruntimeincrease. This creates an asymmetry, which is unfair, so the vruntimesmallest process needs to be scheduled at the next process switch.

runtimeWhy is it completely fair to have high priority and low priority assigned differently ?

Because fairness is based vruntimeon the logic of , not runtimethe logic of , cfsit ensures that each scheduling entity is vruntimeequal. If there is a smaller one, vruntimeit will be scheduled first.

The higher priority one is runtimebigger, the lower priority one runtimeis smaller, but it vruntime's the same, so in this case the lower priority one is actually the clock has a higher decay rate

some problems

vruntimeIs the initial value of the new process 0?

When the child process is created, vruntimethe initial value is first set tomin_vruntime
If bits sched_featuresare set in START_DEBIT, vruntimeit will min_vruntimebe increased based on
After setting the child process vruntime, check sched_child_runs_firstthe parameters. If it is 1, compare the parent process and the child process vruntime. If the parent process vruntimeis smaller, swap the parent and child processes vruntime. This ensures that the child process will be before the parent process. run

Does the value of the hibernating process vruntimeremain unchanged?

Reset vruntimethe value when the hibernation process is awakened, and min_vruntimegive a certain amount of compensation based on the value, but not too much.

Can the time slice occupied by a process be infinitesimally small?

CFSSets the minimum time value for the process to occupy the CPU. sched_min_granularity_ns If the process running on the CPU is less than this time, it cannot be transferred away CPU.

liuzixuan@10-60-73-159:~$ cat /proc/sys/kernel/sched_min_granularity_ns
1500000

Does it change when a process is moved from one CPUto another ?CPUvruntime

When a process comes out of one CPU's run queue, it vruntimesubtracts min_vruntimethe value of the queue; and when a process joins another CPU's run queue, it vruntimeadds min_vruntimethe value of that queue. In this way, processes remain relatively fair after CPUmigrating from one to anotherCPUvruntime

vruntimeWhat should I do if infinite accumulation occurs and overflow occurs?

The red-black tree is keynot the smallest one in the red-black tree vruntime. Subtract the smallest one to surround all processes with the smallest one . In other words, only the relative sizes are compared.vruntime-min_vruntimemin_vruntimekeyvruntimekeyvruntime

static inline int less(u32 left, u32 right)
{
    
    
	return (less_eq(left, right) && (mod(right) != mod(left)));
}

Source code and supplements

Scheduling classes and scheduling strategies

LinuxScheduling class: Allocate computing power as needed to provide maximum fairness to each process in the system

fair_sched_class: CFSCompletely fair scheduler
idle_sched_class: Each processor has one idle thread, that is, 0thread number
rt_sched_class: Maintain a queue for each scheduling priority

struct sched_class {
    
    
	const struct sched_class *next;

	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
	void (*yield_task) (struct rq *rq);

	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);

	struct task_struct * (*pick_next_task) (struct rq *rq);
	void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
	int  (*select_task_rq)(struct task_struct *p, int sync);

	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
			struct rq *busiest, unsigned long max_load_move,
			struct sched_domain *sd, enum cpu_idle_type idle,
			int *all_pinned, int *this_best_prio);

	int (*move_one_task) (struct rq *this_rq, int this_cpu,
			      struct rq *busiest, struct sched_domain *sd,
			      enum cpu_idle_type idle);
	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
	int (*needs_post_schedule) (struct rq *this_rq);
	void (*post_schedule) (struct rq *this_rq);
	void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);

	void (*set_cpus_allowed)(struct task_struct *p,
				 const struct cpumask *newmask);

	void (*rq_online)(struct rq *rq);
	void (*rq_offline)(struct rq *rq);
#endif

	void (*set_curr_task) (struct rq *rq);
	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
	void (*task_new) (struct rq *rq, struct task_struct *p);

	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
			       int running);
	void (*switched_to) (struct rq *this_rq, struct task_struct *task,
			     int running);
	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
			     int oldprio, int running);

#ifdef CONFIG_FAIR_GROUP_SCHED
	void (*moved_group) (struct task_struct *p);
#endif
};

static const struct sched_class fair_sched_class;  // 公开调度类
static const struct sched_class idle_sched_class;  // 空闲调度类
static const struct sched_class rt_sched_class;    // 实时调度类

LinuxScheduling strategy: deciding when and how to select a new process to CPUrun

SCHED_NORMAL: Ordinary process scheduling strategy, allowing scheduling entities to cfsrun through the scheduler
SCHED_FIFO: Real-time process scheduling strategy, first-in-first-out scheduling algorithm
SCHED_RR: Real-time process scheduling strategy, time slice rotation algorithm
SCHED_BATCH: Ordinary process scheduling strategy, batch processing, so that scheduling entities cfsrun through the scheduler
SCHED_IDLEcfs: Ordinary process scheduling strategy, so that the scheduling entity runs through the scheduler with the lowest priority

#define SCHED_NORMAL		0  
#define SCHED_FIFO		    1
#define SCHED_RR		    2
#define SCHED_BATCH		    3
#define SCHED_IDLE		    5

sched_getscheduler()A certain process scheduling policy can be obtained through system calls

Priority classification

Regarding the priority issue , priority is generally divided into static priority and dynamic priority.

Static priority: 100Used to 139represent the static priority of an ordinary process. It is used to evaluate the degree of scheduling between this process and other ordinary processes in the system. It essentially determines the basic time slice of the process.
Dynamic priority: 100Used to 139represent the dynamic priority of ordinary processes, which is the number used by the scheduler when selecting a new process to run.

Balancing rq in multiprocessor systems

A principle: no runnable process can appear in two or more run queues at the same time

Scheduling domain: is a set whose workload should be balanced by the kernel. Its composition is similar to a radix tree. Each scheduling domain is divided into one or more groups in turn, and each group is a subset cpuof the pending scheduling domain. cpuBalancing of workload is always done between groups in the scheduling domain

cpuAll physical sched_domaindescriptors in the system are placed in each cpuvariablephys_domains

static DEFINE_PER_CPU(struct static_sched_domain, phys_domains);

Their initialization is in each machine directory

/* sched_domains SD_NODE_INIT for SGI IP27 machines */
#define SD_NODE_INIT (struct sched_domain) {
      
      		\
	.parent			= NULL,			\
	.child			= NULL,			\
	.groups			= NULL,			\
	.min_interval		= 8,			\
	.max_interval		= 32,			\
	.busy_factor		= 32,			\
	.imbalance_pct		= 125,			\
	.cache_nice_tries	= 1,			\
	.flags			= SD_LOAD_BALANCE	\
				| SD_BALANCE_EXEC	\
				| SD_WAKE_BALANCE,	\
	.last_balance		= jiffies,		\
	.balance_interval	= 1,			\
	.nr_balance_failed	= 0,			\
}