Linux CFS调度器：原理和实现

CFS调度器的重点

基本概念

Linux 2.6版本在我看来是一个比较完美的版本，而2.4和2.6最大的差异就在于CFS调度器的引入

由于优先级的引入，CFS对于cpu资源的分配（即实际运行时间runtime）是根据权重来进行分配的，权重weight和优先级nice的映射如下，优先级和权重呈反比

static const int prio_to_weight[40] = {
    
    
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

实际运行时间runtime的计算公式则为

$\times weight_i / \sum_{i=1}^{n} weight_i(i=1,2...n)$

此处设置了一个虚拟运行时间vruntime

$runtime\times(1024/weight_i)$

化简可得每一个调度实体的vruntime不由自身的权重改变而改变，因此从宏观上来看每个调度周期时每一个调度实体的vruntime应当是一样的，这是理想状态下的

$\times (1024/\sum_{i=1}^{n} weight_i(i=1,2...n))$

而某一调度实体由于某些原因导致进入阻塞或睡眠态，此时便会主动将时间片让出去，导致其vruntime暂时不变，而其他调度实体获得了该时间片开始运行，导致其vruntime增加，此时便会形成不对等，这是不公平的，因此需要在下一进程切换时调度vruntime最小的进程

为什么让优先级高的和优先级低的分配不同的runtime却说完全公平呢？

因为该公平是在vruntime的逻辑上，而不是runtime的逻辑上，cfs保证了每一个调度实体在vruntime的相等，如果有较小的vruntime就优先调度它

优先级较高的runtime较大，优先级较低的runtime较小，但其vruntime是一样的，因此在这种情况下优先级较低的实际上是时钟有了更高的衰减率

一些问题

新进程的vruntime的初始值是不是0？

子进程在创建时，vruntime初值首先被设置为min_vruntime
如果sched_features中设置了START_DEBIT位，vruntime会在min_vruntime的基础上再增大一些
设置完子进程的vruntime之后，检查sched_child_runs_first参数，如果为1的话，就比较父进程和子进程的vruntime，若是父进程的vruntime更小，就对换父、子进程的vruntime，这样就保证了子进程会在父进程之前运行

休眠进程的vruntime的值一直保持不变吗？

在休眠进程被唤醒时重新设置vruntime值，以min_vruntime值为基础，给予一定的补偿，但不能补偿太多

进程占用的时间片可以无穷小吗？

CFS设定了进程占用CPU的最小时间值， sched_min_granularity_ns ，正在CPU上运行的进程如果不足这个时间是不可以被调离CPU的

liuzixuan@10-60-73-159:~$ cat /proc/sys/kernel/sched_min_granularity_ns
1500000

进程从一个CPU迁移至另外一个CPU的时候vruntime会变化吗？

当进程从一个CPU的运行队列中出来时，它的vruntime要减去队列的min_vruntime值；而当进程加入另一个CPU的运行队列时，它的vruntime要加上该队列的min_vruntime值。这样，进程从一个CPU迁移到另一个CPU之后，vruntime保持相对公平

vruntime无限累加，产生溢出怎么办？

红黑树中的key不是vruntime而是vruntime-min_vruntime，min_vruntime是红黑树中最小的key，减去一个最小的vruntime将所有进程的key围绕在最小vruntime的周围，换句话来说就是，只比较相对大小

static inline int less(u32 left, u32 right)
{
    
    
	return (less_eq(left, right) && (mod(right) != mod(left)));
}

源码和补充

调度类和调度策略

Linux调度类：按所需分配的计算能力, 向系统中每个进程提供最大的公正性

fair_sched_class：CFS完全公平调度器
idle_sched_class：每个处理器有一个空闲线程，即0号线程
rt_sched_class：为每个调度优先级维护一个队列

struct sched_class {
    
    
	const struct sched_class *next;

	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
	void (*yield_task) (struct rq *rq);

	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);

	struct task_struct * (*pick_next_task) (struct rq *rq);
	void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
	int  (*select_task_rq)(struct task_struct *p, int sync);

	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
			struct rq *busiest, unsigned long max_load_move,
			struct sched_domain *sd, enum cpu_idle_type idle,
			int *all_pinned, int *this_best_prio);

	int (*move_one_task) (struct rq *this_rq, int this_cpu,
			      struct rq *busiest, struct sched_domain *sd,
			      enum cpu_idle_type idle);
	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
	int (*needs_post_schedule) (struct rq *this_rq);
	void (*post_schedule) (struct rq *this_rq);
	void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);

	void (*set_cpus_allowed)(struct task_struct *p,
				 const struct cpumask *newmask);

	void (*rq_online)(struct rq *rq);
	void (*rq_offline)(struct rq *rq);
#endif

	void (*set_curr_task) (struct rq *rq);
	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
	void (*task_new) (struct rq *rq, struct task_struct *p);

	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
			       int running);
	void (*switched_to) (struct rq *this_rq, struct task_struct *task,
			     int running);
	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
			     int oldprio, int running);

#ifdef CONFIG_FAIR_GROUP_SCHED
	void (*moved_group) (struct task_struct *p);
#endif
};

static const struct sched_class fair_sched_class;  // 公开调度类
static const struct sched_class idle_sched_class;  // 空闲调度类
static const struct sched_class rt_sched_class;    // 实时调度类

Linux调度策略：决定什么时候以怎么样的方式选择一个新进程占用CPU运行

SCHED_NORMAL：普通进程调度策略，使调度实体通过cfs调度器运行
SCHED_FIFO：实时进程调度策略，先进先出调度算法
SCHED_RR：实时进程调度策略，时间片轮转算法
SCHED_BATCH：普通进程调度策略，批量处理，使调度实体通过cfs调度器运行
SCHED_IDLE：普通进程调度策略，使调度实体以最低优先级通过cfs调度器运行

#define SCHED_NORMAL		0  
#define SCHED_FIFO		    1
#define SCHED_RR		    2
#define SCHED_BATCH		    3
#define SCHED_IDLE		    5

可以通过sched_getscheduler()系统调用获得某一进程调度策略

优先级分类

关于优先级问题，优先级一般分为静态优先级和动态优先级

静态优先级：用100到139表示普通进程的静态优先级，用来估价系统中这个进程和其他普通进之间调度的程度，本质上决定了进程的基本时间片
动态优先级：用100到139表示普通进程的动态优先级，其是调度程序在选择新进程来运行的时候使用的数

多处理器系统 rq的平衡

一个原则：任何一个可运行进程都不可能同时出现在两个或多个运行队列中

调度域：是一个cpu集合，它的工作量应当由内核保持平衡，其组成类似于基数树，每个调度域被依次划分为一个或多个组，每个组待办调度域的一个cpu子集，工作量的平衡总是在调度域的组之间来完成

系统中所有物理cpu的sched_domain描述符都放在每cpu变量phys_domains中

static DEFINE_PER_CPU(struct static_sched_domain, phys_domains);

它们的初始化在各个机器目录中

/* sched_domains SD_NODE_INIT for SGI IP27 machines */
#define SD_NODE_INIT (struct sched_domain) {
      
      		\
	.parent			= NULL,			\
	.child			= NULL,			\
	.groups			= NULL,			\
	.min_interval		= 8,			\
	.max_interval		= 32,			\
	.busy_factor		= 32,			\
	.imbalance_pct		= 125,			\
	.cache_nice_tries	= 1,			\
	.flags			= SD_LOAD_BALANCE	\
				| SD_BALANCE_EXEC	\
				| SD_WAKE_BALANCE,	\
	.last_balance		= jiffies,		\
	.balance_interval	= 1,			\
	.nr_balance_failed	= 0,			\
}