概述

内存中保存了对每个进程的唯一描述，并通过若干的结构体连接在一起。

调度器要做的就是在进程之间共享CPU时间，创造并行执行进程的错觉。

这包括两个部分：

1. 调度策略；

2. 上下文切换；

根据不同的调度策略，调度器分为不同的种类，文章主要介绍完全公平调度器，它会挑选具有最高等待时间的进程，把CPU提供给该进程，大致的示意图如下：

关于调度策略，还有若干的现实问题：

1. 进程有不同的优先级；

2. 进程不能切换的太频繁，因为切换本身也占用开销；

调度分为两种方式，一种是直接调度，比如进程打算睡眠或者出于其它目的放弃CPU；另一种是通过周期性的机制，以固定的频率检测是否有必要进行调度。

整个调度器分为若干个子系统，概述如下：

红框部分是调度器的主体，称为通用调度器。它与调度器类和上下文切换两个组件交互：

1. 调度器类是真正用来确定哪个进程会被调度，它可以按照模块化的方式实现，对应不同的调度策略；

2. 上下文切换跟CPU紧密交互；

实现

下面介绍调度器的代码实现，由于涉及到的代码很多，这里只介绍最最简单的。

入口

调度器的实现基于两个函数：周期性调度器函数和主调度器函数。

schedule()函数是主调度器函数：

/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)

它采用的是直接调度方式，在内核的很多地方，如果要将CPU分配给其它的进程，都会直接调用该函数：

static void __lock_sock(struct sock *sk)
{
	DEFINE_WAIT(wait);

	for (;;) {
		prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait,
					TASK_UNINTERRUPTIBLE);
		spin_unlock_bh(&sk->sk_lock.slock);
		schedule();
		spin_lock_bh(&sk->sk_lock.slock);
		if (!sock_owned_by_user(sk))
			break;
	}
	finish_wait(&sk->sk_lock.wq, &wait);
}

scheduler_tick()是周期性调度函数：

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 *
 * It also gets called by the fork code, when changing the parent's
 * timeslices.
 */
void scheduler_tick(void)

显而易见它采用的是周期性调度方式：

/*
 * Called from the timer interrupt handler to charge one tick to the current
 * process.  user_tick is 1 if the tick is user time, 0 for system.
 */
void update_process_times(int user_tick)
{
	struct task_struct *p = current;
	int cpu = smp_processor_id();

	/* Note: this timer irq context must be accounted for as well. */
	account_process_tick(p, user_tick);
	run_local_timers();
	if (rcu_pending(cpu))
		rcu_check_callbacks(cpu, user_tick);
	scheduler_tick();
	run_posix_cpu_timers(p);
}

相关结构体

当讨论调度器实现时，首先还会提到进程结构体task_struct结构体中的成员：

	int prio, static_prio, normal_prio;
	struct list_head run_list;
	const struct sched_class *sched_class;
	struct sched_entity se;
	unsigned int policy;
	cpumask_t cpus_allowed;
	unsigned int time_slice;
	unsigned int rt_priority;

prio、static_prio、normal_prio是进程的优先级，prio和normal_prio是动态优先级；static_prio是静态优先级，是进程启动时分配的优先级；normal_prio是基于进程的静态优先级和调度策略计算出来的优先级；调度器考虑的优先级是prio；

rt_priority是实时进程的优先级；

sched_class表示进程所属的调度器类；

se是可调度实体；调度器不限于调度进程，还可以实现进程组调度，它就是一个可调度实体；事实上调度器操作的是可调度实体，由于调度实体是内嵌在进程结构体中的，所以进程本身也是可调度实体；

policy保存了对该进程的调度策略，有如下的值：

/*
 * Scheduling policies
 */
#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5

NORMAL用于普通进程，通过完全公平调度器来处理；

BATCH和IDLE也通过完全公平调度器处理，不过可用于次要的进程；

RR和FIFO用于实现软实时进程；

cpus_allowed用于多处理器系统上，用于限制进程可以在哪些CPU上运行；

run_list和time_slice是循环实时调度器所需要的，但不用于完全公平调度器；

调度器类

前面讲过调度器类是模块化处理的，每个模块化的调度器都实现了如下的结构体（include\linux\sched.h）：

struct sched_class {
	const struct sched_class *next;

	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
	void (*yield_task) (struct rq *rq);

	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);

	struct task_struct * (*pick_next_task) (struct rq *rq);
	void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
			struct rq *busiest, unsigned long max_load_move,
			struct sched_domain *sd, enum cpu_idle_type idle,
			int *all_pinned, int *this_best_prio);

	int (*move_one_task) (struct rq *this_rq, int this_cpu,
			      struct rq *busiest, struct sched_domain *sd,
			      enum cpu_idle_type idle);
#endif

	void (*set_curr_task) (struct rq *rq);
	void (*task_tick) (struct rq *rq, struct task_struct *p);
	void (*task_new) (struct rq *rq, struct task_struct *p);
};

每种策略的调度器都实现了上述的结构体，比如完全公平调度器（kernel\sched_fair.c）：

/*
 * All the scheduling class methods:
 */
static const struct sched_class fair_sched_class = {
	.next			= &idle_sched_class,
	.enqueue_task		= enqueue_task_fair,
	.dequeue_task		= dequeue_task_fair,
	.yield_task		= yield_task_fair,

	.check_preempt_curr	= check_preempt_wakeup,

	.pick_next_task		= pick_next_task_fair,
	.put_prev_task		= put_prev_task_fair,

#ifdef CONFIG_SMP
	.load_balance		= load_balance_fair,
	.move_one_task		= move_one_task_fair,
#endif

	.set_curr_task          = set_curr_task_fair,
	.task_tick		= task_tick_fair,
	.task_new		= task_new_fair,
};

调度器类提供的接口说明如下：

enqueue_task：向就绪队列添加一个新进程；

dequeue_task：将一个进程从就绪队列去除；

yield_task：当进程想要放弃对处理器的控制权时会执行系统调用sched_yield，而这个系统调用就会执行这里接口；

check_preempt_curr：用一个新换新的进程来抢占当前进程；

pick_next_task：用于选择下一个将要运行的进程；

put_prev_task：在用另一个进程代替当前运行之前调用；

ser_curr_task：在进程的调度策略发生变化时调用；

task_tick：在每次激活周期性调度器时，由周期性调度器调用；

new_task：用于建立fork系统调用和调度器之间的关联；

就绪队列

通用调度器用于管理活动进程的主要数据结构称为就绪队列（下图在前面也出现过）：

它对应的结构体如下：

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
	/* runqueue lock: */
	spinlock_t lock;

	/*
	 * nr_running and cpu_load should be in the same cacheline because
	 * remote CPUs use both these fields when doing load calculation.
	 */
	unsigned long nr_running;
	#define CPU_LOAD_IDX_MAX 5
	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
	unsigned char idle_at_tick;
#ifdef CONFIG_NO_HZ
	unsigned char in_nohz_recently;
#endif
	/* capture load from *all* tasks on this cpu: */
	struct load_weight load;
	unsigned long nr_load_updates;
	u64 nr_switches;

	struct cfs_rq cfs;
#ifdef CONFIG_FAIR_GROUP_SCHED
	/* list of leaf cfs_rq on this cpu: */
	struct list_head leaf_cfs_rq_list;
#endif
	struct rt_rq rt;

	/*
	 * This is part of a global counter where only the total sum
	 * over all CPUs matters. A task can increase this counter on
	 * one CPU and if it got migrated afterwards it may decrease
	 * it on another CPU. Always updated under the runqueue lock:
	 */
	unsigned long nr_uninterruptible;

	struct task_struct *curr, *idle;
	unsigned long next_balance;
	struct mm_struct *prev_mm;

	u64 clock, prev_clock_raw;
	s64 clock_max_delta;
        // 后略

每个CPU都有自身的就绪队列，这些队列在runqueues数组中：

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

优先级

进程的优先级是调度器需要考虑的一个问题。

下表综述了针对不同类型进程进程优先级计算的结果：

进程的重要性除了优先级还要考虑负荷权重，它在可调度实体中表现：

/*
 * CFS stats for a schedulable entity (task, task-group etc)
 *
 * Current field usage histogram:
 *
 *     4 se->block_start
 *     4 se->run_node
 *     4 se->sleep_start
 *     6 se->load.weight
 */
struct sched_entity {
	struct load_weight	load;		/* for load-balancing */

优先级与负荷权重的关系：

上下文切换

内核选择新进程之后，必须处理与多任务相关的技术细节，这些细节统称为上下文切换。

它由函数context_switch()完成（kernel\sched.c）：

/*
 * context_switch - switch to the new MM and the new
 * thread's register state.
 */
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next)

实际上它就在schedule()函数中。

context_switch()主要包含两个函数：

switch_mm()：更换内存管理上下文；

switch_to()：切换处理器寄存器和内核栈；

《深入Linux内核架构》读书笔记007——调度器

概述

实现

入口

相关结构体

调度器类

就绪队列

优先级

上下文切换

猜你喜欢