概述
内存中保存了对每个进程的唯一描述,并通过若干的结构体连接在一起。
调度器要做的就是在进程之间共享CPU时间,创造并行执行进程的错觉。
这包括两个部分:
1. 调度策略;
2. 上下文切换;
根据不同的调度策略,调度器分为不同的种类,文章主要介绍完全公平调度器,它会挑选具有最高等待时间的进程,把CPU提供给该进程,大致的示意图如下:
关于调度策略,还有若干的现实问题:
1. 进程有不同的优先级;
2. 进程不能切换的太频繁,因为切换本身也占用开销;
调度分为两种方式,一种是直接调度,比如进程打算睡眠或者出于其它目的放弃CPU;另一种是通过周期性的机制,以固定的频率检测是否有必要进行调度。
整个调度器分为若干个子系统,概述如下:
红框部分是调度器的主体,称为通用调度器。它与调度器类和上下文切换两个组件交互:
1. 调度器类是真正用来确定哪个进程会被调度,它可以按照模块化的方式实现,对应不同的调度策略;
2. 上下文切换跟CPU紧密交互;
实现
下面介绍调度器的代码实现,由于涉及到的代码很多,这里只介绍最最简单的。
入口
调度器的实现基于两个函数:周期性调度器函数和主调度器函数。
schedule()函数是主调度器函数:
/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)
它采用的是直接调度方式,在内核的很多地方,如果要将CPU分配给其它的进程,都会直接调用该函数:
static void __lock_sock(struct sock *sk)
{
DEFINE_WAIT(wait);
for (;;) {
prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait,
TASK_UNINTERRUPTIBLE);
spin_unlock_bh(&sk->sk_lock.slock);
schedule();
spin_lock_bh(&sk->sk_lock.slock);
if (!sock_owned_by_user(sk))
break;
}
finish_wait(&sk->sk_lock.wq, &wait);
}
scheduler_tick()是周期性调度函数:
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
* It also gets called by the fork code, when changing the parent's
* timeslices.
*/
void scheduler_tick(void)
显而易见它采用的是周期性调度方式:
/*
* Called from the timer interrupt handler to charge one tick to the current
* process. user_tick is 1 if the tick is user time, 0 for system.
*/
void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id();
/* Note: this timer irq context must be accounted for as well. */
account_process_tick(p, user_tick);
run_local_timers();
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_tick);
scheduler_tick();
run_posix_cpu_timers(p);
}
相关结构体
当讨论调度器实现时,首先还会提到进程结构体task_struct结构体中的成员:
int prio, static_prio, normal_prio;
struct list_head run_list;
const struct sched_class *sched_class;
struct sched_entity se;
unsigned int policy;
cpumask_t cpus_allowed;
unsigned int time_slice;
unsigned int rt_priority;
prio、static_prio、normal_prio是进程的优先级,prio和normal_prio是动态优先级;static_prio是静态优先级,是进程启动时分配的优先级;normal_prio是基于进程的静态优先级和调度策略计算出来的优先级;调度器考虑的优先级是prio;
rt_priority是实时进程的优先级;
sched_class表示进程所属的调度器类;
se是可调度实体;调度器不限于调度进程,还可以实现进程组调度,它就是一个可调度实体;事实上调度器操作的是可调度实体,由于调度实体是内嵌在进程结构体中的,所以进程本身也是可调度实体;
policy保存了对该进程的调度策略,有如下的值:
/*
* Scheduling policies
*/
#define SCHED_NORMAL 0
#define SCHED_FIFO 1
#define SCHED_RR 2
#define SCHED_BATCH 3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
NORMAL用于普通进程,通过完全公平调度器来处理;
BATCH和IDLE也通过完全公平调度器处理,不过可用于次要的进程;
RR和FIFO用于实现软实时进程;
cpus_allowed用于多处理器系统上,用于限制进程可以在哪些CPU上运行;
run_list和time_slice是循环实时调度器所需要的,但不用于完全公平调度器;
调度器类
前面讲过调度器类是模块化处理的,每个模块化的调度器都实现了如下的结构体(include\linux\sched.h):
struct sched_class {
const struct sched_class *next;
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq);
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
struct task_struct * (*pick_next_task) (struct rq *rq);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
struct rq *busiest, unsigned long max_load_move,
struct sched_domain *sd, enum cpu_idle_type idle,
int *all_pinned, int *this_best_prio);
int (*move_one_task) (struct rq *this_rq, int this_cpu,
struct rq *busiest, struct sched_domain *sd,
enum cpu_idle_type idle);
#endif
void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
};
每种策略的调度器都实现了上述的结构体,比如完全公平调度器(kernel\sched_fair.c):
/*
* All the scheduling class methods:
*/
static const struct sched_class fair_sched_class = {
.next = &idle_sched_class,
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
.check_preempt_curr = check_preempt_wakeup,
.pick_next_task = pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
#ifdef CONFIG_SMP
.load_balance = load_balance_fair,
.move_one_task = move_one_task_fair,
#endif
.set_curr_task = set_curr_task_fair,
.task_tick = task_tick_fair,
.task_new = task_new_fair,
};
调度器类提供的接口说明如下:
enqueue_task:向就绪队列添加一个新进程;
dequeue_task:将一个进程从就绪队列去除;
yield_task:当进程想要放弃对处理器的控制权时会执行系统调用sched_yield,而这个系统调用就会执行这里接口;
check_preempt_curr:用一个新换新的进程来抢占当前进程;
pick_next_task:用于选择下一个将要运行的进程;
put_prev_task:在用另一个进程代替当前运行之前调用;
ser_curr_task:在进程的调度策略发生变化时调用;
task_tick:在每次激活周期性调度器时,由周期性调度器调用;
new_task:用于建立fork系统调用和调度器之间的关联;
就绪队列
通用调度器用于管理活动进程的主要数据结构称为就绪队列(下图在前面也出现过):
它对应的结构体如下:
/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
* (such as the load balancing or the thread migration code), lock
* acquire operations must be ordered by ascending &runqueue.
*/
struct rq {
/* runqueue lock: */
spinlock_t lock;
/*
* nr_running and cpu_load should be in the same cacheline because
* remote CPUs use both these fields when doing load calculation.
*/
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
unsigned char idle_at_tick;
#ifdef CONFIG_NO_HZ
unsigned char in_nohz_recently;
#endif
/* capture load from *all* tasks on this cpu: */
struct load_weight load;
unsigned long nr_load_updates;
u64 nr_switches;
struct cfs_rq cfs;
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
#endif
struct rt_rq rt;
/*
* This is part of a global counter where only the total sum
* over all CPUs matters. A task can increase this counter on
* one CPU and if it got migrated afterwards it may decrease
* it on another CPU. Always updated under the runqueue lock:
*/
unsigned long nr_uninterruptible;
struct task_struct *curr, *idle;
unsigned long next_balance;
struct mm_struct *prev_mm;
u64 clock, prev_clock_raw;
s64 clock_max_delta;
// 后略
每个CPU都有自身的就绪队列,这些队列在runqueues数组中:
static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
优先级
进程的优先级是调度器需要考虑的一个问题。
下表综述了针对不同类型进程进程优先级计算的结果:
进程的重要性除了优先级还要考虑负荷权重,它在可调度实体中表现:
/*
* CFS stats for a schedulable entity (task, task-group etc)
*
* Current field usage histogram:
*
* 4 se->block_start
* 4 se->run_node
* 4 se->sleep_start
* 6 se->load.weight
*/
struct sched_entity {
struct load_weight load; /* for load-balancing */
优先级与负荷权重的关系:
上下文切换
内核选择新进程之后,必须处理与多任务相关的技术细节,这些细节统称为上下文切换。
它由函数context_switch()完成(kernel\sched.c):
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
实际上它就在schedule()函数中。
context_switch()主要包含两个函数:
switch_mm():更换内存管理上下文;
switch_to():切换处理器寄存器和内核栈;