《深入Linux内核架构》读书笔记007——调度器

概述

内存中保存了对每个进程的唯一描述,并通过若干的结构体连接在一起。

调度器要做的就是在进程之间共享CPU时间,创造并行执行进程的错觉。

这包括两个部分:

1. 调度策略;

2. 上下文切换;

根据不同的调度策略,调度器分为不同的种类,文章主要介绍完全公平调度器,它会挑选具有最高等待时间的进程,把CPU提供给该进程,大致的示意图如下:

关于调度策略,还有若干的现实问题:

1. 进程有不同的优先级;

2. 进程不能切换的太频繁,因为切换本身也占用开销;

调度分为两种方式,一种是直接调度,比如进程打算睡眠或者出于其它目的放弃CPU;另一种是通过周期性的机制,以固定的频率检测是否有必要进行调度。

整个调度器分为若干个子系统,概述如下:

红框部分是调度器的主体,称为通用调度器。它与调度器类和上下文切换两个组件交互:

1. 调度器类是真正用来确定哪个进程会被调度,它可以按照模块化的方式实现,对应不同的调度策略;

2. 上下文切换跟CPU紧密交互;

实现

下面介绍调度器的代码实现,由于涉及到的代码很多,这里只介绍最最简单的。

入口

调度器的实现基于两个函数:周期性调度器函数和主调度器函数。

schedule()函数是主调度器函数:

/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)

它采用的是直接调度方式,在内核的很多地方,如果要将CPU分配给其它的进程,都会直接调用该函数:

static void __lock_sock(struct sock *sk)
{
	DEFINE_WAIT(wait);

	for (;;) {
		prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait,
					TASK_UNINTERRUPTIBLE);
		spin_unlock_bh(&sk->sk_lock.slock);
		schedule();
		spin_lock_bh(&sk->sk_lock.slock);
		if (!sock_owned_by_user(sk))
			break;
	}
	finish_wait(&sk->sk_lock.wq, &wait);
}

scheduler_tick()是周期性调度函数:

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 *
 * It also gets called by the fork code, when changing the parent's
 * timeslices.
 */
void scheduler_tick(void)

显而易见它采用的是周期性调度方式:

/*
 * Called from the timer interrupt handler to charge one tick to the current
 * process.  user_tick is 1 if the tick is user time, 0 for system.
 */
void update_process_times(int user_tick)
{
	struct task_struct *p = current;
	int cpu = smp_processor_id();

	/* Note: this timer irq context must be accounted for as well. */
	account_process_tick(p, user_tick);
	run_local_timers();
	if (rcu_pending(cpu))
		rcu_check_callbacks(cpu, user_tick);
	scheduler_tick();
	run_posix_cpu_timers(p);
}

相关结构体

当讨论调度器实现时,首先还会提到进程结构体task_struct结构体中的成员:

	int prio, static_prio, normal_prio;
	struct list_head run_list;
	const struct sched_class *sched_class;
	struct sched_entity se;
	unsigned int policy;
	cpumask_t cpus_allowed;
	unsigned int time_slice;
	unsigned int rt_priority;

prio、static_prio、normal_prio是进程的优先级,prio和normal_prio是动态优先级;static_prio是静态优先级,是进程启动时分配的优先级;normal_prio是基于进程的静态优先级和调度策略计算出来的优先级;调度器考虑的优先级是prio;

rt_priority是实时进程的优先级;

sched_class表示进程所属的调度器类;

se是可调度实体;调度器不限于调度进程,还可以实现进程组调度,它就是一个可调度实体;事实上调度器操作的是可调度实体,由于调度实体是内嵌在进程结构体中的,所以进程本身也是可调度实体;

policy保存了对该进程的调度策略,有如下的值:

/*
 * Scheduling policies
 */
#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5

NORMAL用于普通进程,通过完全公平调度器来处理;

BATCH和IDLE也通过完全公平调度器处理,不过可用于次要的进程;

RR和FIFO用于实现软实时进程;

cpus_allowed用于多处理器系统上,用于限制进程可以在哪些CPU上运行;

run_listtime_slice是循环实时调度器所需要的,但不用于完全公平调度器;

调度器类

前面讲过调度器类是模块化处理的,每个模块化的调度器都实现了如下的结构体(include\linux\sched.h):

struct sched_class {
	const struct sched_class *next;

	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
	void (*yield_task) (struct rq *rq);

	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);

	struct task_struct * (*pick_next_task) (struct rq *rq);
	void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
			struct rq *busiest, unsigned long max_load_move,
			struct sched_domain *sd, enum cpu_idle_type idle,
			int *all_pinned, int *this_best_prio);

	int (*move_one_task) (struct rq *this_rq, int this_cpu,
			      struct rq *busiest, struct sched_domain *sd,
			      enum cpu_idle_type idle);
#endif

	void (*set_curr_task) (struct rq *rq);
	void (*task_tick) (struct rq *rq, struct task_struct *p);
	void (*task_new) (struct rq *rq, struct task_struct *p);
};

每种策略的调度器都实现了上述的结构体,比如完全公平调度器(kernel\sched_fair.c):

/*
 * All the scheduling class methods:
 */
static const struct sched_class fair_sched_class = {
	.next			= &idle_sched_class,
	.enqueue_task		= enqueue_task_fair,
	.dequeue_task		= dequeue_task_fair,
	.yield_task		= yield_task_fair,

	.check_preempt_curr	= check_preempt_wakeup,

	.pick_next_task		= pick_next_task_fair,
	.put_prev_task		= put_prev_task_fair,

#ifdef CONFIG_SMP
	.load_balance		= load_balance_fair,
	.move_one_task		= move_one_task_fair,
#endif

	.set_curr_task          = set_curr_task_fair,
	.task_tick		= task_tick_fair,
	.task_new		= task_new_fair,
};

调度器类提供的接口说明如下:

enqueue_task:向就绪队列添加一个新进程;

dequeue_task:将一个进程从就绪队列去除;

yield_task:当进程想要放弃对处理器的控制权时会执行系统调用sched_yield,而这个系统调用就会执行这里接口;

check_preempt_curr:用一个新换新的进程来抢占当前进程;

pick_next_task:用于选择下一个将要运行的进程;

put_prev_task:在用另一个进程代替当前运行之前调用;

ser_curr_task:在进程的调度策略发生变化时调用;

task_tick:在每次激活周期性调度器时,由周期性调度器调用;

new_task:用于建立fork系统调用和调度器之间的关联;

就绪队列

通用调度器用于管理活动进程的主要数据结构称为就绪队列(下图在前面也出现过):

它对应的结构体如下:

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
	/* runqueue lock: */
	spinlock_t lock;

	/*
	 * nr_running and cpu_load should be in the same cacheline because
	 * remote CPUs use both these fields when doing load calculation.
	 */
	unsigned long nr_running;
	#define CPU_LOAD_IDX_MAX 5
	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
	unsigned char idle_at_tick;
#ifdef CONFIG_NO_HZ
	unsigned char in_nohz_recently;
#endif
	/* capture load from *all* tasks on this cpu: */
	struct load_weight load;
	unsigned long nr_load_updates;
	u64 nr_switches;

	struct cfs_rq cfs;
#ifdef CONFIG_FAIR_GROUP_SCHED
	/* list of leaf cfs_rq on this cpu: */
	struct list_head leaf_cfs_rq_list;
#endif
	struct rt_rq rt;

	/*
	 * This is part of a global counter where only the total sum
	 * over all CPUs matters. A task can increase this counter on
	 * one CPU and if it got migrated afterwards it may decrease
	 * it on another CPU. Always updated under the runqueue lock:
	 */
	unsigned long nr_uninterruptible;

	struct task_struct *curr, *idle;
	unsigned long next_balance;
	struct mm_struct *prev_mm;

	u64 clock, prev_clock_raw;
	s64 clock_max_delta;
        // 后略

每个CPU都有自身的就绪队列,这些队列在runqueues数组中:

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

优先级

进程的优先级是调度器需要考虑的一个问题。

下表综述了针对不同类型进程进程优先级计算的结果:

进程的重要性除了优先级还要考虑负荷权重,它在可调度实体中表现:

/*
 * CFS stats for a schedulable entity (task, task-group etc)
 *
 * Current field usage histogram:
 *
 *     4 se->block_start
 *     4 se->run_node
 *     4 se->sleep_start
 *     6 se->load.weight
 */
struct sched_entity {
	struct load_weight	load;		/* for load-balancing */

优先级与负荷权重的关系:

上下文切换

内核选择新进程之后,必须处理与多任务相关的技术细节,这些细节统称为上下文切换。

它由函数context_switch()完成(kernel\sched.c):

/*
 * context_switch - switch to the new MM and the new
 * thread's register state.
 */
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next)

实际上它就在schedule()函数中。

context_switch()主要包含两个函数:

switch_mm():更换内存管理上下文;

switch_to():切换处理器寄存器和内核栈;

猜你喜欢

转载自blog.csdn.net/jiangwei0512/article/details/105478031