Linux kernel scheduler System - Detailed master scheduler schedule function

date Kernel version Architecture Author content
2019-3-23 Linux-2.6.32

X86

Bystander Linux process scheduling

1 Introduction

In "Linux systems process scheduling - scheduling architecture detailed analysis" in an article analyzes the principle and process scheduler runs, it will analyze in detail the master scheduler.

1.1Linux process scheduling

Memory save a unique description of each process, and are connected by a number of data structures and other processes. The scheduler face whose task is shared between the program of the CPU time, creating the illusion of parallel execution, the task is divided into two different parts, one of which is scheduling policy is further a context switch.

Two ways to activate the scheduling:

  1. Process directly give up the CPU
  2. Periodically mechanism to detect whether there is a fixed frequency scheduling is necessary

The system consists of two current linux schedulers: the master scheduler and the scheduler periodically (collectively and both generalized scheduler (generic scheduler) or core scheduler (core scheduler))

2. The master scheduler schedule () function

schedule is a function of the master scheduler in the kernel, if you want a different CPU allocated to the current activity of another process, directly or indirectly, will be called the master scheduler function schedule. For example down (struct semaphore * sem) function to the end also call schedule.

The function does the following:

  1. Determine the current run queue, and stores a pointer to the currently active task_struct pointer
  2. After completion of the kernel scheduler kernel preemption Close
  3. Recovery kernel preemption, the current process checks whether a rescheduling sign TLF_NEDD_RESCHED, if other processes set TIF_NEED_RESCHED flag, then re-execute the scheduling function

schedule frame functions as follows:

/*
 * schedule() is the main scheduler function.
 * schedule()函数主要功能就是用另外一个进程来
 * 替换当前正在执行的进程
 */
asmlinkage void __sched schedule(void)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	struct rq *rq;
	int cpu;

need_resched:
	/*
	 * 禁用内核抢占
	 */
	preempt_disable();
	
	/*
	 * 获取当前CPU核心ID
	 */
	cpu = smp_processor_id();
	
	/*
	 * 通过当前CPU核心ID获取正在运行队列数据结构
	 */
	rq = cpu_rq(cpu);
	
	/*
	 * 标记不同的state度过quiescent state,
	 * 这个函数需要学习RCU相关知识,有兴趣同学自己学习,
	 * RCU在linux内核中时很重要技术
	 */
	rcu_sched_qs(cpu);

	/*
	 * 把当前进程赋给prev
	 */
	prev = rq->curr;

	/*
	 * 将截止目前的上下文切换次数赋给switch_count
	 */
	switch_count = &prev->nivcsw;

	/*
	 * 释放大内核锁,schedule()必须要保证prev不能占中大内核锁
	 */
	release_kernel_lock(prev);
need_resched_nonpreemptible:

	/*
	 * 如果禁止内核抢占,而又调用了cond_resched就会出错
	 * 这个函数就是用来捕获该错误的
	 */
	schedule_debug(prev);

	/*
	 * 取消rq中hrtick_timer
	 */
	if (sched_feat(HRTICK))
		hrtick_clear(rq);
	
	/*
	 * 采用自旋锁,锁住rq,保护运行队列
	 */
	spin_lock_irq(&rq->lock);
	
	/*
	 * 更新就绪队列的时钟
	 */
	update_rq_clock(rq);

	/*
	 * 清除prev需要调度标志TIF_NEED_RESCHED,
	 * 避免进入就绪队列中
	 */
	clear_tsk_need_resched(prev);
	/*
	 * 检查prev状态,如果不是可运行状态,
	 * 且prev没有在内核态被抢占。
	 */
	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
		/*
		 * 检查prev是否是阻塞挂起,且状态为TASK_INTERRUPTIBLE
		 * 就把prev状态设置为TASK_RUNNING。
		 */
		if (unlikely(signal_pending_state(prev->state, prev)))
			/*
			 * 这里设置为TASK_RUNNING,因为prev进程中有信号需要处理,
			 * 不能从运行对列中删除,否则信号处理不了,会影响其余进程。
			 */
			prev->state = TASK_RUNNING;
		else
			/*
			 * 把prev从运行队列中删除
			 */
			deactivate_task(rq, prev, 1);
		switch_count = &prev->nvcsw;
	}
	/*
	 * 通知调度器,即将发生进程切换
	 */
	pre_schedule(rq, prev);
	
	/*
	 * 如果运行队列中没有可运行队列,
	 * 则从另一个运行队列迁移可运行进程到本地队列中来。
	 */
	if (unlikely(!rq->nr_running))
		idle_balance(cpu, rq);
	
	/*
	 * 通知调度器,即将用另一个进程替换当前进程
	 */
	put_prev_task(rq, prev);
	
	/*
	 * 选择下一个进程
	 */
	next = pick_next_task(rq);

	/*
	 * 判断选择出的下一个进程是否是当前进程
	 */
	if (likely(prev != next)) {
		
		/*
		 * 计算prev和next进程运行时间等参数
		 */
		sched_info_switch(prev, next);
		
		/*
		 * 从调度程序调用以删除当前任务的事件,同时禁用中断
		 * 停止每个事件并更新事件->计数中的事件值。
		 */
		perf_event_task_sched_out(prev, next, cpu);
	
		/*  
		 * 队列切换次数更新
		 */
		rq->nr_switches++;
		
		/*  
		 * 将next标记为队列的curr进程  
		 */
		rq->curr = next;

		/*
		 * 进程切换次数更新  
		 */
		++*switch_count;
		/*
		 * 进程之间上下文切换,两个进程切换就在此处发生
		 * 两个进程切换两大部分:1.prev到next虚拟地址空间的映射,
		 * 由于内核虚拟地址空间是不许呀切换的, 
		 * 因此切换的主要是用户态的虚拟地址空间。
		 * 2.保存、恢复栈信息和寄存器信息。
		 */
		context_switch(rq, prev, next); /* unlocks the rq */
		/*
		 * the context switch might have flipped the stack from under
		 * us, hence refresh the local variables.进程切换了,刷新局部变量。
		 */
		cpu = smp_processor_id();
		rq = cpu_rq(cpu);
	} else
		/*
		 * 释放rq锁
		 */
		spin_unlock_irq(&rq->lock);
	/*
	 * 通知调度器,完成了进程切换
	 */
	post_schedule(rq);
	/*
	 * 重新获取大内核锁,如果获取不到则需要重新调度
	 */
	if (unlikely(reacquire_kernel_lock(current) < 0))
		goto need_resched_nonpreemptible;
	/*
	 * 重新使能内核抢占
	 */
	preempt_enable_no_resched();
	/*
  	 * 检查其余进程已经设置当前进程的TIF_NEED_RESCHED标志,
  	 * 如果设置了需要进行重新调度。
	 */
	if (need_resched())
		goto need_resched;
}

3.schedule () function in the process of switching key functions

These are combined with schedule function to explain the actual code, I limited the ability to make some superficial explanation, the following are some of the key functions to explain,

  1. idle_balance (int this_cpu, struct RQ * this_rq) (see point solution), if no current CPU run queue process is running into the IDLE state will be called idle_balance () function in turn calls load_balance_newidle multi-processor run queue balancing function.
  2. pick_next_task (struct rq * rq), select a process from the CPU run queue.
  3. context_switch (struct rq * rq, struct task_struct * prev, struct task_struct * next), when pick_next_task () After selecting process, a process of switching from the context_switch (), which is a process context switch.

3.1pick_next_task () function

pick_next_task () function to select a category from the scheduling process, and in Linux pick_next_task () operation but does not directly operate the process scheduling entity.

There are three scheduling priority classes from high in the end as the Linux-2.6.32: rt_sched_class, fair_sched_class, idle_sched_class. pick_next_task () from more than three scheduling classes start scheduling rt_sched_class class with the highest priority sched_class_highest process of selecting the highest priority, if no processes running in rt_sched_class select from low-priority fair_sched_class in, and so on.

Next, pick_next_task () function source code analysis:

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
	const struct sched_class *class;
	struct task_struct *p;

	/*
	 * Optimization: we know that if all tasks are in
	 * the fair class we can call that function directly:
     * 根据注释可以明显看出,若可运行队列中进程数量与fair类中可运行进程
     * 数相等则直接在fair类中进行进程的挑选。是因为在Linux绝大多数进程都是
     * 普通进程数据fair类,这里减少在rt实时类中进行进程的选择减少开销。
	 */
	if (likely(rq->nr_running == rq->cfs.nr_running)) {
		p = fair_sched_class.pick_next_task(rq);
		if (likely(p))
			return p;
	}
    /*从优先级最高的调度类中开始选择*/
	class = sched_class_highest;
	for ( ; ; ) {
		p = class->pick_next_task(rq);
		if (p)
			return p;
		/*
		 * Will never be NULL as the idle class always
		 * returns a non-NULL p:
		 */
        /*如果优先级最高的调度类中没有可运行进程则进行优先级较低的类中进行选择进程*/
		class = class->next;
	}
}

In Linux-2.6.32 sched_class_highest is rt_sched_class, defined in the \ kernel \ sched.c in.

#define sched_class_highest (&rt_sched_class)

In each scheduling class will define lower priority than its own scheduling class, for class = class-> next scheduling function at the pick_next_task ().

When rt_sched_class no running process on the selection process can be run from the fair_sched_class defined rt_sched_class in:

static const struct sched_class rt_sched_class = {
	.next			= &fair_sched_class,
    ...
};

Defined in fair_sched_class in on the selection process can be run from idle_sched_class when your fair_sched_class no running process:

static const struct sched_class fair_sched_class = {
	.next			= &idle_sched_class,
    ...
};

.Next as defined in idle_sched_class in NULL:

static const struct sched_class idle_sched_class = {
	/* .next is NULL */
	/* no enqueue/yield_task for idle tasks */
    ...
};

If rt_sched_class and fair_sched_class are not running the idle process when the process can run, if the process can be run in accordance with the high priority to be low selection process.

3.2context_switch () process context switch

Context switching, or switching process is sometimes referred to as task switching means to switch from a CPU process or thread to another process or thread. In the operating system, CPU switch to another process that needs to save the current state of the process and restore the state of another process: the currently running task into the ready (or suspend, delete) state, and the other is selected tasks become ready task. A context switch involves saving the current operating environment task, the task of recovery will run the operating environment.

Context switching in three cases may occur: interrupt processing, multi-tasking, the user toggling. In the interrupt handling, the other program "interrupted" the programs that are currently running. When the CPU receives an interrupt request, for context switching between a running program and the program initiated the interrupt request. In multitasking processing, CPU execution switch back and forth among different programs, each program has a corresponding time slice, the CPU context switching time interval of two sheet. For some operating systems, when a user would toggle once context switch, although this is not essential.

Process context switch consumption:

  1. Context switch is typically computationally intensive. In other words, it requires considerable processor time, switching tens of hundreds of times per second, each switch needs to nanosecond time. So, it means the system context switches consume large amounts of CPU time. In fact, the operating system may be the biggest time consuming operation.
  2. A thread can run on a dedicated processor, the processor can also cross. Thread by a single processor has an associated service processor (Processor Affinity), it will be more effective. In another processor kernel preemption and scheduling of threads can cause cache misses, cache misses, and as a result of excessive context switching to access local memory. For high-speed communication process to deal with some of the best binding or thread CPU to avoid "cross-nuclear context switch" brings losses.

3.2.1context_switch () to achieve

/*
 * context_switch - switch to the new MM and the new
 * thread's register state.
 */
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next)
{
	struct mm_struct *mm, *oldmm;

	prepare_task_switch(rq, prev, next);
	trace_sched_switch(rq, prev, next);
	mm = next->mm;
	oldmm = prev->active_mm;
	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	if (unlikely(!mm)) {
		next->active_mm = oldmm;
		atomic_inc(&oldmm->mm_count);
		enter_lazy_tlb(oldmm, next);
	} else
		switch_mm(oldmm, mm, next);

	if (unlikely(!prev->mm)) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}
	/*
	 * Since the runqueue lock will be released by the next
	 * task (which is an invalid locking op but in the case
	 * of the scheduler it's an obvious special-case), so we
	 * do an early lockdep release here:
	 */
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);

	barrier();
	/*
	 * this_rq must be evaluated again because prev may have moved
	 * CPUs since it called schedule(), thus the 'rq' on its stack
	 * frame will be invalid.
	 */
	finish_task_switch(this_rq(), prev);
}

The code can be seen from the above handover process is implemented by the kernel of two parts:

  1. Switching global page directory to install a new address space () function is performed by switch_mm.
  2. Kernel stack and hardware context switching. Hardware context provides all the information needed to process the new kernel execution, including CPU registers, done by switch_to ().

4 Conclusion

This paper analyzes the code execution schedule function and implementation, since context_switch () function in switch_mm () and the switch_to () there is no longer obscure more detailed analysis. Research Interested students can learn at their leisure time.

 

Published 15 original articles · won praise 21 · views 30000 +

Guess you like

Origin blog.csdn.net/weixin_42092278/article/details/88778435