date	Kernel version	Architecture	Author	content
2018-11-11	Linux-2.6.32	X86	Bystander	Linux system CFS

First, the preamble

Linux currently supports three process scheduling strategy, are SCHED_FIFO, SCHED_RR and SCHED_NORMAL; and Linux supports two types of processes, real-time process and normal process. Real-time process can be used and SCHED_RR SCHED_FIFO scheduling policy; general process is used SCHED_NORMAL scheduling policy. Linux2.6.23 kernel version from the beginning of a normal process (using scheduling policy SCHED_NORMAL process) using the absolute fair scheduling algorithm, no longer tracks the process of sleep, do not distinguish whether it is an interactive process, it will treat all processes are unified, this is entirely fair meaning.

Two, CFS basic principles outlined

cfs defines a new scheduling model, it gives cfs_rq (cfs's run queue) in each process to set up a virtual clock -virtual runtime (vruntime). If a process to be carried along with the execution time of growing its vruntime also will continue to increase, not vruntime process execution will remain the same.
The scheduler will select the smallest vruntime the process to execute. This is known as "entirely fair." Different priorities of different processes vruntime its growth rate, higher-priority process vruntime grow more slowly, so it may get more opportunities to run.

Three, CFS core algorithm design

       CFS redistribution process run time based on the weight of each process.
Run time calculated process is:
      allocated to a process running time = scheduling period * current process weight / all weight sum processes weights (Formula 3.1)
    Note: scheduling period: all in TASK_RUNNING state processes are scheduled over time, in O ( 1) scheduling algorithm is run queue process runs over time. So the process is directly proportional to the weight and the weight allocated to the process of running time.

The formula is vruntime:
       vruntime = actual running time * NICE_0_LOAD / current process weight (Equation 3.2)

If the running time is allocated to the process equal to the time of actual operation, to push a further vruntime formula. The running time distribution equation in Equation 3.2 to 3.1 in the process of replacing the actual running time, the following results:
vruntime = (* scheduling period of the current process weight / weight the total weight of all processes) * NICE_0_LOAD / weight of the current process

= Scheduling period * NICE_0_LOAD / All heavy sum process right to
a preliminary conclusion: when the running time allocated to the process equal to the actual running time, although the weight of each process of re different, but their vruntime growth rate are the same, the weight has nothing to do with rights. The foregoing has described vruntime used to select the process to be run, vruntime small value that it previously occupied cpu time is short, has been "unfair" treatment, so that the next process is to run it. In this way both fair selection process, but also ensures high-priority processes get more run time.

If the running time to the process of distribution is not equal to the actual running time: CFS idea is to make vruntime increase the speed of each of the different scheduling entity, the slower the greater the weight increase, so that high-priority process will be able to get more cpu execution time, while the smaller value vruntime also be implemented.

Each process or scheduling group corresponds to a scheduling entity, each process run by entities to establish contact with the CFS columns, each time CFS scheduling is going to be a process to select a column in the red-black tree CFS operation scheduling (vruntime value smaller). cfs_rq CFS run on behalf of the columns, which can be found in the corresponding red-black tree. Process task_struct, you can find the corresponding scheduling entity. Sched_entity scheduling entity corresponding to a node of a column running the red-black tree.

Four, CFS scheduler

4.1 Overview of the Scheduler

Modern operating systems are more and more processor core resources and multi-tasking operating system, hardware, CPU is also a resource. In order to ensure that processes and rational use of CPU resources, you need a management unit, responsible for the scheduling process, the management unit at the moment to decide who should be using the CPU, this snap-in is the process scheduler. Scheduler may assign a temporary task execution of the above (per unit time slice). Process Scheduler's task is a reasonable allocation of CPU time to run the process, creating a false sense of all processes running in parallel, enabling users to execute multiple programs at the same time we have become possible, can have a variety of needs shared CPU. Therefore, the scheduler must be as fair share CPU time between each process, while at the same time will have to consider different priority task. An important goal of the scheduler is to allocate CPU time slice efficiently, while providing a good user experience. The general principle is that the scheduler, according to the distribution of computing power needed to provide maximum fairness to the system each process, or from another point of view, trying to ensure that no process has been ill.

4.2 scheduler structure

In the current Linux kernel, the scheduler is divided into two levels, in the process become a common scheduler or core scheduler is invoked directly, separating them as a part of other components and processes, and general-purpose scheduler and the process and not directly related, directly through its specific management process scheduler class of the second layer. DETAILED architecture as shown below:

4.2.1 scheduler class:

Linux kernel to achieve a framework of a scheduler class, which defines the function to be implemented in the scheduler, each scheduler class particular must achieve these functions.

In the Linux version (3.11.1), four scheduler class: stop_sched_class, rt_sched_class, fair_sched_class, idle_sched_class, in the latest kernel has added a scheduling class dl_sched_class. Each process must belong to a specific class scheduler, Linux will implement different scheduling class according to different needs. Have a certain level of relationship between the various scheduling classes, namely general-purpose scheduler selection process when the selection will start from the highest-priority scheduling class, if universal process scheduler class does not run, the next on the choice of a scheduling available processes class, so layer decreases. Scheduler class definitions for the structure of sched_class.

struct sched_class {
    const struct sched_class *next;

    //向就绪队列添加一个进程，该操作发生在一个进程变成就绪态（可运行态）的时候。
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);

    //执行enqueue_task的逆操作，在一个进程由运行态转为阻塞的时候就会发生该操作。
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    
    //进程自愿放弃控制权的时候
    void (*yield_task) (struct rq *rq);
    bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);

    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
    
    //挑选下一个可运行的进程，发生在进程调度的时候
    struct task_struct * (*pick_next_task) (struct rq *rq);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
    int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
    void (*migrate_task_rq)(struct task_struct *p, int next_cpu);

    void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
    void (*post_schedule) (struct rq *this_rq);
    void (*task_waking) (struct task_struct *task);
    void (*task_woken) (struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p,
                 const struct cpumask *newmask);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif
    //当进程的调度策略发生变化时，需要执行此函数
    void (*set_curr_task) (struct rq *rq);
    //在每次激活周期调度器时，由周期调度器调用
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    //建立fork系统调用和调度器之间的关联，每次新进程建立后，就调用该函数通知调度器
    void (*task_fork) (struct task_struct *p);

    void (*switched_from) (struct rq *this_rq, struct task_struct *task);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                 int oldprio);

    unsigned int (*get_rr_interval) (struct rq *rq,
                     struct task_struct *task);

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*task_move_group) (struct task_struct *p, int on_rq);
#endif
};

4.2.2 Scheduler period:

Scheduler_tick periodic schedule automatically call a function of frequency. Its main role is to trigger the process run according to time schedule; in the process of experiencing the resource scheduler function waits for the call is blocked can also display scheduling; Also in the kernel space to user space, it will need to determine whether the current schedule, in the process thread_info corresponding structure, a flag, the second flag field identifying a rescheduling TIF_NEED_RESCHED, when the set is indicating a higher priority process needs to perform scheduling (starting from 0) as a. Also the kernel supports kernel preemption, at the right time can preempt the running kernel. The scheduler does not directly periodic scheduling, rescheduling process set up position TIF_NEED_RESCHED, still performs scheduling by the master scheduler when returned to the user space.

void scheduler_tick(void)
{
	int cpu = smp_processor_id();
	struct rq *rq = cpu_rq(cpu);
	struct task_struct *curr = rq->curr;

	sched_clock_tick();

	spin_lock(&rq->lock);
	update_rq_clock(rq);
	update_cpu_load(rq);
	curr->sched_class->task_tick(rq, curr, 0);
	spin_unlock(&rq->lock);

	perf_event_task_tick(curr, cpu);

#ifdef CONFIG_SMP
	rq->idle_at_tick = idle_cpu(cpu);
	trigger_load_balance(rq, cpu);
#endif
}

4.2.3 master scheduler:

Master scheduler is accomplished by selecting and switching schedule () function process.

static void __sched __schedule(void)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;

need_resched:
    //禁止内核抢占
    preempt_disable();
    cpu = smp_processor_id();
    //获取CPU 的调度队列
    rq = cpu_rq(cpu);
    rcu_note_context_switch(cpu);
    //保存当前进程任务
    prev = rq->curr;

    schedule_debug(prev);

    if (sched_feat(HRTICK))
        hrtick_clear(rq);

    /*
     * Make sure that signal_pending_state()->signal_pending() below
     * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
     * done by the caller to avoid the race with signal_wake_up().
     */
    smp_mb__before_spinlock();
    raw_spin_lock_irq(&rq->lock);

    switch_count = &prev->nivcsw;
     /*  当内核态没有被抢占, 并内核抢占有效时
        即同时满足以下条件：
        1  该进程处于停止状态
        2  该进程没有在内核态被抢占 */
    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
        if (unlikely(signal_pending_state(prev->state, prev))) {
            prev->state = TASK_RUNNING;
        } else {
            deactivate_task(rq, prev, DEQUEUE_SLEEP);
            prev->on_rq = 0;

            /*
             * If a worker went to sleep, notify and ask workqueue
             * whether it wants to wake up a task to maintain
             * concurrency.
             */
            if (prev->flags & PF_WQ_WORKER) {
                struct task_struct *to_wakeup;

                to_wakeup = wq_worker_sleeping(prev, cpu);
                if (to_wakeup)
                    try_to_wake_up_local(to_wakeup);
            }
        }
        switch_count = &prev->nvcsw;
    }

    pre_schedule(rq, prev);

    if (unlikely(!rq->nr_running))
        idle_balance(cpu, rq);
    //通知调度器prev进程将被调度出去
    put_prev_task(rq, prev);
    //选择下一个可运行进程
    next = pick_next_task(rq);
    //清除pre的TIF_NEED_RESCHED标志
    clear_tsk_need_resched(prev);
    rq->skip_clock_update = 0;
   //如果next和当前进程不一致时可以调度
    if (likely(prev != next)) {
        rq->nr_switches++;
        //设置当前调度进程为next
        rq->curr = next;
        ++*switch_count;
        //切换进程上下文
        context_switch(rq, prev, next); /* unlocks the rq */
        /*
         * The context switch have flipped the stack from under us
         * and restored the local variables which were saved when
         * this task called schedule() in the past. prev == current
         * is still correct, but it can be moved to another cpu/rq.
         */
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
    } else
        raw_spin_unlock_irq(&rq->lock);

    post_schedule(rq);
  
    sched_preempt_enable_no_resched();
    if (need_resched())
        goto need_resched;
}

4.2.4 context switching:

Context_switch mainly by context switching function, mainly to do two things: switching the address space, switching register domain and stack space. Throughout the switching process need to lock and disable interrupts, the first switch is the address space, and active_mm mm represent mm_struct scheduling and scheduled process, if mm is empty, the next kernel threads, kernel threads not have their own independent address space, so its mm is null, when run with the active_mm to prev. If non-empty, it is the user processes, you can switch directly, call the function here switch_mm switch; if prev is a kernel thread, because it has no independent address space, so you need to set it active_mm to null. Switch_to last part of the process of switching calls to switch registers and stack domain. (Switch_to is a macro, implemented by the assembly code, there are those who can be in-depth study)

static inline void
context_switch(struct rq *rq, struct task_struct *prev,
           struct task_struct *next)
{
    struct mm_struct *mm, *oldmm;
    //进程切换准备工作加锁和关中断，最后调用finish_task_switch
    prepare_task_switch(rq, prev, next);
    
    mm = next->mm;
    oldmm = prev->active_mm;
    /*
     * For paravirt, this is coupled with an exit in switch_to to
     * combine the page table reload and the switch backend into
     * one hypercall.
     */
    arch_start_context_switch(prev);
    //如果要执行的是内核线程
    if (!mm) {
        next->active_mm = oldmm;
        atomic_inc(&oldmm->mm_count);
        enter_lazy_tlb(oldmm, next);
    } else
        switch_mm(oldmm, mm, next);
    //如果被调度的是内核线程
    if (!prev->mm) {
        prev->active_mm = NULL;
        rq->prev_mm = oldmm;
    }
    /*
     * Since the runqueue lock will be released by the next
     * task (which is an invalid locking op but in the case
     * of the scheduler it's an obvious special-case), so we
     * do an early lockdep release here:
     */
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
    spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif

    context_tracking_task_switch(prev, next);
    /* Here we just switch the register state and the stack. */
    //切换寄存器域和栈
    switch_to(prev, next, prev);

    barrier();
    /*
     * this_rq must be evaluated again because prev may have moved
     * CPUs since it called schedule(), thus the 'rq' on its stack
     * frame will be invalid.
     */
    finish_task_switch(this_rq(), prev);
}

I hastily written, inappropriate, Shangqi readers criticized the correction! I promptly corrected.

Bystander_J

Published 15 original articles · won praise 21 · views 30000 +

Private letter concerns

On Linux completely fair scheduler --CFS