CFS调度器（一）

关于调度类和优先级的概念，前面的文章《调度器概述》中已经做了介绍了，本文不在重述。本文主要关注的就是CFS调度器，或者叫做fair_sched_class调度类。这种调度器是被SCHED_NORMAL、SCHED_BATCH这两种调度策略使用的。（本文基于Linux4.0）

提到调度器涉及到两个问题：

1.怎么根据参量快速选出下一个要调度执行的进程？
2.怎么定义和更新不同进程的相关参量？

带着这两个问题，我们慢慢去讲解CFS调度器。

权重

权重代表着进程对于CPU使用时间的占有率，所以它的大小跟进程优先级相关，优先级越高，抢占的CPU时间越多，因此它的权重就越大。它的关系可以用下面的公式来表示：

进程运行时间 = 进程权重 * CPU总时间 / 所有进程总权重

内核中每个调度实体都用一个结构体成员来表示权重：

struct sched_entity {
	struct load_weight load;
	...
}

表示权重的结构体为：

struct load_weight {
	unsigned long weight;
	u32 inv_weight;
}

其中weight代表的是权重，inv_weight是为了方便计算出来的一个中间值。内核根据优先级nice来定义权重，约定nice值为0的进程权重值为1024，其他nice值的进程权重可以根据查表获取：

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
static const int prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

其他nice值的权重值都是基于1024计算出来的，高一优先级的进程权重总是低一优先级进程的1.25倍。除了这个表，内核为了计算方便还定义了另外一个表：

/*
 * Inverse (2^32/x) values of the prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
static const u32 prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

内核设置进程权重的函数：

static void set_load_weight(struct task_struct *p)
{
    int prio = p->static_prio - MAX_RT_PRIO;
    struct load_weight *load = &p->se.load;

    /*
     * SCHED_IDLE tasks get minimal weight:
     */
    if (p->policy == SCHED_IDLE) {
        load->weight = scale_load(WEIGHT_IDLEPRIO);
        load->inv_weight = WMULT_IDLEPRIO;
        return;
    }

    load->weight = scale_load(prio_to_weight[prio]);
    load->inv_weight = prio_to_wmult[prio];
}

虚拟时间

CFS调度器的目标是保证每个进程的完全公平调度，尝试对进程一视同仁，把CPU时间平均分配给每个进程去运行，进程运行的同时会记录每个进程的运行时间，按照这种公平算法，CPU每次调度都会优先选择运行时间最少的那个进程，以保证进程的公平。
假如按照上述思想去设计调度器，那么会有一个问题，所有进程都是完全公平，那么如何体现优先级的不同呢？实际上内核并不是使用实际运行时间作为pick_next进程的参考量，而采用的是一个虚拟时间，这个虚拟时间加入了不同优先级权重因素的。虚拟时间的计算方法如下：

vruntime = delta_exec * NICE0_WEIGHT / weight;

vruntime是虚拟运行时间，delta_exec是实际运行时间，NICE0_WEIGHT是nice值为0的权重，也就是1024，weight进程的权重。这样由上面的表可知，优先级越高的进程算出来的vruntime越小，从而被CFS调度器选出来的概率就越大。runqueue中的所有进程都要维护一个vruntime变量，并且把这些进程按照红黑树来管理，vrunquque总是处于红黑树的最左侧，这样在算法选择时具有更高的效率。

回到前面的问题，CFS调度器就是通过vruntime参量来决定当前要选择运行的进程，为了选择算法的高效和快速，采用红黑树来管理这个参量，这就是CFS调度器的核心思想了。
下面是内核中的代码定义：

struct sched_entity {

    u64         exec_start;
    u64         sum_exec_runtime;
    u64         vruntime;
    u64         prev_sum_exec_runtime;
	......
};

对于实际运行时间转换为虚拟时间，通过如下函数来实现calc_delta_fair：

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
    if (unlikely(se->load.weight != NICE_0_LOAD))
        delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

    return delta;
}

有了这个虚拟时间，我们可以用来作为选择pick next进程的参考量，那么此参考量何时需要更新呢？


1.进程创建，加入运行队列时要更新

2.周期调度时要更新

调度延迟

调度延迟表示是保证系统中的每个进程都能执行一遍所需要的时间。它不是一个固定值，计算方式如下：


unsigned int sysctl_sched_latency = 6000000ULL;         //6ms
static unsigned int sched_nr_latency = 8;
unsigned int sysctl_sched_min_granularity = 750000ULL;   //0.75ms
 
static u64 __sched_period(unsigned long nr_running)
{
    u64 period = sysctl_sched_latency;
    unsigned long nr_latency = sched_nr_latency;

    if (unlikely(nr_running > nr_latency)) {
        period = sysctl_sched_min_granularity;
        period *= nr_running;
    }

    return period;
}

当系统处于就runnable的进程少于一个定值（默认值8）的时候，调度延迟也是固定一个值不变（默认值6ms）。当系统就绪态进程个数超过这个值时，我们保证每个进程至少运行一定的时间，那么调度延迟总时间就是nr_running * sysctl_sched_min_granularity。
当获取到一个调度延迟后，还需要计算一个进程分配的时间片，CFS调度器和以往的O(1)调度器不同，不在有固定的时间片，而是通过权重计算出的时间片。按照前面权重的介绍，通过

时间片=调度延迟总时间*进程权重/总权重

来计算一个进程的时间片。

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);

    for_each_sched_entity(se) {
        struct load_weight *load;
        struct load_weight lw;

        cfs_rq = cfs_rq_of(se);
        load = &cfs_rq->load;

        if (unlikely(!se->on_rq)) {
            lw = cfs_rq->load;

            update_load_add(&lw, se->load.weight);
            load = &lw;
        }
        slice = __calc_delta(slice, se->load.weight, load);
    }
    return slice;
}

__calc_delta传入的参数分别是总时间，当前调度实体的权重，整个runqueue的总权重。因此每个进程的时间片是不固定的，而是根据权重和系统中进程的数量发生着变化。

与调度相关的场景

1.进程创建过程
2.周期调度
3.直接调度
后续文章将针对每个场景介绍调度相关的内容。

权重

虚拟时间

调度延迟

与调度相关的场景

猜你喜欢