从一个CFS调度案例谈Linux系统卡顿的根源

Linux系统是一个让人感觉卡顿的系统，先别怼，让我说完：

卡顿的原因在于Linux内核的调度器从来不关注业务场景！

Linux内核只能看到机器而不愿意看到应用。它倾向于自下而上从CPU角度提高吞吐，而不是自上而下从业务角度提高用户体验。

拟人来看，Linux是一个好程序员，但不是一个好经理。

万事必有因缘，Linux就是一个程序员发起一帮程序员折腾起来的，几乎没有穿西装的经理之类的人参与。

程序员天天挂在嘴边的就是性能，时间复杂度，cache利用率，CPU，内存，反之，经理每天吆喝的就是客户，客户，客户，体验，体验，体验！

前天晚上下班回到住处已经很晚，姓刘的副经理请教了我一个问题，说是他在调试一个消息队列组件，涉及到生产者，消费者等多个相互配合的线程，非常复杂，部署上线后，发现一个奇怪的问题：

整个系统资源似乎被该组件的线程独占，该消息队列组件的效率非常高，但其系统非常卡顿！

我问他有没有部署cgroup，cpuset之类的配置，他说没有。

我又问他该消息队列组件一共有多少线程，他说不多，不超过20个。

我又问…他说…

…

我感到很奇怪，我告诉刘副经理说让我登录机器调试下试试看，他并没有同意，只是能尽可能多的告诉我细节，我来远程协助。

…

我并不懂消息队列，我也不懂任何的中间件，调试任何一个此类系统对我而言是无能为力的，我也感到遗憾，我只能尝试从操作系统的角度去解释和解决这个问题。很显然，这就是操作系统的问题。而且很明确，这是操作系统的调度问题。

涉及生产者，消费者，如果我能本地重现这个问题，那么我就一定能解决这个问题。

于是开始Google关于Linux schedule子系统关于生产者和消费者的内容，producer，consumer，schedule，cfs，linux kernel…

第二天的一整天，我一直在思考这个问题，没有复现环境，却也只能闲暇时思考。

晚饭后，我找到了下面的patch：
https://git.bricked.de/Bricked/flo/commit/4793241be408b3926ee00c704d7da3b3faf3a05f

Impact: improve/change/fix wakeup-buddy scheduling

Currently we only have a forward looking buddy, that is, we prefer to
schedule to the task we last woke up, under the presumption that its
going to consume the data we just produced, and therefore will have
cache hot benefits.

This allows co-waking producer/consumer task pairs to run ahead of the
pack for a little while, keeping their cache warm. Without this, we
would interleave all pairs, utterly trashing the cache.

This patch introduces a backward looking buddy, that is, suppose that
in the above scenario, the consumer preempts the producer before it
can go to sleep, we will therefore miss the wakeup from consumer to
producer (its already running, after all), breaking the cycle and
reverting to the cache-trashing interleaved schedule pattern.

The backward buddy will try to schedule back to the task that woke us
up in case the forward buddy is not available, under the assumption
that the last task will be the one with the most cache hot task around
barring current.

This will basically allow a task to continue after it got preempted.

In order to avoid starvation, we allow either buddy to get wakeup_gran
ahead of the pack.

似乎CFS调度器的LAST_BUDDY feature相关，该feature涉及运行队列的next，last指针，好不容易找到了这个，我决定设计一个最简单的实验，尝试复现问题并对LAST_BUDDY feature进行一番探究。

我的实验非常简单：

生产者循环唤醒消费者。

为了让唤醒操作更加直接，我希望采用wake_up_process直接唤醒，而不是使用信号这些复杂的机制，所以我必须寻找内核的支持，首先写一个字符设备来支持这个操作：

// wakedev.c
#include <linux/sched.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/cdev.h>
#include <linux/device.h>
#include<linux/uaccess.h>

#define CMD_WAKE	122

dev_t dev = 0;
static struct class *dev_class;
static struct cdev wake_cdev;

static long _ioctl(struct file *file, unsigned int cmd, unsigned long arg);

static struct file_operations fops = {
	.owner          = THIS_MODULE,
	.unlocked_ioctl = _ioctl,
};

static long _ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
	u32 pid = 0;
	struct task_struct *task = NULL;

	switch(cmd) {
		// ioctl唤醒命令
		case CMD_WAKE:
			copy_from_user(&pid, (u32 *) arg, sizeof(pid));
			task = pid_task(find_vpid(pid), PIDTYPE_PID);
			if (task) {
				wake_up_process(task);
			}
			break;
	}
	return 0;
}

static int __init crosswake_init(void)
{
	if((alloc_chrdev_region(&dev, 0, 1, "test_dev")) <0){
		printk("alloc failed\n");
		return -1;
	}
	printk("major=%d minor=%d \n",MAJOR(dev), MINOR(dev));

	cdev_init(&wake_cdev, &fops);

	if ((cdev_add(&wake_cdev, dev, 1)) < 0) {
		printk("add failed\n");
		goto err_class;
	}

	if ((dev_class = class_create(THIS_MODULE, "etx_class")) == NULL) {
		printk("class failed\n");
		goto err_class;
	}

	if ((device_create(dev_class, NULL, dev, NULL, "etx_device")) == NULL) {
		printk(KERN_INFO "create failed\n");
		goto err_device;
	}

	return 0;

err_device:
	class_destroy(dev_class);
err_class:
	unregister_chrdev_region(dev,1);
	return -1;
}

void __exit crosswake_exit(void)
{
	device_destroy(dev_class,dev);
	class_destroy(dev_class);
	cdev_del(&wake_cdev);
	unregister_chrdev_region(dev, 1);
}

module_init(crosswake_init);
module_exit(crosswake_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("shabi");

编译加载并创建字符设备：

[root@localhost test]# insmod ./wakedev.ko
[root@localhost test]# dmesg |grep major.*minor
[   68.385310] major = 248 minor = 0
[root@localhost test]# mknod /dev/test c 248 0

OK，接下来是生产者程序的代码：

// producer.c
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	int fd, i = 0xfff, j;
	int pid = -1;

	pid = atoi(argv[1]);
	fd = open("/dev/test", O_RDWR);
	perror("open");

	while(1 || i --) {
		j = 0xfffff;
		ioctl(fd, 122, &pid); // 唤醒consumer进程
		while(j--) {} // 模拟生产过程！
	}
}

接下来是消费者程序：

#include <stdio.h>
int main()
{
	while(1) {
		sleep(1);
	}
}

编译运行之：

# 名字很长是因为日志里显眼！
[root@localhost test]# gcc consumer.c -O0 -o consumerAconsumerACconsumerAconsumer
[root@localhost test]# gcc producer.c -O0 -o producerAproducerAproducerAproducer
# 启动消费者
[root@localhost test]# ./consumerAconsumerACconsumerAconsumer &
# 启动生产者，指定消费者进程
[root@localhost test]# ./producerAproducerAproducerAproducer 26274

差不多就是以上的方式进行实验，试了多次，没有发现卡顿现象。

比较失望，但是我想这是必然的，因为如果问题真的如此容易复现的话，社区早就fix了，可见，这里并没有显而易见的问题需要解决。可能只是使用系统的方式不当。

不管怎样，还是先看数据吧，从数据里分析细节。

首先，在实验进行的情况下，先用perf导出进程切换的过程：

[root@localhost test]# perf record -e sched:sched_switch -e sched:sched_wakeup -a -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.303 MB perf.data (1303 samples) ]
[root@localhost test]# perf script -i perf.data > final.data

下面是一个片段，我捕获了大概10次捕获到的一个片段：

      loop_sleep  6642 [000] 23085.198476: sched:sched_switch: loop_sleep:6642 [120] R ==> xfsaild/dm-0:400 [120]
    xfsaild/dm-0   400 [000] 23085.198482: sched:sched_switch: xfsaild/dm-0:400 [120] S ==> loop_sleep:6642 [120]
      loop_sleep  6642 [000] 23085.217752: sched:sched_switch: loop_sleep:6642 [120] R ==> producerAproduc:26285 [130]
## 从这里开始霸屏
 producerAproduc 26285 [000] 23085.220257: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.220259: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.220273: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.269921: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.269923: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.269927: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.292748: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.292749: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.292752: sched:sched_switch: consumerAconsum:26274 [120] producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.320205: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.320208: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.320212: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.340971: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.340973: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.340977: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.369630: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.369632: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.369637: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.400818: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.400821: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.400825: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.426043: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.426045: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.426048: sched:sched_switch: consumerAconsum:26274 [120] S ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.447646: sched:sched_wakeup: xfsaild/dm-0:400 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.447649: sched:sched_switch: producerAproduc:26285 [130] R ==> xfsaild/dm-0:400 [120]
## 实在太久了，R状态让位！ 
## 到这里才结束！！！  
    xfsaild/dm-0   400 [000] 23085.447654: sched:sched_switch: xfsaild/dm-0:400 [120] S ==> loop_sleep:6642 [120]
      loop_sleep  6642 [000] 23085.468047: sched:sched_switch: loop_sleep:6642 [120] R ==> producerAproduc:26285 [130]
 producerAproduc 26285 [000] 23085.469862: sched:sched_wakeup: consumerAconsum:26274 [120] success=1 CPU:000
 producerAproduc 26285 [000] 23085.469863: sched:sched_switch: producerAproduc:26285 [130] R ==> consumerAconsum:26274 [120]
 consumerAconsum 26274 [000] 23085.469867: sched:sched_switch: consumerAconsum:26274 [120] S ==> loop_sleep:6642 [120]
      loop_sleep  6642 [000] 23085.488800: sched:sched_switch: loop_sleep:6642 [120] R ==> producerAproduc:26285 [130]

看来确实存在生产者和消费者配合在一起霸屏的现象，只是比较少而已。

到底是什么原因让两个进程如此粘连在一起的呢？这个问题比较有意思。

CFS调度器并没有兑现“消除trick，保持简单”的承诺，越来越多的“启发式算法feature”被加入进去，重蹈了 $O(1)$ 调度器的老路！！

我们从 /sys/kernel/debug/sched_features 中可以看到这些“trick式的feature” ：

[root@localhost test]# cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN NO_NUMA NUMA_FAVOUR_HIGHER NO_NUMA_RESIST_LOWER

对着文档一个个仔细看吧。这里我们只关心 LAST_BUDDY 。

如果review相关的代码的话，我们会发现wakeup操作的下面的片段：

static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
	...
    if (wakeup_preempt_entity(se, pse) == 1) {
        /*
         * Bias pick_next to pick the sched entity that is
         * triggering this preemption.
         */
        if (!next_buddy_marked)
            set_next_buddy(pse);
        goto preempt;
    }	
    ...
preempt:
    resched_task(curr);
    /*
     * Only set the backward buddy when the current task is still
     * on the rq. This can happen when a wakeup gets interleaved
     * with schedule on the ->pre_schedule() or idle_balance()
     * point, either of which can * drop the rq lock.
     *
     * Also, during early boot the idle thread is in the fair class,
     * for obvious reasons its a bad idea to schedule back to it.
     */
    if (unlikely(!se->on_rq || curr == rq->idle))
        return;
	// 这里是关键！设置了一个next
    if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
        set_last_buddy(se);
}

以及pick_next操作的下面的片段：

/* 看这些注释就够了！
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
	...
	/*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    // next比last要优先！
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;

    /*
     * Someone really wants this to run. If it's not unfair, run it.
     */
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;

    clear_buddies(cfs_rq, se);
}

这里就有疑点了。

设生产者为P，消费者为C，则：

P唤醒了C，那么P就会成为last。
C投入运行。
C运行结束进入阻塞。
CPU进行调度，pick next。
last抢先leftmost而胜出。
last的vruntime虽然比leftmost的大，但不足一个granularity，因此优选last！

上述最后一点需要解释，我还是解释一下内核函数wakeup_preempt_entity的注释吧：
在这里插入图片描述

事情到此为止，其实问题的解法已经有了，大致就是：

禁用LAST_BUDDY feature.

而这个只需要一条命令即可：

[root@localhost test]# echo NO_LAST_BUDDY >/sys/kernel/debug/sched_features

和副经理交了差之后，却并未满足自己的好奇心，我还是希望看个究竟。

为了挖出事情怎么发生的，仅靠perf就不够了，需要systemtap来帮忙。

下面的脚本可以探测在进程被唤醒，调度切换的时候，细节到底是什么样的：

#!/usr/bin/stap -g

global g_cfs_rq;

probe begin {
	g_cfs_rq = 0;
}

function container_of_entity:long(se:long)
{
	offset = &@cast(0, "struct task_struct")->se;
	return se - offset;
}

function container_to_entity:long(task:long)
{
	offset = &@cast(0, "struct task_struct")->se;
	return task +  offset;
}

function entity_to_rbnode:long(rb:long)
{
	offset = &@cast(0, "struct sched_entity")->run_node;
	return rb - offset;
}

function print_task(s:string, se:long, verbose:long, min_vruntime:long)
{
	my_q = @cast(se, "struct sched_entity")->my_q;
	if(my_q == 0) {
		t_se = container_of_entity(se);
		printf("%8s %p  %s %d \n", s, t_se, task_execname(t_se), task_pid(t_se));
	}
}

probe kernel.function("pick_next_task_fair")
{
	printf("--------------- begin pick_next_task_fair --------------->\n");
	g_cfs_rq = &$rq->cfs;
}

probe kernel.function("pick_next_entity")
{
	if (g_cfs_rq == 0)
		next;

	printf("------- begin pick_next_entity ------->\n");
	cfsq = g_cfs_rq;
	vrun_first = 0;
	vrun_last = 0;
	last = @cast(cfsq, "struct cfs_rq")->last;
	if (last) {
		my_q = @cast(last, "struct sched_entity")->my_q;
		if(my_q != 0) {
			cfsq = @cast(last, "struct sched_entity")->my_q;
			last = @cast(cfsq, "struct cfs_rq")->last;
		}
		t_last = container_of_entity(last);
		vrun_last = @cast(last, "struct sched_entity")->vruntime;
		printf("LAST:[%s] vrun:%d\t", task_execname(t_last), vrun_last);
	}
	firstrb = @cast(cfsq, "struct cfs_rq")->rb_leftmost;
	if (firstrb) {
		firstse = entity_to_rbnode(firstrb);
		my_q = @cast(firstse, "struct sched_entity")->my_q;
		if(my_q != 0) {
			firstrb = @cast(my_q, "struct cfs_rq")->rb_leftmost;
			firstse = entity_to_rbnode(firstrb);
		}
		t_first = container_of_entity(firstse);
		vrun_first = @cast(firstse, "struct sched_entity")->vruntime;
		printf("FIRST:[%s] vrun:%d\t", task_execname(t_first), vrun_first);
	}
	if (last && firstrb) {
		printf("delta: %d\n", vrun_last - vrun_first);
	} else {
		printf("delta: N/A\n");
	}
	printf("<------- end pick_next_entity -------\n");
	printf("###################\n");
}

probe kernel.function("pick_next_task_fair").return
{
	if($return != 0) {
		se = &$return->se;
		t_se = container_of_entity(se);
		t_curr = task_current();
		printf("Return task: %s[%d]  From current: %s[%d]\n", task_execname(t_se), task_pid(t_se), task_execname(t_curr), task_pid(t_curr));
	}

	printf("<--------------- end pick_next_task_fair ---------------\n");
	printf("###########################################################\n");
	g_cfs_rq = 0;
}

probe kernel.function("set_last_buddy")
{
	se_se = $se;
	print_task("=== set_last_buddy", se_se, 0, 0);
}

probe kernel.function("__clear_buddies_last")
{
	se_se = $se;
	print_task("=== __clear_buddies_last", se_se, 0, 0);
}

probe kernel.function("check_preempt_wakeup")
{
	printf("--------------- begin check_preempt_wakeup --------------->\n");
	_cfs_rq = &$rq->cfs;
	min_vruntime = @cast(_cfs_rq, "struct cfs_rq")->min_vruntime;
	ok = @cast(_cfs_rq, "struct cfs_rq")->nr_running - sched_nr_latency;
	t_curr = task_current();
	t_se = $p;
	se_curr = container_to_entity(t_curr);
	se_se = container_to_entity(t_se);
	vrun_curr = @cast(se_curr, "struct sched_entity")->vruntime;
	vrun_se = @cast(se_se, "struct sched_entity")->vruntime;

	printf("curr wake:[%s]   woken:[%s]\t", task_execname(t_curr), task_execname(t_se));
	printf("UUUUU curr:%d  se:%d min:%d\t", vrun_curr, vrun_se, min_vruntime);
	printf("VVVVV delta:%d   %d\n", vrun_curr - vrun_se, ok);
}

probe kernel.function("check_preempt_wakeup").return
{
	printf("<--------------- end check_preempt_wakeup ---------------\n");
	printf("###########################################################\n");
}

重新做实验，我们跑几遍脚本，终于采集到一个case，且看下面的输出：

.....
--------------- begin check_preempt_wakeup --------------->
curr wake:[producerAproduc]   woken:[consumerAconsum]   UUUUU curr:17886790442766  se:17886787442766 min:20338095223270 VVVVV delta:3000000   1
=== set_last_buddy 0xffff8800367b4a40  producerAproduc 26285
<--------------- end check_preempt_wakeup ---------------
###########################################################
--------------- begin pick_next_task_fair --------------->
------- begin pick_next_entity ------->
LAST:[producerAproduc] vrun:17886790442766  FIRST:[consumerAconsum] vrun:17886787442766 delta: 3000000
<------- end pick_next_entity -------
###################
Return task: consumerAconsum[26274]  From current: producerAproduc[26285]
<--------------- end pick_next_task_fair ---------------
###########################################################
--------------- begin pick_next_task_fair --------------->
------- begin pick_next_entity ------->
#【注意这里的case！】
#【本来loop_sleep将要投入运行的，结果被上次被抢占从而设置为last的producerAproduc抢先！！】
LAST:[producerAproduc] vrun:17886790442766  FIRST:[loop_sleep] vrun:17886790410519  delta: 32247
<------- end pick_next_entity -------
###################
=== __clear_buddies_last 0xffff8800367b4a40  producerAproduc 26285
#【放弃loop_sleep，选择producerAproduc】
Return task: producerAproduc[26285]  From current: consumerAconsum[26274]
<--------------- end pick_next_task_fair ---------------
###########################################################
--------------- begin pick_next_task_fair --------------->
------- begin pick_next_entity ------->
FIRST:[loop_sleep] vrun:17886790410519  delta: N/A
<------- end pick_next_entity -------
###################
Return task: loop_sleep[4227]  From current: producerAproduc[26285]
<--------------- end pick_next_task_fair ---------------
.......

抓到的这个现场，意思是，本应该投入运行的loop_sleep进程竟然 由于CPU cache亲和的原因 被一个之前设置的last锚点进程给抢先运行了，虽然这个从CPU的视角来看，最大化了cache的利用率，但是我们从业务的角度来看，这并不合理。

由于代码和时间戳之间紧密耦合，很难构造持续LAST_BUDDY抢先的场景，但只要发现类似上例一处此类case，基本就说明问题了，一旦发生全局同步式的共振，某些进程耦合霸屏造成系统卡顿的场面就复现了。

总体而言，LAST_BUDDY feature为CFS红黑树的leftmost node引入了一个有力的竞争者，而leftmost node很有可能会输掉竞争而无法投入运行，这是不是破坏了CFS引以为傲的精妙简单的红黑树结构带来快速进程选择的收益呢？

一棵红黑树被各种启发式feature啄木鸟般啄得体无完肤！

想当初， $O(1)$ 调度器就是因为越来越多的或补偿，或惩罚的启发式trick推动了新一代CFS的上位，期间经历了RSDL(The Rotating Staircase Deadline Schedule)等过渡，最终都没有CFS来的简单直接。如今，CFS又开始逐渐变复杂，变臃肿的节奏。事实上，这把CFS调度器最终也变成了各种trade off。

接下来，我来稍作评价。

也许loop_sleep是一个急切要运行的进程，它可能救经理于水火，应用程序可以无视CPU的cache利用率，但是却不能无视其业务逻辑以及延迟执行的后果！

当然，作为通用操作系统的Linux从来没有说自己是实时操作系统，但退一万步，系统卡顿对于用户而言，也是非常不好的体验！Windows系统耗电，线程切换频繁，但是用起来如丝般顺滑啊，这就是差别，差别不仅仅是一双皮鞋，一套西装。

我们再次审视 /sys/kernel/debug/sched_features 里的那么一大坨features，调的一手好参数并不是人人具备的能力，经理也不行，如此众多的features如何排列组合成一个通用场景下适中的，我想大神级的专家也没这能力吧，更何况，很多feature的组合之间是相互掣肘的。

只要一套 动态优先级提升 调度算法就能解决所有的这些问题，对于提升用户体验这个角度而言，Windows系统的调度算法就非常不错。

关于LAST_BUDDY feature，在2.6.39-rc1内核版本增加了pick next排序之前，它甚至是 有bug的：

static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
	struct sched_entity *se = __pick_next_entity(cfs_rq);
	struct sched_entity *left = se;

	// 先check next
	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
		se = cfs_rq->next;

	/*
	 * Prefer last buddy, try to return the CPU to a preempted task.
	 */
	// 再check last！所以，last有可能覆盖掉选择的结果next，造成严重的乒乓效应！
	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
		se = cfs_rq->last;

	clear_buddies(cfs_rq, se);

	return se;
}

我将大笑，并且歌唱。

浙江温州皮鞋湿，下雨进水不会胖。

dog250

博客专家

原创文章 1603 获赞 5261 访问量 1129万+

关注他的留言板

从一个CFS调度案例谈Linux系统卡顿的根源

猜你喜欢