记一次ltp测试导致的soft lockup问题分析与内核态不可抢占的关系

首先什么是软锁(soft lockup),

A ‘softlockup’ is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds (see “Implementation” below for details), without giving other tasks a chance to run. The current stack trace is displayed upon detection and, by default, the system will stay locked up. Alternatively, the kernel can be configured to panic; a sysctl, “kernel.softlockup_panic”, a kernel parameter,“softlockup_panic” (see “Documentation/kernel-parameters.txt” for details), and a compile option,“BOOTPARAM_SOFTLOCKUP_PANIC”, are provided for this.
A ‘hardlockup’ is defined as a bug that causes the CPU to loop in kernel mode for more than 10 seconds (see “Implementation” below for details), without letting other interrupts have a chance to run. Similarly to the softlockup case, the current stack trace is displayed upon detection and the system will stay locked up unless the default behavior is changed, which can be done through a sysctl, ‘hardlockup_panic’, a compile time knob,“BOOTPARAM_HARDLOCKUP_PANIC”, and a kernel parameter, “nmi_watchdog”

下面先关注下软锁的检测原理:

watchdog是什么?

在linux系统上,会为每个cpu启动一个最高优先级为139的实时线程watchdog

每个cpu上的watchdog线程会周期性的刷新percpu的一个变量watchdog_touch_ts, 默认情况下,这个变量20秒(2*/proc/sys/kernel/watchdog_thresh,可配)内没有刷新,即20秒内watchdog线程都没有被调度到,就会触发软锁(打印“BUG: soft lockup -CPU#%d stuck for %us![%s:%d]”,%s值是current->comm)。
判断软锁发生的代码(touch_ts是上次watchdog线程运行更新的时间戳):

谁来唤醒watchdog?
上面说watchdog线程会周期性的刷新watchdog_touch_ts,那么watchdog就需要被周期性的唤醒,如下所示系统为每个cpu创建一个hrtimer,其触发周期是sample_period(除非发生了硬锁),该值等于2*(/proc/sys/kernel/watchdog_thresh)/5=4秒。
创建percpu的hrtimer:

设置hrtimer的触发周期sample_period:

hrtimer的处理函数中唤醒watchdog线程:

给了watchdog很多次机会

如上所示,如果watchdog线程20秒内没有刷新时间戳,就认为发生了软锁,而hrtimer会每4秒唤醒一次watchdog线程,这样20秒内会唤醒watchdog线程5次。唤醒了5次,watchdog就调度不到,就出问题了。

watchdog线程为什么设置为最该优先级?

如果最高优先级都调度不到,就真的出问题了。

发布了158 篇原创文章 · 获赞 115 · 访问量 37万+

猜你喜欢

转载自blog.csdn.net/yiyeguzhou100/article/details/103093496