1. Problem phenomenon
The business process (user mode multi-threaded program) hangs, the operating system is unresponsive, and the system log has no abnormalities. From the kernel mode stack of the process, it seems that all threads are stuck in the following stack flow of the kernel mode:
[root@vmc116 ~]# cat /proc/27007/task/11825/stack
[] retint_careful+0x14/0x32
[ ] 0xffffffffffffffff
2, the problem points analysis
2.1, kernel stack analysis
From the perspective of the kernel stack, all processes are blocked on retint_careful. This is the process of interrupt return. The code (assembly) is as follows:
entry_64.S:
ret_from_intr:
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
decl PER_CPU_VAR(irq_count)
/* Restore saved previous stack */
popq %rsi
CFI_DEF_CFA rsi,SS+8-RBP /* reg/off reset after def_cfa_expr */
leaq ARGOFFSET-RBP(%rsi), %rsp
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET RBP-ARGOFFSET
...
retint_careful://中断返回
CFI_RESTORE_STATE
bt $TIF_NEED_RESCHED,%edx
jnc retint_signal
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
pushq_cfi %rdi
SCHEDULE_USER//调度点
popq_cfi %rdi
GET_THREAD_INFO(%rcx)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp retint_check
This is actually the process of returning from the interrupt after the user mode process is interrupted by the user mode, combined with retint_careful+0x14/0x32, and disassembled, you can confirm that the blocking point is actually in SCHEDULE_USER,
which is actually calling schedule() for scheduling , That is to say, when the process goes to the interrupt return flow, it is found that it needs to be scheduled (TIF_NEED_RESCHED is set), so scheduling occurs here.
There is a question: Why can't I see the stack frame of schedule() level in the stack?
Because this is directly called by assembly, there is no related stack frame push and context preservation operations.
2.2. Perform status information analysis
From the results of the top command, the relevant threads are actually in the R state, the CPU is almost completely exhausted, and most of them are consumed in the user mode:
[root@vmc116 ~]# top
top - 09:42:23 up 16 days, 2:21, 23 users, load average: 84.08, 84.30, 83.62
Tasks: 1037 total, 85 running, 952 sleeping, 0 stopped, 0 zombie
Cpu(s): 97.6%us, 2.2%sy, 0.2%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32878852k total, 32315464k used, 563388k free, 374152k buffers
Swap: 35110904k total, 38644k used, 35072260k free, 28852536k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27074 root 20 0 5316m 163m 14m R 10.2 0.5 321:06.17 z_itask_templat
27084 root 20 0 5316m 163m 14m R 10.2 0.5 296:23.37 z_itask_templat
27085 root 20 0 5316m 163m 14m R 10.2 0.5 337:57.26 z_itask_templat
27095 root 20 0 5316m 163m 14m R 10.2 0.5 327:31.93 z_itask_templat
27102 root 20 0 5316m 163m 14m R 10.2 0.5 306:49.44 z_itask_templat
27113 root 20 0 5316m 163m 14m R 10.2 0.5 310:47.41 z_itask_templat
25730 root 20 0 5316m 163m 14m R 10.2 0.5 283:03.37 z_itask_templat
30069 root 20 0 5316m 163m 14m R 10.2 0.5 283:49.67 z_itask_templat
13938 root 20 0 5316m 163m 14m R 10.2 0.5 261:24.46 z_itask_templat
16326 root 20 0 5316m 163m 14m R 10.2 0.5 150:24.53 z_itask_templat
6795 root 20 0 5316m 163m 14m R 10.2 0.5 100:26.77 z_itask_templat
27063 root 20 0 5316m 163m 14m R 9.9 0.5 337:18.77 z_itask_templat
27065 root 20 0 5316m 163m 14m R 9.9 0.5 314:24.17 z_itask_templat
27068 root 20 0 5316m 163m 14m R 9.9 0.5 336:32.78 z_itask_templat
27069 root 20 0 5316m 163m 14m R 9.9 0.5 338:55.08 z_itask_templat
27072 root 20 0 5316m 163m 14m R 9.9 0.5 306:46.08 z_itask_templat
27075 root 20 0 5316m 163m 14m R 9.9 0.5 316:49.51 z_itask_templat
...
2.3, process scheduling information
From the scheduling information of related threads:
[root@vmc116 ~]# cat /proc/27007/task/11825/schedstat
15681811525768 129628804592612 3557465
[root@vmc116 ~]# cat /proc/27007/task/11825/schedstat
15682016493013 129630684625241 3557509
[root@vmc116 ~]# cat /proc/27007/task/11825/schedstat
15682843570331 129638127548315 3557686
[root@vmc116 ~]# cat /proc/27007/task/11825/schedstat
15683323640217 129642447477861 3557793
[root@vmc116 ~]# cat /proc/27007/task/11825/schedstat
15683698477621 129645817640726 3557875
found that the scheduling statistics of related threads have been increasing, indicating that the related threads have been scheduled to run, combined with its status has always been R, speculated that it is likely to occur in user mode Dead loop (or non-sleep deadlock).
There is another question here : Why is the CPU occupancy rate of each thread only about 10% from the top, instead of the 100% occupancy rate caused by the usual endless loop process?
Because there are many threads and their priorities are all the same, according to the CFS scheduling algorithm, the time slice will be evenly distributed, and one thread will not monopolize the CPU. The result is that multiple threads are scheduled in turn, consuming all the cpu. .
Another question: Why does the kernel not detect softlockup in this case?
Because the priority of the business process is not high, it will not affect the scheduling of the watchdog kernel thread (the highest priority real-time thread), so there will be no softlockup.
Another question: Why is it always blocked in retint_careful every time I view the thread stack and not elsewhere?
Because this (when the interrupt returns) is the timing of scheduling, scheduling cannot occur at other points in time (regardless of other circumstances~), and we check the behavior of the thread stack, which must also depend on the process scheduling, so we check every time When stacking, it is the time when the process (cat command) viewing the stack is scheduled, and it is when the interrupt returns, so the blocking point just seen is retint_careful.
2.4, user attitude analysis
From the above analysis, it is speculated that a deadlock occurred in the user mode (the deadlock cannot occur here, if a deadlock occurs, there must be a process that is locked in the kernel mode ). The mutex kernel stack is as follows:
Detaching from program: /sbc/wq/some_func/share_mem_mutex, process 29758
[root@localhost ~]# cat /proc/29758/stack
[<ffffffff810a4179>] futex_wait_queue_me+0xb9/0xf0
[<ffffffff810a53a8>] futex_wait+0x1f8/0x390
[<ffffffff810a6c11>] do_futex+0x121/0xb10
[<ffffffff810a767b>] sys_futex+0x7b/0x170
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
At this point, the user mode can be used to confirm the method:
deploy the debug information, then gdb attach the relevant process, confirm the stack, and combine the code logic analysis.
It is finally confirmed that the problem is indeed an endless loop in the user mode process.