What should I do if the Linux process is stuck?

When we use the Linux system, if there is a problem with the I/O such as the network or the disk, we will find that the process is stuck, and the process kill -9can , and many commonly used debugging tools, such as strace. pstackthing?

At this point, we use to psview the process list, we can see that the stuck process status is displayed as D.

man psThe D state described in is Uninterruptible Sleep.

Linux processes have two sleep states:

  1. Interruptible Sleep , interruptible sleep, displays S in the ps command. A process in this sleep state can be woken up by sending a signal to it.
  2. Uninterruptible Sleep , uninterruptible sleep, displays D in the ps command. A process in this sleep state cannot immediately handle any signals sent to it, which is why it cannot be killed with kill.

There is an answer on Stack Overflow:

kill -9It just sends a SIGKILLsignal . When a process is in a special state (signal processing, or in a system call), it will not be able to handle any signals, including SIGKILLand can not be handled correctly, so the process cannot be killed immediately, that is, we often say Dstate (uninterruptible sleep state). Those commonly used debugging tools (such as strace, pstacketc.) are generally implemented by using a special signal, and cannot be used in this state.

It can be seen that the process in the D state is generally in a kernel-mode system call, so how do you know which system call it is and what is it waiting for? Fortunately, Linux provides procfs (that is, the /proc directory under Linux), through which you can see the current kernel call stack of any process. Next, we use the process of accessing JuiceFS to simulate (because the JuiceFS client is based on FUSE, which is a user-mode file system, it is easier to simulate I/O failure).

First mount JuiceFS to the foreground (add a -f parameter to the ./juicefs mountcommand ), and then use Cltr+Z to stop the process. At this time, use ls /jfsto access the mount point, and you will find that it is lsstuck .

You can see by the following command that ls is stuck on the vfs_fstatatcall , it will send a getattrrequest to the FUSE device, and it is waiting for a response. And the JuiceFS client process has been stopped by us, so it is stuck:

$ cat /proc/`pgrep ls`/stack
[<ffffffff813277c7>] request_wait_answer+0x197/0x280
[<ffffffff81327d07>] __fuse_request_send+0x67/0x90
[<ffffffff81327d57>] fuse_request_send+0x27/0x30
[<ffffffff8132b0ac>] fuse_simple_request+0xcc/0x1a0
[<ffffffff8132c0f0>] fuse_do_getattr+0x120/0x330
[<ffffffff8132df28>] fuse_update_attributes+0x68/0x70
[<ffffffff8132e33d>] fuse_getattr+0x3d/0x50
[<ffffffff81220c6f>] vfs_getattr_nosec+0x2f/0x40
[<ffffffff81220ee6>] vfs_getattr+0x26/0x30
[<ffffffff81220fc8>] vfs_fstatat+0x78/0xc0
[<ffffffff8122150e>] SYSC_newstat+0x2e/0x60
[<ffffffff8122169e>] SyS_newstat+0xe/0x10
[<ffffffff8186281b>] entry_SYSCALL_64_fastpath+0x22/0xcb
[<ffffffffffffffff>] 0xffffffffffffffff

At this time, pressing Ctrl+C cannot exit.

root@localhost:~# ls /jfs
^C
^C^C^C^C^C

But using stracecan wake it up, and start processing the previous interrupt signal, and then exit.

root@localhost:~# strace -p `pgrep ls`
strace: Process 26469 attached
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=13290, si_uid=0} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
。。。
tgkill(26469, 26469, SIGINT)            = 0
--- SIGINT {si_signo=SIGINT, si_code=SI_TKILL, si_pid=26469, si_uid=0} ---
+++ killed by SIGINT +++

At this time, if you use kill -9, you can also kill it:

root@localhost:~# ls /jfs
^C
^C^C^C^C^C
^C^CKilled

Because vfs_lstatat()this simple system call does not shield SIGKILL, SIGQUIT, SIGABRTand other signals, it can also do some conventional processing.

Let's simulate a more complex I/O error, configure an unwritable storage type for JuiceFS, mount it, and use cp to try to write data into it. At this time, cp will also get stuck:

root@localhost:~# cat /proc/`pgrep cp`/stack
[<ffffffff813277c7>] request_wait_answer+0x197/0x280
[<ffffffff81327d07>] __fuse_request_send+0x67/0x90
[<ffffffff81327d57>] fuse_request_send+0x27/0x30
[<ffffffff81331b3f>] fuse_flush+0x17f/0x200
[<ffffffff81218fd2>] filp_close+0x32/0x80
[<ffffffff8123ac53>] __close_fd+0xa3/0xd0
[<ffffffff81219043>] SyS_close+0x23/0x50
[<ffffffff8186281b>] entry_SYSCALL_64_fastpath+0x22/0xcb
[<ffffffffffffffff>] 0xffffffffffffffff

How to get stuck in close_fd()? This is because writing data to JFS is asynchronous. When cpcalling write(), the data will be cached in the client process of JuiceFS and asynchronously written to the backend storage. After cpwriting data, it will call closeto ensure that the data writing is completed. , corresponding to the flushoperation . When the JuiceFS client encounters flushan operation , it needs to ensure that all the written data is persisted to the back-end storage, and the back-end storage fails to write, it is in the process of multiple retries, so the flushoperation is stuck, I haven't replied yet cp, so cpI 'm stuck too.

At this time, if you use Cltr+C killor interruptable cpoperation, because JuiceFS implements the interrupt processing of various file system operations, it will give up the current operation (for example flush) and return EINTR, so that it can interrupt the ongoing operation when encountering various network failures. Access the JuiceFS app .

At this time, if I stop the JuiceFS client process so that it can no longer process any FUSE requests (including interrupt requests), if I try to kill it at this time, it will not be killed, including kill -9can not be killed. Use to psview the process status, it has been is the Dstatus .

root      1592  0.1  0.0  20612  1116 pts/3    D+   12:45   0:00 cp parity /jfs/aaa

But this time it can be used to cat /proc/1592/stacksee its kernel call stack

root@localhost:~# cat /proc/1592/stack
[<ffffffff8132775d>] request_wait_answer+0x12d/0x280
[<ffffffff81327d07>] __fuse_request_send+0x67/0x90
[<ffffffff81327d57>] fuse_request_send+0x27/0x30
[<ffffffff81331b3f>] fuse_flush+0x17f/0x200
[<ffffffff81218fd2>] filp_close+0x32/0x80
[<ffffffff8123ac53>] __close_fd+0xa3/0xd0
[<ffffffff81219043>] SyS_close+0x23/0x50
[<ffffffff8186281b>] entry_SYSCALL_64_fastpath+0x22/0xcb
[<ffffffffffffffff>] 0xffffffffffffffff

The kernel call stack shows that it is stuck on a FUSE flushcall , and as long as the JuiceFS client process is resumed, it can be interrupted immediately cpto let it exit.

closeSuch an operation involving data security, no ,restartable it cannot be interrupted at will, for example, the implementation of FUSE can only be interrupted by responding to the interrupt operation.SIGKILL

Therefore, as long as the JuiceFS client process can respond to interruptions in a healthy manner, there is no need to worry about the application that accesses JuiceFS getting stuck. Or killing the JuiceFS client process can also end the current mount point, interrupting all applications accessing the current mount point .

If it is helpful, please follow our project  Juicedata/JuiceFS  ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324092788&siteId=291194637