[Linux] 20. Process status: uninterruptible process, iowait, zombie process, dstat strace pstree

The running time of short-lived applications is relatively short, and it is difficult to find them in tools such as top or ps that display system overview and process snapshots. You need to use tools that record events to cooperate with diagnosis, such as execsnoop or perf top.

Talking about types of CPU usage. In addition to user CPU, it also includes system CPU (such as context switching), CPU waiting for I/O (such as waiting for disk response), and interrupt CPU (including soft interrupt and hard interrupt), etc.

We have already analyzed the problem of high system CPU usage in the context switching article. The remaining CPU usage waiting for I/O (hereinafter referred to as iowait) increases, which is also the most common server performance problem. Today we will look at a multi-process I/O case and analyze this situation.

1. Process status

When iowait increases, the process is likely to be in an uninterruptible state for a long time because it cannot get a response from the hardware. From the output of the ps or top command, you can see that they are all in the D state, which is the uninterruptible sleep state.

top and ps are the most commonly used tools to view the status of a process. Its S column (that is, the Status column) represents the status of the process. It has several statuses such as R, D, Z, S, and I. The meanings are as follows:

  • R is the abbreviation of Running or Runnable, which means that the process is in the CPU's ready queue, running or waiting to run.
  • D is the abbreviation of Disk Sleep, which is Uninterruptible Sleep. It generally indicates that the process is interacting with the hardware, and the interaction process is not allowed to be interrupted by other processes or interrupts.
    • This is actually to ensure that the process data is consistent with the hardware status, and under normal circumstances, the uninterruptible state will end in a short time. Therefore, we can generally ignore short-term uninterruptible state processes.
    • However, if a system or hardware failure occurs, the process may remain in an uninterruptible state for a long time, or even lead to a large number of uninterruptible processes in the system. At this time, you have to pay attention to whether the system has I/O or other performance problems.
  • Z is the abbreviation of Zombie, which means a zombie process, that is, the process has actually ended, but the parent process has not reclaimed its resources (such as process descriptors, PID, etc.).
    • Under normal circumstances, when a process creates a child process, it should wait for the end of the child process through the system call wait() or waitpid() and recycle the resources of the child process; when the child process ends, it will send a message to its parent process. SIGCHLD signal, so the parent process can also register the SIGCHLD signal processing function to asynchronously reclaim resources.
    • If the parent process does not do this, or the child process executes too quickly, the child process will exit early before the parent process has time to process the child process status, and then the child process will become a zombie process. In other words, the father should always be responsible for his son and start well and finish well. If he fails to act or cannot keep up, it will lead to the emergence of "problem boys".
    • Usually, the zombie process lasts for a short time and will die after the parent process reclaims its resources; or it will die after the parent process exits and is reclaimed by the init process.
    • Once the parent process does not handle the termination of the child process and keeps running, the child process will always be in a zombie state. A large number of zombie processes will run out of PID process numbers and prevent new processes from being created, so this situation must be avoided.
  • S is the abbreviation of Interruptible Sleep, which means interruptible state sleep, which means that the process is suspended by the system because it is waiting for an event. When the event that the process is waiting for occurs, it wakes up and enters the R state.
  • I is the abbreviation of Idle, which is the idle state, used for kernel threads that cannot interrupt sleep. As mentioned earlier, the uninterruptible process caused by hardware interaction is represented by D, but for some kernel threads, they may not actually have any load. Idle is used to distinguish this situation. It should be noted that processes in D state will cause the average load to increase, but processes in I state will not.
  • T or t, which is the abbreviation of Stopped or Traced, indicates that the process is suspended or traced.
    • Send a SIGSTOP signal to a process, and it will become stopped in response to this signal; send it a SIGCONT signal again, and the process will resume running (if the process is started directly in the terminal, you need to use the fg command , restore to the foreground).
    • When you use a debugger (such as gdb) to debug a process, after using a breakpoint to interrupt the process, the process will become a tracking state. This is actually a special pause state, but you can use the debugger to track it. And control the running of the process as needed.
  • X, which is the abbreviation of Dead, means that the process has died, so you will not see it in the top or ps commands.
$ top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28961 root      20   0   43816   3148   4040 R   3.2  0.0   0:00.01 top
  620 root      20   0   37280  33676    908 D   0.3  0.4   0:00.01 app
    1 root      20   0  160072   9416   6752 S   0.0  0.1   0:37.64 systemd
 1896 root      20   0       0      0      0 Z   0.0  0.0   0:00.00 devapp
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.10 kthreadd
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq
    7 root      20   0       0      0      0 S   0.0  0.0   0:06.37 ksoftirqd/0

Docker image see

Machine configuration: 2 CPU, 8GB memory
Pre-install docker, sysstat, dstat and other tools, such as apt install docker.io dstat sysstat (here, dstat is a new performance tool , which absorbs the advantages of several tools such as vmstat, iostat, ifstat, etc., and can simultaneously observe the system's CPU, disk I/O, network and memory usage.)

# 安装完成后,我们首先执行下面的命令运行案例应用:
$ docker run --privileged --name=app -itd feisky/app:iowait

# 然后,输入 ps 命令,确认案例应用已正常启动。如果一切正常,你应该可以看到如下所示的输出:
$ ps aux | grep /app
root      4009  0.0  0.0   4376  1008 pts/0    Ss+  05:51   0:00 /app
root      4287  0.6  0.4  37280 33660 pts/0    D+   05:54   0:00 /app
root      4288  0.6  0.4  37280 33668 pts/0    D+   05:54   0:00 /app

From this interface, we can find that multiple app processes have been started, and their statuses are Ss+ and D+ respectively. Among them, S represents the interruptible sleep state, and D represents the uninterruptible sleep state.

What do the following s and + mean? It doesn’t matter if you don’t know, just check man ps.

  • s indicates that this process is the leader process of a session
  • And + represents the foreground process group.

Two new concepts appear here, process groups and sessions. They are used to manage a group of interrelated processes, and their meaning is actually easy to understand.

  • A process group represents a group of interrelated processes, for example, each child process is a member of the group to which the parent process belongs;
  • A session refers to one or more process groups that share the same control terminal.

For example, when we log in to the server through SSH, a control terminal (TTY) will be opened, and this control terminal corresponds to a session. The commands we run in the terminal and their sub-processes form process groups. Among them, the commands running in the background constitute the background process group; the commands running in the foreground constitute the foreground process group.

After understanding this, let's use top to look at the resource usage of the system:

# 按下数字 1 切换到所有 CPU 的使用情况,观察一会儿按 Ctrl+C 结束
$ top
top - 05:56:23 up 17 days, 16:45,  2 users,  load average: 2.00, 1.68, 1.39
Tasks: 247 total,   1 running,  79 sleeping,   0 stopped, 115 zombie
%Cpu0  :  0.0 us,  0.7 sy,  0.0 ni, 38.9 id, 60.5 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.7 sy,  0.0 ni,  4.7 id, 94.6 wa,  0.0 hi,  0.0 si,  0.0 st
...
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4340 root      20   0   44676   4048   3432 R   0.3  0.0   0:00.05 top
 4345 root      20   0   37280  33624    860 D   0.3  0.0   0:00.01 app
 4344 root      20   0   37280  33624    860 D   0.3  0.4   0:00.01 app
    1 root      20   0  160072   9416   6752 S   0.0  0.1   0:38.59 systemd
...

Can you see any problems here? Be careful, look line by line, don't miss any place. If you forget the meaning of any parameter line, you should go back and review it in time.

Okay, if you already have the answer, then continue going down and see if the question I am looking for is the same. Here, I found four suspicious places.

  • Let’s first look at the average load (Load Average) in the first row. The average load in the past 1 minute, 5 minutes and 15 minutes has been decreasing successively, indicating that the average load is increasing; and the average load in 1 minute has reached the system CPU The number indicates that the system is likely to have a performance bottleneck.
  • Looking at the Tasks in the second line, there is one running process, but there are many zombie processes, and they are still increasing, indicating that there are child processes that have not been cleaned up when exiting.
  • Next, look at the usage of the two CPUs. The user CPU and system CPU are not high, but iowait is 60.5% and 94.6% respectively, which seems a bit abnormal.
  • Finally, looking at the situation of each process, the process with the highest CPU usage is only 0.3%, which does not seem to be high; but there are two processes in the D state. They may be waiting for I/O, but it cannot be determined based on this alone. They cause iowait to rise.

Let's summarize these four questions again, and we can get two very clear points:

  • The first point is that iowait is too high, causing the average load of the system to increase, even reaching the number of system CPUs.
  • The second point is that the number of zombie processes is increasing, indicating that some programs have failed to correctly clean up the resources of child processes.

So, what should we do if we encounter these two problems? Let’s continue to break it down:

1.1 iowait analysis

Let’s first look at the problem of iowait rising.

I believe that when it comes to iowait rising, you will first want to query the I/O situation of the system. I generally think this way, so what tool can query the I/O status of the system? I recommend dstat, which can view the usage of both CPU and I/O resources at the same time for easy comparison and analysis. Then, we run the dstat command in the terminal and observe the CPU and I/O usage:

# 间隔 1 秒输出 10 组数据
$ dstat 1 10
You did not select any stats, using -cdngy by default.
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw
  0   0  96   4   0|1219k  408k|   0     0 |   0     0 |  42   885
  0   0   2  98   0|  34M    0 | 198B  790B|   0     0 |  42   138
  0   0   0 100   0|  34M    0 |  66B  342B|   0     0 |  42   135
  0   0  84  16   0|5633k    0 |  66B  342B|   0     0 |  52   177
  0   3  39  58   0|  22M    0 |  66B  342B|   0     0 |  43   144
  0   0   0 100   0|  34M    0 | 200B  450B|   0     0 |  46   147
  0   0   2  98   0|  34M    0 |  66B  342B|   0     0 |  45   134
  0   0   0 100   0|  34M    0 |  66B  342B|   0     0 |  39   131
  0   0  83  17   0|5633k    0 |  66B  342B|   0     0 |  46   168
  0   3  39  59   0|  22M    0 |  66B  342B|   0     0 |  37   134
从 dstat 的输出,我们可以看到,每当 iowait 升高(wai)时,磁盘的读请求(read)都会很大。这说明 iowait 的升高跟磁盘的读请求有关,很可能就是磁盘读导致的。
那到底是哪个进程在读磁盘呢?不知道你还记不记得,上节在 top 里看到的不可中断状态进程,我觉得它就很可疑,我们试着来分析下。

# 我们继续在刚才的终端中,运行 top 命令,观察 D 状态的进程:
# 观察一会儿按 Ctrl+C 结束
$ top
...
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4340 root      20   0   44676   4048   3432 R   0.3  0.0   0:00.05 top
 4345 root      20   0   37280  33624    860 D   0.3  0.0   0:00.01 app
 4344 root      20   0   37280  33624    860 D   0.3  0.4   0:00.01 app
...
 
我们从 top 的输出找到 D 状态进程的 PID,你可以发现,这个界面里有两个 D 状态的进程,PID 分别是 43444345
接着我们查看这些进程的磁盘读写情况。一般要查看某一个进程的资源使用情况,都可以用 pidstat,不过这次记得加上 -d 参数,以便输出 I/O 使用情况。

# 比如,以 4344 为例,我们在终端里运行下面的 pidstat 命令,并用 -p 4344 参数指定进程号:
# -d 展示 I/O 统计数据,-p 指定进程号,间隔 1 秒输出 3 组数据
$ pidstat -d -p 4344 1 3
06:38:50      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:38:51        0      4344      0.00      0.00      0.00       0  app
06:38:52        0      4344      0.00      0.00      0.00       0  app
06:38:53        0      4344      0.00      0.00      0.00       0  app
在这个输出中, kB_rd 表示每秒读的 KB 数, kB_wr 表示每秒写的 KB 数,iodelay 表示 I/O 的延迟(单位是时钟周期)。它们都是 0,那就表示此时没有任何的读写,说明问题不是 4344 进程导致的。

可是,用同样的方法分析进程 4345,你会发现,它也没有任何磁盘读写。

# 那要怎么知道,到底是哪个进程在进行磁盘读写呢?我们继续使用 pidstat,但这次去掉进程号,干脆就来观察所有进程的 I/O 使用情况。在终端中运行下面的 pidstat 命令:
# 间隔 1 秒输出多组数据 (这里是 20 组)
$ pidstat -d 1 20
...
06:48:46      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:47        0      4615      0.00      0.00      0.00       1  kworker/u4:1
06:48:47        0      6080  32768.00      0.00      0.00     170  app
06:48:47        0      6081  32768.00      0.00      0.00     184  app
 
06:48:47      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:48        0      6080      0.00      0.00      0.00     110  app
 
06:48:48      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:49        0      6081      0.00      0.00      0.00     191  app
 
06:48:49      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
 
06:48:50      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:51        0      6082  32768.00      0.00      0.00       0  app
06:48:51        0      6083  32768.00      0.00      0.00       0  app
 
06:48:51      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:52        0      6082  32768.00      0.00      0.00     184  app
06:48:52        0      6083  32768.00      0.00      0.00     175  app
 
06:48:52      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:53        0      6083      0.00      0.00      0.00     105  app
...
观察一会儿可以发现,的确是 app 进程在进行磁盘读,并且每秒读的数据有 32 MB,看来就是 app 的问题。不过,app 进程到底在执行啥 I/O 操作呢?

这里,我们需要回顾一下进程用户态和内核态的区别。进程想要访问磁盘,就必须使用系统调用,所以接下来,重点就是找出 app 进程的系统调用了。
# strace 正是最常用的跟踪进程系统调用的工具。所以,我们从 pidstat 的输出中拿到进程的 PID 号,比如 6082,然后在终端中运行 strace 命令,并用 -p 参数指定 PID 号:
$ strace -p 6082
strace: attach: ptrace(PTRACE_SEIZE, 6082): Operation not permitted
这儿出现了一个奇怪的错误,strace 命令居然失败了,并且命令报出的错误是没有权限。按理来说,我们所有操作都已经是以 root 用户运行了,为什么还会没有权限呢?你也可以先想一下,碰到这种情况,你会怎么处理呢?

# 一般遇到这种问题时,我会先检查一下进程的状态是否正常。比如,继续在终端中运行 ps 命令,并使用 grep 找出刚才的 6082 号进程:
$ ps aux | grep 6082
root      6082  0.0  0.0      0     0 pts/0    Z+   13:43   0:00 [app] <defunct>
果然,进程 6082 已经变成了 Z 状态,也就是僵尸进程。僵尸进程都是已经退出的进程,所以就没法儿继续分析它的系统调用。关于僵尸进程的处理方法,我们一会儿再说,现在还是继续分析 iowait 的问题。

At this point, you should have noticed that the system iowait problem is still continuing, but tools such as top and pidstat can no longer give more information. At this time, we should turn to dynamic tracking tools based on event recording.

You can use perf top to see if there are any new findings. Or, like me, you can run perf record in the terminal for a while (for example, 15 seconds), then press Ctrl+C to exit, and then run perf report to view the report:

$ perf record -g
$ perf report

Then, find the app process we are interested in and press Enter to expand the call stack. You will get the following call relationship diagram:

The swapper in this picture is the scheduling process in the kernel, you can ignore it for now.

Let's look at other information. You can find that the app is indeed reading data through the system call sys_read(). And from new_sync_read and blkdev_direct_IO, we can see that the process is reading directly from the disk, which means bypassing the system cache. Each read request will be read directly from the disk, which can explain the increase in iowait we observed.

It seems that the culprit is that the app performs direct I/O to the disk internally!

The following problems can be easily solved. Next, we should analyze from the code level where the direct read request occurs. Check the source code fileapp.c, and you will find that it indeed uses the O_DIRECT option to open the disk, thus bypassing the system cache and directly reading and writing to the disk. .

open(disk, O_RDONLY|O_DIRECT|O_LARGEFILE, 0755)

Directly reading and writing to the disk is very friendly to I/O-sensitive applications (such as database systems), because you can directly control the reading and writing of the disk in the application. But in most cases, we are better off optimizing disk I/O through the system cache. In other words, just remove the O_DIRECT option.

app-fix1.c is the modified file. I also packaged it into an image file. Run the following command and you can start it:

# 首先删除原来的应用
$ docker rm -f app
# 运行新的应用
$ docker run --privileged --name=app -itd feisky/app:iowait-fix1

# 最后,再用 top 检查一下:
$ top
top - 14:59:32 up 19 min,  1 user,  load average: 0.15, 0.07, 0.05
Tasks: 137 total,   1 running,  72 sleeping,   0 stopped,  12 zombie
%Cpu0  :  0.0 us,  1.7 sy,  0.0 ni, 98.0 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  1.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
...
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3084 root      20   0       0      0      0 Z   1.3  0.0   0:00.04 app
 3085 root      20   0       0      0      0 Z   1.3  0.0   0:00.04 app
    1 root      20   0  159848   9120   6724 S   0.0  0.1   0:09.03 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 I   0.0  0.0   0:00.40 kworker/0:0
...

You will find that iowait is already very low, only 0.3%, which means that the changes just now have successfully fixed the problem of high iowait, and you're done! However, don't forget, zombie processes are still waiting for you. If you carefully observe the number of zombie processes, you will find to your dismay that the number of zombie processes is still growing.

1.2 Zombie process

Next, we will deal with the problem of zombie processes. Since zombie processes appear because the parent process does not recycle the resources of the child process, then to solve them, we must find their roots, that is, find the parent process, and then solve them in the parent process.

We have talked about how to find the parent process before. The simplest way is to run the pstree command:

# -a 表示输出命令行选项
# p 表 PID
# s 表示指定进程的父进程
$ pstree -aps 3084
systemd,1
  └─dockerd,15006 -H fd://
      └─docker-containe,15024 --config /var/run/docker/containerd/containerd.toml
          └─docker-containe,3991 -namespace moby -workdir...
              └─app,4009
                  └─(app,3084)

After running, you will find that the parent process of process No. 3084 is 4009, which is the app application.

So, we then check the code of the app application to see if the end of the child process is handled correctly, such as whether wait() or waitpid() is called, or whether a handler function for the SIGCHLD signal is registered.

Now we look at the source code file after repairing iowaitapp-fix1.c and find the place where the child process is created and cleaned up: a>

int status = 0;
  for (;;) {
    
    
    for (int i = 0; i < 2; i++) {
    
    
      if(fork()== 0) {
    
    
        sub_process();
      }
    }
    sleep(5);
  }
 
  while(wait(&status)>0);

循环语句本来就容易出错,你能找到这里的问题吗?这段代码虽然看起来调用了 wait() 函数等待子进程结束,但却错误地把 wait() 放到了 for 死循环的外面,也就是说,wait() 函数实际上并没被调用到,我们把它挪到 for 循环的里面就可以了, 如下:
int i = 0;
for (;;)
{
    
    
	for (i = 0; i < 2; i++){
    
    
		if (fork() == 0){
    
    
			sub_process(disk, buffer_size, buffer_count);
		}
	}
	while (wait(&status) > 0);
	sleep(5);
}

I put the modified file intoapp-fix2.c and packaged it into a Docker image. Run the following command: You can launch it:

# 先停止产生僵尸进程的 app
$ docker rm -f app
# 然后启动新的 app
$ docker run --privileged --name=app -itd feisky/app:iowait-fix2

# 启动后,再用 top 最后来检查一遍:
$ top
top - 15:00:44 up 20 min,  1 user,  load average: 0.05, 0.05, 0.04
Tasks: 125 total,   1 running,  72 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  1.7 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  1.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
...
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3198 root      20   0    4376    840    780 S   0.3  0.0   0:00.01 app
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 I   0.0  0.0   0:00.41 kworker/0:0
...

Well, the zombie process (Z state) is gone, iowait is also 0, and the problem is finally solved.

1.3 Summary

Today I will use a multi-process case to take you to analyze the situation where the CPU usage of the system waiting for I/O (that is, iowait%) increases.

Although in this case disk I/O caused iowait to increase, high iowait does not necessarily mean that there is a performance bottleneck in I/O. When only I/O type processes are running in the system, iowait will also be high, but in fact, disk reading and writing are far from reaching the level of performance bottleneck.

Therefore, when encountering an increase in iowait, you need to first use tools such as dstat and pidstat to confirm whether it is a disk I/O problem, and then find out which processes are causing the I/O.

Processes waiting for I/O are generally in an uninterruptible state, so processes in the D state (i.e., uninterruptible state) found with the ps command are mostly suspicious processes. But in this case, after the I/O operation, the process became a zombie process again, so strace cannot be used to directly analyze the system calls of this process.

In this case, we used the perf tool to analyze the system's CPU clock events, and finally found that the problem was caused by direct I/O. At this time, it is very easy to check the corresponding location in the source code.

The problem of zombie processes is relatively easy to troubleshoot. After using pstree to find the parent process, check the code of the parent process and check the call of wait() / waitpid() or the registration of the SIGCHLD signal processing function.

おすすめ

転載: blog.csdn.net/jiaoyangwm/article/details/134480509