The system appears in a large number of non-interruptible process and how to do zombie process

The system appears in a large number of non-interruptible process and how to do zombie process

Run short application time is relatively short, it is difficult in top or ps found to show a snapshot overview of the process and the system here, you need tools to record events to match the diagnosis, such as execsnoop or perf top

Mentioned cpu utilization type, other users cpu addition, the system further comprises a cpu (context switch), waiting io the cpu (response waiting disks) and an interrupt cpu (including hardware interrupts and software interrupts), etc.

- Process Status

When iowait elevated, the process is likely to get a response because of hardware, but a long time in an uninterruptible state. The ps or top may be found to have a d state, i.e. not interrupt status ( Uninterruptible SLEEP )

Top , PS is the tool most commonly used to view the process state, Top of the s column indicates the status of the process --R \ D \ Z \ S \ I and several other states

--R is running or Runnable , it indicates that the process in the cpu 's ready queue, waiting to run or are running

--D is Disk SLEEP , uninterruptible sleep state ( Uninterruptible SLEEP ) said it was generally interact with the hardware, and interaction not allowed to be interrupted by other processes or interruption

--Z is Zombie , zombie process, which is in fact the process has ended, but the parent process has not yet recovered its resources

--S is the Interruptible SLEEP , Interruptible sleep state, waiting for an event represented because the system is suspended, when the process of waiting for the event, it will wake up and enter the R state

--I is IDLE , idle state, used in uninterruptible sleep kernel threads. To note, D process status will lead to increased average load, the I process the state will not.

--T or T , stoped or traced , that the process is paused or tracking state, is sent to a process sigstop signal, it will be due to a pause state in response to this signal becomes ( stopped ); retransmission SIGCONT , the process will return

--X , indicates that the process is dead, it is not in top or ps see in

--top command, press a switch to the cpu

If you send a system or hardware failure, the process may not be interrupted in the state to maintain for a long time, and even lead to a large number of systems can not interrupt the process, should be noted, is not there a system I / O and other performance issues.

[root@mysqlhq ~]# yum install dstat -y

Here dstat is a new performance tools, absorbs the vmstat , the iostat , ifstat advantage of such tools, the system can monitor the current CPU , disk IO , memory and network usage

 

- uninterruptible state , indicates that the process is interacting with the hardware, in order to protect the consistency of process data and hardware, the system does not allow other processes or interrupt interrupt this process. Process a long time in an uninterruptible state usually indicates io performance problems.

- zombie process means the process has exited, but the parent process has not yet recovered resources occupied by child processes. Short zombie state usually do not have to bother, but a long time in a zombie state process, it should be noted, and may have application does not properly handle the child process exits.

--1 iowait too high, the system reached the cpu number

--iowait analysis

[root@mysqlhq ~]# dstat 1 10 ##间隔1秒输出10组数据
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  1   0  99   0   0   0| 329k  133k|   0     0 |   0     0 | 202   249 
  0   0 100   0   0   0|   0    15k|4086B  842B|   0     0 | 176   225 
  0 0 100 0 0 0 | 0 206k | 3282B 362B | 0 0 | 182 258
  0 0 100 0 0 0 | 0 5120B | 3341B 362B | 0 0 | 141 174
  0 0 100 0 0 0 | 0 0 | 2946B 362B | 0 0 | 144 178
  0   0 100   0   0   0|   0    10k|2142B  362B|   0     0 | 151   208 
  0   0 100   0   0   0|   0    15k|2640B  362B|   0     0 | 171   213 

Look read and writ , when analyzing iowait elevated, disk read request ( read ) or writ request, it is likely to lead to read or write disk

According to top command, observe D process state

Find the process pid , the case of 2171

# # -D io statistics show, -p process ID 1 second intervals Output 3 sets of data 
[root @ mysqlhq ~] # pidstat -d -p 2171 1 3 
Linux 3.10.0-514.ky3.kb3.x86_64 (mysqlhq) _x86_64_ 06/11/2019 (. 4 the CPU)
 04:51:13 the PM the UID the PID    kB_rd / S    kB_wr / S kB_ccwr / S IODELAY   the Command
 04:51:14 the PM 1001 2171 0.00 0.00 0.00 0   zabbix_agentd
 04:51:15 2171 the PM 1001 0 0.00 0.00 0.00   zabbix_agentd
 04:51:16 PM 1001 2171 0.00 0.00 0.00 0   zabbix_agentd
Average:     1001      2171      0.00      0.00      0.00       0  zabbix_agentd

kB_rd for read per second KB number, kB_wr express written per KB number, IODELAY represent io delays are all 0 represents no read and write at this time, the issue is not present in 2171

Other analysis using the same method D process state

[mysqlhq the root @ ~] # pidstat -d intervals of 1 second. 1 20 ## 20 sets of data outputs 
the Linux 3.10.0-514.ky3.kb3.x86_64 (mysqlhq) 06/11/2019 _x86_64_ (. 4 the CPU)
 04:55: the PID kB_rd the UID the PM 37 [/ S kB_wr / kB_ccwr S / S IODELAY the Command
 04:55:38 the PM 1000 3093 0.00 15.84 0.00 0   mysqld

04:55:38 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
04:55:39 PM  1000      3093      0.00     12.00      0.00       0  mysqld

Observed, mysqld process disk write, and write data per second is 15KB , if this value is large, that is the problem with this process

Process tries to access the disk, you must use a system call, so the next to find out mysqld system calls the process

strace tool is most commonly used to track the process of system calls

[root@mysqlhq ~]# strace -p 1000
strace: attach: ptrace(PTRACE_ATTACH, ...): No such process
[root@mysqlhq ~]# strace -p 3093
Process 3093 attached
restart_syscall(<... resuming interrupted call ...>) = 1
fcntl(31, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(31, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
accept(31, {sa_family=AF_INET6, sin6_port=htons(37136), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 133
fcntl(31, F_SETFL, O_RDWR)              = 0
setsockopt(133, SOL_IP, IP_TOS, [8], 4) = 0
setsockopt(133, SOL_TCP, TCP_NODELAY, [1], 4) = 0

If the execution fails

strace -p 6082

strace: attach: ptrace(PTRACE_SEIZE, 6082): Operation not permitted

Whether general encounter this problem, first check the status of the normal process

[root@mysqlhq ~]# ps aux|grep 6082

Use perf top view

$ perf record -g

$ perf report

Screenshot swapper is the kernel of the scheduling process, you can ignore

View the process of system calls

- zombie process

To solve the zombie process, he needs to find their roots, that is, to find the parent, then the parent process to resolve in

Parent process to find the law

# -A represents the output of the command-line options

# P table PID

# S represents the parent of the specified process

[root@mysqlhq ~]# pstree -aps 3093

 

Find the parent process and resolve

summary:

iowait high does not necessarily represent io have performance bottlenecks, when the system has only io type of process at runtime, iowait will be high, but in fact, disk read and write far from the degree of performance bottlenecks.

Met iowait elevated, first use dstat , pidstat and other tools, make sure the disk is not io the problem, and then find the process that led to io

Waiting io process is generally not interrupt status with ps find command D status process, mostly for suspicious processes, if turned into a zombie process, the trace can not directly analyze the process of system calls, use this case perf tools, class analysis system cpu clock event, eventually found a direct io problem.

Problem zombie process, using pstree identify the parent process, check the parent process wait () / waitpid () call

Guess you like

Origin www.cnblogs.com/yhq1314/p/11004945.html