Recently, the server’s response is very slow, and the applications on the server often time out, etc., and sometimes get stuck. After finding that the server I/O pressure is very high, the pressure from the hard disk I/O access has reached 100% .
The last reason is that the online business code was written at the same time, which caused the server hard disk I/O to burst. I will record it here for the convenience of you and yourself in the future to solve such problems as soon as possible.
Use the top command to take a look at real-time viewing system status information:
CPU status (CPU s): user process occupancy ratio, system process occupancy ratio, user's nice priority process occupancy ratio and idle CPU resource ratio, etc.;
Memory status (Mem): total memory, used amount, free amount, etc.;
Swap partition status (Swap): total swap partition, usage, idle volume, etc.;
The description of the parameters in the CPU state:
us: Ratio of CPU time used in user mode
sy: CPU time used in system mode
ni: User-mode CPU time ratio used as nice weighted process allocation
id: idle CPU time ratio
wa: CPU waits for disk write completion time
hi: hard interruption time
si: soft interrupt consumption time
st: the virtual machine steals time
It can be seen that the wa (71.1%) of the server is extremely high, and the percentage of CPU time occupied by IO waiting is higher than 30%, indicating that there is a problem with the disk IO.
We use iostat and other commands to continue the detailed analysis. If there is no iostat on the server, install it as follows:
[root@Mike-VM-Node-172_31_225_214 ~]# yum install sysstat
[root@Mike-VM-Node-172_31_225_214 ~]# iostat
Linux 3.10.0-514.26.2.el7.x86_64 (Mike-VM-Node172_31_225_214.com) 11/03/2020 _x86_64_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.14 0.00 0.04 0.01 0.00 99.81
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
vda 0.44 1.38 4.59 1786837 5940236
[root@Mike-VM-Node-172_31_225_214 ~]#
Parameter Description:
%user: the percentage of time the CPU is in user mode
%nice: The percentage of time that the CPU is in user mode with NICE value
%system: The percentage of time the CPU is in system mode
%iowait: the percentage of time the CPU waits for input and output to complete
%steal: The percentage of unconscious wait time of the virtual CPU when the hypervisor maintains another virtual processor
%idle: CPU idle time percentage
tps: The number of transmissions per second of the device, "one transmission" means "one I/O request". Multiple logical requests may be combined into "one I/O request". The size of the "one transfer" request is unknown
kB_read/s: The amount of data read from the device per second
kB_wrtn/s: The amount of data written to the device per second
kB_read: the total amount of data read
kB_wrtn: The total amount of data written; these units are Kilobytes
Use the iostat -x 1 10 command to check the IO status.
[root@Mike-VM-Node-172_31_225_214 ~]# iostat -x 1 10
Linux 3.10.0-514.26.2.el7.x86_64 (Mike-VM-Node172_31_225_214.com) 11/03/2020 _x86_64_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.13 0.00 0.04 97.01 0.00 99.82
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.10 0.06 0.33 1.07 4.42 28.07 0.00 10.94 22.13 8.83 0.35 0.01
avg-cpu: %user %nice %system %iowait %steal %idle
1.00 0.00 4.00 95.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 0.00 2140.00 0.00 8560.00 8.00 19.87 9.29 0.00 9.29 0.47 100.00
You can view %util 100.00 %idle 99.82.
The value of %util has been increasing, and the utilization of the disk is getting higher and higher, indicating that the io operation is more and more frequent, and the use of disk resources is increasing, which is consistent with increasing the thread for io operation.
If %util is already 100%, it means that too many I/O requests are generated, the I/O system is fully loaded, and the disk may have a bottleneck.
Idle is greater than 99%. The IO pressure has reached a very limit. Generally, the read speed has more wait.
Parameter Description:
rrqm/s: The number of merge read operations per second. I.e. rmerge/s
wrqm/s: The number of merge write operations per second. I.e. wmerge/s
r/s: The number of read I/O devices completed per second. Ie rio/s
w/s: The number of write I/O devices completed per second. I.e. wio/s
rkB/s: The number of K bytes read per second. It is half of rsect/s because each sector is 512 bytes in size
wkB/s: The number of K bytes written per second. Is half of wsect/s
avgrq-sz: Average data size (sector) of each device I/O operation
avgqu-sz: average I/O queue length
rsec/s: The number of sectors read per second. Ie rsect/s
wsec/s: The number of sectors written per second. I.e. wsect/s
r_await: The average time required for each read operation, including not only the time of the hard disk device read operation, but also the time waiting in the kernel queue
w_await: The average time required for each write operation, including not only the time of hard disk device write operations, but also the time waiting in the kernel queue
await: Average waiting time (milliseconds) for each device I/O operation
svctm: average service time per device I/O operation (milliseconds)
%util: What percentage of a second is used for I/O operations, that is, the percentage of cpu consumed by io
If you want to perform an IO load stress test on the hard disk, you can use the fio command. If there is no fio on the server, install it as follows:
[root@Mike-VM-Node-172_31_225_214 ~]# yum install -y fio
The following command will generate 30 1G files in the specified directory, which are executed concurrently by multiple threads:
[root@Mike-VM-Node-172_31_225_214 /tmp]# fio -directory=/tmp/ -name=readtest -direct=1-iodepth1-thread -rw=write -ioengine=psync -bs=4k -size=1G -numjobs=30-runtime=3-group_reporting
numjobs=30 means 30 concurrent jobs.
-rw=Read, single test, read write, single test, write rw, read and write at the same time, randrw, random read and write, and randread, single test, random read, randwrite, single test, random write.
-runtime=The unit is seconds, which means the total duration of the test.
If you
① Engage in functional testing and want to advance automated testing
②I have been in the testing industry for one or two years, but still can’t type code
③ Interviews with big companies repeatedly hit the wall
I invite you to join the group! Come on~~ Tester, 313782132 (There are technical experts in the Q group to communicate and share, the value of learning resources depends on your actions, don’t be a "collector") Get more technology and interview materials