[Turn] Check Linux Server Performance in 1 Minute with 10 Commands

Go to: http://www.infoq.com/cn/news/2015/12/linux-performance

 

If the load of your Linux server suddenly increases sharply, and the alarm text message is about to explode on your mobile phone, how to find out the Linux performance problem in the shortest time? Check out this blog post from the Netflix Performance Engineering team for 10 commands to diagnose machine performance issues in under a minute.

Overview

By executing the following command, you can get a general understanding of system resource usage within 1 minute.

  • uptime
  • dmesg | tail
  • vmstat 1
  • mpstat -P ALL 1
  • pidstat 1
  • iostat -xz 1
  • free -m
  • sar -n DEV 1
  • sar -n TCP,ETCP 1
  • top

Some of these commands require the sysstat package to be installed, and some are provided by the procps package. The output of these commands helps to quickly locate performance bottlenecks and check out the utilization, saturation and error metrics of all resources (CPU, memory, disk IO, etc.), also known as USE method .

Let's introduce these commands one by one. For more parameters and descriptions of these commands, please refer to the command's manual.

uptime

$ uptime
23:51:26 up 21:31,  1 user,  load average: 30.02, 26.43, 19.02

This command can quickly view the load of the machine. In Linux systems, these data represent the number of processes waiting for CPU resources and blocked in uninterruptible IO processes (process status is D). This data can give us a macro understanding of system resource usage.

The output of the command indicates the load averages for 1 minute, 5 minutes, and 15 minutes, respectively. With these three pieces of data, it is possible to understand whether the server load is tensing or the area is easing. If the 1-minute average load is high, but the 15-minute average load is low, it means that the server is commanding a high load, and you need to further investigate where the CPU resources are being consumed. Conversely, if the 15-minute load average is high and the 1-minute average load is low, it is possible that the CPU resource stress time has passed.

In the output of the above example, you can see that the average load in the last minute is very high, and it is much higher than the load in the last 15 minutes, so we need to continue to investigate what processes are consuming a lot of resources in the current system. You can use the vmstat, mpstat and other commands described below to further troubleshoot.

dmesg | tail

$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request.  Check SNMP counters.

This command prints the last 10 lines of the system log. In the output of the example, you can see a kernel oom kill and a TCP packet loss. These logs can help troubleshoot performance issues. Don't forget this step.

vmstat 1

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
34  0    0 200889792  73708 591828    0    0     0     5    6   10 96  1  3  0  0
32  0    0 200889920  73708 591860    0    0     0   592 13284 4282 98  1  1  0  0
32  0    0 200890112  73708 591860    0    0     0     0 9501 2154 99  1  0  0  0
32  0    0 200889568  73712 591856    0    0     0    48 11900 2459 99  0  0  0  0
32  0    0 200890208  73712 591860    0    0     0     0 15898 4840 98  1  1  0  0
^C

The vmstat(8) command, each line will output some system core indicators, these indicators can give us a more detailed understanding of the system status. The following parameter 1 indicates that statistics are output once per second. The header indicates the meaning of each column. These columns introduce some columns related to performance tuning:

  • r: The number of processes waiting for CPU resources. This data is more representative of the CPU load than the average load, and the data does not include processes waiting for IO. If this value is greater than the number of machine CPU cores, then the machine's CPU resources are saturated.
  • free: The amount of available memory in the system (in kilobytes). If the remaining memory is insufficient, it will also cause system performance problems. The free command described below can provide a more detailed understanding of system memory usage.
  • si, so: The number of swap writes and reads. If this data is not 0, it means that the system is already using the swap area (swap), and the physical memory of the machine is insufficient.
  • us, sy, id, wa, st: These all represent the consumption of CPU time, they represent user time (user), system (kernel) time (sys), idle time (idle), IO wait time (wait) and Stolen time (stolen, generally consumed by other virtual machines).

The above CPU time allows us to quickly understand whether the CPU is in a busy state. In general, if the sum of user time and system time is very large, the CPU is busy executing instructions. If the IO wait time is very long, then the bottleneck of the system may be the disk IO.

As you can see from the output of the example command, a lot of CPU time is consumed in user mode, that is, the user application consumes CPU time. This is not necessarily a performance issue, it needs to be analyzed together with the r queue.

mpstat -P ALL 1

$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)
07:38:49 PM  CPU   %usr  %nice   %sys %iowait   %irq  %soft  %steal  %guest  %gnice  %idle
07:38:50 PM  all  98.47   0.00   0.75    0.00   0.00   0.00    0.00    0.00    0.00   0.78
07:38:50 PM    0  96.04   0.00   2.97    0.00   0.00   0.00    0.00    0.00    0.00   0.99
07:38:50 PM    1  97.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   2.00
07:38:50 PM    2  98.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   1.00
07:38:50 PM    3  96.97   0.00   0.00    0.00   0.00   0.00    0.00    0.00    0.00   3.03
[...]

This command can display the occupancy of each CPU. If there is a particularly high CPU occupancy, it may be caused by a single-threaded application.

pidstat 1

$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)
07:41:02 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM     0      4214    5.66    5.66    0.00   11.32    15  mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java
07:41:03 PM 60004     60154    0.94    4.72    0.00    5.66     9  pidstat
07:41:03 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:04 PM     0      4214    6.00    2.00    0.00    8.00    15  mesos-slave
07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java
07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java
07:41:04 PM   108      6718    1.00    0.00    0.00    1.00     0  snmp-pass
07:41:04 PM 60004     60154    1.00    4.00    0.00    5.00     9  pidstat
^C

The pidstat command outputs the CPU usage of the process. This command will output continuously and will not overwrite the previous data, which is convenient for observing system dynamics. As shown in the above output, it can be seen that the two JAVA processes occupy nearly 1600% of the CPU time, which consumes about 16 CPU core computing resources.

iostat -xz 1

$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          73.96    0.00    3.73    0.03    0.06   22.21
Device:   rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09
xvdb        0.01     0.00    1.02    8.94   127.97   598.53   145.79     0.00    0.43    1.78    0.28   0.25   0.25
xvdc        0.01     0.00    1.02    8.86   127.79   595.94   146.50     0.00    0.45    1.82    0.30   0.27   0.26
dm-0        0.00     0.00    0.69    2.32    10.47    31.69    28.01     0.01    3.23    0.71    3.98   0.13   0.04
dm-1        0.00     0.00    0.00    0.94     0.01     3.78     8.00     0.33  345.84    0.04  346.81   0.01   0.00
dm-2        0.00     0.00    0.09    0.07     1.35     0.36    22.50     0.00    2.55    0.23    5.62   1.78   0.03
[...]
^C

The iostat command is mainly used to view the IO status of the machine disk. The main meanings of the columns output by this command are:

  • r/s, w/s, rkB/s, wkB/s: The number of reads and writes per second and the amount of data read and written per second (kilobytes), respectively. Excessive read and write volume may cause performance issues.
  • await: The average waiting time for IO operations, in milliseconds. This is the time that the application needs to consume when interacting with the disk, including IO waiting and actual operation time. If this value is too large, the hardware device may encounter a bottleneck or malfunction.
  • avgqu-sz: Average number of requests made to the device. If this value is greater than 1, the hardware device may be saturated (some front-end hardware devices support parallel writing).
  • %util: Device utilization. This value indicates the busyness of the device. The empirical value is that if it exceeds 60, it may affect the IO performance (refer to the average waiting time of IO operations). If it reaches 100%, the hardware device is saturated.

If the data is displayed for logical devices, the device utilization does not mean that the actual hardware devices on the backend are saturated. It is worth noting that even if the IO performance is not ideal, it does not necessarily mean that the application performance will be poor. You can use strategies such as pre-reading and write caching to improve application performance.

free –m

$ free -m
             total       used       free     shared    buffers     cached
Mem:        245998      24545     221453         83         59        541
-/+ buffers/cache:      23944     222053
Swap:            0          0          0

The free command can view the usage of system memory. The -m parameter indicates that it is displayed in megabytes. The last two columns represent the amount of memory used for the IO cache and the amount of memory used for the file system page cache, respectively. It should be noted that the second line -/+ buffers/cache, it seems that the cache takes up a lot of memory space. This is the memory usage policy of the Linux system. Use memory as much as possible. If the application needs memory, this part of the memory will be reclaimed and allocated to the application immediately. Therefore, this part of the memory is generally regarded as the available memory.

If the available memory is very low, the system may use the swap area (if configured), which will increase the IO overhead (which can be extracted in the iostat command) and reduce the system performance.

sar -n DEV 1

$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015     _x86_64_    (32 CPU)
12:16:48 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00
12:16:49 AM        lo     14.00     14.00      1.36      1.36      0.00      0.00      0.00      0.00
12:16:49 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:16:49 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00
12:16:50 AM        lo     20.00     20.00      3.25      3.25      0.00      0.00      0.00      0.00
12:16:50 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
^C

The sar command is here to view the throughput rate of the network device. When troubleshooting performance problems, you can use the throughput of the network device to determine whether the network device is saturated. As shown in the example output, the throughput rate of the eth0 network card device is about 22 Mbytes/s, which is 176 Mbits/sec, which does not reach the hardware upper limit of 1 Gbit/sec.

sar -n TCP,ETCP 1

$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)
12:17:19 AM  active/s passive/s    iseg/s    oseg/s
12:17:20 AM      1.00      0.00  10233.00  18846.00
12:17:19 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:20 AM      0.00      0.00      0.00      0.00      0.00
12:17:20 AM  active/s passive/s    iseg/s    oseg/s
12:17:21 AM      1.00      0.00   8359.00   6039.00
12:17:20 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:21 AM      0.00      0.00      0.00      0.00      0.00
^C

The sar command is used here to view the TCP connection status, which includes:

  • active/s: The number of TCP connections initiated locally per second, that is, the TCP connections created by the connect call;
  • passive/s: The number of TCP connections initiated remotely per second, that is, the TCP connections created by the accept call;
  • retrans/s: The number of TCP retransmissions per second;

The number of TCP connections can be used to determine whether the performance problem is due to the establishment of too many connections, and further to determine whether it is an actively initiated connection or a passively accepted connection. TCP retransmission may be due to poor network environment or excessive server pressure leading to packet loss.

top

$ top
top - 00:15:40 up 21:56,  1 user,  load average: 31.09, 29.87, 29.92
Tasks: 871 total,   1 running, 868 sleeping,   0 stopped,   2 zombie
%Cpu (s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem:  25190241+total, 24921688 used, 22698073+free,    60448 buffers
KiB Swap:        0 total,        0 used,        0 free.   554208 cached Mem
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 20248 root      20   0  0.227t 0.012t  18748 S  3090  5.2  29812:58 java
  4213 root      20   0 2722544  64640  44232 S  23.5  0.0 233:35.37 mesos-slave
 66128 titancl+  20   0   24344   2332   1172 R   1.0  0.0   0:00.07 top
  5235 root      20   0 38.227g 547004  49996 S   0.7  0.2   2:02.74 java
  4299 root      20   0 20.015g 2.682g  16836 S   0.3  1.1  33:14.42 java
     1 root      20   0   33620   2920   1496 S   0.0  0.0   0:03.82 init
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
     3 root      20   0       0      0      0 S   0.0  0.0   0:05.35 ksoftirqd/0
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   0:06.94 kworker/u256:0
     8 root      20   0       0      0      0 S   0.0  0.0   2:38.05 rcu_sched

The top command contains the checks of the previous commands. Such as system load (uptime), system memory usage (free), system CPU usage (vmstat), etc. Therefore, through this command, you can view the source of the system load in a relatively comprehensive manner. At the same time, the top command supports sorting, which can be sorted according to different columns, which is convenient to find out the processes with the most memory usage and the processes with the highest CPU usage.

However, compared to the previous commands, the output of the top command is an instantaneous value, and if you don't keep staring, you may miss some clues. At this time, it may be necessary to suspend the refresh of the top command to record and compare data.

Summarize

There are many tools for troubleshooting Linux server performance problems. Some of the commands described above can help us quickly locate the problem. For example, in the previous example output, there are multiple evidences that a JAVA process occupies a lot of CPU resources, and subsequent performance tuning can be performed for the application.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326833718&siteId=291194637