Check Linux Server Performance in One Minute with Ten Commands


If your Linux server suddenly has a sudden load increase, and the alarm text message is about to explode on your mobile phone, how to find out the Linux performance problem in the shortest time? Brendan Gregg of the Netflix performance engineering team wrote this blog post. Brothers Linux Training

Editor Here's how: Let's see how they can diagnose machine performance problems in under a minute with ten commands.

  Overview

  By executing the following command, you can get a general understanding of system resource usage in 1 minute.

  uptime

  dmesg | tail

  vmstat 1

  mpstat -P ALL 1

  pidstat 1

  iostat -xz 1

  free -m

  sar -n DEV 1

  sar -n TCP,ETCP 1

  top

  Let's introduce these commands one by one, for more parameters and For description, please refer to the manual of the command.

  (1) uptime

  $ uptime

  23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02

  This command can quickly check the load of the machine. In Linux systems, these data represent the number of processes waiting for CPU resources and blocked in uninterruptible IO processes (process status is D). This data can give us a macro understanding of system resource usage.

  The output of the command indicates the load averages for 1 minute, 5 minutes, and 15 minutes, respectively. With these three pieces of data, it is possible to understand whether the server load is tensing or the area is easing. If the 1-minute average load is high, but the 15-minute average load is low, it means that the server is commanding a high load, and you need to further investigate where the CPU resources are being consumed. Conversely, if the 15-minute load average is high and the 1-minute average load is low, it is possible that the CPU resource stress time has passed.

  In the output of the above example, you can see that the average load in the last minute is very high, and it is much higher than the load in the last 15 minutes, so we need to continue to investigate what processes are consuming a lot of resources in the current system. You can use the vmstat, mpstat and other commands described below to further troubleshoot.

  (2) dmesg | tail

  dmesg|tail[1880957.563150]perlinvokedoom−killer:gfpmask=0x280da,order=0,oomscoreadj=0[…][1880957.563400]Outofmemory:Killprocess18694(perl)score246processorsacrificechild[1880958.563408]Killedtal :1972392kB,anon−rss:1953348kB,file−rss:0kB[2320864.954447]TCP:PossibleSYNfloodingonport7001.Droppingrequest.CheckSNMPcounters. This command outputs the last 10 lines of the syslog. In the output of the example, you can see a kernel oomkill and a TCP packet loss. These logs can help troubleshoot performance issues. Don't forget this step. (3) vmstat1vmstat 1

  procs ——memory——- ——swap——————io——- -system– ——cpu——–

  rb swpd free buff cache siso bi bo in cs us sy id wa st

  34 0 0 200889792 73708591828 0 0 0 5 6 10 96 1 3 0 0   32

  0 0 200889920 73708591860 0 0 0 592 2 13208 4

200891860 0 0 0 0 9501 2154 99 1 0 0 0

  32 0 0 200889568 73712591856 0 0 0 48 11900 2459 99 0 0 0 0

  32 0 0 200890208 73712591860 0 0 0 0 0 15898 4840 98 1 1 0 0

  ^ C Vmstat

  (8 ) command, each line will output some system core indicators, which can give us a more detailed understanding of the system status. The following parameter 1 indicates that statistical information is output once per second. The header indicates the meaning of each column. These columns introduce some columns related to performance tuning:

  r: The number of processes waiting for CPU resources. This data is more representative of the CPU load than the average load, and the data does not include processes waiting for IO. If this value is greater than the number of machine CPU cores, then the machine's CPU resources are saturated.

  free: The amount of available memory in the system (in kilobytes). If the remaining memory is insufficient, it will also cause system performance problems. The free command described below can provide a more detailed understanding of system memory usage.

  si, so: The number of swap writes and reads. If this data is not 0, it means that the system is already using the swap area (swap), and the physical memory of the machine is insufficient.

  us, sy, id, wa, st: These all represent the consumption of CPU time, they represent user time (user), system (kernel) time (sys), idle time (idle), IO wait time (wait) and Stolen time (stolen, generally consumed by other virtual machines).

  The above CPU time allows us to quickly understand whether the CPU is in a busy state. In general, if the sum of user time and system time is very large, the CPU is busy executing instructions. If the IO wait time is very long, then the bottleneck of the system may be the disk IO.

  As you can see from the output of the example command, a lot of CPU time is consumed in user mode, that is, the user application consumes CPU time. This is not necessarily a performance issue, it needs to be analyzed together with the r queue.

  (4) mpstat -P ALL 1

  $ mpstat -P ALL 1

  Linux 3.13.0-49-generic(titanclusters-xxxxx) 07/14/2015 x86_64 (32 CPUs)

  07:38:49 PM CPU %usr %nice%sys % iowait% IRQ% SOFT% STEAL% Guest% GNICE%

  IDLE 07:38:50 PM All 98.47 0.000.75 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

  0.002.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99

  07:38:50 PM 1 97.00 0.001.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00

  07:38:50 PM 2 98.00 0.001.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

  07:38:50 PM 3 96.97 0.000.00 0.00 0.00 0.00 0.00 0.00 0.00   This command can show the occupancy of each CPU,

  […]

If there is a particularly high CPU usage, it may be caused by a single-threaded application.

  (5) pidstat 1

  $ pidstat 1

  Linux 3.13.0-49-generic(titanclusters-xxxxx) 07/14/2015 x86_64 (32 CPU)

  07:41:02 PM UID PID %usr%system %guest %CPU CPU Command

  07 :41:03 PM 0 9 0.00 0.940.00 0.94 1 rcuos/0

  07:41:03 PM 0 4214 5.665.66 0.00 11.32 15 mesos-slave 07:41:03   PM

  0 4354 0.940.94 0.00 1.89 8 java

41:03 PM 0 6521 1596.231.89 0.00 1598.11 27 java

  07:41:03 PM 0 6564 1571.707.55 0.00 1579.25 28 java

  07:41:03 PM 60004 601540.94 4.72 0.00 5.66 9 pidstat

  07:41:03 PM UID PID %usr%system %guest %CPU CPU Command

  07:41:04 PM 0 4214 6.002.00 0.00 8.00 15 mesos-slave

  07: 41:04 PM 0 6521 1590.001.00 0.00 1591.00 27 java

  07:41:04 PM 0 6564 1573.0010.00 0.00 1583.00 28 java

  07:41:04 PM 108 6718 1.000.00 snmp-

  pass4 0.0070 1.00 PM 60004 601541.00 4.00 0.00 5.00 9 pidstat

  ^C

  The pidstat command outputs the CPU usage of the process. This command will output continuously and will not overwrite the previous data, which is convenient for observing system dynamics. As shown in the above output, it can be seen that the two JAVA processes occupy nearly 1600% of the CPU time, which consumes about 16 CPU core computing resources.

  (6) iostat -xz 1

  $ iostat -xz 1

  Linux 3.13.0-49-generic(titanclusters-xxxxx) 07/14/2015 x86_64 (32 CPU)

  avg-cpu: %user %nice%system %iowait %steal % idle

  73.96 0.00 3.73 0.03 0.0622.21

  Device: RRQM / S WRQM / SR / SW / S RKB / S WKB / S AVGRQ-SZ AVGQU-SZ AWAIT R_AWAIT W_AWAIT SVCTM% UTIL XVDA

  0.00 0.23 0.21 0.184.52 2.08 34.37 0.00 9.98 13.80 5.42 0.09 2.44

  xvdb 8.94127.97 0.01 0.00 1.02 0.00 0.43 1.78 0.28 598.53 145.79 0.25 0.25

  xvdc 8.86127.79 0.01 0.00 1.02 0.00 0.45 1.82 595.94 146.50 0.30 0.27 0.26

  DM-0 0.00 0.00 31.69 28.01 0.69 2.3210.47 0.01 3.23 0.71 3.98 0.13 0.04

  DM -1 0.00 0.00 0.00 0.94 0.013.78 0.04 346.81 0.01 0.00

  DM-2 0.00 0.00 0.09 0.071.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03

  [...]

  ^ c

  iostat command is primarily used to view machine disk IO. The main meanings of the columns output by this command are:

  r/s, w/s, rkB/s, wkB/s: represent the number of reads and writes per second and the amount of data read and write per second (kilobytes), respectively. Excessive read and write volume may cause performance issues.

  await: The average waiting time for IO operations, in milliseconds. This is the time that the application needs to consume when interacting with the disk, including IO waiting and actual operation time. If this value is too large, the hardware device may encounter a bottleneck or malfunction.

  avgqu-sz: Average number of requests made to the device. If this value is greater than 1, the hardware device may be saturated (some front-end hardware devices support parallel writing).

  %util: Device utilization. This value represents the busyness of the device. The empirical value is that if it exceeds 60, it may affect the IO performance (refer to the average waiting time of IO operations). If it reaches 100%, the hardware device is saturated.

  If the data is displayed for logical devices, the device utilization does not mean that the actual hardware devices on the backend are saturated. It is worth noting that even if the IO performance is not ideal, it does not necessarily mean that the application performance will be poor. You can use strategies such as pre-reading and write caching to improve application performance.

  (7) free –m

  $ free -m

  total used free sharedbuffers cached

  Mem: 245998 24545 221453 8359 541

  -/+ buffers/cache: 23944222053

  Swap: 0 0 0

  The free command can view the usage of system memory. The -m parameter indicates that it is displayed in megabytes. The last two columns represent the amount of memory used for the IO cache and the amount of memory used for the file system page cache, respectively. It should be noted that the second line -/+ buffers/cache, it seems that the cache takes up a lot of memory space. This is the memory usage policy of the Linux system. Use memory as much as possible. If the application needs memory, this part of the memory will be reclaimed and allocated to the application immediately. Therefore, this part of the memory is generally regarded as the available memory.

  If the available memory is very low, the system may use the swap area (if configured), which will increase the IO overhead (which can be extracted in the iostat command) and reduce system performance.

  (8) sar -n DEV 1

  $ sar -n DEV 1

  Linux 3.13.0-49-generic(titanclusters-xxxxx) 07/14/2015 x86_64 (32 CPU)

  12:16:48 AM IFACE rxpck/stxpck/s rxkB / s txkb / s rxmcst / s% iFutil

  12:16:49 am Eth0 18763.005032.00 20686.42 478.30 0.00 0.00 0.00 0.00

  12:16:49 AM LO 14.00 14.001.36 1.36 0.00 0.00 0.00 0.00

  12: 16:49 AM docker0 0.000.00 0.00 0.00 0.00 0.00 0.00 0.00

  12:16:49 AM IFACE rxpck/stxpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil

  12:16:50 AM eth0 19763.005101.00 21999.10 482.56 0.00 0.00

  12:6:500 AM lo 20.00 20.003.25 3.25 0.00 0.00 0.00 0.00

  12:16:50 AM docker0 0.000.00 0.00 0.00 0.00 0.00 0.00 0.00

  ^C

  The sar command can view the throughput rate of the network device here. When troubleshooting performance problems, you can use the throughput of the network device to determine whether the network device is saturated. As shown in the example output, the throughput rate of the eth0 network card device is about 22 Mbytes/s, which is 176 Mbits/sec, which does not reach the hardware upper limit of 1 Gbit/sec.

  (9) sar -n TCP,ETCP 1

  $ sar -n TCP,ETCP 1

  Linux 3.13.0-49-generic(titanclusters-xxxxx) 07/14/2015 x86_64 (32 CPU)

  12:17:19 AM active/spassive /s iseg/s oseg/s

  12:17:20 AM 1.00 0.0010233.00 18846.00

  12:17:19 AM atmptf/s estres/sretrans/s isegerr/s orsts/s

  12:17:20 AM 0.00 0.00 0.000.00 0.00

  12:17:20 AM active/spassive/s iseg/s oseg/s

  12:17:21 AM 1.00 0.008359.00 6039.00

  12:17:20 AM atmptf/sestres/ s retrans/s isegerr/s orsts/s

  12:17:21 AM 0.00 0.00 0.000.00 0.00

  ^C

  The sar command is used here to view the TCP connection status, including:

  active/s: the number of locally initiated TCP connections per second , both the TCP connections created by the connect call;

  passive/s: the number of TCP connections initiated remotely per second, that is, the TCP connections created by the accept call;

  retrans/s: the number of TCP retransmissions per second; the number of

  TCP connections can be used to judge Whether the performance problem is due to the establishment of too many connections, it can be further judged whether it is an actively initiated connection or a passively accepted connection. TCP retransmission may be due to poor network environment or excessive server pressure leading to packet loss.

  (10)top

  $ top

  top - 00:15:40 up 21:56, 1user, load average: 31.09, 29.87, 29.92

  Tasks: 871 total, 1running, 868 sleeping, 0 stopped, 2 zombie

  %Cpu(s): 96.8 us, 0.4 sy,0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st

  KiB Mem: 25190241+total,24921688 used, 22698073+free, 60448 buffers

  KiB Swap: 0 total, 0 used,0 free. 554208 cached Mem

  PID USER PR NI VIRT RES SHRS %CPU %MEM TIME+ COMMAND

  20248 root 20 0 0.227t0.012t 18748 S 3090 5.2 29812:58 java

  4213 root 20 0 272254464640 44232 S 23.5 0.0 233:35.37 mesos-slave

  66128 titancl+ 20 0 243442332 1172 R 1.0 0.0 0:00.07 top

  5235 root 20 0 38.227g547004 49996 S 0.7 0.2 2:02.74 java

  4299 root 20 0 20.015g2.682g 16836 S 0.3 1.1 33:14.42 java

  1 root 20 0 33620 2920 1496S 0.0 0.0 0:03.82 init

  2 root 20 0 0 0 0 S 0.0 0.00:00.02 kthreadd

  3 root 20 0 0 0 0 S 0.0 0.00:05.35 ksoftirqd/0

  5 root 0 -20 0 0 0 S 0.00.0 0:00.00 kworker/0:0H

  6 root 20 0 0 0 0 S 0.0 0.00:06.94 kworker/u256 :0

  8 root 20 0 0 0 0 S 0.0 0.02:38.05 The rcu_sched

  top command contains the contents of the checks of the previous commands. Such as system load (uptime), system memory usage (free), system CPU usage (vmstat), etc. Therefore, through this command, you can view the source of the system load in a relatively comprehensive manner. At the same time, the top command supports sorting, which can be sorted according to different columns, which is convenient to find out the processes with the most memory usage and the processes with the highest CPU usage.

  However, compared to the previous commands, the output of the top command is an instantaneous value, and if you don't keep staring, you may miss some clues. At this time, it may be necessary to suspend the refresh of the top command to record and compare data.

  Summary There are many tools for

  troubleshooting Linux server performance problems. Some of the commands introduced above can help us quickly locate the problem. For example, in the previous example output, there are multiple evidences that a JAVA process occupies a lot of CPU resources, and subsequent performance tuning can be performed for the application.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326502995&siteId=291194637