60 seconds 10-line command to quickly locate performance bottlenecks

Today, as we translate a technology blog from Netflix's Linux Performance in the Analysis 60,000 Milliseconds , the author is well-known linux kernel performance optimization expert engineers & Brendan D. Gregg Netflix and team performance. This article will teach you how to use 10 common linux tools to complete the initial diagnosis of performance problems within 60 seconds.

When you log on to the linux server processing performance problems, beginning one minute what would you do?
Netflix has a large number of EC2 cloud service hosting, there are a lot of tools to detect and troubleshoot performance problems. Cloud monitoring tools such as Atlas and examples of analytical tools the Vector . These tools help us solve most performance problems, but sometimes we still need to log in to run some standard of performance on the Linux server troubleshooter.

Overview

In this article, Netflix team will show you how to use and readily available for Linux command-line tool to complete a performance troubleshooting in the 60s. By the following 10 command, you can have a full understanding of the process of resource usage and health system within 60 seconds. First check the error and saturation index, since both are easy to understand, and secondly is to look at resource utilization. Saturation refers to whether the load exceeds the resources it can bear the load, this indicator can reflect a situation where the task or the task queue length of the waiting time.

uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

Note that some commands need to install the sysstatpackage. These commands are exposed data can help you complete a methodological performance optimization practice, this methodology includes checking all system resources (cpu, memory, disk ......) usage, saturation, and error indicators. These commands also allow you to check all the places, narrowing the scope of the investigation by a process of elimination, direction for subsequent analysis.

The next part of the article will introduce these commands, and gives practical examples. About these commands more detail, please refer to the relevant manual.

1. uptime

$ uptime 
23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02

uptime can quickly see the average system load. In Linux, these figures indicate the average per second in the running state, and non-executable state interrupt status (usually disk IO) of the number of tasks. These data allow us to have a full understanding of the entire system resources, but no other tool we still can not understand the root cause, but still well worth uptime quick glance.

That these three numbers what the meaning of it! In fact, these three figures represent 1 minute, 5 minutes, and the average load within 15 minutes by the data in the past a certain time window, obtained by summing attenuation index. These three numbers can let you know some time in the past load conditions. For example, if you search on-line performance issues, you see load1 far below load15, then you may have missed this performance problem.

In the example above, the load has been growing, load1 has been to 30, while only 19 load15, which gives us very important information, there may be high CPU usage, and have to use vmstat and mpstat confirmation, this two commands we will introduce in the 3 and 4.

2. dmesg|tail

$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request.  Check SNMP counters.

If anything, this command will display the most recent 10 system information. Find out which errors can cause performance problems. The above example a message included in the process of being lost because of the requested tcp and kill oom lead.

Do not skip this step, dmesg well worth viewing.

3. vmstat 1

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
34  0    0 200889792  73708 591828    0    0     0     5    6   10 96  1  3  0  0
32  0    0 200889920  73708 591860    0    0     0   592 13284 4282 98  1  1  0  0
32  0    0 200890112  73708 591860    0    0     0     0 9501 2154 99  1  0  0  0
32  0    0 200889568  73712 591856    0    0     0    48 11900 2459 99  0  0  0  0
32  0    0 200890208  73712 591860    0    0     0     0 15898 4840 98  1  1  0  0
^C

vmstat command is widespread in all kinds of linux systems (BSD decades ago is created), you can display summary information related to virtual memory, each row is key statistics server virtual memory.

The latter parameter 1represents the output of one second time intervals. Note that the first line of the output data since the start of the system, rather than a second before, so you can skip the first row of data.

The meaning of each column

  • r: running and waiting for the number of running processes. This indicator can reveal better than a load of CPU load, because this does not include the I / O process. If r is greater than the number of instructions that the CPU has a CPU core saturation.
  • free: free memory capacity (in kB), if the number more numerous that you have a lot of free memory available. In section 7 below in free -morder more clearly visualize the case of free memory.
  • Si, SO: the swap partition and commutation change into the. If the number is not zero show that you have enough memory to start the swap partition.
  • us, sy, id, wa, st: these are long average used CPU fine division, respectively, long user length system (kernel), long idle, length of time, the IO length (by other visitors or Xen theft when stolen, visitors It has its own independent drive domain).

Fine division of CPU time (length + length of the system user) can confirm whether the CPU is busy. If you wait longer to get the same IO, disk IO is the bottleneck explained. It also explains why the CPU is idle, because the tasks are waiting because IO is blocked out. So you can wait for I / O is seen as another form of CPU is idle, it provides a free clue about why they are.

For IO-type process, the system length is very important, if longer than 20% of the average system it is worthy of further exploration at, and this may explain the core IO process very inefficient.

In the above example, almost all were occupied CPU user mode, the average CPU utilization rate of over 90%, which is usually no big problem, or to focus on under rthis column data now.

4. mpstat -P ALL 1

$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

07:38:49 PM  CPU   %usr  %nice   %sys %iowait   %irq  %soft  %steal  %guest  %gnice  %idle
07:38:50 PM  all  98.47   0.00   0.75    0.00   0.00   0.00    0.00    0.00    0.00   0.78
07:38:50 PM    0  96.04   0.00   2.97    0.00   0.00   0.00    0.00    0.00    0.00   0.99
07:38:50 PM    1  97.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   2.00
07:38:50 PM    2  98.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   1.00
07:38:50 PM    3  96.97   0.00   0.00    0.00   0.00   0.00    0.00    0.00    0.00   3.03
[...]

This command meticulous division of each CPU core out, can be seen from this data whether the CPU core load balancing, if a single CPU core hot spots illustrate the application of single-threaded.

5. pidstat 1

$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

07:41:02 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:03 PM     0         9    0.00    0.94    0.00    0.94     1  rcuos/0
07:41:03 PM     0      4214    5.66    5.66    0.00   11.32    15  mesos-slave
07:41:03 PM     0      4354    0.94    0.94    0.00    1.89     8  java
07:41:03 PM     0      6521 1596.23    1.89    0.00 1598.11    27  java
07:41:03 PM     0      6564 1571.70    7.55    0.00 1579.25    28  java
07:41:03 PM 60004     60154    0.94    4.72    0.00    5.66     9  pidstat

07:41:03 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:04 PM     0      4214    6.00    2.00    0.00    8.00    15  mesos-slave
07:41:04 PM     0      6521 1590.00    1.00    0.00 1591.00    27  java
07:41:04 PM     0      6564 1573.00   10.00    0.00 1583.00    28  java
07:41:04 PM   108      6718    1.00    0.00    0.00    1.00     0  snmp-pass
07:41:04 PM 60004     60154    1.00    4.00    0.00    5.00     9  pidstat
^C

pidstatA bit like the top command statistics about each process, and the top command is different is that it is rolling output rather than clear the screen output, this model can easily look past changes, you can easily copy and paste in your troubleshooting process in.

In the above example, there are two java process consumes a lot of CPU, 1591% illustrate the java process consumes nearly 16 core.

6. iostat -xz 1

$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          73.96    0.00    3.73    0.03    0.06   22.21

Device:   rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda        0.00     0.23    0.21    0.18     4.52     2.08    34.37     0.00    9.98   13.80    5.42   2.44   0.09
xvdb        0.01     0.00    1.02    8.94   127.97   598.53   145.79     0.00    0.43    1.78    0.28   0.25   0.25
xvdc        0.01     0.00    1.02    8.86   127.79   595.94   146.50     0.00    0.45    1.82    0.30   0.27   0.26
dm-0        0.00     0.00    0.69    2.32    10.47    31.69    28.01     0.01    3.23    0.71    3.98   0.13   0.04
dm-1        0.00     0.00    0.00    0.94     0.01     3.78     8.00     0.33  345.84    0.04  346.81   0.01   0.00
dm-2        0.00     0.00    0.09    0.07     1.35     0.36    22.50     0.00    2.55    0.23    5.62   1.78   0.03
[...]
^C

iostatSee block device is a load and performance information with the best tools, view the following information:

  • r / s, w / s, rkB / s, wkB / s: second device to read, write and read times, write data amount (kB). These indicators can be quantified IO device load, probably because some performance problems caused by excessive load.
  • await: wait a few milliseconds of IO. This application is perceived time, including waiting time and queue time. If the average time expected to exceed the device is turned saturated or equipment failure.
  • avgqu-sz: Mean requesting device. If the value is greater than the device 1 may be saturated (although generally parallel operation request apparatus, in particular a plurality of virtual devices located before the rear end of the disk).
  • % util: equipment utilization. This is a real busy time scale device, showing the actual work time per device. More than 60% may lead to reduced device performance, of course, this also depends on the device itself. DESCRIPTION apparatus close to 100% saturated.

If the storage device only by a plurality of logical disk devices consisting of 100% utilization of 100% means that it is only the time for processing the IO, the rear end far from the physical disk can handle their performance limits.

Remember that disk I / O performance does not necessarily lead to poor application problems. Many techniques typically use asynchronous I / O, so that the application will not be blocked also feel the delay (eg, pre-reading and writing Buffer).

7. free -m

$ free -m
             total       used       free     shared    buffers     cached
Mem:        245998      24545     221453         83         59        541
-/+ buffers/cache:      23944     222053
Swap:            0          0          0

The far right has two buffers and cached

  • buffers: buffer for block devices IO.
  • cached: memory page cache used in the file system.

We only need to check the size thereof is proximate to 0, 0 may result in higher near the disk I / O (confirmed using iostat) and poorer performance. The example above looks good, has a lot of megabytes each.

"- / + buffers / cache" provides a clear value as used and free memory size. linux free memory used as a cache, if the application requires more memory can also be quickly freed, the cached portion of memory should also be included in the free content in more detail in the column, this line of data is so, there may be confusing, You can view this page linuxatemyram.com/ .

If linux system with a ZFS would be more confusing, because ZFS has its own file system cache, free -mand do not reflect the size of the cache files. It may appear seem out of memory, but can actually demand the release of some of the ZFS cache memory from the inside.

8. sar -n DEV 1

$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015     _x86_64_    (32 CPU)

12:16:48 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:49 AM      eth0  18763.00   5032.00  20686.42    478.30      0.00      0.00      0.00      0.00
12:16:49 AM        lo     14.00     14.00      1.36      1.36      0.00      0.00      0.00      0.00
12:16:49 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:16:49 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:50 AM      eth0  19763.00   5101.00  21999.10    482.56      0.00      0.00      0.00      0.00
12:16:50 AM        lo     20.00     20.00      3.25      3.25      0.00      0.00      0.00      0.00
12:16:50 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
^C

This command can view the network throughput, rxkB / s and txkB / s, you can measure the workload, you can also see if the limit is reached. In the above example, eth0 received reaches 22MB / s, i.e. 176Ms / S (much lower than the limit 1Gb / s.).
This version also has a% ifutil to represent capacity utilization (maximum of two full-duplex direction), we Brendan's nicstat can also view. As with nicstat which it is difficult to get the right result, but does not seem to work (0.00) normal in this case.

9. sar -n TCP,ETCP 1

$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

12:17:19 AM  active/s passive/s    iseg/s    oseg/s
12:17:20 AM      1.00      0.00  10233.00  18846.00

12:17:19 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:20 AM      0.00      0.00      0.00      0.00      0.00

12:17:20 AM  active/s passive/s    iseg/s    oseg/s
12:17:21 AM      1.00      0.00   8359.00   6039.00

12:17:20 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:21 AM      0.00      0.00      0.00      0.00      0.00
^C

Tcp this command to view metrics data, including:

  • active / s: per second locally originated active connections (such as call connect ()).
  • passive / s: per second passive number of remote connections received (for example, calling accept ()).
  • retrans / s: the number of retransmissions per second tcp.

Active and passive connections may generally be used as a rough measure of the number of server load: new acceptance number (passive) connection and the number of (active) connected downstream. Can simply be seen as an active, passive deemed to be in, but this is not entirely correct (for example, initiate a connection to localhost to localhost).
The number of retransmissions can be viewed as a signal network or service problems, may be unreliable networks (such as the public network), or because the service is overloaded loss caused. The above example is only one connections per second.

10. top

$ top
top - 00:15:40 up 21:56,  1 user,  load average: 31.09, 29.87, 29.92
Tasks: 871 total,   1 running, 868 sleeping,   0 stopped,   2 zombie
%Cpu(s): 96.8 us,  0.4 sy,  0.0 ni,  2.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  25190241+total, 24921688 used, 22698073+free,    60448 buffers
KiB Swap:        0 total,        0 used,        0 free.   554208 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 20248 root      20   0  0.227t 0.012t  18748 S  3090  5.2  29812:58 java
  4213 root      20   0 2722544  64640  44232 S  23.5  0.0 233:35.37 mesos-slave
 66128 titancl+  20   0   24344   2332   1172 R   1.0  0.0   0:00.07 top
  5235 root      20   0 38.227g 547004  49996 S   0.7  0.2   2:02.74 java
  4299 root      20   0 20.015g 2.682g  16836 S   0.3  1.1  33:14.42 java
     1 root      20   0   33620   2920   1496 S   0.0  0.0   0:03.82 init
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
     3 root      20   0       0      0      0 S   0.0  0.0   0:05.35 ksoftirqd/0
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   0:06.94 kworker/u256:0
     8 root      20   0       0      0      0 S   0.0  0.0   2:38.05 rcu_sched

top command contains a lot of us already mentioned indicators, you can easily run it to see indicators of change.
One drawback top is provided as an output of the rolling tool like top and vmstat pidstat and other data can not see it before. If you're not fast enough to suspend output (Ctrl-S to pause, Ctrl-Q to continue), the data on the screen is cleared, it is hard to intercept evidence.

Subsequent analysis

There are many command-line tools and methods can dig deeper, see Brendan's Linux Performance Tools Guide , which lists more than 40 kinds of tools, including observation, benchmarking, tuning, static tuning, profiling and tracking.

Published 233 original articles · won praise 564 · views 500 000 +

Guess you like

Origin blog.csdn.net/xindoo/article/details/104182666