Linux performance monitoring: CPU, Memory, IO, Network

Linux performance monitoring: CPU, Memory, IO, Network

 

index

tool

cpu

usr<=70%, sys<=35%, usr+sys<=70%

top

memory

si == so == 0 free space >= 30%

vmstat 1;free; /proc/meminfo

I

iowait% < 20%

iostat -x;

network

udp: buffer not squeezed, no packet loss tcp: retransmission rate

netstat -lunp; netstat -su; /proc/net/snmp

1. CPU

1. Good condition indicator

  • CPU利用率:User Time <= 70%,System Time <= 35%,User Time + System Time <= 70%
  • Context switches: correlated with CPU utilization, a large number of context switches are acceptable if the CPU utilization is in a good state
  • Runnable Queue: Runnable Queue <= 3 threads per processor

2. Monitoring tools

  • top most commonly used
  • vmstat
$ vmstat 1 (1 means 1s output once)
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
14  0    140 2904316 341912 3952308  0    0     0   460 1106 9593 36 64  1  0  0
17  0    140 2903492 341912 3951780  0    0     0     0 1037 9614 35 65  1  0  0
20  0    140 2902016 341912 3952000  0    0     0     0 1046 9739 35 64  1  0  0
17  0    140 2903904 341912 3951888  0    0     0    76 1044 9879 37 63  0  0  0
16  0    140 2904580 341912 3952108  0    0     0     0 1055 9808 34 65  1  0  0

Important parameters:

r, run queue, the number of threads in the runnable queue, these threads are all runnable, but the CPU is temporarily unavailable; b, the number of blocked processes, waiting for IO requests; in, interrupts, the number of interrupts that have been processed cs, context switch, the number of context switches being done on the system us, the percentage of the CPU occupied by the user sys, the percentage of the CPU occupied by the kernel and interrupt id, the percentage of the CPU completely idle

The above example gives:

sy is high and us is low, as well as high-frequency context switching (cs), indicating that the application has made a large number of system calls;

The r of this 4-core machine should be within 12, and now r is above 14 threads, and the CPU load is heavy at this time.

  • View the CPU resources occupied by a process$ while :; do ps -eo pid,ni,pri,pcpu,psr,comm | grep 'db_server_login'; sleep 1; done PID NI PRI %CPU PSR COMMAND 28577 0 23 0.0 0 db_server_login 28578 0 23 0.0 3 db_server_login 28579 0 23 0.0 2 db_server_login 28581 0 23 0.0 2 db_server_login 28582 0 23 0.0 3 db_server_login 28659 0 23 0.0 0 db_server_login ……

2. Memory

1. Good condition indicator

  • swap in (si) == 0,swap out (so) == 0
  • Application free memory / system physical memory <= 70%

2. Monitoring tools

  • vmstat
$ vmstat 1

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 2085940 402932 245856 8312440    1    1    15   172    0    0  7  5 88  0  0
 1  0 2085940 400196 245856 8315148    0    0   532     0 4828 9087 30  6 63  1  0
 2  0 2085940 397716 245868 8317908    0    0   464   552 4311 8427 29  6 65  0  0
 2  0 2085940 393144 245876 8322484    0    0   584  2556 4620 9054 30  5 65  0  0
 2  0 2085940 389368 245876 8325408    0    0   592     0 4460 8779 30  5 65  0  0

Important parameters:

swpd, the size of the used SWAP space, KB is the unit free, the available physical memory size, KB is the unit buff, the physical memory is used to cache the buffer size of read and write operations, KB is the unit cache, the physical memory is used to cache the process address space The size of the cache, KB is the unit si, the size of data read from SWAP to RAM (swap in), KB is the unit so, the size of data written from RAM to SWAP (swap out), KB is the unit

The above example gives:

There is basically no significant change in the free physical available memory, and swapd increases gradually, indicating that the minimum available memory is always kept at 256MB (physical memory size) * 10% = 2.56MB or so, and when the dirty page reaches 10%, it starts to use swap a lot.

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  3 252696   2432    268   7148 3604 2368  3608  2372  288  288  0  0 21 78  1
0  2 253484   2216    228   7104 5368 2976  5372  3036  930  519  0  0  0 100  0
0  1 259252   2616    128   6148 19784 18712 19784 18712 3821 1853  0  1  3 95  1
1  2 260008   2188    144   6824 11824 2584 12664  2584 1347 1174 14  0  0 86  0
2  1 262140   2964    128   5852 24912 17304 24952 17304 4737 2341 86 10  0  0  4
^C
  • free
$ free
             total       used       free     shared    buffers     cached
Mem:      15728640   15600636     128004          0          0   13439080
-/+ buffers/cache:    2161556   13567084
Swap:      2104508     276276    1828232

Unit KB. Output in MB can be requested with the -m option.

We use names such as total1, used1, free1, used2, free2 to represent the values ​​of the above statistics, and 1 and 2 represent the data in the first and second rows, respectively.

total1: Indicates the total amount of physical memory. used1: Indicates the total amount allocated to the cache (including buffers and cache), but some of the cache may not be actually used. free1: Unallocated memory. shared1: Shared memory, which is generally not used by the system, and will not be discussed here. buffers1: The number of buffers allocated by the system but not in use. cached1: The number of caches allocated by the system but not in use. The difference between buffer and cache is described later. used2: The total amount of buffers and caches actually used, and also the total amount of memory actually used. free2: The sum of unused buffers, cache and unallocated memory, this is the actual available memory of the system.

The difference between cache and buffer:

Cache: Cache is a small but high-speed memory located between the CPU and main memory.

Since the speed of the CPU is much higher than that of the main memory, the CPU needs to wait for a certain period of time to directly access data from the memory. The Cache stores a part of the data that the CPU has just used or recycled. When the CPU uses this part of the data again, it can be retrieved from the Cache In this way, the waiting time of the CPU is reduced and the efficiency of the system is improved.

Cache is further divided into Level 1 Cache (L1 Cache) and Level 2 Cache (L2 Cache). L1 Cache is integrated inside the CPU. In the early days, L2 Cache was generally soldered on the motherboard, and now it is also integrated inside the CPU. The common capacity is 256KB. or 512KB L2 Cache.

Buffer: Buffer, an area used to store data transferred between devices with different speeds or different priorities. Through the buffer, the mutual waiting between the processes can be reduced, so that when the data is read from the slow device, the operation process of the fast device will not be interrupted.

Buffer and cache in Free: (they all occupy memory)

buffer : As the memory of the buffer cache, it is the read and write buffer of the block device

cache: as the memory of the page cache, the cache of the file system

If the value of cache is large, it means that there are many files in the cache. If frequently accessed files can be cached, then the read IO of the disk will be very small.

  • $ cat /proc/meminfoMemTotal: 15728640 kB MemFree: 116196 kB Buffers: 0 kB Cached: 13448268 kB ……

This server has a total of 8GB of physical memory (MemTotal), about 3GB of available memory (MemFree), about 343MB for disk cache (Buffers), and about 4GB for file cache (Cached). These values ​​can also be found with the free command, note that the unit is kB.

3. Disk IO

1. Good condition indicator

  • iowait % < 20% A simple way to improve the hit rate is to increase the file cache area. The larger the cache area, the more pre-stored pages, and the higher the hit rate. The Linux kernel hopes to generate secondary page fault interrupts as much as possible (reading from the file cache area), and to avoid the main page fault interrupt (reading from the hard disk) as much as possible, so that with the increase of secondary page fault interrupts, the file buffer area also gradually increases. Large, Linux does not start freeing unused pages until the system has only a small amount of free physical memory.

2. Monitoring tools

  • sar
$ sar -d 2 3 (3 times in 2 seconds)
Linux 3.10.83-1-tlinux2-0021.tl1 (xgame_9_zone1)        06/22/17        _x86_64_        (48 CPU)

16:50:05          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
16:50:07       dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

16:50:07          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
16:50:09       dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

16:50:09          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
16:50:11       dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
Average:       dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Important parameters:

await represents the average wait time (in milliseconds) for each device I/O operation. svctm represents the average service time (in milliseconds) per device I/O operation. %util indicates the percentage of a second that is used for I/O operations.

If the value of svctm is close to await, it means that there is almost no I/O waiting, and the disk performance is good. If the value of await is much higher than the value of svctm, it means that the I/O queue is waiting too long, and the application running on the system will slow down.

If %util is close to 100%, it means that the disk generates too many I/O requests, the I/O system is already working at full capacity, and the disk may have a bottleneck.

  • $ iostat -x (option -x is used to display extended data related to io)
Linux 3.10.83-1-tlinux2-0021.tl1 (xgame_9_zone1)        06/22/17        _x86_64_        (48 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.46    0.00    2.75    0.01    0.00   94.78

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 153.95 0.00 1.02 0.47 0.00

Check which process accounts for IO: iotop, the effect is the same as that of top watching cpu. View files opened by a process: /proc/${pid}/fd

4. Network IO

for UDP

1. Good condition indicator

The receiving and sending buffers do not have network packets waiting for processing for a long time

2. Monitoring tools

  • netstat

For UDP services, view the network conditions of all listening UDP ports

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
udp        0      0 0.0.0.0:64000           0.0.0.0:*                           -
udp        0      0 0.0.0.0:38400           0.0.0.0:*                           -
udp        0      0 0.0.0.0:38272           0.0.0.0:*                           -
udp        0      0 0.0.0.0:36992           0.0.0.0:*                           -
udp        0      0 0.0.0.0:17921           0.0.0.0:*                           -
udp        0      0 0.0.0.0:11777           0.0.0.0:*                           -
udp        0      0 0.0.0.0:14721           0.0.0.0:*                           -
udp        0      0 0.0.0.0:36225           0.0.0.0:*                           -

It is normal for RecvQ and SendQ to be 0, or to have values ​​for a short time.

For UDP services, check the packet loss situation (the network card has received the packet, but the application layer has not processed the packet loss caused by it)

$ watch netstat -su
Udp:
    278073881 packets received
    4083356897 packets to unknown port received.
    2474435364 packet receive errors
    1079038030 packets sent

An increase in the value of packet receive errors indicates packet loss.

For TCP (from Davidshan Shanwei's experience, thx~)

1. Good condition indicator

For TCP, there is no such thing as packet loss due to insufficient cache. Due to other reasons such as the network, the packet is lost, and the protocol layer will also use the retransmission mechanism to ensure that the lost packet reaches the other party.

Therefore, tcp is more focused on the retransmission rate .

2. Monitoring tools

Through snmp, you can view the sending and receiving packets of each layer of network protocols

$ cat /proc/net/snmp | grep Tcp
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 78447 413 50234 221 3 5984652 5653408 156800 0 849

Retransmission rate = RetransSegs / OutSegs

As for the range of this value, it is ok, it depends on the specific business.

The business side is more concerned about the response time.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325392048&siteId=291194637