Linux performance monitoring: CPU, Memory, IO, Network
|
index |
tool |
---|---|---|
cpu |
usr<=70%, sys<=35%, usr+sys<=70% |
top |
memory |
si == so == 0 free space >= 30% |
vmstat 1;free; /proc/meminfo |
I |
iowait% < 20% |
iostat -x; |
network |
udp: buffer not squeezed, no packet loss tcp: retransmission rate |
netstat -lunp; netstat -su; /proc/net/snmp |
1. CPU
1. Good condition indicator
- CPU利用率:User Time <= 70%,System Time <= 35%,User Time + System Time <= 70%
- Context switches: correlated with CPU utilization, a large number of context switches are acceptable if the CPU utilization is in a good state
- Runnable Queue: Runnable Queue <= 3 threads per processor
2. Monitoring tools
- top most commonly used
- vmstat
$ vmstat 1 (1 means 1s output once) procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 14 0 140 2904316 341912 3952308 0 0 0 460 1106 9593 36 64 1 0 0 17 0 140 2903492 341912 3951780 0 0 0 0 1037 9614 35 65 1 0 0 20 0 140 2902016 341912 3952000 0 0 0 0 1046 9739 35 64 1 0 0 17 0 140 2903904 341912 3951888 0 0 0 76 1044 9879 37 63 0 0 0 16 0 140 2904580 341912 3952108 0 0 0 0 1055 9808 34 65 1 0 0
Important parameters:
r, run queue, the number of threads in the runnable queue, these threads are all runnable, but the CPU is temporarily unavailable; b, the number of blocked processes, waiting for IO requests; in, interrupts, the number of interrupts that have been processed cs, context switch, the number of context switches being done on the system us, the percentage of the CPU occupied by the user sys, the percentage of the CPU occupied by the kernel and interrupt id, the percentage of the CPU completely idle
The above example gives:
sy is high and us is low, as well as high-frequency context switching (cs), indicating that the application has made a large number of system calls;
The r of this 4-core machine should be within 12, and now r is above 14 threads, and the CPU load is heavy at this time.
- View the CPU resources occupied by a process
$ while :; do ps -eo pid,ni,pri,pcpu,psr,comm | grep
'db_server_login'; sleep
1; done PID NI PRI %CPU PSR COMMAND 28577
0
23
0.0
0 db_server_login 28578
0
23
0.0
3 db_server_login 28579
0
23
0.0
2 db_server_login 28581
0
23
0.0
2 db_server_login 28582
0
23
0.0
3 db_server_login 28659
0
23
0.0
0 db_server_login ……
2. Memory
1. Good condition indicator
- swap in (si) == 0,swap out (so) == 0
- Application free memory / system physical memory <= 70%
2. Monitoring tools
- vmstat
$ vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 2085940 402932 245856 8312440 1 1 15 172 0 0 7 5 88 0 0 1 0 2085940 400196 245856 8315148 0 0 532 0 4828 9087 30 6 63 1 0 2 0 2085940 397716 245868 8317908 0 0 464 552 4311 8427 29 6 65 0 0 2 0 2085940 393144 245876 8322484 0 0 584 2556 4620 9054 30 5 65 0 0 2 0 2085940 389368 245876 8325408 0 0 592 0 4460 8779 30 5 65 0 0
Important parameters:
swpd, the size of the used SWAP space, KB is the unit free, the available physical memory size, KB is the unit buff, the physical memory is used to cache the buffer size of read and write operations, KB is the unit cache, the physical memory is used to cache the process address space The size of the cache, KB is the unit si, the size of data read from SWAP to RAM (swap in), KB is the unit so, the size of data written from RAM to SWAP (swap out), KB is the unit
The above example gives:
There is basically no significant change in the free physical available memory, and swapd increases gradually, indicating that the minimum available memory is always kept at 256MB (physical memory size) * 10% = 2.56MB or so, and when the dirty page reaches 10%, it starts to use swap a lot.
$ vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 3 252696 2432 268 7148 3604 2368 3608 2372 288 288 0 0 21 78 1 0 2 253484 2216 228 7104 5368 2976 5372 3036 930 519 0 0 0 100 0 0 1 259252 2616 128 6148 19784 18712 19784 18712 3821 1853 0 1 3 95 1 1 2 260008 2188 144 6824 11824 2584 12664 2584 1347 1174 14 0 0 86 0 2 1 262140 2964 128 5852 24912 17304 24952 17304 4737 2341 86 10 0 0 4 ^C
- free
$ free total used free shared buffers cached Mem: 15728640 15600636 128004 0 0 13439080 -/+ buffers/cache: 2161556 13567084 Swap: 2104508 276276 1828232
Unit KB. Output in MB can be requested with the -m option.
We use names such as total1, used1, free1, used2, free2 to represent the values of the above statistics, and 1 and 2 represent the data in the first and second rows, respectively.
total1: Indicates the total amount of physical memory. used1: Indicates the total amount allocated to the cache (including buffers and cache), but some of the cache may not be actually used. free1: Unallocated memory. shared1: Shared memory, which is generally not used by the system, and will not be discussed here. buffers1: The number of buffers allocated by the system but not in use. cached1: The number of caches allocated by the system but not in use. The difference between buffer and cache is described later. used2: The total amount of buffers and caches actually used, and also the total amount of memory actually used. free2: The sum of unused buffers, cache and unallocated memory, this is the actual available memory of the system.
The difference between cache and buffer:
Cache: Cache is a small but high-speed memory located between the CPU and main memory.
Since the speed of the CPU is much higher than that of the main memory, the CPU needs to wait for a certain period of time to directly access data from the memory. The Cache stores a part of the data that the CPU has just used or recycled. When the CPU uses this part of the data again, it can be retrieved from the Cache In this way, the waiting time of the CPU is reduced and the efficiency of the system is improved.
Cache is further divided into Level 1 Cache (L1 Cache) and Level 2 Cache (L2 Cache). L1 Cache is integrated inside the CPU. In the early days, L2 Cache was generally soldered on the motherboard, and now it is also integrated inside the CPU. The common capacity is 256KB. or 512KB L2 Cache.
Buffer: Buffer, an area used to store data transferred between devices with different speeds or different priorities. Through the buffer, the mutual waiting between the processes can be reduced, so that when the data is read from the slow device, the operation process of the fast device will not be interrupted.
Buffer and cache in Free: (they all occupy memory)
buffer : As the memory of the buffer cache, it is the read and write buffer of the block device
cache: as the memory of the page cache, the cache of the file system
If the value of cache is large, it means that there are many files in the cache. If frequently accessed files can be cached, then the read IO of the disk will be very small.
- $ cat /proc/meminfo
MemTotal: 15728640 kB MemFree: 116196 kB Buffers: 0 kB Cached: 13448268 kB ……
This server has a total of 8GB of physical memory (MemTotal), about 3GB of available memory (MemFree), about 343MB for disk cache (Buffers), and about 4GB for file cache (Cached). These values can also be found with the free command, note that the unit is kB.
3. Disk IO
1. Good condition indicator
- iowait % < 20% A simple way to improve the hit rate is to increase the file cache area. The larger the cache area, the more pre-stored pages, and the higher the hit rate. The Linux kernel hopes to generate secondary page fault interrupts as much as possible (reading from the file cache area), and to avoid the main page fault interrupt (reading from the hard disk) as much as possible, so that with the increase of secondary page fault interrupts, the file buffer area also gradually increases. Large, Linux does not start freeing unused pages until the system has only a small amount of free physical memory.
2. Monitoring tools
- sar
$ sar -d 2 3 (3 times in 2 seconds) Linux 3.10.83-1-tlinux2-0021.tl1 (xgame_9_zone1) 06/22/17 _x86_64_ (48 CPU) 16:50:05 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 16:50:07 dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16:50:07 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 16:50:09 dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16:50:09 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 16:50:11 dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util Average: dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Important parameters:
await represents the average wait time (in milliseconds) for each device I/O operation. svctm represents the average service time (in milliseconds) per device I/O operation. %util indicates the percentage of a second that is used for I/O operations.
If the value of svctm is close to await, it means that there is almost no I/O waiting, and the disk performance is good. If the value of await is much higher than the value of svctm, it means that the I/O queue is waiting too long, and the application running on the system will slow down.
If %util is close to 100%, it means that the disk generates too many I/O requests, the I/O system is already working at full capacity, and the disk may have a bottleneck.
- $ iostat -x (option -x is used to display extended data related to io)
Linux 3.10.83-1-tlinux2-0021.tl1 (xgame_9_zone1) 06/22/17 _x86_64_ (48 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 2.46 0.00 2.75 0.01 0.00 94.78 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 153.95 0.00 1.02 0.47 0.00
Check which process accounts for IO: iotop, the effect is the same as that of top watching cpu. View files opened by a process: /proc/${pid}/fd
4. Network IO
for UDP
1. Good condition indicator
The receiving and sending buffers do not have network packets waiting for processing for a long time
2. Monitoring tools
- netstat
For UDP services, view the network conditions of all listening UDP ports
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name udp 0 0 0.0.0.0:64000 0.0.0.0:* - udp 0 0 0.0.0.0:38400 0.0.0.0:* - udp 0 0 0.0.0.0:38272 0.0.0.0:* - udp 0 0 0.0.0.0:36992 0.0.0.0:* - udp 0 0 0.0.0.0:17921 0.0.0.0:* - udp 0 0 0.0.0.0:11777 0.0.0.0:* - udp 0 0 0.0.0.0:14721 0.0.0.0:* - udp 0 0 0.0.0.0:36225 0.0.0.0:* -
It is normal for RecvQ and SendQ to be 0, or to have values for a short time.
For UDP services, check the packet loss situation (the network card has received the packet, but the application layer has not processed the packet loss caused by it)
$ watch netstat -su Udp: 278073881 packets received 4083356897 packets to unknown port received. 2474435364 packet receive errors 1079038030 packets sent
An increase in the value of packet receive errors indicates packet loss.
For TCP (from Davidshan Shanwei's experience, thx~)
1. Good condition indicator
For TCP, there is no such thing as packet loss due to insufficient cache. Due to other reasons such as the network, the packet is lost, and the protocol layer will also use the retransmission mechanism to ensure that the lost packet reaches the other party.
Therefore, tcp is more focused on the retransmission rate .
2. Monitoring tools
Through snmp, you can view the sending and receiving packets of each layer of network protocols
$ cat /proc/net/snmp | grep Tcp Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts Tcp: 1 200 120000 -1 78447 413 50234 221 3 5984652 5653408 156800 0 849
Retransmission rate = RetransSegs / OutSegs
As for the range of this value, it is ok, it depends on the specific business.
The business side is more concerned about the response time.