Linux-cpu analysis-vmstat

Reprinted: https://blog.csdn.net/ty_hf/article/details/63394960

I. Introduction
In order to more easily understand the meaning of this article, please take a look at the following cumbersome concepts for easier understanding.
If you don't understand these concepts, even if you read it, you can only know, run vmstat and see the linux feedback results, but you need to know why~
 
Let's talk about the concept of memory first.
Why not talk about cpu? Because this memory will consume cpu when it is converted to each other. As for why convert? Be patient and look down.
The memory of Linux system is divided into physical memory and virtual memory. Physical memory is real, that is, the memory on the physical memory stick. The virtual memory uses the hard disk space to supplement the physical memory (very important, the speed of the two is different), and writes the temporarily unused memory pages to the hard disk to free up more physical memory for the processes in need to use . When these freed memory pages need to be used again, the memory is read back from the hard disk (virtual memory). All this is transparent to the user. Usually for Linux systems, virtual memory is the swap partition.
 
Well, here's the highlight of this time: vmstat .
vmstat (VirtualMeomoryStatistics, virtual memory statistics) is a common tool for monitoring memory in Linux, which can monitor the overall situation of the operating system's virtual memory, processes, and CPU. This command can display brief information about the relative performance between various resources of the system, here we mainly use it to see a load situation of the CPU.
 
Every process running in the system needs to use physical memory, but not every process needs to use the memory space allocated by the system all the time. When the memory required by the system to run exceeds the actual physical memory, the kernel will release some or all of the physical memory occupied by some processes but not used, and store this part of the data on the disk until the next call of the process, and will release the memory. Provided for use by processes in need. [This is the process of memory conversion mentioned above]
In Linux memory management, the above-mentioned memory scheduling is mainly accomplished through "Paging" and "Swapping". The paging algorithm is to swap the recently used pages in memory to disk, and keep the active pages in memory for the process to use. The swapping technique is to swap the entire process, rather than part of the page, to disk. The process of writing a page to disk is called Page-Out, and the process of returning a page from disk to memory is called Page-In.
When the system kernel finds that there is less runnable memory, it will release part of the physical memory through Page-Out. Managed Page-Out does not happen often, but if Page-out happens frequently and continuously, the system performance will drop sharply until the time when the kernel manages the paging exceeds the time of running the program. At this time the system has been running very slow or into a suspended state, this state is also known as thrashing (thrashing). [The reason why the above consumes the CPU]
 
2. Effect display
vmstat 3 5 //Output a message every three seconds, a total of 5
 
For the novice, is it a little confusing, let alone the bottleneck of joint data analysis, let’s talk about the meaning of each parameter first.
Or, to put it another way, the following notes are very important:
 
 
3. Practical analysis
1. r: The number of waiting processes in the run queue
 
r (run: the number of processes being executed in the run queue) and b (the number of processes in the block waiting for CPU resources). When r exceeds the number of CPUs, a CPU bottleneck occurs.
 
Check the number of CPU cores: cat /proc/cpuinfo|grep processor|wc -l
 
When evaluating the performance of the CPU, it is inaccurate to completely copy the several times mentioned on the Internet. You can't just look at the parameters in the top. You have to look at the run value and blocked value displayed by vmstat yourself. When there is a lot of blocked, it means that the CPU has a bottleneck. The average load displayed by the top command and the uptime command can only be used as a reference for judging the status of the system in a certain period of time in the past, and has little to do with the performance of the CPU.
 
When the r value exceeds the number of CPUs, there will be a CPU bottleneck. There are several solutions:
1. The easiest way is to increase the number of CPUs and cores
2. By adjusting the task execution time, for example, large tasks are executed when the system is not busy, and the system tasks are balanced.
3. Adjust the priority of existing tasks
 
(tips: The measurement of CPU in vmstat is percentage. When the value of us+sy is close to 100, it means that the CPU is working at full capacity.
But it should be noted that the full load of the CPU does not mean anything, Linux always tries to keep the CPU as busy as possible to maximize the throughput of the task.
The only thing that can determine the CPU bottleneck is the value of r (the run queue). )
 
2. CPU usage
If the id (idle rate) of the CPU is lower than 10% for a long time, it means that the resources of the CPU are already very tight, and you should consider process optimization or add more CPUs.
wa (waiting for IO) means that the CPU is forced to be in an idle state due to waiting for IO resources. At this time, the CPU is not in an operation state, but is wasted in vain, so "waiting for IO should be as small as possible."
 
 
[The average load value displayed by the top command and the uptime command can only be used as a reference for judging the status of the system in a certain period of time in the past, and has little to do with the performance of the CPU.
 
Recommended articles:
 
 
 
2. sar command
The second tool for checking CPU performance is sar. sar is very powerful and can perform separate statistics on each aspect of the system, but using the sar command will increase system overhead, but these costs can be evaluated. The statistical results of the system won't have a big impact.
The following is the CPU statistics output of the sar command for a system:
 
Click (here) to collapse or open
  1. [root@webserver ~]# sar -u 3 5
  2. Linux 2.6.9-42.ELsmp (webserver) 11/28/2008 _i686_ (8 CPU)
  3. 11:41:24 AM CPU %user %nice %system %iowait %steal %idle
  4. 11:41:27 AM all 0.88 0.00 0.29 0.00 0.00 98.83
  5. 11:41:30 AM all 0.13 0.00 0.17 0.21 0.00 99.50
  6. 11:41:33 AM all 0.04 0.00 0.04 0.00 0.00 99.92
  7. 11:41:36 AM all 0.29 0.00 0.13 0.00 0.00 99.58
  8. 11:41:39 AM all 0.38 0.00 0.17 0.04 0.00 99.41
  9. Average: all 0.34 0.00 0.16 0.05 0.00 99.45
 
The output for each of the above items is interpreted as follows:
 The %user column shows the percentage of CPU time consumed by user processes.
 The %nice column shows the percentage of CPU time consumed by running normal processes.
 The %system column shows the percentage of CPU time consumed by system processes.
 The %iowait column shows the percentage of CPU time occupied by IO waiting
 The %steal column shows the steal operations that pagein forces to perform on different pages in a relatively tight memory environment.
 The %idle column shows the percentage of time the CPU is idle.
 This output is the statistics of the overall CPU usage of the system. The output of each item is very intuitive, and the last line, Average, is a summary line, which is an average of the above statistics.
 One thing to note is that the statistical information in the first row includes the statistical consumption of sar itself, so the value of the %user column will be higher, but this will not have much impact on the statistical results.
 In a multi-CPU system, if the program uses a single thread, there will be such a phenomenon that the overall utilization rate of the CPU is not high, but the system application responds slowly. This may be because the program uses a single thread. Using one CPU, the CPU usage rate is 100%, and other requests cannot be processed, while other CPUs are idle, which leads to low overall CPU usage and slow application.
 To solve this problem, you can query each CPU of the system separately and count the usage of each CPU:
 
Click (here) to collapse or open
  1. [root@webserver ~]# sar -P 0 3 5
  2. Linux 2.6.9-42.ELsmp (webserver) 11/29/2008 _i686_ (8 CPU)
  3. 06:29:33 PM CPU %user %nice %system %iowait %steal %idle
  4. 06:29:36 PM 0 3.00 0.00 0.33 0.00 0.00 96.67
  5. 06:29:39 PM 0 0.67 0.00 0.33 0.00 0.00 99.00
  6. 06:29:42 PM 0 0.00 0.00 0.33 0.00 0.00 99.67
  7. 06:29:45 PM 0 0.67 0.00 0.33 0.00 0.00 99.00
  8. 06:29:48 PM 0 1.00 0.00 0.33 0.33 0.00 98.34
  9. Average: 0 1.07 0.00 0.33 0.07 0.00 98.53
This output is the information statistics of the first CPU of the system. It should be noted that the count of CPUs in sar starts from 0. Therefore, "sar -P 0 3 5" indicates that the first CPU of the system is counted. Information statistics, "sar -P 4 3 5" means statistics on the fifth CPU of the system. And so on. It can be seen that the above system has eight CPUs.
 
3 iostat command
 The iostat command is mainly used to count the IO status of the disk, but it can also view the CPU usage information. Its limitation is that it can only display the average information of all CPUs in the system. See the following output:
 
Click (here) to collapse or open
  1. [root@webserver ~]# iostat -c
  2. Linux 2.6.9-42.ELsmp (webserver) 11/29/2008 _i686_ (8 CPU)
  3. avg-cpu: %user %nice %system %iowait %steal %idle
  4. 2.52 0.00 0.30 0.24 0.00 96.96
 Here, we use the "-c" parameter to only display the statistics of the system CPU. The meaning of each item in the output is exactly the same as the output item of the sar command, and will not be described in detail.
 
 
1.4 uptime command
 uptime is the most commonly used command for monitoring system performance. It is mainly used to count the current operating status of the system. The output information is: the current time of the system, how long the system has been running since the last time it was powered on, and how many logins the system currently has. User, system load average in one minute, five minutes, fifteen minutes. See one of the outputs below:
 
Click (here) to collapse or open
  1. [root@webserver ~]# uptime
  2. 18:52:11 up 27 days, 19:44, 2 users, load average: 0.12, 0.08, 0.08
 
It should be noted here that the output value of load average is generally not larger than the number of system CPUs. For example, the system has 8 CPUs in this output. If the three values ​​of load average are greater than 8 for a long time, it means that The CPU is very busy and the load is very high, which may affect the system performance, but if it is occasionally greater than 8, don't worry, it will generally not affect the system performance. On the contrary, if the output value of load average is less than the number of CPUs, it means that the CPU still has an idle time slice, such as the output in this example, the CPU is very idle.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324971072&siteId=291194637