linux CPU load

Introduction

CPULoad understanding

Here we need to distinguish between CPU load and CPU utilization. They are two different concepts, but their information can be displayed in the same top command.
CPU utilization shows the percentage of CPU occupied by the program in real time during its operation, while CPU load shows the average number of tasks that are using and waiting to use the CPU in a period of time. High CPU utilization does not mean that the load is necessarily large. An article on the Internet gave an interesting analogy, using a phone call to illustrate the difference between the two.
In a public telephone booth, one person is calling, and four people are waiting. Each person is limited to using the phone for one minute. If someone does not finish the call within one minute, they can only hang up and wait in line for the next round. The phone is equivalent to the CPU here, and the number of people who are or waiting to call is equivalent to the number of tasks.
During the use of the phone booth, some people will definitely leave after the call, some people choose to re-queue without finishing the call, and there will be new people queuing here. The change in the number of people is equivalent to the increase or decrease in the number of tasks. In order to count the average load, we count the number of people once in 5 seconds, and average the statistics at the first 1, 5, and 15 minutes to form the average load for the first 1, 5, and 15 minutes.
Some people just pick up the phone and call and keep calling for 1 minute, while some people may be looking for a phone number 30 seconds before, or hesitate to call, and then actually call 30 seconds later. If the phone is regarded as a CPU and the number of people is regarded as a task, we say that the CPU utilization rate of the former person (task) is high, and the CPU utilization rate of the latter person (task) is low.
Of course, the CPU will not work in the first 30 seconds and rest for the next 30 seconds. It just means that some programs involve a lot of calculations, so the CPU utilization is high, and some programs involve very little calculation. , CPU utilization is naturally low. But no matter whether the CPU utilization is high or low, it does not necessarily matter how many tasks are queued behind.

topView machine load

Tasks: 2244 total,   1 running, 2151 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.3 us,  1.6 sy,  0.0 ni, 92.2 id,  1.7 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 49199592 total, 21400612 free, 21820432 used,  5978548 buff/cache
KiB Swap: 16653308 total,  3981452 free, 12671856 used. 26274108 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                           
18142 mfeng     20   0 19.715g 3.565g   8704 S  46.7  7.6   9:22.22 java                                                              
 6986 root       0 -20       0      0      0 I  12.1  0.0   4:44.67 kworker/11:1H                                                     
20692 mfeng     20   0 11.883g 0.012t   2168 D  11.8 25.3   1:12.41 ld                                                                
 9889 root      20   0       0      0      0 I   2.0  0.0   0:24.43 kworker/11:2                                                      
21495 root      20   0   37512   5672   3048 R   1.6  0.0   0:00.15 top                                                               
20656 root      20   0       0      0      0 I   0.7  0.0   0:00.23 kworker/4:0                                                       
21341 root      20   0       0      0      0 I   0.7  0.0   0:00.10 kworker/4:2                                                       
    8 root      20   0       0      0      0 I   0.3  0.0   0:06.17 rcu_sched      
`load average`代表CPU的平均负载值,这些数据来自于文件`/proc/loadavg`,内核会负责统计出这些数据。`top`和`uptime`命令显示的内容就来自于这个文件,根据proc的帮助文件可知,这里的值就是单位时间内**处于运行状态以及等待磁盘 I/O状态的平均job数量**。
- 1、 对于内核而言,进程和线程都是job
- 2、 job处于运行状态指job处于内核的运行队列中,正在或等待被`CPU`调度(用户空间的进程正在运行不代表需要被CPU调度,有可能在等待I/O,也有可能在sleep等等)

Single core example

  • 1. Less than 1: It means that less than one job is busy at a time on average. For a single-core CPU, it can be processed.
  • 2. Equal to 1: It means that on average, there is exactly one job busy each time. For a single-core CPU, it just can handle it.
  • 3. Greater than 1: It means that more than one job is busy at a time on average. For a single-core CPU, since only one task can be processed at a time, there must be tasks waiting, indicating that the system load is large and scheduling cannot be achieved. There is a job to wait.
    For a single-core cpu, once it is greater than 1, it means that the job cannot be scheduled in time and system performance will be affected. For 2 cores, the value greater than 2 means that the cpu is too busy.

%Cpu(s)

load averageInfer the busyness of the CPU by counting the average number of jobs waiting to run, and %Cpu(s)directly counting CPUthe time in different states, which is load averagemore intuitive than the above , so it is actually used more.
Generally speaking, the CPU will be in one of the following three states:
-1 Idle,: in idle state, no tasks need to be scheduled
-2 User space,: running user space code (in user mode)
-3 Kernel.: running kernel The code (in the kernel state)
of the above three states, the kernel is further subdivided into many states, here is an example of the 8 states outputted above:
-1. us: means that the CPU is running 2.5% of the time User mode code (that is, user mode program is running)
-2. sy: Means that the CPU is running kernel mode code 1.8% of the time. The kernel is responsible for managing all processes and hardware resources of the system. All kernel codes run in kernel mode. When user mode processes need to access hardware resources, such as allocating memory, reading and writing I/O, etc., they also need to enter kernel mode through system calls. Run the kernel code. A high %sy indicates that the kernel takes up too many resources, or the user process has initiated too many system calls.
-3. ni: It means that the CPU has 3.1% of the time running the process code whose niceness is not 0. By default, the niceness value of a process is 0, but you can start a process and specify its niceness value through the nice command. The niceness value range is -20 to 19. The smaller the value, the higher the priority and the higher the priority. Scheduled by the kernel.
-4. id: indicates that the CPU is idle 90.5% of the time
-5. wa: Indicates that the CPU is in I/O waiting state 1.7% of the time. Normally, when the CPU encounters an I/O operation, it will trigger the I/O operation first, and then do other things. After the I/O operation is completed, the CPU will continue to work, but if the system is relatively idle at this time , The CPU has nothing else to do, then the CPU will be in a waiting state. This time in the waiting state will be counted into the I/O wait, which means that the CPU is in the I/O wait state, that is, the CPU is idle and has nothing to do When the I/O operation ends, it is almost the same as idle. A high value indicates that the CPU is idle and there are many I/O operations or slow I/O operations, but low does not indicate that there are no I/O operations or fast I/O operations. It may be that the CPU is busy with something else, so this is just a reference Values ​​need to be analyzed together with other statistical items.
- 6, hi & si: These two values reflect how much CPU time spent on interrupt handling, hi(hardware interrupts)是硬件中断, si(softirqs)是软件中断. Hardware interrupts are generally caused by I/O devices, such as network cards, disks, etc. After a hardware interrupt occurs, the CPU needs to process it immediately. When there are many things to be processed in the hardware interrupt, the kernel will generate the corresponding soft interrupt, which will then be time-consuming and Operations that do not need to be processed immediately are executed in soft interrupts. For example, when the network card receives a network packet, the CPU needs to immediately copy the data to the memory, because the cache of the network card is small, and if it is not processed in time, the following Data packets can't get in, resulting in packet loss. After the data is copied to the memory, there is no need to deal with it in such a hurry. At this time, the code for processing the data packet (protocol stack) can be executed in the soft interrupt.
-7. st: %st is related to the virtual machine. When the system is running in a virtual machine, the current virtual machine will share the CPU with the host machine and other virtual machines. %st means that the current virtual machine is waiting for the CPU to serve it time. The larger the value, the longer the physical CPU is occupied by the host machine and other virtual machines, resulting in insufficient CPU resources for the current virtual machine. If %st is greater than 0 for a long time, it means that the CPU resources are not satisfied. At this time, you can consider moving the virtual machine to another machine, or reducing the number of virtual machines running on the current machine.
The sum of the above statistical items is equal to 100%. Except for %idle, any value that is too high indicates that there is a problem with the system.

problem solved.

  • 1. %us is too high: It means that a user-mode process occupies too much CPU. You can clearly see which process is through the top command. If this is not the expected behavior, you can kill the corresponding process or restart it through the kill command it
  • 2. %sy is too high: If it is only occasionally high, don't worry, but if it continues to rise, you need to pay attention to it. It may be that the system calls of some processes are too frequent, for example, the process keeps outputting logs to the console. , But if there is no problem with the user-mode process, there may be a problem with the code in the kernel, especially the driver module with poorly written code
  • 3. %ni is too high: It means that someone used a nice program to run a CPU-consuming process. If the niceness value is greater than 0, there is nothing to worry about, because its priority is lower than the default priority and will not affect CPU performance, but it is better to confirm that the process will not preempt other resources of the system, such as memory, Disk I/O, etc., to avoid affecting the overall performance of the system. If the niceness value is less than 0, it means that the process has a high priority and occupies a lot of CPU resources. You need to ensure that the CPU resources occupied by the process are in line with expectations. If not, you can use the top command to find it out and kill it or restart it.
  • 4. %wa is too high: It means that there are processes in the system doing a lot of I/O operations, or I/O devices with slow read and write speeds, such as frequent disk reads and writes. At this time, you can use the iotop command to check Which processes account for I/O, and then deal with different processes accordingly; there is another situation that the system is frequently using swap partitions, and what needs to be solved at this time is the memory problem, not the I/O problem.
  • 5. %hi or %si is too high: If %hi is too high, it is usually a problem with the hardware, and %si is too high, it is usually a problem with the code in the kernel.
  • 6. %st is too high: As mentioned above, %st is too high to indicate that the current virtual machine cannot get enough CPU resources. At this time, you can consider relocating the current virtual machine to another host, or find a way to reduce the load of the current host, such as turning off some other virtual machines.

Guess you like

Origin blog.csdn.net/qq_25562325/article/details/111589589