A brief discussion on why slow disks cause Linux load to soar

Let’s talk about the reasons and conclusions first

On Linux systems, the load average indicator is basically useless because you don’t know what it means. When you see a high load average, you don’t know whether there are too many runnable processes or too many uninterruptible sleep processes. That is, It is impossible to determine whether the CPU is insufficient or the IO device has a bottleneck.

Another way to explain why when the disk is slow (when a large amount of disk is used), the CPU load will soar. Basically, there are two situations when I encounter high CPU load:

  1. The CPU itself handles too many tasks, coupled with too frequent soft interrupts and context switching, resulting in high load;
  2. Furthermore, the disk is too slow, resulting in too many uninterruptible sleeps, resulting in high CPU load.

uninterruptible sleep process understanding

The reason for the uninterruptible sleep state is that the process is waiting for IO, such as disk IO, network IO, etc. After the IO request issued does not receive a response, the process will generally enter the uninterruptible sleep state. For example, if the NFS server is closed and the related directories are not previously amounted, executing df on the client will hang the entire session. If you use ps axf to view it, you will find that the df process status bit has changed to D2.

1. The difference between CPU utilization and load rate

Insert image description here
Here we need to distinguish between CPU load and CPU utilization. They are two different concepts, but their information can be displayed in the same top command. CPU utilization shows the percentage of CPU occupied by the program in real time during running. This is a statistics of CPU usage within a period of time. Through this indicator, you can see how much the CPU is occupied within a certain period of time. If it is occupied If the time is very high, then you need to consider whether the CPU is already overloaded. The CPU load shows the statistical information of the sum of the number of processes that the CPU is processing and waiting for the CPU to process within a period of time, which is also the statistical information of the length of the CPU usage queue.

High CPU utilization does not mean that the load must be large. Maybe the task is CPU-intensive. Will a high Load Average occur under the same low CPU utilization situation? By understanding the occupation time and usage time, you can know that when the CPU allocates a time slice, whether to use it depends entirely on the user, so it is entirely possible to have low utilization and high Load Average. In addition, IO devices may also cause high CPU load.

From this point of view, it is not enough to judge whether the CPU is in an overloaded working state only based on the CPU usage. It must be combined with Load Average to look at the CPU usage globally. There is an example on the Internet to illustrate the difference between the two: in a public phone booth, there is one person calling and four people waiting. Each person is limited to using the phone for one minute. If someone does not finish the call within one minute, they can only hang up. Call to queue and wait for the next round. The phone here is equivalent to the CPU, and the people who are calling or waiting to call are equivalent to the number of tasks. During the use of the phone booth, some people will definitely leave after making calls, some people will queue up again without finishing their calls, and there will even be new people queuing up here. The change in the number of people is equivalent to the increase or decrease in the number of tasks. In order to count the average load, we count the number of people every 5 seconds, and average the statistics at the 1st, 5th, and 15th minutes to form the average load at the 1st, 5th, and 15th minutes. Some people pick up the phone and call immediately, and the call lasts for one minute, while some people may be looking for the phone number in the first thirty seconds, or hesitating whether to call, and then actually call in the last thirty seconds. If the phone is regarded as a CPU and the number of people is regarded as a task, we say that the CPU utilization of the former person (task) is high and the CPU utilization of the latter person (task) is low. Of course, the CPU will not work in the first thirty seconds, and will rest in the next thirty seconds. The CPU will keep working. It’s just that some programs involve a lot of calculations, so the CPU utilization is high, while some programs involve very little calculation, so the CPU utilization is naturally low. But whether the CPU utilization is high or low, it has nothing to do with how many tasks are queued later.

The number of CPUs and the number of CPU cores (that is, the number of cores) will affect the CPU load, because tasks are ultimately assigned to CPU cores for processing. Two CPUs are better than one CPU, and dual cores are better than single cores. Therefore, we need to remember that apart from the difference in CPU performance, the CPU load is calculated based on the number of cores, that is, "how many cores there are, that is, how much load there is". For example, it is best not to exceed 100% for a single core, that is, the load is 1.00, and so on.

There is a /proc directory in Linux, which stores the virtual mapping of the current running system. There is a file called cpuinfo, which stores CPU information. The /proc/cpuinfo file displays information in sections by logical CPU rather than real CPU. The information of each logical CPU occupies one section, and the first logical CPU identifier starts from 0.

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x36
cpu MHz         : 2399.998
cache size      : 20480 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr ......
bogomips        : 4799.99
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

To understand the CPU information in this file, there are several related concepts to know, such as: processor represents the identification of the logical CPU, model name represents the model information of the real CPU, physical id represents the real CPU and identification, and cpu cores represents the real CPU. Number of cores etc.

Description of logical CPU: Today's servers generally use "Hyper-Threading" (HT) technology to improve CPU performance. Hyper-threading technology allows a CPU to execute multiple programs at the same time and share the resources in a CPU. In theory, it should execute two threads at the same time like two CPUs. Although hyper-threading technology can execute two threads at the same time, it is not like two real CPUs, each CPU has independent resources. When two threads need a resource at the same time, one of them must be temporarily stopped and give up the resource until these resources are idle before continuing. Therefore, the performance of hyper-threading is not equal to the performance of two CPUs. CPUs with Hyper-Threading Technology have other limitations.

2. Calculation method of CPU load rate

The concept of load average originates from the UNIX system. Although the formulas of each company are different, they are all used to measure the number of processes using the CPU and the number of processes waiting for the CPU. In one word, it is the number of runable processes. Therefore, the load average can be used as a reference indicator for CPU bottlenecks. If it is greater than the number of CPUs, it means that the CPU may not be enough.

However, it's a little different on Linux!

In addition to the number of processes using the CPU and the number of processes waiting for the CPU, the load average on Linux also includes the number of uninterruptible sleep processes. Usually when waiting for IO devices and the network, the process will be in uninterruptible sleep state. The logic of Linux designers is that uninterruptible sleep should be very short-lived and will resume operation soon, so it is equated to runnable. However, uninterruptible sleep is still sleep even if it is short, not to mention that uninterruptible sleep may not be very short in the real world. A large number or long uninterruptible sleep usually means that the IO device has encountered a bottleneck. As we all know, processes in sleep state do not require CPU. Even if all CPUs are idle, the sleeping process cannot run. Therefore, the number of sleep processes is definitely not suitable to be used as an indicator to measure CPU load. Linux counts uninterruptible sleep processes as The practice of entering load average directly subverts the original meaning of load average. Therefore, on Linux systems, the load average indicator is basically useless because you don’t know what it means. When you see a high load average, you don’t know whether there are too many runnable processes or too many uninterruptible sleep processes. It is impossible to determine whether the CPU is insufficient or the IO device has a bottleneck.

On the other hand, it can explain why the CPU load will soar when the disk is slow (when a large amount of disk is used). Basically, when I encounter high CPU load, there are two situations: the CPU itself handles too many tasks, plus soft interrupts and context switches are too frequent, resulting in high load; and the disk is too slow, causing too much uninterruptible sleep, which causes the CPU to Load is high.

The above is the entire content of this article about why a slow disk will cause the Linux load to soar. I hope it will be helpful to everyone. Interested friends can continue to refer to other related topics on this site. If there are any shortcomings, please leave a message to point out. Thank you friends for supporting this site!

Guess you like

Origin blog.csdn.net/tian830937/article/details/132636387