【Online Production Troubleshooting - CPU】


foreword

Every time we find that the system is slowing down, the first thing we usually do is to execute the top or uptime command to understand the load of the system. For example, as shown below, I entered the uptime command in the command line, and the system immediately gave the result.

—I insert image description here
believe everyone is familiar with the first three columns here, which respectively represent the current time, system running time, and the number of users who are logging in, and the last three numbers are the average load of 1 minute, 5 minutes, and 15 minutes (Load Average ). The average number of processes that the system is running and uninterruptible, that is, the average number of active processes, is not directly related to the CPU usage.

1. What is the average load?

What is load average? Many people think that the average load is the CPU usage per unit of time, and the above 0.90 means that the current average CPU usage for one minute is 90%. In fact, otherwise, to put it simply, the average load refers to the average number of processes that the system is in a runnable and uninterruptible state per unit time, that is, the average number of active processes, which is not directly related to the CPU usage.

The process in the runnable state refers to the process that is using the CPU or waiting for the CPU to execute. In the system, this process is R, that is, (Running or Runnable), and the uninterruptible state refers to the current process that is trapped in the kernel execution key. The current process cannot be interrupted.

Since the average load refers to the average number of active processes, it is easy to think that each CPU is exactly
one process running on each CPU, so that each CPU is fully utilized.
For example, what does it mean when the load average is 2?

On a system with only 2 CPUs, this means that all CPUs are exactly fully utilized.

On a 4 CPU system, that means 50% of the CPU is idle.

In a system with only 1 CPU, it means that half of the processes cannot compete for the CPU.

Therefore, the average load has different meanings on different systems. Therefore, when judging whether the average load of your system is too high, you need to judge according to the number of CPUs in your current system. At this time, you can also Use the grep 'model name' /proc/cpuinfo | wc -l command to read the number of CPU cores in the current system

What is the average load more reasonable? In fact, there is no accurate value. We need to make corresponding judgments based on historical monitoring data to see if there is an obvious upward trend. Generally, when the average load is higher than 70% of the number of CPUs, you should analyze and troubleshoot the problem of high load.

2. Average load and CPU usage

The average load refers to the number of processes in a runnable state and an uninterruptible state per unit time. So, it includes not only the processes that are using the CPU, but also the processes that are waiting for the CPU and waiting for I/O.

The CPU usage is the statistics of CPU busyness per unit time, which does not necessarily correspond exactly to the average load. for example:

A CPU-intensive process that uses a large amount of CPU will lead to an increase in the average load, and the two are consistent at this time;

I/O-intensive processes, waiting for I/O will also lead to an increase in the average load, but the CPU usage is not necessarily high;

The scheduling of a large number of processes waiting for the CPU will also lead to an increase in the average load, and the CPU usage will be relatively high at this time.

case analysis

CPU intensive process

Open three terminals in the virtual machine, first run the stress --cpu 2 --timeout 600 command on the first terminal to simulate a scene with high CPU usage

Then, run uptime on the second terminal to check the change of the average load: insert image description here
run mpstat on the third terminal to check the change of the CPU usage
insert image description here
From the second terminal, it can be seen that the average load of 1 minute increases slowly to 2, while It can also be seen from Terminal 3 that exactly one CPU usage rate is 100%, but iowait is only 0, which shows that the increase in average load is due to the 100% CPU usage rate.
insert image description here
It can be seen that the CPU usage of the stress process is 100%.

Summarize

Introduced the troubleshooting ideas of production problem CPU

Guess you like

Origin blog.csdn.net/weixin_44821965/article/details/129369158