Causes and troubleshooting of high CPU load (transfer)

Reprinted from: http://m.blog.csdn.net/canot/article/details/78079085

 

what is cpu load value

The load average displayed in the top command is the average system load in the last 1 minute, 5 minutes, and 15 minutes. 
write picture description here

System load average is defined as the average number of processes in the run queue (how many processes are running on the CPU or waiting to run) during a specific time interval. A process is placed in the run queue if it satisfies the following conditions:

  • It is not waiting for the result of an I/O operation
  • It does not actively enter the wait state (ie does not call 'wait')
  • not stopped (eg: waiting for termination)

In Linux, processes are divided into three states, one is blocked process (waiting for I/O device data or system calls), one is runnable process, and the other is running process .

When a process is in a runnable state, it is in a run queue and competes with other runnable processes for CPU time. The load of the system refers to the total number of processes running and ready to run. For example, now the system has 2 running processes and 3 runnable processes, then the system load is 5. Load average is the number of loads in a certain period of time.

What factors constitute the size of the cpu load

The index to measure the load of the CPU system is load, and load is a measure of how much load the computer system can bear. Simply put, it is the length of the process queue. If the request is larger than the current processing capacity, there will be a wait, causing the load to increase.  
For the load average shown just at the beginning of this article 0.21 0.10 0.03

Many people understand load averages like this: The three numbers represent the average system load for different time periods (one minute, five minutes, and fifteen minutes), and of course, the smaller the number, the better. The higher the number, the more loaded the server is, which could also be a sign of some kind of problem with the server. This is not quite the case. What factors make up the magnitude of load averages, and how do you distinguish whether they are currently in "good" or "bad" condition? When should you be aware of which values ​​are out of the ordinary?

Before answering these questions, we first need to understand some of the knowledge behind these values. Let's start with the simplest example, a server with only one single-core processor.

driving across the bridge

  A single-core processor can be likened to a single lane. Imagine that you now need to collect tolls for this road—to be busy dealing with vehicles that are about to cross the bridge. You first of course need to know some information, such as the weight of the vehicle, and how many vehicles are waiting to cross the bridge. If no vehicle is waiting ahead, then you can tell the driver behind you to pass. If there are a lot of vehicles, they need to be informed that it may take a while.

  Therefore, some specific codes are needed to indicate the current traffic situation, such as:

  0.00 means that there is currently no traffic on the bridge. In fact this situation is the same as between 0.00 and 1.00, all in all it is smooth and passing vehicles can pass without waiting at all.

  1.00 means that it is just within the bearing range of the bridge. This situation is not bad, but the traffic flow will be a little bit congested, but this situation may cause the traffic to become slower and slower. 
   
  If it exceeds 1.00, it means that the bridge has been overloaded and the traffic is seriously congested. So how bad is the situation? For example, the situation of 2.00 means that the traffic flow has exceeded the bridge can bear twice, so there will be more than double the number of vehicles crossing the bridge waiting anxiously. At 3.00, the situation is even worse, indicating that the bridge is almost unbearable, and there are more than twice the load of the bridge waiting. 
  The above situation is very similar to the processor load situation. The bridge time of a car is like the actual time the processor spends processing a thread. The Unix system defines the running time of a process as the processing time of all processor cores plus the time the thread waits in the queue.

  Like a toll collector, you certainly want your car not to be anxiously awaited. Therefore, ideally, the average load is expected to be less than 1.00. Of course, it is not ruled out that some peaks will exceed 1.00, but if it stays in this state for a long time, it means that there will be problems. At this time, you should be very anxious. 
"So your ideal load is 1.00?" Well, that's not entirely true. A load of 1.00 means that the system has no remaining resources. In practice, experienced system administrators will place this line at 0.70. If your system load is around 0.70 for a long time, then you need to take some time to understand why before things get worse.

Multi-core processors: The system still calculates the load average by the number of processor cores 
In multi-core processing, your system average should not be higher than the total number of processor cores. Continue to focus on the above-mentioned car bridge problem, if it is a dual-core CPU, then the load is full at 2.0. Here is the output for a dual core processor

uptime
17:57  up 22 days,  8:29, 3 users, load averages: 2.04 2.04 2.01
  • 1
  • 2

Reasons for high cpu load and troubleshooting

The reason for the high cpu load. From the programming language level, the increase in the number of full gcs or the infinite loop may cause the cpu load to increase.

A specific one-sentence description of the investigation is

首先要找到哪几个线程在占用cpu,之后再通过线程的id值在堆栈文件中查找具体的线程,看看出来什么问题。
  • 1

Find the most CPU-consuming process

  • by command ps ux
  • Display a list of process running information through the top -c command (key P is sorted by CPU occupied resources)

Find the thread that consumes the most CPU

  • top -Hp process ID Display a list of thread running information of a process ID (press P is sorted by CPU occupied resources). 
    If the process is a java process, you need to check which code is causing the high CPU load. According to the thread obtained above ID can use jstack under JDK to view the stack.

Since the thread id in the stack is represented by hexadecimal, the above thread can be converted into hexadecimal representation.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326646629&siteId=291194637