CPU soars, frequent GC, how to troubleshoot

0. Foreword

Students who have dealt with online problems will basically encounter the problems of sudden slow system operation, 100% CPU, and too many times of Full GC. Of course, the ultimate intuitive phenomenon of these problems is that the system runs slowly and has a large number of alarms.

This article mainly aims at the problem of slow system operation, and provides troubleshooting ideas for the problem, so as to locate the code point of the problem, and then provide ideas for solving the problem.

For the sudden running slow problem of the online system, if the problem makes the online system unavailable, the first thing to do is to export jstack and memory information, and then restart the system to ensure the availability of the system as soon as possible . There are two main possible reasons for this situation:

  • A certain location in the code reads a large amount of data, which leads to the exhaustion of system memory, which leads to too many times of Full GC and slow system;

  • There are relatively CPU-consuming operations in the code, resulting in high CPU and slow system operation;

Relatively speaking, these are the two most frequent online problems, and they will directly cause the system to be unavailable. There are several other situations that can cause a function to run slowly, but it will not cause the system to be unavailable:

  • There is a blocking operation at a certain position in the code, which makes the function call more time-consuming as a whole, but the occurrence is relatively random;

  • A certain thread enters the WAITING state for some reason, at this time the function is unavailable as a whole, but it cannot be reproduced;

  • Due to improper use of locks, multiple threads enter a deadlock state, resulting in a slow system as a whole.

For these three cases, it is impossible to find out the specific problem by checking the CPU and system memory, because they are relatively blocking operations, the CPU and system memory usage are not high, but the function is very slow. Next, we will check the system logs to identify the above problems step by step.

1. Too many times of Full GC

Relatively speaking, this situation is the easiest to occur, especially when new features are launched. For the situation where there are many Full GCs, it mainly has the following two characteristics:

  • The CPU of multiple threads on the line exceeds 100%. Through the jstack command, we can see that these threads are mainly garbage collection threads.

  • Monitor the GC situation through the jstat command, and you can see that the number of Full GCs is very large, and the number is constantly increasing.

First, we can use the top command to view the CPU usage of the system. The following is an example of a high CPU in the system:

 It can be seen that there is a Java program whose CPU usage reaches 98.8% at this time. At this time, we can copy the process id9 and use the following command to view the running status of each thread of the process:

top -Hp 9

The running status of each thread under this process is as follows:

 It can be seen that the CPU usage of each thread in the Java program with process 9, and then we can use the jstack command to check why the thread with thread id 10 consumes the most CPU.

Note: In the results displayed by the jsatck command, the thread id is converted into hexadecimal form. You can use the following command to view the conversion results, or you can find a scientific calculator to convert:

root@a39de7e7934b:/# printf "%x\n" 10
a

The print result here shows that the thread’s display form in jstack is 0xa. Through the jstack command, we can see the following information:

 The end of the VM Thread line here displays nid=0xa, where nid means the operating system thread id. And VM Thread refers to the thread of garbage collection. Here we can basically confirm that the main reason for the slowness of the current system is that garbage collection is too frequent, resulting in a long GC pause time. We can check the status of GC with the following command:

 It can be seen that here FGC refers to the number of Full GC, which is as high as 6793 here, and it is still growing. This further confirms that the system is slow due to memory overflow. So the memory overflow is confirmed here, but how to check which objects are causing the memory overflow? You can dump the memory log, and then use the mat tool of eclipse to view it. The following is an object tree structure displayed:

After analysis by the mat tool, we can basically determine which object in the memory consumes more memory, and then find the location where the object was created and process it.

The main thing here is that PrintStream is the most, but we can also see that its memory consumption is only 12.2%. In other words, it is not enough to cause a large number of Full GCs. At this time, we need to consider another situation, that is, there are displayed System.gc() calls in the code or third-party dependent packages.

In this case, we can judge by looking at the file obtained from the dump memory, because it will print the GC reason:

 For example, the first GC here is caused by the display call of System.gc(), while the second GC is initiated by the JVM. In summary, there are two main reasons for the excessive number of Full GCs:

  • The code acquires a large number of objects at one time, resulting in memory overflow. At this time, you can use the mat tool of eclipse to check which objects are more in the memory;

  • The memory usage is not high, but the number of Full GC is still relatively high. At this time, the displayed  System.gc()call may cause too many GC times. This can be added  -XX:+DisableExplicitGCto disable the JVM’s response to the displayed GC.

 Two, the CPU is too high

In the first point above, we mentioned that if the CPU is too high, it may be that the system frequently performs Full GC, causing the system to slow down. And we usually encounter time-consuming calculations, which lead to high CPU. At this time, the viewing method is actually very similar to the above.

First, we use the top command to check which process the current CPU consumption is too high, so as to get the process id; then use top -Hp to check which threads in the process have too high CPU, generally more than 80% is relatively high, about 80% is reasonable. In this way, we can get the thread id with relatively high CPU consumption. Then view the specific stack information of the current thread in the jstack log through the hexadecimal representation of the thread id.

Here we can distinguish whether the cause of the high CPU is too many times of Full GC or the time-consuming calculation in the code. If there are too many times of Full GC, then the thread information obtained through jstack will be a thread similar to VM Thread, and if there is a time-consuming calculation in the code, then what we get is the specific stack information of a thread. The following is a thread information that has time-consuming calculations in the code and causes the CPU to be too high:

 It can be seen here that when the UserController is requested, the CPU of the thread is always at 100% because the Controller makes a time-consuming call. According to the stack information, we can directly locate the 34th line of UserController to see what is causing the high amount of calculation in the code.

3. Occasional interface time-consuming phenomena

For this situation, a typical example is that it often takes 2~3 seconds to return to a certain interface access. This is a rather troublesome situation, because generally speaking, it does not consume much CPU and does not occupy a lot of memory. That is to say, we cannot solve this problem through the above two methods of troubleshooting.

Moreover, the time-consuming problem of such an interface occurs from time to time, which leads to the fact that even if we get the stack information accessed by the thread through the jstack command, we cannot judge which thread is performing a time-consuming operation. the rout.

For the time-consuming problems of interfaces that appear from time to time, our positioning idea is basically as follows: first find the interface, and continuously increase access through the pressure testing tool. If there is a certain position in the interface that is time-consuming, Since the frequency of our access is very high, most threads will eventually block at this blocking point, so that by having multiple threads with the same stack log, we can basically locate the time-consuming code in this interface Location. The following is a thread stack log obtained through a pressure testing tool for a time-consuming blocking operation in the code:

 From the log above, you can see that there are multiple threads blocked on line 18 of UserController, indicating that this is a blocking point, which is the reason why the interface is relatively slow.

4. A thread enters the WAITING state

For this kind of situation, this is a relatively rare situation, but it is also possible, and because it has a certain "irreproducibility", it is very difficult for us to find it during investigation.

The author has encountered a similar situation. The specific scenario is that when using CountDownLatch, the main thread will be awakened to execute after each parallel task is executed. At that time, we used CountDownLatch to control multiple threads to connect and export the user's gmail mailbox data. One of the threads connected to the user's mailbox, but the connection was suspended by the server, causing the thread to wait for the server's response. Eventually our main thread and the rest of the threads are in the WAITING state.

For such a problem, readers who have checked the jstack log should know that under normal circumstances, most threads on the line are in the TIMED_WAITING state, and the state of the problematic thread here is exactly the same as it is, which is very easy cloud our judgment. The main ideas to solve this problem are as follows:

  • Use grep to find all the threads in the state in the jstack log  TIMED_WAITING, and export them to a file, such as a1.log. The following is an example of an exported log file:

  • After waiting for a period of time, such as 10s, grep the jstack log again and export it to another file, such as a2.log, the result is as follows:

  • Repeat step 2. After exporting 3~4 files, we compare the exported files to find out the user threads that have always existed in these files. This thread can basically be confirmed to contain the waiting state. The thread in question. Because the normal request thread will not be in the waiting state after 20~30s.

  • After checking these threads, we can continue to check their stack information. If the thread itself should be in a waiting state, such as an idle thread in the thread pool created by the user, then the stack information of this thread is not Will contain user-defined classes. These can be ruled out, and the remaining threads can basically be confirmed to be the problematic threads we are looking for. Through its stack information, we can find out exactly where the code causes the thread to be in a waiting state.

What needs to be explained here is that when we judge whether it is a user thread, we can judge through the thread name at the front of the thread, because the thread naming of the general framework is very standardized, and we can directly judge the thread name through the thread name. Threads are threads in some frameworks, which can basically be ruled out. And the rest, such as Thread-0 above, and the custom thread names we can identify, these are the objects we need to check.

After checking in the above way, we can basically conclude that Thread-0 here is the thread we are looking for. By viewing its stack information, we can get the specific location that causes it to be in a waiting state. In the following example, line 8 of SyncTask causes the thread to wait.

 

Five, deadlock

For deadlocks, this situation is basically easy to find, because jstack can help us check deadlocks and print specific deadlock thread information in the log. The following is an example of a jstack log that generates a deadlock:

 As you can see, at the bottom of the jstack log, it directly helps us analyze which deadlocks exist in the log, as well as the thread stack information of each deadlock. Here we have two user threads waiting for each other to release the lock, and the blocked position is on line 5 of ConnectTask. At this time, we can directly locate the position and perform code analysis to find the cause of the deadlock. reason.

6. Summary

This article mainly explains the five situations that may cause system slowness that may occur online, and analyzes the phenomena when each situation occurs in detail. Based on the phenomena, we can use which methods to locate the system slowness caused by this reason. Briefly speaking, when we analyze online logs, we can mainly divide them into the following steps:

  • Use  topthe command to check the CPU status. If the CPU is relatively high, use  top-Hp<pid>the command to check the running status of each thread of the current process. After finding out the thread with too high CPU, convert its thread id to a hexadecimal representation, and then log in jstack View the main work in progress in this thread. There are two cases

  • If it is a normal user thread, check the thread's stack information to see where the user code is running, which consumes more CPU;

  • If the thread is  VMThreadjstat-gcutil<pid><period><times>monitor the GC status of the current system through commands, and then  jmapdump:format=b,file=<filepath><pid>export the current memory data of the system. After exporting, put the memory situation into the mat tool of eclipse for analysis to find out what objects in the memory are mainly consuming memory, and then you can process related codes;

  • If  top you see that the CPU is not high through the command, and the system memory usage is relatively low. At this point, you can consider whether the problem is caused by the other three situations. Specifically, it can be analyzed on a case-by-case basis:

  • If the interface call is time-consuming and occurs irregularly, you can increase the frequency of the blocking point through pressure testing, so that you can  jstackfind the blocking point by viewing the stack information;

  • If a function suddenly stagnates, this situation cannot be reproduced. At this time, you can export  jstacklogs multiple times to compare which user threads are always in a waiting state. These threads are threads that may have problems;

  • If  jstackyou can view the deadlock status, you can check the specific blocking points of the two threads that generate the deadlock, so as to deal with the corresponding problems.

This article mainly proposes five common problems that lead to slow online functions, as well as troubleshooting ideas. Of course, there are various forms of online problems, and they are not necessarily limited to these situations. If we can carefully analyze the scenarios where these problems occur, we can analyze them according to the specific situation, so as to solve the corresponding problems.

Guess you like

Origin blog.csdn.net/u011487470/article/details/127674403