During the meeting, the CPU soared by 100%. Colleagues were scrambling to memorize an emergency process

Alert

I was in a meeting, and suddenly Dingding alarm sounded non-stop. At the same time, the market staff reported that the customer could not log in the complaint system and reported a 504 error. Viewing the alarm information on Dingding, several business server nodes all reported that the CPU exceeded the alarm threshold, reaching 100%.

Hurry up and down from the meeting, log in to the server with SSH, and use the top command to check that the CPU usage of several Java processes reaches 180% and 190%. These Java processes correspond to several Pods (or containers) of the same business service.

Positioning

  1. Use the docker stats command to view the resource usage of this node's container, and use docker exec -it <container ID>bash to enter the container that takes up a lot of CPU.
  2. Execute the top command inside the container to view, locate the process ID with high CPU usage, and use top -Hp <process ID> to locate the thread ID with high CPU usage.
  3. Use jstack <process ID>> jstack.txt to print out the thread stack of the process.
  4. Exit the container and use the docker cp <container ID>:/usr/local/tomcat/jstack.txt ./ command to copy the jstack file to the host for easy viewing. After obtaining the jstack information, quickly restart the service to make the service available again.
  5. Use the pringf'%x\n' <thread ID> command to convert the thread ID into hexadecimal form for the thread ID that occupies the high CPU in 2. Assuming the thread ID is 133, the hexadecimal number 85 is obtained. Locate the position nid=0x85 in the jstack.txt file, which is the execution stack information that occupies the CPU high thread. As shown below,

Insert picture description here

  1. Confirm with my colleagues that this is the excel export function using a framework, and there is no pagination and no restriction when exporting excel! ! ! Check the SQL query records. The export function exports 50w pieces of data at a time, and each piece of data needs to be converted. What's worse, the operator clicks continuously because the export has not responded for a long time, and more than 10 are initiated within a few minutes. Export requests. . . So the CPU was full, the service crashed, and I crashed too. .

solve

For such resource-consuming operations, corresponding restrictions must be done. For example, you can limit the number of requests, control the maximum page size, and limit the frequency of access, such as how many times the same user requests in one minute.

Resend

The service resumes after restarting. In the afternoon, another server node CPU alarmed. According to the previous steps, locate the thread with high CPU usage, as follows

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fa114020800 nid=0x10 runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fa114022000 nid=0x11 runnable 

Use the command jstat -gcutil <进程ID> 2000 10to view the GC case, as

Insert picture description here

It is found that the number of Full GC has reached more than 1000, and it is still increasing. At the same time, the Eden area and the Old area are already occupied (you can also use to jmap -heap <进程ID>view the occupancy of each area of ​​the heap memory). Use jmap to dump the memory usage.

jmap -dump:format=b,file=./jmap.dump 13

Exit the container, use docker cp <容器ID>:/usr/local/tomcat/jmap.dump./ to copy the dump file to the host's directory, download it locally, and open it with MemoryAnalyzer (download address: www.eclipse.org/mat/downloa... ), as shown in the figure

Insert picture description here

If the dump file is relatively large, you need to increase the -Xmx value in the MemoryAnalyzer.ini configuration file

It is found that char[], String objects occupy the most memory. You can view the referenced objects by right-clicking, but you can’t see it when you click on it. Enter the memory leak report page, as shown in the figure.

Insert picture description here

This page counts the heap memory occupancy, and gives the suspected leaks. Click on the "see stacktrace" link in the above picture to enter the thread stack page,

Insert picture description here

The screen that seems to have been familiar is still related to excel export, too much data, causing memory overflow. . . So GC is frequent, so the CPU burst. The root cause is still the same.

to sum up

This article uses the actual process of processing 100% of the online service CPU to exemplify the general processing methods for high CPU consumption or memory overflow caused by Java services. I hope to provide a reference for you to locate similar online problems. At the same time, you need to consider more far-reaching when developing and implementing functions. You cannot stop at solving the current scenario. You need to consider whether your implementation is still applicable when the amount of data continues to increase. As the saying goes, junior programmers solve current problems, intermediate programmers solve problems two years later, and senior programmers solve problems five years later, ^_^.

Author: Rain Song file

Guess you like

Origin blog.csdn.net/weixin_46577306/article/details/107346137