Remember an online memory overflow problem troubleshooting process - high availability

origin

The APP project system I am in charge of is called System A for short. In the case that the T system of other projects is not on the same server, it is found that the restart of the T system will cause a sudden increase in the concurrency of the A system, and finally cause the memory overflow of the A system to hang up.

Troubleshoot

1. Check the server log

Use the XShell6 tool to view the system service log, download the log and check the abnormality during the server downtime. There
insert image description here
is no specific interface or method in the abnormal information, so I asked the operation and maintenance for the heapdump file, and used the jvisualvm tool of the JDK to analyze it.

2. Analysis through tools

The location of the tool, in the bin directory of the JDK, please refer to the specific use of the tool.
insert image description here
Import the heapdump file for analysis:
insert image description here
two interfaces located by the tool.

3. Interface pressure test

Hand over the located interfaces to the test for stress testing. The result of the stress testing is that one of the interfaces has a problem when the concurrency is 50, and it needs to be optimized before going online.

4. Wait for the T system to release the version

After releasing the version, it was found that system A still hangs up

5. Statistics of interface visits during system A downtime

Statistical command
sed -n '/2021-12-22 21:27:01/,/2021-12-22 21:29:47/p' catalina.out|grep third party to obtain work order details|wc -l
insert image description here
found access The volume still increased sharply at that time, so I asked about the business situation and investigation of the interfaces used by the T system, and found that their middleware was abnormally used, and the normal ack mechanism could not be performed, so all the accumulated accesses were called again at the moment of restart.

6 last

1. The interface is optimized, and the pressure test results: 1000 concurrency has no abnormality
2. The T system has modified the BUG
3. The operation and maintenance has modified the configuration

#net.ipv4.tcp_tw_recycle = 1    将这个注释了

net.ipv4.tcp_keepalive_time = 500
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5

Later, the T system was restarted twice, and there was no abnormality in the APP project. It is planned to re-release the version of the subsequent T system and restart the system to monitor the situation. to be continued


The follow-up system still occasionally overflows memory. Java VisualVM only provides some basic functions and cannot specifically locate the problem.
To use Eclipse Memory Analyzer to analyze.

The result of the investigation is that when the third party restarts, it will still call an interface multiple times, and occasionally call without passing parameters, causing the interface to query the entire table data, and the amount of data is too large to cause memory overflow

Eclipse Memory Analyzer download address:
https://www.eclipse.org/mat/downloads.php

Troubleshooting process

Get the dump file

insert image description here

Modify mat memory

Because the dump file exceeds 1G, and the initial size of mat is 1024, the file cannot be opened without modification
insert image description here

insert image description here

open a file

insert image description here
insert image description here
The total memory used is 1.2G.
The Thread object occupies 755.2M.
Click on Leak Suspects to view the specific memory leak report.
insert image description here
DETAILS
insert image description here
Click on the circle to view the reference relationship. As shown in the figure below, it can be clearly seen that 18W bytes are placed in the ArrayList Return to the details caused by the array
insert image description here
, click to select Java basics-"Thread Details,
check the execution position of the code when the memory overflow may occur in the thread, check one by one, and finally locate and
insert image description here
find that the parameter id must be passed because the third-party parameter is not passed, and this method There is also no required parameter verification, which causes this method to query the full table data (the full table data is more than 18W), which is why there are 18W byte arrays in the ArrayList. Eventually lead to memory overflow.

double confirm

Through the local configuration and online memory -Xms1536m -Xmx1536m
insert image description here

Then call this method locally without passing parameters, and the same memory overflow error occurs in the second call.
insert image description here

solve

Verify the required parameters of this method, and there is no memory overflow problem after going online

Refer to
Eclipse Memory Analyzer Getting Started Study Notes
Interviewer: Memory leaks, how to troubleshoot memory overflows?

Guess you like

Origin blog.csdn.net/RoyRaoHR/article/details/122197013