A strange memory leak investigation process, the reason behind it is thought-provoking

01 Cause

A few days ago, a student in the group sent a message saying that during the stress test, he found that there were too many FullGCs in the jvm of a certain application, and he did not know if it was normal or not, and posted a screenshot of the gc. Click it and it looks like this
insert image description here

This is a screenshot of jstat monitoring. The number in the FGC column represents the number of FullGC occurrences since the application was started so far. The data is printed one line per second, so it can be seen from the figure that the number of FullGC is from 2735 to 2736, only using 10 seconds, that is, about 10 seconds, a FullGC will occur, which is too frequent for a Java program. Each FullGC will cause the application to suspend, resulting in long response time and serious degradation of program performance. This can also be confirmed from the figure. The time spent on each FullGC is more than 400ms.

02 Analysis

There is also a strange phenomenon in the picture. Generally, FullGC occurs because of insufficient space in the old age in the jvm. The fourth column in the figure is the data of the old age. As you can see in the picture, the old age only occupies about 3%. Far from reaching the upper limit of 100%.

Students said that in this case, using 10 concurrency to test the interface, the tps is about 900, and the performance seems to be OK, but there must be a problem with FullGC so frequently, and it still needs to be checked.

I didn't think about it too much. Since FullGC is so frequent, let's take a look at the allocation of objects in the jvm, so let the students use jmap to print the object information in the heap memory. The results are as follows:
insert image description here

The first 4 lines are the data objects that come with jdk, char[], byte[], int[] and String objects. These are generally normal, and there is no problem, so they can be ignored.

Lines 5 and 6 can be seen from the class name as fastjson objects. Fastjson is mainly used for json conversion and is a commonly used json serialization component in Java. In addition, other objects are provided by jdk, and there are no business-related objects. Therefore, it can be guessed that the high probability is caused by the fastjson component.

I looked at the other screenshots provided by the students, and found that the jvm parameters of the application are not very reasonable.

The size of the new generation is not configured, so the jvm will use the default value

The permanent generation parameter configuration is incorrect. Since jdk8, the permanent generation parameter has been changed from PermSize to MataSpace
insert image description here

So let the students change all the parameters first. After all, the basis of troubleshooting is that the parameter configuration is reasonable, and it may be caused by the parameter problem.

After the students finished the changes, they re-tested the stress test. This time, the frequency of FullGC dropped a lot, and it took about 30 seconds for a FullGC. JVM recycling is also very regular.

insert image description here

It seems that the parameter modification has a certain effect, but the students said that the tps is now reduced, and the jmeter shows that the tps is between 100-200, which is much lower than the 900 before the modification.

insert image description here

This is embarrassing. The more tps is adjusted, the lower it is, so I turn my attention to the fastjson just now. The student happens to have the right to view the server code, so let the student see where fastjson is used in the code.

The students said that the logic of this interface is very simple, that is, to query user information from the database, and then convert the data into a json string and return:
insert image description here

In the handle method, fastjson is really used
insert image description here

This code is relatively simple to write, and it seems that there is no problem with it. In order to confirm that it is indeed the handle function problem, use the module isolation method to verify. Let the students give feedback to the development, do not use the handle function to process, directly return a hard-coded json, and then verify.

After modifying the code and pressing it again with jmeter, the tps of the interface can reach 1400+, and the FullGC is also normal, which means that it is indeed caused by a problem with the handle function code.

The code was restored to its original state, and when the stress test was performed again, the student sent another JVM monitoring screenshot, saying that the fluctuation curve of the metaspace space was consistent with the fluctuation curve of class loading. Could it be that there are object instances in the code that were generated into the metaspace hit and did not release
insert image description here

When I saw this picture, I suddenly realized. The rising curve indicates that the metaspace is constantly loading classes, and when the loading is close to the upper limit, a recycling will be triggered. Some classes are released, causing a trough. The trainees also observed that the frequency of metaspace troughs is consistent with the frequency of FullGC, which indicates that FullGC is caused by metaspace recycling.

This is indeed relatively rare. Let me explain to you here that metaspace is a memory space in jvm, which mainly stores basic information of classes, static variables, constants, etc. Generally speaking, in the process of running a java program, the class information will be loaded into the metaspace when an object is created for the first time for each class, and it will only be loaded once. No matter how many concurrent and subsequent requests are made, the class will not be loaded repeatedly. Therefore, the space usage of metaspace is very stable and basically does not change with the change of concurrency. In most stress tests, it can be seen that the memory usage of mataspace is a straight line. As long as a relatively large initial space is given to the metaspace when the application starts, it will not cause FullGC.

Of course, there is one exception. If there is a dynamic loading of classes in the code, each thread will reload the class into the metaspace when executing the code, which is usually used in scenarios where reflection is used.

Looking back at the handle method in the code, in this method, there really is a line of code for dynamically loading classes
insert image description here

After the config object is created, a Long.class will be put. At this time, the Long class will be loaded into the metaspace, and the handle method will be called for every request. That's why the metaspace space is growing so fast. Of course, simply creating an object every time will not cause memory overflow, which is why the usage of the old age is not very high. The problem is that Long.class is loaded every time

In the code, the config object is a configuration object. In fact, it is not necessary to create an object every time the interface is called, and then load the configuration again. Can be made into a global static variable, and then load the configuration once.

The code is modified as follows, defined as a static variable, and then initialized in the static code block
insert image description here

Use jmeter to re-verify, under 10 concurrency, tps can now run to about 1400, and there is no FullGC. This problem has been solved, and because handle is a public method in the project, this problem solving will improve the performance of all interfaces in the project.

03 Summary

In retrospect, this problem should have been discovered long ago, because in the earliest screenshots of gc, it can be seen that metaspace caused FullGC, but I didn't pay much attention to metaspace at that time, plus text printing is not as good as curve The picture is so direct
insert image description here

This also reminds all students who use fastjson. When using the serialization configuration function, remember to define the configuration object as global static, otherwise it will cause the metaspace memory overflow, which will trigger FullGC and cause the performance of the application to decline. .

Guess you like

Origin blog.csdn.net/Testfan_zhou/article/details/124037430