Remember a Native memory leak troubleshooting process | JD Cloud technical team

1 Problem phenomenon

The routing calculation service is the core service of the routing system, responsible for the calculation of the waybill routing plan and the matching between the actual operation and the plan. In the process of operation and maintenance, it is found that the TP99 slowly climbs the slope when it is not restarted for a long time. In addition, during the trial calculation of the weekly routine scheduling, the increase in memory can be clearly seen. The following screenshots are the monitoring of these two abnormal situations.

TP99 climbing

memory ramp

The machine configuration is as follows

CPU: 16C RAM: 32G

The Jvm configuration is as follows:

-Xms20480m (later switched to 8GB) -Xmx20480m (later switched to 8GB) -XX:MaxPermSize=2048m -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -XX:+PrintReferenceGC -XX:+UseG1GC -Xss256k -XX:ParallelGCThread s= 16 -XX:ConcGCThreads=4 -XX:MaxDirectMemorySize=2g -Dsun.net.inetaddr.ttl=600 -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j2.asyncQueueFullPolicy=Discard -XX: MetaspaceSize=1024M -XX:G1NewSizePercent=35 -XX:G1MaxNewSizePercent=35

Routine task scheduling:

Execution is triggered every Monday at 2:00 am. The screenshot above contains a total of two cycles of tasks. It can be seen that at the first execution, the memory directly climbed from 33% to 75%. In the second execution, after climbing to 88%, OOM exited abnormally.

2 Troubleshooting

Since there are two phenomena, there are two main lines of investigation. The first is to check the memory usage for the purpose of tracking the cause of OOM, referred to as memory problem troubleshooting. The second item is the investigation of the reasons for the slow growth of TP99, referred to as the troubleshooting of performance degradation.

2.1 Troubleshooting performance degradation

Since it is a slow climb, and the ramp cycle is directly related to service restart, the possibility of external interface performance problems can be ruled out. First find the reason from your own program. Therefore, first check the GC situation and memory situation. The following is the GC log that has not been restarted for a long time. This is a YGC and takes 1.16 seconds in total. Among them, the Ref Proc link consumes 1150.3 ms, and the recovery of JNI Weak Reference consumes 1.1420596 seconds. On the newly restarted machine, the recovery time of JNI Weak Reference is 0.0000162 seconds. Therefore, it can be located that the increase in TP99 is caused by the increase in the recycling cycle of JNI Weak Reference.

JNI Weak Reference, as the name suggests, should be related to the use of Native memory. However, due to the difficulty of Native memory troubleshooting. Therefore, it is better to start the investigation from the usage of the heap, and take a chance to see if you can find any clues.

2.2 Troubleshooting memory problems

Going back to memory, after Jiange’s reminder, the problem should be reproduced first. And the tasks triggered every week will stably reproduce the memory increase, so it is easier to check from the direction of scheduling tasks. With the help of @柳岩, I have the ability to reproduce problems at any time in the trial calculation environment.

The troubleshooting of memory problems still starts from the memory in the heap. After multiple dumps, although the total memory usage of the java process continued to rise, the heap memory usage did not increase significantly. After applying for root permission and deploying arthas, through the dashbord function of arthas, it can be clearly seen that the heap (heap) and non-heap (nonheap) remain stable.

arthas dashboard

The memory usage has doubled

From this, it can be concluded that the increase in native memory usage leads to an increase in the memory usage of the entire java application. The first step in analyzing native is to enable jvm -XX:NativeMemoryTracking=detail.

2.2.1 Use jcmd to view the overall situation of the memory

jcmd can print all the memory allocation of the java process. When the NativeMemoryTracking=detail parameter is enabled, you can see the native method call stack information. After applying for root permission, you can install it directly using yum.

安装好后,执行如下命令。

jcmd <pid> VM.native_memory detail

jcmd result display

In the figure above, there are two parts. The first part is a summary of the overall memory situation, including total memory usage and classification usage. Categories include: Java Heap, Class, Thread, Code, GC, Compiler, Internal, Symbol, Native Memory Tracking, Arena Chunk, Unknown. For the introduction of each category, you can read this document; the second part is details, including each The start address and end address of segment memory allocation, the specific size, and the category to which it belongs. For example, the part in the screenshot describes that 8GB of memory is allocated for the Java heap (later, in order to quickly reproduce the problem, the heap size is adjusted from 20GB to 8GB). The indented lines behind represent the specific allocation of memory.

Use jcmd dump twice at an interval of 2 hours for comparison. You can see that the Internal part has grown significantly. What is Internal and why is it growing? After Google, I found that there are very few introductions in this area, basically it is command line analysis, JVMTI and other calls. After consulting @崔立园, I learned that JVMTI may be related to java agent. In routing calculation, only pfinder should be related to java agent, but the problem of the underlying middleware should not only affect routing, so I just asked about pfinder research and development , and did not continue to invest in follow-up.

2.2.2 Using pmap and gdb to analyze memory

First, the conclusion of this method is given. Since this analysis contains relatively large guesswork, it is not recommended to try it first. The overall idea is to use pmap to output all the memory allocated by the java process, select suspicious memory ranges, use gdb to dump, and encode and visualize its content for analysis.
There are many related blogs on the Internet, all of which locate the case of link leakage by analyzing the existence of a large number of 64MB memory allocation blocks. So I also checked on our process, and it does contain a lot of memory usage of about 64MB. According to the introduction in the blog, after the memory is encoded, most of the content is related to JSF, which can be inferred to be the memory pool used by JSF netty. The 1.7.4 version of JSF we're using doesn't have memory pool leaks, so it shouldn't be related.
pmap: https://docs.oracle.com/cd/E56344_01/html/E54075/pmap-1.html
gdb: https://segmentfault.com/a/1190000024435739

2.2.3 Use strace to analyze system calls

This should be regarded as a kind of analysis method of luck. The idea is to use strace to output the system call for each allocation of memory, and then match it with the threads in jstack. So as to determine which java thread allocated the native memory. This kind of efficiency is the lowest. First of all, the system calls are very frequent, especially on services with more RPCs. Therefore, except for the more obvious memory leaks, it is easy to troubleshoot in this way. Slow memory leaks such as this article will basically be submerged by normal calls, making it difficult to observe.

2.3 Problem location

After a series of attempts, the root cause was not located. So we can only start with the phenomenon of Internal memory growth detected by jcmd again. So far, there is still the clue of memory allocation details that has not been analyzed. Although there are 1.2w lines of records, I can only go through it, hoping to find clues related to Internal.

Through the following paragraph, you can see that after allocating 32k Internal memory space, there are two JNIHandleBlock-related memory allocations, namely 4GB and 2GB, and MemberNameTable-related calls allocate 7GB of memory.

[0x00007fa4aa9a1000 - 0x00007fa4aa9a9000] reserved and committed 32KB for Internal from
    [0x00007fa4a97be272] PerfMemory::create_memory_region(unsigned long)+0xaf2
    [0x00007fa4a97bcf24] PerfMemory::initialize()+0x44
    [0x00007fa4a98c5ead] Threads::create_vm(JavaVMInitArgs*, bool*)+0x1ad
    [0x00007fa4a952bde4] JNI_CreateJavaVM+0x74

[0x00007fa4aa9de000 - 0x00007fa4aaa1f000] reserved and committed 260KB for Thread Stack from
    [0x00007fa4a98c5ee6] Threads::create_vm(JavaVMInitArgs*, bool*)+0x1e6
    [0x00007fa4a952bde4] JNI_CreateJavaVM+0x74
    [0x00007fa4aa3df45e] JavaMain+0x9e
Details:

[0x00007fa4a946d1bd] GenericGrowableArray::raw_allocate(int)+0x17d
[0x00007fa4a971b836] MemberNameTable::add_member_name(_jobject*)+0x66
[0x00007fa4a9499ae4] InstanceKlass::add_member_name(Handle)+0x84
[0x00007fa4a971cb5d] MethodHandles::init_method_MemberName(Handle, CallInfo&)+0x28d
                             (malloc=7036942KB #10)

[0x00007fa4a9568d51] JNIHandleBlock::allocate_handle(oopDesc*)+0x2f1
[0x00007fa4a9568db1] JNIHandles::make_weak_global(Handle)+0x41
[0x00007fa4a9499a8a] InstanceKlass::add_member_name(Handle)+0x2a
[0x00007fa4a971cb5d] MethodHandles::init_method_MemberName(Handle, CallInfo&)+0x28d
                             (malloc=4371507KB #14347509)

[0x00007fa4a956821a] JNIHandleBlock::allocate_block(Thread*)+0xaa
[0x00007fa4a94e952b] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0x6b
[0x00007fa4a94ea3f4] JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x884
[0x00007fa4a949dea1] InstanceKlass::register_finalizer(instanceOopDesc*, Thread*)+0xf1
                             (malloc=2626130KB #8619093)

[0x00007fa4a98e4473] Unsafe_AllocateMemory+0xc3
[0x00007fa496a89868]
                             (malloc=239454KB #723)

[0x00007fa4a91933d5] ArrayAllocator<unsigned long, (MemoryType)7>::allocate(unsigned long)+0x175
[0x00007fa4a9191cbb] BitMap::resize(unsigned long, bool)+0x6b
[0x00007fa4a9488339] OtherRegionsTable::add_reference(void*, int)+0x1c9
[0x00007fa4a94a45c4] InstanceKlass::oop_oop_iterate_nv(oopDesc*, FilterOutOfRegionClosure*)+0xb4
                             (malloc=157411KB #157411)

[0x00007fa4a956821a] JNIHandleBlock::allocate_block(Thread*)+0xaa
[0x00007fa4a94e952b] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0x6b
[0x00007fa4a94ea3f4] JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x884
[0x00007fa4a94eb0d1] JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321
                             (malloc=140557KB #461314)

By comparing the output of jcmd in the two time periods, we can see that the memory allocation related to JNIHandleBlock does continue to grow. Therefore, it can be concluded that it is the memory allocation of JNIHandles::make_weak_global that causes the leak. So what is this logic doing and what is causing the leak?

Through Google, I found the article of Jvm God, which answered the whole question for us. The problem phenomenon is basically consistent with ours. Blog: https://blog.csdn.net/weixin_45583158/article/details/100143231

Among them, Han Quanzi gave a code to reproduce the problem. There is an almost identical section of our code, which does involve luck.

// 博客中的代码
public static void main(String args[]){

        while(true){

            MethodType type = MethodType.methodType(double.class, double.class);

            try {

                MethodHandle mh = lookup.findStatic(Math.class, "log", type);

            } catch (NoSuchMethodException e) {

                e.printStackTrace();

            } catch (IllegalAccessException e) {

                e.printStackTrace();

            }

        }

    }
}

jvm bug:https://bugs.openjdk.org/browse/JDK-8152271

It is the above bug, frequent use of MethodHandles-related reflections will cause expired objects to fail to be recycled, and will also cause YGC scan time to increase, resulting in performance degradation.

3 problem solving

Since jvm 1.8 has made it clear that this problem will not be dealt with in 1.8, it will be refactored in java. But we can't upgrade to java in a short time. So there is no way to fix it by directly upgrading the JVM. Since the problem is the frequent use of reflection, it is considered to add a cache to reduce the frequency, so as to solve the problems of performance degradation and memory leaks. Considering the issue of thread safety, the cache is placed in ThreadLocal, and the elimination rule of LRU is added to avoid leakage again.

The final repair effect is as follows. The memory growth is controlled within the normal heap memory setting range (8GB), and the growth rate is relatively moderate. After restarting for 2 days, the JNI Weak Reference time is 0.0001583 seconds, as expected.

4 Summary

The troubleshooting idea of ​​native memory leak is similar to that of in-heap memory, mainly based on time-sharing dump and comparison. Determine the cause of the problem by observing outliers or abnormal growth. Due to differences in tools and the troubleshooting process of Native memory, it is difficult to directly associate memory leaks with threads. You can try your luck through strace. In addition, according to the limited clues, search on the search engine, you may find the relevant investigation process, and you will receive unexpected surprises. After all, jvm is still very reliable software, so if there are serious problems, it should be easy to find relevant solutions on the Internet. If there is less content on the Internet, you may still need to consider whether you are relying on software that is too niche.

In terms of development, try to use mainstream development and design patterns. Although there is no distinction between good and bad technologies, implementation methods such as reflection and AOP need to limit the scope of use. Because these technologies will affect the readability of the code, and the performance is gradually getting worse in the ever-increasing AOP. In addition, in terms of trying new technologies, try to start from the edge business. In the core application, the first thing to consider is the stability issue. This kind of awareness can avoid stepping on pits that are difficult for others to encounter, thereby reducing unnecessary troubles.

Author: JD Logistics Chen Haolong

Source: JD Cloud Developer Community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10085734