Remember an online FGC troubleshooting

introduction

This article records the troubleshooting process and ideas of an online GC problem, hoping to be helpful to readers. I also took some detours in the process. Now I have time to settle down and think about it and summarize it to share with you. I hope it will be helpful for you to troubleshoot online GC problems in the future.

background

In the afternoon after a week after the release of the new service function, I suddenly received a CMS GC alarm, which caused a single node to be pulled out. Then each node in the cluster had a CMS GC one after another. After the node was pulled out, the access traffic returned to normal after garbage collection.

The warning information is as follows (desensitized):

GC problems occurred on multiple nodes almost at the same time, and after checking the natural traffic monitoring, it was found that there was no significant increase. It is basically certain that there is a GC problem and needs to be resolved.

Troubleshooting process

GC log troubleshooting

The first thing to check for GC problems should be the GC log. The log can clearly determine what caused the GC at the moment when the GC occurred. By analyzing the GC log, you can clearly find out which part of the GC is having problems. The following is an example of the GC log:

0.514: [GC (Allocation Failure) [PSYoungGen: 4445K->1386K(28672K)] 168285K->165234K(200704K), 0.0036830 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
0.518: [Full GC (Ergonomics) [PSYoungGen: 1386K->0K(28672K)] [ParOldGen: 163848K->165101K(172032K)] 165234K->165101K(200704K), [Metaspace: 3509K->3509K(1056768K)], 0.0103061 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
0.528: [GC (Allocation Failure) [PSYoungGen: 0K->0K(28672K)] 165101K->165101K(200704K), 0.0019968 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
0.530: [Full GC (Allocation Failure) [PSYoungGen: 0K->0K(28672K)] [ParOldGen: 165101K->165082K(172032K)] 165101K->165082K(200704K), [Metaspace: 3509K->3509K(1056768K)], 0.0108352 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]

From the above GC log, it is obvious that the problem causing Full GC is: after Full GC, the memory of the new generation has not changed, and the memory usage of the old generation has decreased from 165101K to 165082K (almost no change). The program eventually ran out of memory because there was no heap memory available to create a 70m large object.

However, there are always strange problems in the production environment. Because the service is deployed in the K8s container, and the operation and maintenance have heartbeat detection for the service, when the program triggers Full GC, the entire system stops the world. If the heartbeat detection fails several times in a row, it is determined that the current node may be faulty (hardware, network, BUG, ​​etc.), and the current node is directly pulled out and rebuilt immediately. At this time, the GC logs printed before are all in the current container volume.

JVM monitoring buried point troubleshooting

The above GC log loss problem is basically unsolvable. When GC occurs, it will be rebuilt immediately. Unless human intervention, it is difficult to obtain the GC log at that time, and it is difficult to predict the time when the next GC problem will occur (if the GC date can be reported, there will be no such problem. I found out afterwards, but I did not find it...).

At this time, another way is to troubleshoot the problem through JVM buried point monitoring. Enterprise applications will be equipped with a complete JVM monitoring board, just to be able to see the " accident scene " clearly. Through monitoring, you can clearly see how the JVM internally allocates and reclaims memory on the timeline.

JVM monitoring is used to monitor important JVM metrics, including heap memory, non-heap memory, direct buffer, memory mapped buffer, GC cumulative information, thread count, etc.

The core indicators that are mainly concerned are as follows:

  • GC (garbage collection) instantaneous and cumulative details
    • FullGC times
    • YoungGC times
    • FullGC time-consuming
    • YoungGC time-consuming
  • Heap Details
    • total heap memory
    • The number of bytes in the old generation of heap memory
    • The number of bytes in the Survivor area of ​​the young generation of heap memory
    • The number of bytes in the young generation Eden area of ​​the heap memory
    • Committed memory bytes
  • Metaspace Metaspace Bytes
  • non-heap memory
    • Number of bytes committed in non-heap memory
    • The initial number of bytes of non-heap memory
    • Maximum number of bytes of non-heap memory
  • direct buffer
    • DirectBuffer total size (bytes)
    • DirectBuffer usage size (bytes)
  • Number of JVM threads
    • total number of threads
    • Number of deadlocked threads
    • Number of new threads
    • number of blocking threads
    • number of runnable threads
    • Number of finalized threads
    • The number of waiting threads within a time limit
    • number of waiting threads

When a GC problem occurs, focus on these indicators, and you can roughly delineate the GC problem.

Heap Memory Troubleshooting

First, check the heap memory to confirm whether there is memory overflow (caused by not being able to apply for enough memory). The internal monitoring is as follows:

You can see that after the occurrence of Full GC, the heap memory has been significantly reduced, but even after a large number of Full GCs have not occurred, some memory has been recycled to the same position as the full GC. Therefore, it can be concluded that the heap memory can be recycled normally, and it is not the culprit that caused a large number of Full GCs.

Non-heap memory troubleshooting

Non-heap memory refers to the Metaspace area, and the monitoring points are as follows:

It can be seen that after the alarm occurs, a lot of non-heap memory is immediately reclaimed (because the server is rebuilt after being invalidated by the health check, which is equivalent to restarting and reinitializing the JVM). Anyone who has GC troubleshooting experience here must be sure immediately that there is a problem with metaspace.

What is Metaspace for? With the arrival of JDK8, the JVM no longer has PermGen (permanent generation), but the metadata information (metadata) of the class is still there, but it is no longer stored in the continuous heap space, but moved to the local memory (Native memory) called "Metaspace".

So when will class information be loaded?

  • Program Runtime: When a Java program is run, the classes and methods required by the program.
  • When a class is referenced: When a program first references a class, the class is loaded.
  • Reflection: When a class is accessed using the reflection API, the class is loaded.
  • Dynamic Proxy: When a proxy object is created using a dynamic proxy, the classes required by the object are loaded.

It can be concluded from the above that if a service does not have a large number of class loading requirements such as reflection or dynamic proxy, it is reasonable that after the program is started, the number of classes loaded should fluctuate very little (it is not ruled out that some abnormal stack reflections will also cause an increase in class loading).

Check the JVM loading class monitoring as follows:

From the above monitoring, it is true that a large number of classes are loaded, and the trend of the number is consistent with the trend of non-heap usage.

View the non-heap memory size set by the current JVM as follows:

MetaspaceSize & MaxMetaspaceSize = 1024 M. According to the above non-heap memory usage monitoring, the usage is close to 1000 M. It is unable to allocate enough memory to load classes, which eventually leads to Full GC problems.

Program code troubleshooting

The conclusion drawn from the above investigation: a large number of created classes in the program lead to the explosion of non-heap memory . Combined with the extensive use of Groovy dynamic script functions in the current service, there is a high probability that there is a problem with creating scripts. The code for creating dynamic classes in scripts is as follows:

public static GroovyObject buildGroovyObject(String script) {
    
    
    GroovyClassLoader classLoader = new GroovyClassLoader();
    try {
    
    
        Class<?> groovyClass = classLoader.parseClass(script);
        GroovyObject groovyObject = (GroovyObject) groovyClass.newInstance();
        classLoader.clearCache();

        log.info("groovy buildScript success: {}", groovyObject);
        return groovyObject;
    } catch (Exception e) {
    
    
        throw new RuntimeException("buildScript error", e);
    } finally {
    
    
        try {
    
    
            classLoader.close();
        } catch (IOException e) {
    
    
            log.error("close GroovyClassLoader error", e);
        }
    }
}

Opening the log online does prove that classes are being created continuously.

Script creation class causes the heap memory to be blown up, and there is also a pitfall between them. For the same script (with the same MD5 value), the cache will be taken directly, and the class will not be created repeatedly. The cache check logic is as follows:

public static GroovyObject buildScript(String scriptId, String script) {
    
    
    Validate.notEmpty(scriptId, "scriptId is empty");
    Validate.notEmpty(scriptId, "script is empty");

    // 尝试缓存获取
    String currScriptMD5 = DigestUtils.md5DigestAsHex(script.getBytes());
    if (GROOVY_OBJECT_CACHE_MAP.containsKey(scriptId)
            && currScriptMD5.equals(GROOVY_OBJECT_CACHE_MAP.get(scriptId).getScriptMD5())) {
    
    
        log.info("groovyObjectCache hit, scriptId: {}", scriptId);
        return GROOVY_OBJECT_CACHE_MAP.get(scriptId).getGroovyObject();
    }

    // 创建
    try {
    
    
        GroovyObject groovyObject = buildGroovyObject(script);

        // 塞入缓存
        GROOVY_OBJECT_CACHE_MAP.put(scriptId, GroovyCacheData.builder()
                .scriptMD5(currScriptMD5)
                .groovyObject(groovyObject)
                .build());
    } catch (Exception e) {
    
    
        throw new RuntimeException(String.format("scriptId: %s buildGroovyObject error", scriptId), e);
    }

    return GROOVY_OBJECT_CACHE_MAP.get(scriptId).getGroovyObject();
}

The code logic here has been repeatedly verified in the previous tests, and there will be no problem, that is, only the problem of the cache key causes the repeated loading of the class. Combined with the logic of the recent modification and launch, after investigation, it was found that the scriptId may be duplicated, resulting in different scripts and the same scriptId being repeatedly loaded (the loading frequency is updated every 10 minutes, so the non-heap usage is slowly increasing).

A small hole is buried here : the loaded class is stored by Map, that is, the same cacheKey calls the Map.put() method, and the repeated loaded class will be replaced by the class loaded later, that is, the previously loaded class is no longer "held" by the Map and will be recycled by the garbage collector. It stands to reason that Metaspace should not keep growing! ?

Tip: Classloading vs. Groovy classloading, when Metaspace is recycled.

Due to space reasons, this article will not go into the details here. Interested friends can Google or follow me, and I will open a chapter to explain the reasons in detail.

Summarize

Know it and know why.

If you want to systematically master GC problem-solving methods, you still have to understand the basics of GC: basic concepts, memory division, allocation objects, collection objects, collectors, etc. Master the commonly used tools for analyzing GC problems, such as gceasy.io online GC log analysis tool. Here, the author refers to the analysis and solution of 9 common CMS GC problems in Java from the Meituan technical team article . The benefits are huge, and I recommend everyone to read it.

Wonderful past

Welcome to pay attention to the official account: Cuckoo Chicken Technology Column
Personal technology blog: https://jifuwei.github.io/

Guess you like

Origin blog.csdn.net/weixin_43975482/article/details/128820266