Troubleshooting the reasons for the skyrocketing heap memory caused by netty under jdk17 | JD Cloud Technical Team

background:

introduce

Skynet Risk Control Lingji System is a high-throughput and low-latency online computing service based on memory computing. It provides online statistical computing services of count, distinctCout, max, min, avg, sum, std and interval distribution within sliding or rolling windows. . The bottom layer of the client and the server directly communicates with TCP through netty, and the server also backs up data to the corresponding slave cluster based on netty.

Low latency bottleneck

The first version of Lingji has undergone extensive optimization, and the system can provide greater throughput. If a 10ms timeout is set for the client, the availability rate can only be guaranteed to be around 98.9% under the traffic of 1wqps/core on the server side. In high concurrency situations, it is mainly GC that causes the availability rate to decrease. If based on cms garbage collector. When the throughput of an 8c16g machine exceeds 20wqps after the second version optimization, a gc will be generated approximately every 4 seconds. If one gc is equal to 30ms. Then at least the minute granularity accounts for at least (15*30/1000/60)=0.0075 in gc time. This means that minute-level tp992 is at least 30ms. Does not meet the needs of related businesses.

jdk17+ZGC

In order to solve the above-mentioned problems related to excessive latency, JDK 11 began to launch a low-latency garbage collector ZGC. ZGC uses some new technologies and optimization algorithms to control the GC pause time within 10 milliseconds. With the support of JDK 17, ZGC's pause time can even be controlled at the sub-millisecond level. The actual measured average pause time is about 10us. It is mainly based on a dye pointer and a read barrier to achieve concurrency in most gc stages. Interested students can learn about it, and jdk17 is an lts version.

question:

After using jdk17+zgc and passing relevant stress tests, everything was developing in a good direction. However, during a special scenario stress test, when data needed to be synchronized from the Beijing data center to the Suqian data center, some strange things were discovered.

  • The memory of the server container skyrocketed, and after stopping the stress test, the memory only decreased very slowly.

  • The CPU of the relevant machine has been kept at 20% (there is no traffic request)

  • I have been doing a few gcs. About once every 10 seconds

Troubleshooting trip

Memory leak troubleshooting

The first reaction is that when encountering the problem of skyrocketing memory and inability to release, it is first summarized as a memory leak problem. I feel that this question is simple and clear. Start related memory leak check: first dump heap memory analysis and find that netty-related objects occupy heap memory. Some time ago, a classmate also shared the memory leak caused by unreasonable use of netty byteBuf under netty, which further increased the impact on netty memory. Leak Suspicion. So I turned on netty's strict memory leak detection mode (plus the jvm parameter Dio.netty.leakDetection.level=PARANOID), and re-tested and found no relevant memory leak logs. Okay~! The preliminary judgment is that it is not a netty memory leak.

jdk and netty version bug troubleshooting

Could it be a bug caused by poor compatibility between netty and jdk17? After rolling back jdk8, the test found that this problem did not exist. The jdk17.0.7 version was used at that time. It just so happened that the jdk17.0.8 version was officially released, and I saw a number of Bug Fixes in the version introduction. So I upgraded a small version of jdk, but found that the problem still existed. Could it be that the version of netty is too low? I happened to see a similar issue on gitup# https://github.com/netty/netty/issues/6125WriteBufferWaterMark's and the problem was suspected to be fixed in a higher version. I modified several versions of netty and re-tested, but I found that the problem still exists.

Direct cause location and solution

After the above two investigations, we found that the problem was more complicated than imagined. We should conduct an in-depth analysis of why and reorganized the relevant clues:

  • It was found that when rolling back to jdk8, the amount of backup data received by the cluster corresponding to the Suqian center was much lower than the amount of data sent by the Beijing center.

  • Why is there still gc when there is no traffic? The high cpu should be caused by gc (it was thought to be some characteristics of zgc’s memory at the time)

  • Memory analysis: Why netty's MpscUnboundedArrayQueue references a large number of AbstractChannelHandlerContext$WriteTask objects. MpscUnboundedArrayQueue is the production and consumption writeAndFlush task queue, and WriteTask is the related writeAndFlush task object. It is precisely because of the large number of WriteTask objects and their references that the memory usage is too high.

  • This problem only occurs across data centers and does not occur in data stress testing in the same data center.

After analysis, we have a basic conjecture. Because the delay in computer rooms across data centers is greater, the synchronization data capability cannot be satisfied under a single channel, resulting in insufficient consumption of Netty's eventLoop, leading to a backlog.

Solution: Add a channel connection to the backup data node, use connectionPool, and randomly select a surviving channel for data communication each time batch synchronization of data is performed. After relevant modifications, the problem was found to be solved.

Root cause location and solution

Root cause location

Although the above modifications may seem to have solved the problem, the root cause of the problem has still not been discovered.

  • 1. If the consumption capacity of eventLoop is insufficient, why does the relevant memory only decrease slowly after stopping the stress test? Logically speaking, it should be a crazy decrease in memory.

  • 2. Why is the CPU always at around 23%? According to the usual stress test data, data synchronization is a batch transfer operation, which consumes about 5% of the CPU at most. The extra CPU should be caused by gc, but data synchronization should not be Not much, it shouldn't cause so much gc pressure.

  • 3. Why does this problem not exist under jdk8?

It is speculated that there is a netty eventLoop that consumes time-consuming and blocking operations, resulting in a significant decrease in consumption capacity. So I felt it was still a problem with netty, so I opened netty's related debug log. Found a key log line

[2023-08-23 11:16:16.163] DEBUG [] - io.netty.util.internal.PlatformDependent0 - direct buffer constructor: unavailable: Reflective setAccessible(true) disabled

Following this log, I found the root cause of this problem. Why can't a direct memory constructor be used, which will cause our system's WriteTask consumption to be blocked? With this purpose, we went to view the relevant source code.

Source code analysis

  • Netty will use PooledByteBufAllocator to allocate direct memory by default, using a memory pool mechanism similar to jmelloc. Every time there is insufficient memory, io.netty.buffer.PoolArena.DirectArena#newChunk will be created to pre-occupy the requested memory.
protected PoolChunk<ByteBuffer> newChunk() {  
     // 关键代码  
        ByteBuffer memory = allocateDirect(chunkSize);  
    }  
}  
  • allocateDirect () is the logic to apply for direct memory. Roughly speaking, if you can use underlying unsafe to apply for and release direct memory and reflect to create ByteBuffer objects, then use unsafe. Otherwise, directly call Java's Api ByteBuffer.allocateDirect to allocate memory directly and use the built-in Cleaner to release the memory. PlatformDependent.useDirectBufferNoCleaner is a key point here. It is actually the USE_DIRECT_BUFFER_NO_CLEANER parameter configuration.
PlatformDependent.useDirectBufferNoCleaner() ?  
     PlatformDependent.allocateDirectNoCleaner(capacity) :       ByteBuffer.allocateDirect(capacity); 
  • The USE_DIRECT_BUFFER_NO_CLEANER parameter logic is configured in the static {} of the PlatformDependent class.

    Key logic: maxDirectMemory==0 and! hasUnsafe () does not meet the conditions without special configuration under jdk17. The key is the judgment logic of PlatformDependent0.hasDirectBufferNoCleanerConstructor

if (maxDirectMemory == 0 || !hasUnsafe() || !PlatformDependent0.hasDirectBufferNoCleanerConstructor()) {  
    USE_DIRECT_BUFFER_NO_CLEANER = false;  
} else {  
    USE_DIRECT_BUFFER_NO_CLEANER = true; 
  • The judgment of PlatformDependent0.hasDirectBufferNoCleanerConstructor () is to see whether the DIRECT_BUFFER_CONSTRUCTOR of PlatformDependent0 is NULL. Returning to the debug log we just opened, we can see that the constructor DIRECT_BUFFER_CONSTRUCTOR is unavailable by default (unavailable is NULL). The following code contains specific logical judgments and its pseudocode.

1. Opening condition 1: jdk9 and above must enable the jvm parameter-io.netty.tryReflectionSetAccessible parameter

2. Opening condition two: A private DirectByteBuffer constructor can be obtained by reflection, which constructs a DirectByteBuffer through the memory address and size. (Note: If there are module permission restrictions on java.nio in jdk9 or above, you need to add jvm startup Parameter --add-opens=java.base/java.nio=ALL-UNNAMED, otherwise Unable to make private java.nio.DirectByteBuffer (long,int) accessible: module java.base does not "opens java.nio" will be reported to unnamed module)

So here we do not enable these two jvm parameters by default, so DIRECT_BUFFER_CONSTRUCTOR is null, and the corresponding second part PlatformDependent.useDirectBufferNoCleaner () is false.

    // 伪代码,实际与这不一致  
 ByteBuffer direct = ByteBuffer.allocateDirect(1);  
  
    if(SystemPropertyUtil.getBoolean("io.netty.tryReflectionSetAccessible",  
        javaVersion() < 9 || RUNNING_IN_NATIVE_IMAGE)) {  
         DIRECT_BUFFER_CONSTRUCTOR =  
         direct.getClass().getDeclaredConstructor(long.class, int.class)  
        }  
  • Now go back to step 2 and find that the default value of PlatformDependent.useDirectBufferNoCleaner () in higher jdk versions is false. Then each application for direct memory is created through ByteBuffer.allocateDirect. Then by this time , the relevant root cause has been located, and direct memory is applied for through ByteBuffer.allocateDirect. If the memory is insufficient, the system will be forced to System.Gc (), and it will synchronously wait for DirectByteBuffer to reclaim the memory through the virtual reference of Cleaner . The following is the key code for ByteBuffer.allocateDirect to reserve memory (reserveMemory). Probably the logic is to reach the maximum direct memory requested -> determine whether there are related objects being recycled by gc -> if they are not being recycled, actively trigger System.gc () to trigger recycling -> wait at most MAX_SLEEPS times in the synchronization loop to see if there are Sufficient direct memory. In personal testing, the entire synchronization waiting logic can last up to 1 second in the jdk17 version.

So the most fundamental reason: If our netty consumer EventLoop processes consumption at this time because it applies for direct memory and reaches the maximum memory, then a large number of task consumption will be synchronized to wait for application for direct memory. And if there is not enough direct memory, it will become a large area of ​​consumption blocking.

static void reserveMemory(long size, long cap) {  
  
    if (!MEMORY_LIMIT_SET && VM.initLevel() >= 1) {  
        MAX_MEMORY = VM.maxDirectMemory();  
        MEMORY_LIMIT_SET = true;  
    }  
  
    // optimist!  
    if (tryReserveMemory(size, cap)) {  
        return;  
    }  
  
    final JavaLangRefAccess jlra = SharedSecrets.getJavaLangRefAccess();  
    boolean interrupted = false;  
    try {  
  
        do {  
            try {  
                refprocActive = jlra.waitForReferenceProcessing();  
            } catch (InterruptedException e) {  
                // Defer interrupts and keep trying.  
                interrupted = true;  
                refprocActive = true;  
            }  
            if (tryReserveMemory(size, cap)) {  
                return;  
            }  
        } while (refprocActive);  
  
        // trigger VM's Reference processing  
        System.gc();  
  
        int sleeps = 0;  
        while (true) {  
            if (tryReserveMemory(size, cap)) {  
                return;  
            }  
            if (sleeps >= MAX_SLEEPS) {  
                break;  
            }  
            try {  
                if (!jlra.waitForReferenceProcessing()) {  
                    Thread.sleep(sleepTime);  
                    sleepTime <<= 1;  
                    sleeps++;  
                }  
            } catch (InterruptedException e) {  
                interrupted = true;  
            }  
        }  
  
        // no luck  
        throw new OutOfMemoryError  
            ("Cannot reserve "  
             + size + " bytes of direct buffer memory (allocated: "  
             + RESERVED_MEMORY.get() + ", limit: " + MAX_MEMORY +")");  
  
    } finally {  
        if (interrupted) {  
            // don't swallow interrupts  
            Thread.currentThread().interrupt();  
        }  
    }  
}  
  • Although we have seen the reason for blocking, why does it not block under jdk8 ? From the 4 steps, we can see that java 9 is setting DIRECT_BUFFER_CONSTRUCTOR, so PlatformDependent.allocateDirectNoCleaner is used for memory allocation. The following is a specific introduction and key code

Step 1: Before applying for memory: Use the global memory counter DIRECT_MEMORY_COUNTER and call incrementMemoryCounter to increase the relevant size each time you apply for memory. If the relevant DIRECT_MEMORY_LIMIT (default is - XX:MaxDirectMemorySize) parameter is reached, an exception will be thrown directly without going to Synchronous gc waiting causes a lot of time. .

Step 2: Allocate memory allocateDirectNoCleaner : apply for memory through unsafe, and then use the constructor DIRECT_BUFFER_CONSTRUCTOR to construct DirectBuffer through the memory address and size. Release can also be done through unsafe.freeMemory to release the relevant memory based on the memory address instead of using Java's own cleaner to release the memory.

public static ByteBuffer allocateDirectNoCleaner(int capacity) {  
    assert USE_DIRECT_BUFFER_NO_CLEANER;  
  
    incrementMemoryCounter(capacity);  
    try {  
        return PlatformDependent0.allocateDirectNoCleaner(capacity);  
    } catch (Throwable e) {  
        decrementMemoryCounter(capacity);  
        throwException(e);  
        return null;    }  
}  
  
private static void incrementMemoryCounter(int capacity) {  
    if (DIRECT_MEMORY_COUNTER != null) {  
        long newUsedMemory = DIRECT_MEMORY_COUNTER.addAndGet(capacity);  
        if (newUsedMemory > DIRECT_MEMORY_LIMIT) {  
            DIRECT_MEMORY_COUNTER.addAndGet(-capacity);  
            throw new OutOfDirectMemoryError("failed to allocate " + capacity  
                    + " byte(s) of direct memory (used: " + (newUsedMemory - capacity)  
                    + ", max: " + DIRECT_MEMORY_LIMIT + ')');  
        }  
    }  
}  
  
static ByteBuffer allocateDirectNoCleaner(int capacity) {  
  return newDirectBuffer(UNSAFE.allocateMemory(Math.max(1, capacity)), capacity);  
}  
  
  • After the above source code analysis, we have seen the root cause, which is caused by ByteBuffer.allocateDirect gc synchronously waiting for direct memory release, resulting in a serious lack of consumption capacity. In addition, when the maximum direct memory is insufficient, large-area consumption blocking takes time to apply for direct memory. memory, causing the WriteTask consumption capacity to be close to 0, and the memory cannot be reduced.

Summarize

1. Flowchart:

2. Direct cause:

  • Synchronizing data across data centers The single channel pipeline has insufficient data synchronization capabilities, causing TCP ring congestion. As a result, the write capability in the consumption WriteTask task (WriteAndFlush) of netty eventLoop is greater than the flush capability, so the large amount of direct memory requested is stored in the ChannelOutboundBuffer#unflushedEntry linked list and cannot be flushed.

3. Root cause:

  • Netty needs to manually add the jvm parameters -add-opens=java.base/java.nio=ALL-UNNAMED and - io.netty.tryReflectionSetAccessible in higher versions of jdk to enable it and directly call the underlying unsafe to apply for memory. If it is not enabled, then netty applies. Memory uses ByteBuffer.allocateDirect to apply for direct memory. If the direct memory applied for by the EventLoop consumption task reaches the maximum direct memory scenario, a large number of task consumption will be synchronously waiting to apply for direct memory. And if enough direct memory is not released, it will cause a large-scale consumption blockage, and also cause a large number of objects to accumulate in netty's unbounded queue MpscUnboundedArrayQueue.

4. Reasons for slow reflection and positioning of problems:

  • By default, the synchronized data will not be a system bottleneck. There is no judgment of lowWaterMark and highWaterMark water levels (socketChannel.isWritable ()). If the synchronized data reaches the system bottleneck, an exception should be thrown in advance.

  • When calling writeAndFlush when synchronizing data, the relevant exception listener (code 2 below) should be added. If the exception OutOfMemoryError can be sensed in advance, it will be more convenient to troubleshoot related problems.

(1)ChannelFuture writeAndFlush(Object msg)  
(2)ChannelFuture writeAndFlush(Object msg, ChannelPromise promise);  
  • The non-heap memory monitoring seen by the monitoring system under jdk17 is not consistent with the direct memory statistics actually used by the system . As a result, it is impossible to locate the problem when locating the problem and the direct memory has reached the maximum value, so this solution is not considered.

  • The underlying communication of the related referenced middleware also relies on netty communication . If there is similar data synchronization, similar problems may be triggered. In particular, ump is shaded and packaged when higher versions and titan use netty, and the relevant jvm parameters are also modified. Although this bug will not be triggered, it may also trigger the system gc.

ump高版本:jvm参数修改(低版本直接采用了底层socket通信,未使用netty和创建ByteBuffer) io.netty.tryReflectionSetAccessible->ump.profiler.shade.io.netty.tryReflectionSetAccessible  
  
titan:jvm参数修改:io.netty.tryReflectionSetAccessible->titan.profiler.shade.io.netty.tryReflectionSetAccessible  

Guess you like

Origin blog.csdn.net/weiweiqiao/article/details/132793119