The lesson of high off-heap memory caused by a compression


Do not miss passing through

Click the blue word to follow us

1. Project introduction

lz_rec_push_kafka_consume
This project interacts with the algorithm through kafka, and pre-generates the message body through the push recommendation platform (lz_rec_push_platform).

2. Problem background

It is found that the k8s container of the project will restart. The restart time is just the push expansion, and the push data volume per hour is expanded by about 5 times.
When a problem occurs, the container configuration: CPU: 4, memory: 3G inside the heap, 1G outside the heap.

3. Troubleshooting process: look-smell-ask-cut

Hope: Check the monitoring system and observe the resource situation of the container instance when the restart occurs



Note: Container restart mechanism: When k8s monitoring finds that the memory usage of the "instance" exceeds the application, the container will be restarted. This action uses kill -9 directly instead of restarting the virtual machine through the jvm instruction, so don't think about dumping the heap here.




At first it was suspected to be the memory, but if the memory is insufficient, it should be oom. So first eliminate the problem of insufficient memory in the heap. Expand the instance memory to: 6G, 5G in the heap, 1G outside the heap. Found that the restart phenomenon has not improved at all.

Smell: Check the health of the project: threads, on-heap memory usage, off-heap memory usage.

  1. Through jstack and jstat, check the thread status and garbage collection status of the project, there is no thread sudden increase, no fullGC and frequent youngGC.

  2. Through the top command, it is found that the heap size displayed by the res command is much larger than the jstat command (forgot to keep the scene). At this time, it is suspected that it is caused by an off-heap memory leak. In order to determine that the leak is outside the heap rather than inside the heap, analyze the GC log file.

  • Analyze the GC log with the help of easygc: there is no fullGC situation (the four fullGCs in the figure are manually triggered to test: jmap -histo:live), and each youngGC can normally reclaim the object.

  • Modify the startup script, set the -Xmx parameter and -Xms parameter to 4G, and increase the dump heap parameter (-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/logs/ ), if oom occurs in the heap, we can get our thoughts Analysis of the heap files.
    But things backfired. When the container was restarted many times, there was no oom in the project heap, that is, there was no dump site. At this time, it is more certain, it should be a memory leak outside the heap.

  • Configure the off-heap parameter: -XX:MaxDirectMemorySize is used to limit the use of off-heap memory, but the memory usage of the instance is still swelled to 11G. Friends on the Internet say that this parameter can be used to limit the use of off-heap memory. Is it because I am not using it well. I originally wanted to use this parameter to trigger the off-heap memory shortage error, so as to verify the direction of off-heap memory leakage.
    Since this direction is unworkable, expand the outside of the heap to see if the leak outside the heap can be recovered or is a permanent leak.

  • Off-heap memory leaks are generally caused by object references in the heap (the most common is caused by NIO, but this time NIO means not to blame), and the references in the heap cannot be recycled (I guess). Through the fourth figure, after the youngGC under natural conditions or after the fullGC is manually triggered, the garbage collection can test the heap and return to the normal level. It is judged here that the leaked memory is valued by the recoverable reference.
    Then the problem is coming. This part of the reference has accumulated a lot before garbage collection, resulting in insufficient off-heap memory space, triggering the k8s container to be killed. I guess, verify this idea next.

    • Let the operation and maintenance boss adjust the k8s instance to 12G, because the memory usage of the container is almost stable at about 11g every time it is restarted. (Well, in fact, the operation and maintenance boss saw that the container has been restarting, and actively asked for expansion to assist in the investigation, like one)

    • Limit the memory in the heap to 7G, and use 6G in the heap, leaving as much space as possible outside the heap.

  • After the instance memory adjustment, the three instances of the project continued to run for two days without restarting, and the memory can be recovered normally after each "pre-generated data". It is thus determined that the leaked off-heap memory is recyclable, not permanently leaked, and the recovery can be completed after the references in the heap are recovered.

  • The above picture is the resource monitoring diagram of the k8s instance, which can only reflect the resource situation of the container, not the heap situation of the project in the container. This picture can only prove that the off-heap memory can be recovered normally, not permanently leaked. Now that the restart is no longer, the problem is solved, and you can leave? Naive, a node of 12G, unnecessary waste, the operation and maintenance boss will kill people to sacrifice to the sky.
    It can be observed through the jstat command, and the GC log can be concluded that the heap memory usage can be basically stabilized within 4G, and there is no need to waste 12G of space.

  • Question: The current problem that needs to be solved is to find out the cause of off-heap memory leaks.

    1. Search for articles on heap memory troubleshooting through Google: Let’s talk about how to find the BUG of JVM off-heap memory leaks. A process of troubleshooting off-heap memory leaks

    2. Borrowing arthas observations, when the Eden zone expands to 85%+, a round of youngGC will be performed. So stare at the monitoring and dump the heap when Eden usage reaches 80% (jmap -dump:format=b,file=heap.hprof).

    Cut: analyze the heap file through analysis tools: JProfiler (will be used later), MemoryAnalyzer

    1. Use the Memory Analyzer (MAT) tool to open the heap file. The specific use process can be Baidu, not detailed here.

    • Open the heap file first


      • After entering, I saw three obvious errors in the analysis results. Problem one and problem two are caused by the introduction of arthas, so skip it directly.


      • Seeing whether the third question shines, we knew what java.lang.ref.Finalizer was doing when we were learning Java when we were young. If you are interested, you can Google it yourself, or take a look: JVM finalize implementation principle and this Murder


      • java.lang.ref.Finalizer basically determines that there is a problem in the recycling phase, and enters the search for objects to be recycled. At this time, we are not entangled in how many objects have not been recycled, and why are not recycled. It is whether these unreclaimed objects are directed to off-heap memory.


        • Click on the instance to view the class. Here you can see that there are 3500+ unreclaimed objects pointing to java.util.zip.ZipFile$ZipFileInflaterInputStream. Google quickly found that there are still many small partners encountering the same problem, such as: Java compressed stream GZIPStream The resulting memory leak.


        • Seeing ZipFileInflaterInputStream, I immediately remembered where the compression is used: push messages are stored in redis after pre-generation, and the messages are compressed and stored after batch generation. The zip compression is used. The code example is as follows:
          Unfortunately, the compression tool used in the project For the zip compression that comes with jdk, interested children can learn about zip compression based on Deflater and Inflater. (The specific usage method directly refers to the sample comments on these two classes, which should be the most authoritative usage method) The following is my use in the project:

      
              byte[] input = log.getBytes();
      
              try (final ByteArrayOutputStream outputStream = new ByteArrayOutputStream(input.length)) {
                  final Deflater compressor = new Deflater();
                  compressor.setInput(input);
                  compressor.finish();
      
                  byte[] buffer = new byte[1024];
                  int offset = 0;
                  for (int length = compressor.deflate(buffer, offset, buffer.length); length > 0; length = compressor.deflate(buffer, offset, buffer.length)) {
                      outputStream.write(buffer, 0, length);
                      outputStream.flush();
                  }
                  //compressor.end();
                  return Base64Utils.encodeToString(outputStream.toByteArray());
              }
          }
      
          public static String zipDecompress(final String str) throws Exception {
      
              byte[] input = Base64Utils.decodeFromString(str);
      
              try (final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(input.length)) {
      
                  final Inflater decompressor = new Inflater();
                  decompressor.setInput(input);
      
                  byte[] buffer = new byte[1024];
                  for (int length = decompressor.inflate(buffer); length > 0 || !decompressor.finished(); length = decompressor.inflate(buffer)) {
                      byteArrayOutputStream.write(buffer, 0, length);
                  }
                  //decompressor.end();
                  return new String(byteArrayOutputStream.toByteArray());
              }
          }
      
      
      1. The strange thing is that the pre-compression and decompression are all written in the try with resource format, and it is reasonable to close the stream. Some online friends recommend using snapy instead of zip, but I still don’t want to figure out why there is no resource recovery immediately after the method stack pops up.

      2. Click to enter the deflate method of Deflater or the inflate method of Inflater, and you can find that both call the "native" method. Please refer to the source code for the detailed code. Both tool classes have an end() method, which is annotated as follows:

      /**
           * Closes the compressor and discards any unprocessed input.
           * This method should be called when the compressor is no longer
           * being used, but will also be called automatically by the
           * finalize() method. Once this method is called, the behavior
           * of the Deflater object is undefined.
           */
      
      1. So in the above code, just let go of the call to the end() method in the two lines commented out (the two lines are added after the lock issue). The end() method can release the memory used outside the heap after the call, instead of waiting for the jvm garbage collection to come, and then indirectly reclaim the buffer outside the heap when the reference is recycled. Continuing to look at the source code, it is not difficult to find that Deflater and Inflater have indeed rewritten the finalize method, and the implementation of this method is to call the end method, which verifies our above conjecture. It is well known that the finalize method will be called when the object is recycled and will only be called once. Therefore, the space outside the referenced heap cannot be reclaimed before the object is reclaimed.

       /**
           * Closes the compressor and discards any unprocessed input.
           * This method should be called when the compressor is no longer
           * being used, but will also be called automatically by the
           * finalize() method. Once this method is called, the behavior
           * of the Deflater object is undefined.
           */
          public void end() {
              synchronized (zsRef) {
                  long addr = zsRef.address();
                  zsRef.clear();
                  if (addr != 0) {
                      end(addr);
                      buf = null;
                  }
              }
          }
      
          /**
           * Closes the compressor when garbage is collected.
           */
          protected void finalize() {
              end();
          }
      
      1. Looking at the storage space of redis, well, even the peak period of data is not a lot, I think too much.

      Thinking: The restart of the project only appeared after Kafka data was expanded, so why didn't this problem appear before the expansion? In fact, the problem has always existed, but when the amount of data is small, the off-heap memory can be released normally after the reference is garbage collected. However, after the volume is expanded, the instantaneous traffic increases, resulting in a large number of references to off-heap memory usage. Before the next garbage collection, the ReferenceQueue queue has accumulated a large number of references, bursting the off-heap memory in the container.

      Medicine: remove compression and decompression actions

      After removing the compression and decompression actions, release the version for observation. The k8s instance resource monitoring of the project is in a reasonable range.




      At this point, the off-heap memory problem has been solved.

      Five, thinking and review

      Problem: When using resources, keep the habit of releasing them in time after use. This problem is caused by the incorrect use of compression, and it should be considered a low-level error.

      Since it was the first time to troubleshoot memory leaks outside the heap, there was no rich experience to lock down the problem points to quickly troubleshoot, and no detours were made. The article is a bit verbose, but the main purpose is to record the troubleshooting process. The first time I posted a blog, my writing thoughts were a bit messy, please forgive me. If there is any improper wording, I hope to point it out. If you have any good suggestions, I hope to give some pointers.


      Previous wonderful recommendations

      Summary of Tencent, Ali, Didi Backstage Interview Questions-(including answers)

      Interview: The most comprehensive multi-threaded interview questions in history!

      The latest Alibaba pushes Java back-end interview questions

      JVM is difficult to learn? That's because you didn't read this article seriously

      —END—

      Follow the author's WeChat public account—"JAVA Rotten Pigskin"

      Learn more about java back-end architecture knowledge and the latest interview book

      Everything you order is pretty, I take it seriously

    Guess you like

    Origin blog.csdn.net/yunzhaji3762/article/details/108878553