Explain OOM Killed in Flink containerized environment in detail

In a production environment, Flink is usually deployed on a resource management system such as YARN or k8s. The process will run in a containerized manner (YARN container or docker container), and its resources will be strictly limited by the resource management system. On the other hand, Flink runs on the JVM, and the JVM is not particularly compatible with the containerized environment. In particular, the JVM's complex and less controllable memory model can easily cause the process to be killed due to excessive use of resources, causing Flink The application is unstable or even unavailable.

In response to this problem, Flink refactored the memory management module in version 1.10 and designed new memory parameters. In most scenarios, Flink's memory model and default are good enough to help users shield the complex memory structure behind the process. However, once a memory problem occurs, troubleshooting and repair of the problem requires more domain knowledge, which usually makes ordinary users Stay away.

To this end, this article will analyze the memory model of JVM and Flink, and summarize the common reasons that Flink's memory usage exceeds the container limit that I encountered in the work and learned in the community communication. Since Flink memory usage is closely related to user code, deployment environment, various dependent versions and other factors, this article mainly discusses on YARN deployment, Oracle JDK/OpenJDK 8, Flink 1.10+. In addition, special thanks to @宋辛童 (the main author of Flink 1.10+ new memory architecture) and @唐云 (RocksDB StateBackend expert) for answering questions in the community, which benefited the author a lot.

JVM memory partition

For most Java users, the frequency of dealing with JVM Heap in daily development is much greater than that of other JVM memory partitions, so other memory partitions are often referred to as Off-Heap memory. For Flink, the problem of excessive memory usually comes from Off-Heap memory, so a deeper understanding of the JVM memory model is necessary.

According to JVM 8 Spec[1], the memory partitions managed by JVM are as follows:

JVM 8 memory model

In addition to the standard partitions specified by the above Spec, the JVM often adds some additional partitions for advanced function modules in specific implementation. Taking HotSopt JVM as an example, according to the standard of Oracle NMT[5], we can subdivide the JVM memory into the following areas:

  • Heap: The memory area shared by each thread mainly stores the objects created by the new operator. The release of the memory is managed by the GC and can be used by user code or JVM itself.
  • Class: Metadata of the class, corresponding to Method Area in Spec (excluding Constant Pool), Metaspace in Java 8.
  • Thread: Thread level memory area, corresponding to the sum of PC Register, Stack and Natvive Stack in Spec.
  • Compiler: The memory used by the JIT (Just-In-Time) compiler.
  • Code Cache: A cache used to store the code generated by the JIT compiler.
  • GC: The memory used by the garbage collector.
  • Symbol: The memory for storing Symbols (such as field names, method signatures, and Interned String), corresponding to the Constant Pool in the Spec.
  • Arena Chunk: JVM applies for temporary buffer area of ​​operating system memory.
  • NMT: NMT's own memory.
  • Internal: Other memory that does not meet the above classification, including Native/Direct memory requested by user code.
  • Unknown: Unknown memory.

Ideally, we can strictly control the upper limit of the memory of each partition to ensure that the overall memory of the process is within the container limit. However, too strict management will bring additional usage costs and lack of flexibility. Therefore, in practice, JVM only provides a hard upper limit for a few of the partitions exposed to users, while other partitions can be viewed as a whole It is the memory consumption of the JVM itself.

The specific JVM parameters that can be used to limit the memory of the partition are shown in the following table (it is worth noting that the industry does not have an accurate definition of JVM Native memory. Native memory in this article refers to the non-Direct part of Off-Heap memory. Non-Direct can be interchanged).

As can be seen from the table, it is safer to use Heap, Metaspace and Direct memory, but the non-Direct Native memory situation is more complicated. It may be some internal use of the JVM itself (such as the MemberNameTable mentioned below). It may be the JNI dependency introduced by the user code, or it may be the Native memory requested by the user code itself through sun.misc.Unsafe. In theory, Native memory requested by user code or third-party lib requires the user to plan the memory usage, and the rest of the Internal can be incorporated into the memory consumption of the JVM itself. In fact, Flink's memory model also follows a similar principle.

Flink TaskManager memory model

First review the TaskManager memory model of Flink 1.10+.

Flink TaskManager memory model

Obviously, the Flink framework itself will not only include Heap memory managed by JVM, but also apply for Native and Direct memory managed by Off-Heap by itself. In my opinion, Flink's Off-Heap memory management strategy can be divided into three types:

  • Hard Limit: The hard limit of the memory partition is Self-Contained, and Flink will ensure that its usage will not exceed the set threshold (if the memory is not enough, an OOM-like exception will be thrown)
  • Soft Limit: Soft limit means that the memory usage will be below the threshold for a long time, but may temporarily exceed the configured threshold.
  • Reserved: Reserved means that Flink will not limit the use of partition memory, but only reserve a part of the space when planning the memory, but it cannot guarantee that the actual use will not exceed the limit.

Combined with the memory management of the JVM, what consequences will a memory overflow of a Flink memory partition cause? The judgment logic is as follows:

1. If Flink has a hard-limited partition, Flink will report that the partition has insufficient memory. Otherwise, go to the next step.
2. If the partition belongs to the JVM-managed partition, when its actual value increases and the JVM partition also runs out of memory, the JVM will report the OOM of the JVM partition it belongs to (such as java.lang.OutOfMemoryError: Jave heap space). Otherwise, go to the next step.
3. The memory of the partition continues to overflow, eventually causing the overall memory of the process to exceed the container memory limit. In an environment where strict resource control is enabled, the resource manager (YARN/k8s, etc.) will kill the process.

In order to visually show the relationship between Flink's memory partitions and JVM memory partitions, the author has compiled the following memory partition mapping table:

Flink partition and JVM partition memory limit relationship

According to the previous logic, among all Flink memory partitions, only JVM Overhead that is not Self-Contained and its own JVM partition has no memory hard limit parameter may cause the process to be OOM kill. As a hodgepodge of memory reserved for various uses, JVM Overhead is indeed prone to problems, but at the same time it can also be used as an isolation buffer to alleviate memory problems from other areas.

For example, the Flink memory model has a trick when calculating Native Non-Direct memory:

Although, native non-direct memory usage can be accounted for as a part of the framework off-heap memory or task off-heap memory, it will result in a higher JVM’s direct memory limit in this case.

Although the Off-Heap partition of Task/Framework may contain Native Non-Direct memory, and this part of memory is strictly JVM Overhead and will not be limited by the JVM -XX:MaxDirectMemorySize parameter, Flink still counts it in MaxDirectMemorySize. This part of the reserved Direct memory quota will not be actually used, so it can be reserved for JVM Overhead with no upper limit to achieve the effect of reserving space for Native Non-Direct memory.

Common causes of OOM Killed

Consistent with the above analysis, the common causes of OOM Killed in practice basically stem from the leak or overuse of Native memory. Because the OOM Killed of virtual memory is easily avoided through the configuration of the resource manager and usually there is no big problem, the following only discusses the OOM Killed of physical memory.

Uncertainty of RocksDB Native memory

As we all know, RocksDB directly applies for Native memory through JNI and is not controlled by Flink, so in fact, Flink indirectly affects its memory usage by setting RocksDB's memory parameters. However, Flink currently estimates these parameters, which are not very accurate values. There are several reasons for this.

The first is that part of the memory is difficult to calculate accurately. The memory of RocksDB has 4 parts[6]:

  • Block Cache: A layer of cache above OS PageCache, which caches uncompressed data Block.
  • Indexes and filter blocks: Indexes and bloom filters are used to optimize read performance.
  • Memtable: Similar to write cache.
  • Blocks pinned by Iterator: When triggering RocksDB traversal operations (such as traversing all keys of RocksDBMapState), Iterator will prevent the Blocks and Memtables referenced by it from being released during its life cycle, resulting in additional memory usage [10].

The memory in the first three areas is configurable, but the resources locked by Iterator depend on the application business usage pattern, and there is no hard limit. Therefore, Flink does not take this part into consideration when calculating RocksDB StateBackend memory.

The second is a bug in RocksDB Block Cache[8][9], which will cause the Cache size to be unable to be strictly controlled and may exceed the set memory capacity in a short time, which is equivalent to a soft limit.

For this problem, we usually only need to increase the JVM Overhead threshold and let Flink reserve more memory, because RocksDB's memory overuse is only temporary.

glibc Thread Arena issues

Another common problem is glibc's famous 64 MB problem, which may cause the memory usage of the JVM process to increase significantly and eventually be killed by YARN.

Specifically, JVM applies for memory through glibc, and in order to improve memory allocation efficiency and reduce memory fragmentation, glibc will maintain a memory pool called Arena, including a shared Main Arena and thread-level Thread Arena. When a thread needs to apply for memory but the Main Arena has been locked by other threads, glibc will allocate a Thread Arena of approximately 64 MB (64-bit machine) for the thread to use. These Thread Arenas are transparent to the JVM, but will be included in the overall virtual memory (VIRT) and physical memory (RSS) of the process.

By default, the maximum number of Arena is the number of cpu cores * 8. For an ordinary 32-core server, it occupies up to 16 GB, which is not impressive. In order to control the total amount of memory consumed, glibc provides the environment variable MALLOC_ARENA_MAX to limit the total amount of Arena. For example, Hadoop sets this value to 4 by default. However, this parameter is only a soft limit. When all Arenas are locked, glibc will still create a new Thread Arena to allocate memory [11], causing unexpected memory usage.

Generally speaking, this problem will occur in applications that need to create threads frequently. For example, HDFS Client will create a DataStreamer thread for each file being written, so it is easier to encounter the problem of Thread Arena. If you suspect that your Flink application encounters this problem, the simpler way to verify is to see if there are many continuous anon segments with a multiple of 64MB in the pmap of the process. For example, the 65536 KB segments in blue in the figure below are very likely It's Arena.

pmap 64 MB arena

The fix for this problem is relatively simple, just set MALLOC_ARENA_MAX to 1, that is, disable Thread Arena and only use Main Arena. Of course, the cost of this is that thread allocation memory efficiency will be reduced. However, it is worth mentioning that it may not be feasible to use Flink's process environment variable parameters (such as containerized.taskmanager.env.MALLOC_ARENA_MAX=1) to override the default MALLOC_ARENA_MAX parameter. env-whitelist) In the case of conflicts, NodeManager will merge the original value and the added value by merging URLs, resulting in a result like MALLOC_ARENA_MAX="4:1".

Finally, there is a more thorough alternative solution, which is to replace glibc with tcmalloc from Google or jemalloc from Facebook [12]. Except that there will be no Thread Arena problems, the memory allocation performance is better and the fragmentation is less. In fact, the official image of Flink 1.12 also changed the default memory allocator from glibc to jemelloc [17].

JDK8 Native memory leak

The version before Oracle Jdk8u152 has a Native memory leak bug [13], which will cause the JVM's Internal memory partition to keep growing.

Specifically, the JVM will cache the mapping pairs of string symbols (Symbol) to methods (Method) and member variables (Field) to speed up the search. Each pair of mappings is called MemberName, and the entire mapping relationship is called MemeberNameTable, which is defined by java.lang. This class is responsible for invoke.MethodHandles. Before Jdk8u152, MemberNameTable used Native memory, so some obsolete MemberNames will not be automatically cleaned up by GC, causing memory leaks.

To confirm this problem, you need to check the JVM memory through NMT. ​​For example, the author has encountered a MemeberNameTable of more than 400 MB of an online TaskManager.

JDK8 MemberNameTable Native memory leak

After JDK-8013267[14], MemeberNameTable was moved from Native memory to Java Heap to fix this problem. However, there is more than one native memory leak problem of JVM, such as the memory leak problem of C2 compiler [15], so for users who do not have a dedicated JVM team like the author, upgrading to the latest version of JDK is the best way to fix the problem .

YARN mmap memory algorithm

As we all know, YARN will calculate the total memory of the entire container process tree according to the process information under /proc/${pid}, but there is a special point in the shared memory of mmap. The mmap memory will all be counted into the VIRT of the process. There should be no doubt about this, but there are different standards for RSS calculations. According to the calculation rules of YARN and Linux smaps, Pages are divided into two standards:

  • Private Pages: Only the current process mapped (mapped) Pages
  • Shared Pages: Pages shared with other processes
  • Clean Pages: Pages that have not been modified since being mapped
  • Dirty Pages: Pages that have been modified since they were mapped. In the default implementation, YARN calculates the total memory according to /proc/${pid}/status. All Shared Pages will be counted into the process's RSS, even if these Pages are at the same time Mapping by multiple processes [16], which will cause deviations from the physical memory of the actual operating system, and may cause the Flink process to be killed by mistake (of course, the premise is that the user code uses mmap and does not reserve enough space).

To this end, YARN provides the yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled configuration option. After setting it to true, YARN will use the more accurate /proc/${pid}/smap One of the key concepts in calculating memory usage is PSS. In simple terms, the difference of PSS is that Shared Pages are equally distributed to all processes that use this Pages when calculating memory. For example, if a process holds 1000 Private Pages and 1000 Shared Pages that will be shared with another process, then The total number of pages in the process is 1500. Back to the memory calculation of YARN, the process RSS is equal to the sum of all Pages RSS mapped by it. By default, YARN calculates a Page RSS formula as: ``` Page RSS = Private_Clean + Private_Dirty + Shared_Clean + Shared_Dirty ``` Because a Page is either Private, or Shared, and either Clean or Dirty, so in fact At least three items on the right side of the above announcement are 0. After the smaps option is turned on, the formula becomes: ``` Page RSS = Min(Shared_Dirty, PSS) + Private_Clean + Private_Dirty ``` Simply put, the result of the new formula is to remove the effect of repeated calculations in the Shared_Clean part. Although turning on the option based on smaps calculation will make the calculation more accurate, it will introduce the overhead of traversing Pages to calculate the total memory. It is not as fast as directly taking the statistics of /proc/${pid}/status. Therefore, if you encounter mmap problems, It is recommended to increase the capacity of Flink's JVM Overhead partition.

to sum up

This article first introduces the JVM memory model and the Flink TaskManager memory model, and then analyzes that process OOM Killed usually originates from Native memory leaks, and finally lists several common causes of Native memory leaks and solutions, including the uncertainty of RocksDB memory usage , Glibc's 64MB problem, JDK8 MemberNameTable leak, and YARN's inaccurate calculation of mmap memory. Due to the limited level of the author, it is impossible to guarantee that all the content is correct. If readers have different opinions, please leave a message and discuss together.

 

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/112672702