What should I do if the container takes up memory resources? Let's solve this problem!

1. The mechanism and characteristics of Linux memory management

OOM Killer

  1. OOM Killer will kill a running process to release some memory if the memory is insufficient in the Linux system. If the process is the entrypoint of the container, the container exits. docker inspect The command looks at the container, the container is in the "exited" state, and "OOMKilled" is true.
  2. All programs in Linux call malloc() to apply for memory. If the memory is insufficient, just return failure from malloc(). Why kill the running process? Linux allows processes to apply for memory that exceeds the actual physical memory limit. Because malloc() applies for the virtual address of the memory, the system only gives the program an address range. Since no data is written, the program does not get the real physical memory. Physical memory will only be allocated to the program when the program actually writes data to this address. PS: Virtual memory  ps auxis viewed through VSZ; physical memory  ps auxRES (also known as RSS) is viewed
  3. When OOM occurs, what criteria does Linux use to select the process to be killed? This is to mention that there is an oom_badness() function in the Linux kernel, which defines the criteria for selecting processes.
  4. How can we quickly determine that the container has OOM? This can be found in time by looking at the kernel log. Use  journalctl -k the command, or view the log file directly /var/log/message

Memory Cgroup

/sys/fs/cgroup/memory

  1. memory.limit_in_bytes, the maximum amount of memory available to all processes in a control group
  2. memory.oom_control, when the memory usage of the process in the control group reaches the upper limit, this parameter can determine whether the OOM Killer will be triggered (default is), of course, only the process in the control group can be killed. echo 1 > memory.oom_control Even if the memory used by all processes in the control group reaches the upper limit set by memory.limit_in_bytes, the control group will not kill the processes inside, but it will affect the processes in the control group that are applying for physical memory pages. These processes will be in a stopped state and cannot run down.
  3. memory.usage_in_bytes, read-only, is the total memory actually used by all processes in the current control group.

Linux  memory type

  1. The kernel needs to allocate memory to the page table, kernel stack, and slab, which is the Cache Pool of various data structures of the kernel;
  2. User mode process memory
    1. RSS memory includes the code segment memory of the process, stack memory, heap memory, and shared library memory
    2. Page Cache for reading and writing files. It is a mechanism to use free physical memory to improve disk file read and write performance , because the default behavior of system calls read() and write() will store the read or written pages in the Page Cache.
  3. Linux memory management has a memory page recovery mechanism (page frame reclaim), which will decide whether to start memory recovery according to whether the free physical memory in the system is lower than a certain threshold (wartermark). The memory reclamation algorithm will determine which memory pages are released first according to different types of memory and the least recently used principle of memory, which is the LRU (Least Recently Used) algorithm. Because the memory pages of the Page Cache only serve as a Cache, they will naturally be released first.

Memory Cgroup does not limit the memory of the kernel (such as page tables, slabs, etc.). Only two memory types related to user mode are restricted, RSS (Resident Set Size) and Page Cache. When the process in the control group needs to apply for new physical memory, and the value in memory.usage_in_bytes exceeds the memory upper limit value memory.limit_in_bytes in the control group, then the Linux memory recovery (page frame reclaim) we mentioned earlier will is invoked. Then part of the page cache memory in this control group will be released according to the newly applied memory size, so that we can still successfully apply for new physical memory, and the total physical memory overhead memory.usage_in_bytes in the entire control group will not exceed the above Limit value memory.limit_in_bytes. PS: So there will be a phenomenon that the container memory usage is always at a critical point.

There is a parameter memory.stat in Memory Cgroup, which can display the actual cost of various memory types in the current control group. We can't use memory.usage_in_bytes in Memory Cgroup, but need to use rss value in memory.stat. This is very similar to how we use the free command to check the available memory of a node. Instead of looking at the value under the "free" field, we need to look at the value under the "available" field after removing the Page Cache.

swap

Swap is a piece of disk space. When the memory is full, you can temporarily write data that is not commonly used in the memory to this Swap space. In this way, the memory space can be released to meet the needs of new memory applications.

  1. Open the Swap space on the host node, and Swap can be used in the container.
    1. Because of the Swap space, the container that would have been OOM Killed can run well (RSS is not exceeded). If a program in a container has a memory leak (Memory leak), then the Memory Cgroup can kill the process in time so that it does not affect other applications in the entire node. As a result, the memory leaking process is not killed, and it continues to read and write the Swap disk, which affects the performance of the entire node.
    2. When the memory is tight, how does the Linux system decide whether to release the Page Cache first, or to release and write the anonymous memory into the Swap space first? If the system releases all the Page Cache first, then once there are frequent file read and write operations in the node, the system performance will drop. If the Linux system first releases the anonymous memory and writes it to Swap, then once the released anonymous memory needs to be used immediately, it needs to be read back from the Swap space to the memory, which will make the Swap (in fact, disk) read Frequent writes lead to system performance degradation.
  2. Obviously, when we release memory, we need to balance the release of Page Cache and the release of anonymous memory . The function of the Linux swappiness parameter value is that after the system has Swap space, when the system needs to reclaim memory, whether to release the memory in the Page Cache first or to release the anonymous memory first (that is, write to Swap).
  3. There is also a memory.swappiness in each Memory Cgroup control group. The difference is that when the swappiness parameter value in each Memory Cgroup control group is 0, the memory in the control group can stop writing to Swap.

2. Problems and solutions of Linux memory management

Since the principle of the Linux kernel is to use memory as much as possible instead of reclaiming it continuously, when the process in the container applies for memory, the memory usage will often continue to rise. When the memory usage of the container is close to the Limit, container-level direct memory reclamation (direct reclaim) will be triggered to reclaim clean file pages. This process occurs in the context of the process applying for memory, so it will cause the application in the container to freeze; if the memory The application rate is high, which may also cause the container OOM (Out of Memory) Killed, causing the application in the container to be interrupted and restarted.

When the memory resources of the whole machine are tight, the kernel will trigger recycling according to the water level of the free memory (the Free part of the kernel interface statistics): when the water level reaches the Low water level, background memory recycling will be triggered, and the recycling process will be completed by the kernel thread kswapd without blocking The application process is running and supports the recovery of dirty pages; when the free water level reaches the Min water level (Min < Low), it will trigger global direct memory recovery. This process occurs in the context of memory allocation by the process, and scans for updates Multiple pages, so performance is greatly affected, and all containers on the node may be disturbed. When the memory allocation rate of the whole machine exceeds and reclaims the rate, a wider range of OOM will be triggered, resulting in a decrease in resource availability.

fairness issue

Containers with resource overuse (Usage > Request) may compete for memory resources with containers that are not overused: For the Request level, Kubelet sets the cgroups interface cpu.shares according to the CPU Request as the relative weight of the competition for CPU resources between containers. When resources are tight, the proportion of shared CPU time between containers will be divided according to the Request ratio to meet fairness; while Memory Request does not set the cgroups interface by default, which is mainly used for scheduling and eviction reference. When the memory resource of the node is tight, since the Memory Request is not mapped to the cgroups interface, the available memory between containers will not be divided according to the request ratio like the CPU, so there is a lack of resource fairness guarantee.

Kubelet provides the MemoryQoS feature in versions above Kubernetes 1.22, and further guarantees the memory resource quality of containers through the memcg QoS capability provided by Linux cgroups v2, including:

  1. Set the Memory Request of the container to the cgroups v2 interface memory.min, and the memory requested by the lock will not be reclaimed by the global memory.
  2. Based on the container-based Memory Limit, set the cgroups v2 interface memory.high, and when the memory of the Pod is overused (Memory Usage > Request), the current limit is first triggered to avoid OOM caused by unlimited overused memory.

However, from the perspective of users using resources, there are still some shortcomings:

  1. When the Pod memory declaration Request = Limit, there may still be resource constraints in the container, and the triggered direct memory reclamation at the memcg level may affect the RT (response time) of the application service.
  2. The solution currently does not consider compatibility with cgroups v1, and the problem of fairness of memory resources on cgroups v1 has not yet been resolved.

Guaranteed (locked) capacity for memory usage during memory reclamation

In a Kubernetes cluster, there may be a need to guarantee priority between Pods. For example, high-priority Pods need better resource stability. When the overall machine resources are tight, it is necessary to avoid the impact on high-priority Pods as much as possible. However, in some real scenarios, low-priority Pods are often running resource-consuming tasks, which means that they are more likely to cause a wide range of memory resource constraints, interfere with the resource quality of high-priority Pods, and are real "troublemakers" ". In this regard, Kubernetes currently mainly evicts low-priority Pods through Kubelet, but the response timing may occur after global memory recycling.

Alibaba Cloud Container Service ACK is based on the enhanced memory subsystem of Alibaba Cloud Linux 2. Users can use the more complete container Memory QoS function in advance on cgroups v1, as follows:

  1. Guarantee the fairness of memory recovery among Pods. When the memory resources of the whole machine are tight, memory is first recovered from Pods with memory overuse (Usage > Request) (Memory QoS supports setting the watermark for active memory recovery for such Pods, and the memory The use is limited near the water level), and the spoilers are restrained to avoid the decline of the resource quality of the whole machine.
  2. When the Pod's memory usage is close to the limit, a part of the memory will be asynchronously reclaimed in the background firstly to alleviate the performance impact caused by direct memory reclamation.
  3. When the memory resource of the node is tight, the memory operation quality of the Guaranteed/Burstable Pod is given priority. The Memory QoS function enables the global lowest watermark classification and kernel memcg QoS. When the memory resources of the whole machine are tight, the memory is first recovered from the BE container, reducing the impact of global memory recovery on the LS container; it also supports the priority recovery of overused memory. resources to ensure the fairness of memory resources.

Guess you like

Origin blog.csdn.net/m0_37723088/article/details/130576961