Cloud native in-depth analysis of how to debug network latency issues in Kubernetes clusters

I. Introduction

  • As the scale of Kubernetes clusters continues to grow, the requirements for service latency are becoming more and more stringent. Sometimes it is observed that some services running on the Kubernetes platform are facing occasional latency problems. These intermittent problems are not due to performance issues of the application itself. caused.
  • I gradually discovered that the delay problem caused by applications on the Kubernetes cluster seemed to be random. The establishment of some network connections may take more than 100ms, causing downstream services to time out or retry. These services themselves handle business responses. The time can be kept well within 100ms, but it takes more than 100ms to establish a connection, which is intolerable. In addition, for some SQL queries that should be executed very fast (on the order of milliseconds), from the application perspective, it actually exceeds 100ms. However, from the perspective of the MySQL database, it is completely normal, and no possible slow query problems have been found.
  • Through troubleshooting, the problem can be narrowed down to the link of establishing a connection with the Kubernetes node, including requests within the cluster or requests involving external resources and external visitors. The simplest way to reproduce this problem is to use Vegeta on any internal node to launch an HTTP stress test on a service exposed with NodePort. You can observe that some high-latency requests are generated from time to time. So, how to track and locate this problem?

2. Problem analysis

  • If you try to reproduce the problem with a simple example, you hope to narrow the scope of the problem and remove unnecessary complexity. At first, there were too many components involved in the data flow between Vegeta and Kubernetes Pods. It was difficult to determine whether this was a deeper network problem, so a subtraction was needed:

Insert image description here

  • The Vegeta client will initiate a TCP request to a Kube node in the cluster. The Kubernetes cluster in the data center uses the Overlay network (running on the existing data center network) and encapsulates the IP packets of the Overlay network in the data center. within the IP packet. When the request reaches the first kube node, it performs NAT translation to convert the IP and port of the kube node to the network address of the Overlay, specifically the IP and port of the Pod running the application. When requesting a response, the corresponding inverse transformation (SNAT/DNAT) will occur. This is a very complex system that maintains a large amount of mutable state that is constantly updated as services are deployed.
  • When we first used Vegeta for stress testing, we found that there was a delay in the TCP handshake stage (between SYN and SYN-ACK). In order to simplify the complexity caused by HTTP and Vegeta, use hping3 to send SYN packets, and observe whether the response packets are delayed, and then close the connection to filter out those packets with delays exceeding 100ms to simply reproduce Vegeta. Layer 7 stress testing or simulating a service being exposed to a SYN attack. The following log shows the result of sending TCP SYN/SYN-ACK packets to port 30927 of kube-node at 10ms intervals and filtering out slow requests:
theojulienne@shell ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{
    
    3}\.'
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms
len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms
  • According to the sequence number and time in the log, the first thing to observe is that this delay is not a single incident, but often occurs in clusters, as if the backlog of requests is finally processed at once.
  • Next, we want to specifically locate which component may have an exception. Are they the NAT rules for kube-proxy, since they are hundreds of lines long? Or is the IPIP tunnel or similar network component performing poorly? One way to troubleshoot is to test every step in the system. What happens if you remove the NAT rules and firewall logic and just use IPIP tunnels?

Insert image description here

  • If it is also on a kube node, Linux allows direct communication with the Pod, which is very simple:
theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{
    
    3}\.'
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms
len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms
  • As you can see from the results, the problem is still there, which rules out problems with iptables and NAT. Is there a problem with TCP? Let’s take a look at what happens if we use ICMP requests?
theojulienne@kube-node-client ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{
    
    3}\.'
len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms
len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms
len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms
len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms
len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms
len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms
len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms
  • The results show that ICMP can still reproduce the problem. Is the IPIP tunnel causing the problem? Let's simplify the problem further:

Insert image description here

  • So is it possible that any communication between these nodes will cause this problem?
theojulienne@kube-node-client ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{
    
    3}\.'
len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms
len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms
len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms
len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms
len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms
  • Behind this complexity, in simple terms it is any network communication between two kube nodes, including ICMP. If this target node is "abnormal" (some nodes are worse than others, such as higher latency and more frequent problems), you will still see similar delays when the problem occurs.
  • So the question now is, obviously this problem is not found on all machines, why does this problem only appear on those servers of kube nodes? Does it occur when the kube node serves as the request sender or the request receiver? Fortunately, it is now easy to narrow down the scope of the problem: you can use a machine outside the cluster as the sender, and use the same "known failure" machine as the target of the request, but find that the request in this direction Problem still exists:
theojulienne@shell ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{
    
    3}\.'
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms
len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms
  • Then repeat the above operation, this time sending the request from the kube node to the external node:
theojulienne@kube-node-client ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{
    
    3}\.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms
  • By looking at the delay data in the packet capture, you can get more information. Specifically, latency is observed from the sending side (below), however the receiving server does not see latency (above). Note the Delta column in the figure (in seconds):

Insert image description here

  • In addition, by looking at the difference in the order of TCP and ICMP network packets at the receiving end (based on sequence ID), it can be found that ICMP packets always arrive at the receiving end in the order they are sent, but the delivery time is irregular, and the sequence ID of TCP packets is sometimes It will be staggered, and part of it will be paused. In particular, if you count the ports where SYN packets are sent/received, these ports are not sequential on the receiving end, but they are sequential on the sending end.
  • The network cards currently used by servers, such as those used in your own data center, have some subtle differences in processing TCP and ICMP network messages. When a datagram arrives, the network card hashes the packet passed on each connection and attempts to assign different connections to different receive queues, assigning (presumably) a CPU core to each queue. For TCP packets, this hash value contains both the source IP and port and the destination IP and port. In other words, the hash value of each connection is likely to be different. For ICMP packets, the hash value only contains the source IP and destination IP, since there is no port, which explains the above finding.
  • Another new discovery was that ICMP packets between the two hosts were stalled for a period of time, while TCP packets were fine for the same period of time. This seems to tell us that it is the hash of the receiving network card queue that is "joking", which is almost certain that the pause occurs during the processing of RX packets at the receiving end, rather than a problem at the sending end. This rules out transmission issues between kube nodes, so it is now known that this is a stall at the stage of processing the packet, and that it is some kube node that is on the receiving end.

3. Network packet processing process of Linux kernel

  • In order to understand why the problem occurs on the receiving end of the kube node service, let's take a look at how Linux handles network packets. In the simplest and primitive implementation, after receiving a network packet, the network card will send an interrupt to the Linux kernel to inform that a network packet needs to be processed. The kernel will stop other work it is currently doing, switch the context to the interrupt handler, process the network packet and then switch back to the previous work task.

Insert image description here

  • Context switching will be very slow. This method may not be a problem for 10Mbit network cards in the 1990s, but many servers now have 10G network cards, and the maximum packet processing speed may be able to reach 15 million packets per second: on a small 8-core On the server this means millions of interrupts per second.
  • Many years ago, Linux added a new NAPI. The Networking API was used to replace the traditional method in the past. Modern network card drivers use this new API to significantly improve the performance of high-speed packet processing. At low speeds, the kernel still accepts interrupts from the network card as described previously. Once packets exceeding the threshold arrive, the kernel disables interrupts and starts polling the network card to capture network packets in batches. This process is completed in a "softirq", or it can also be called a software interrupt context. This happens at the end of the system call, when the program has entered kernel space instead of user space.

Insert image description here

  • This method is much faster than the traditional method, but it also brings another problem. If the number of packets is so large that all the CPU time is spent processing the packets received from the network card, but this will not allow the user-mode program to actually process these network requests in the queue (such as from a TCP connection fetching data, etc.), eventually the queue will fill up and packets will start to be dropped. In order to balance the running time between user mode and kernel mode, the kernel will limit the number of processing packets for a given software interrupt context and arrange a "budget". Once this "budget" value is exceeded, it wakes up another thread called "ksoftiqrd" (or you can see this thread in the ps command), which continues processing the software outside of the normal system call path. Interrupt context, this thread will use the standard process scheduler, thus achieving fair scheduling.

Insert image description here

  • By sorting out the path used by the Linux kernel to process network packets, we can find that this processing process may indeed pause. If the interval between softirq processing calls becomes longer, then the network packet may be in the RX queue of the network card for a period of time. This may be due to a CPU core deadlock or some slower processing tasks blocking the kernel to process softirqs. .

4. Narrow the problem to a certain core or method

  • So far, we believe this delay is indeed possible and seem to be observing some very similar signs. The next step is to confirm the theory and try to understand what is causing the problem.
  • Let’s take a look at the network request that caused the problem:
len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms
len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms
len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms
len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms
len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms
len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms
  • As discussed before, these ICMP packets will be hashed to a specific network card RX queue and then processed by a certain CPU core. If you want to understand what the kernel is doing, you first need to know which CPU core it is and how softirq and ksoftiqrd process these packets. This will be very helpful in locating the problem.
  • There are tools that can be used to track the running status of the Linux kernel in real time. For this purpose, bcc can be used. bcc allows you to write a small C program and mount it on any function of the kernel. Then it can cache events and transmit them to a user-mode Python program, and this Python program performs some summary analysis on these events and then The results are returned to you. The above-mentioned "mounting to any function in the kernel" is actually a difficult point, but it has been used as safely as possible because it is designed to track such production environment problems. Problems generally cannot be easily reproduced in a test or development environment.
  • We know that the kernel is processing those IMCP Ping packets, so let's intercept the kernel's icmp_echo method. This method will accept an inbound direction ICMP "echo request" packet and initiate an ICMP reply "echo response". You can These packets are identified by the icmp_seq sequence number displayed in hping3. The code for this bcc script may seem complicated, but when you break it down it doesn't sound that scary. The icmp_echo function is passed a pointer to the structure sk_buff * skb, which is the data packet containing the ICMP echo request. We can do some digging, extract echo.sequence (corresponding to icmp_seq shown by hping3 above), and send it back to user space. At the same time, you can also easily obtain the current process name or process id.
  • When the kernel processes these packets, you can see the following results:
TGID    PID     PROCESS NAME    ICMP_SEQ
0       0       swapper/11      770
0       0       swapper/11      771
0       0       swapper/11      772
0       0       swapper/11      773
0       0       swapper/11      774
20041   20086   prometheus      775
0       0       swapper/11      776
0       0       swapper/11      777
0       0       swapper/11      778
4512    4542   spokes-report-s  779
  • What needs to be noted about the process name here is that in the context of softirq that occurs after the system call, you can see that the process that initiated this system call is displayed as "process", even though this is the kernel processing it in the context of the kernel.
  • By running, it is now possible to correlate stalled packets observed by hping3 with the process that handled it. A simple grep on the captured icmp_seq values ​​provides context to see what happened before these packets were processed. Packets matching the icmp_seq values ​​shown in hping3 above have been marked, and the observed rtt values ​​are also shown (in brackets we assume that requests with RTT<50ms are not filtered out):
TGID    PID     PROCESS NAME    ICMP_SEQ ** RTT
--
10137   10436   cadvisor        1951
10137   10436   cadvisor        1952
76      76      ksoftirqd/11    1953 ** 99ms
76      76      ksoftirqd/11    1954 ** 89ms
76      76      ksoftirqd/11    1955 ** 79ms
76      76      ksoftirqd/11    1956 ** 69ms
76      76      ksoftirqd/11    1957 ** 59ms
76      76      ksoftirqd/11    1958 ** (49ms)
76      76      ksoftirqd/11    1959 ** (39ms)
76      76      ksoftirqd/11    1960 ** (29ms)
76      76      ksoftirqd/11    1961 ** (19ms)
76      76      ksoftirqd/11    1962 ** (9ms)
--
10137   10436   cadvisor        2068
10137   10436   cadvisor        2069
76      76      ksoftirqd/11    2070 ** 75ms
76      76      ksoftirqd/11    2071 ** 65ms
76      76      ksoftirqd/11    2072 ** 55ms
76      76      ksoftirqd/11    2073 ** (45ms)
76      76      ksoftirqd/11    2074 ** (35ms)
76      76      ksoftirqd/11    2075 ** (25ms)
76      76      ksoftirqd/11    2076 ** (15ms)
76      76      ksoftirqd/11    2077 ** (5ms)
  • From the above results, first of all, these packets are processed by the ksoftirqd/11 process, which conveniently tells us that this particular machine hashed its ICMP packets to the receiver's CPU core 11. We can also see that each The first time you see stalls, you always see some packets being processed in the context of the cadvisor's system call softirq, and then ksoftirqd takes over and processes the backlog, and this happens to correspond to the stalled packets we found.
  • The fact that cAdvisor always runs immediately before the stalled request also suggests that this may be related to the issue we are troubleshooting. Ironically, using cAdvisor precisely to “analyze the resource usage and performance characteristics of running containers,” as described on cAdvisor’s homepage, caused this performance issue. As with many things related to containers, these are relatively cutting-edge tools and there are situations that cause some unexpected performance degradation.

5. What did cAdvisor do to cause the pause?

  • Now that you understand how a stall occurs, the process that causes it, and the CPU core that causes it, you now have a good understanding of it. In order for the kernel to hard block rather than schedule ksoftirqd ahead of time, and having also seen the packets being processed in cAdvisor's softirq context, we believe cAdvisor may be very slow to call the syscall, and after it completes the rest of the network packets can be processed normally :

Insert image description here

  • This is just a theory, so how do you verify that this is actually happening? What we can do is track what is running on the CPU core throughout the process, find the point where packets exceed the "budget" and start waking up ksoftirqd processing, and then go back and see what is running on the CPU core. Think of it like taking an X-ray of the CPU every few milliseconds. It looks like this:

Insert image description here

  • Deep traversal is that most of this requirement has been implemented. The perf record tool can sample specified CPU cores at a specific frequency and can generate real-time call graphs (including user space and kernel). Real-time calls can be recorded and manipulated using a program derived from FlameGraph developed by Brendan Gregg. This tool preserves the order of the stack trace and can sample a single line of stack trace every 1ms, and then obtain a sample of 100ms before ksoftirqd is executed. :
# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100
  • The results are as follows: (hundreds of traces that look similar)
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_r
  • There are a lot of logs, but if you are careful, you may have discovered a fixed pattern: cAdvisor and then ksoftirqd, so what does this mean? Each line is a trace record at a certain moment, and the methods in each called method stack are separated by semicolons. In the middle of the line you can see that the called syscall is read(): ...;dosyscall_64;sys_read;... So cAdvisor spends a lot of time calling the read() system call, which is related to the mem_cgroup* function because it is the method The method at the bottom of the stack). The method stack trace cannot easily show the specific content of read, so you can use strace to see what cAdvisor is doing and find those system calls that take more than 100ms.
theojulienne@kube-node-bad ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0\.[1-9]'
[pid 10436] <... futex resumed> )       = 0 <0.156784>
[pid 10432] <... futex resumed> )       = 0 <0.258285>
[pid 10137] <... futex resumed> )       = 0 <0.678382>
[pid 10384] <... futex resumed> )       = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880\nrss 507904\nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> )       = 0 <0.104614>
[pid 10436] <... futex resumed> )       = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> )   = 0 <0.118113>
[pid 10382] <... pselect6 resumed> )    = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880\nrss 507904\nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> )       = 0 <0.917495>
[pid 10436] <... futex resumed> )       = 0 <0.208172>
[pid 10417] <... futex resumed> )       = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 576 <0.154442>
  • At this point, you can be pretty sure that the read() system call is very slow. Judging from the content read by read and the context of mem_cgroup, those read() calls are reading memory.state files, which are used to describe the system's memory usage and cgroup limitations. cAdvisor polls this file to obtain details of the resources used by the container, and manually calls this method to verify whether the problem is with the kernel or cAdvisor:
theojulienne@kube-node-bad ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null

real    0m0.153s
user    0m0.000s
sys    0m0.152s
theojulienne@kube-node-bad ~ $
  • Since this problem can be reproduced, it suggests that it is a "pathological" method triggered by the kernel.

6. Why is reading so slow?

  • At this stage, it is easy to find similar problems reported by others. As it turns out, this issue had already been reported to cAdvisor, which was found to be a high CPU usage issue, and it was just not noticed that the latency was also randomly affecting the network stack. In fact, some internal developers have noticed that cAdvisor is consuming more CPU than expected, but it doesn't seem to be causing an issue because our servers have ample CPU performance, so CPU usage has not been investigated.
  • Looking at this question, it is mainly about the memory cgroup, which is responsible for managing and counting memory usage within the namespace (container). The memory cgroup is released by Docker when all processes in the cgroup exit. However, "memory" is not only the memory of the process, and while the usage of process memory is gone, it turns out that the kernel also allocates memory for cache space, such as dentries and inodes (directory and file metadata), which are cached into the memory cgroup. As can be seen from this question: "zombie" cgroups: cgroups that have no processes running and have been deleted still hold a certain amount of memory space (in our case, these cache objects are directory data, but it may also be the page cache or tmpfs).
  • Instead of traversing all cache pages when the cgroup is released, which can also be very slow, the kernel will lazily wait for the memory to be used before reclaiming it. When all memory pages are cleared, the corresponding cgroup will Finally recycled. At the same time, these cgroups will still be counted in statistics.
  • From a performance point of view, they amortize the huge time-consuming of direct block recycling by recycling each page in installments, choosing to do the initial cleanup quickly, but this way will keep some cache in memory. But that's okay, the cgroup will eventually be cleaned up when the kernel reclaims the last page of memory in the cache, so it's not a "leak". Unfortunately, the problem lies in the way memory.stat performs the search. For example, on some of our servers, the kernel is still version 4.9. The implementation of this version is problematic. Plus, our servers generally have a lot of memory space. This means that the last memory cache reclamation and cleanup of zombie cgroups may take a long time.
  • It turned out that the nodes had a large number of zombie cgroups and some had read/pauses for more than a second. The temporary workaround for this cAdvisor issue is to immediately release the system-wide directory/inode cache, which immediately eliminates read latency and network latency is also resolved because the cache removal includes those occupied by "zombie" cgroups cached pages, they are also released at the same time. This is not a final solution, but it verifies the cause of the problem.
  • It turns out that newer kernel versions (4.19+) improve the performance of memory.stat calls, so this is no longer an issue after updating to this version of the kernel. In the meantime, use existing tools to detect problems with nodes in the Kubernetes cluster and gracefully remove and restart them: it is these tools that are used to detect latency situations that are high enough to trigger problems. It will then be handled with a normal restart. This provided a breathing space during which we were able to upgrade the systems and kernels of the remaining servers.

7. Summary

  • Because this issue manifests itself as NIC RX queue stalls for a few hundred milliseconds, it causes high latency on short connections as well as delays in the middle of the connection (such as between MySQL queries and response packets). Understanding and maintaining the performance of our most fundamental systems, like Kubernetes, is critical to the reliability and speed of all services built on top of it.

Guess you like

Origin blog.csdn.net/Forever_wj/article/details/134948470