Cloud native in-depth analysis of the flow path of Kubernetes network traffic

1. Kubernetes network requirements

  • The Kubernetes networking model defines a basic set of rules:
    • Pods in the cluster are able to communicate with any other Pod without using Network Address Translation (NAT);
    • A program running on a cluster node can communicate with any Pod on the same node without using network address translation (NAT);
    • Each Pod has its own IP address (IP-per-Pod), and any other Pod can access it through the same address.
  • These requirements do not limit the implementation to a certain solution. Instead, they describe properties of cluster networks in general terms, and to satisfy these constraints, the following challenges must be addressed:
    • How can I ensure containers in the same Pod behave as if they were on the same host?
    • Can Pods in the cluster access other Pods?
    • Can the pod access the service? Are services load balanced?
    • Can Pods receive traffic from outside the cluster?

2. How does the Linux network namespace work in a Pod?

  • Let's look at a main container running an application accompanied by another container, in the example there is a Pod with nginx and busybox containers:
apiVersion: v1
kind: Pod
metadata:
  name: multi-container-Pod
spec:
  containers:
    - name: container-1
      image: busybox
      command: ['/bin/sh', '-c', 'sleep 1d']
    - name: container-2
      image: nginx
  • When deployed, the following things happen:
    • Pods have independent network namespaces on nodes;
    • Assign an IP address to the Pod, and share the port between the two containers;
    • Both containers share the same network namespace and are locally visible to each other.
  • Network configuration is done quickly in the background, but let's take a step back and try to understand why the above actions are required to run a container? In Linux, the network namespace is an independent and isolated logical space. The network namespace can be regarded as an independent part after dividing the physical network interface into small blocks. Each part can be configured separately and has its own network rules. and resources, these include firewall rules, interfaces (virtual or physical), routes, and everything related to networking.
  • The physical network interface holds the root network namespace:

insert image description here

  • You can also use the Linux network namespace to create independent networks. Each network is independent and will not communicate with other networks by default unless configured:

insert image description here

  • But in the end, the physical interface is still required to handle all real packets, and all virtual interfaces are created based on the physical interface. Network namespaces can be managed with ip-netns, and the namespaces on a host can be listed with ip netns list.
  • It should be noted that the created network namespace will appear under /var/run/netns, but Docker does not follow this rule. For example, here are some namespaces for Kubernetes nodes (the cni- prefix means that the namespace is created by the CNI plugin):
$ ip netns list

cni-0f226515-e28b-df13-9f16-dd79456825ac (id: 3)
cni-4e4dfaac-89a6-2034-6098-dd8b2ee51dcd (id: 4)
cni-7e94f0cc-9ee8-6a46-178a-55c73ce58f2e (id: 2)
cni-7619c818-5b66-5d45-91c1-1c516f559291 (id: 1)
cni-3004ec2c-9ac2-2928-b556-82c7fb37a4d8 (id: 0)
  • When a Pod is created and the Pod is assigned to a node, the CNI will assign an IP address and connect the container to the network. If a Pod contains multiple containers, they will all be placed in the same namespace.
  • When a Pod is created, the container runtime creates a network namespace for the container:

insert image description here

  • The CNI is then responsible for assigning an IP address to the Pod:

insert image description here

  • Finally CNI connects the container to the rest of the network:

insert image description here

  • So what happens when you list the namespaces of the containers on the node? It is possible to connect to a Kubernetes node via SSH and view the namespace:
$ lsns -t net

        NS TYPE NPROCS   PID USER     NETNSID NSFS                           COMMAND
4026531992 net     171     1 root  unassigned /run/docker/netns/default      /sbin/init noembed norestore
4026532286 net       2  4808 65535          0 /run/docker/netns/56c020051c3b /pause
4026532414 net       5  5489 65535          1 /run/docker/netns/7db647b9b187 /pause
  • lsns is a command to list all namespaces available on a host machine, there are various namespace classes in Linux, so where is the Nginx container? What are those pause containers?

3. In the Pod, the pause container creates a network namespace

  • First list all namespaces on the node to see if the Nginx container can be found:
$ lsns
        NS TYPE   NPROCS   PID USER            COMMAND
# truncated output
4026532414 net         5  5489 65535           /pause
4026532513 mnt         1  5599 root            sleep 1d
4026532514 uts         1  5599 root            sleep 1d
4026532515 pid         1  5599 root            sleep 1d
4026532516 mnt         3  5777 root            nginx: master process nginx -g daemon off;
4026532517 uts         3  5777 root            nginx: master process nginx -g daemon off;
4026532518 pid         3  5777 root            nginx: master process nginx -g daemon off;
  • Nginx containers are in the mount (mnt), Unix time-sharing (uts) and PID (pid) namespaces, but not in the network namespace (net). Unfortunately, lsns only shows the smallest PID of each process, but you can further filter based on this process ID.
  • Retrieve the Nginx container in all namespaces with the following command:
$ sudo lsns -p 5777

       NS TYPE   NPROCS   PID USER  COMMAND
4026531835 cgroup    178     1 root  /sbin/init noembed norestore
4026531837 user      178     1 root  /sbin/init noembed norestore
4026532411 ipc         5  5489 65535 /pause
4026532414 net         5  5489 65535 /pause
4026532516 mnt         3  5777 root  nginx: master process nginx -g daemon off;
4026532517 uts         3  5777 root  nginx: master process nginx -g daemon off;
4026532518 pid         3  5777 root  nginx: master process nginx -g daemon off;
  • The pause process reappears, it hijacks the network namespace, what's going on? Every Pod in the cluster has an additional hidden container running in the background, called the pause container. List running containers on a node and get the pause container:
$ docker ps | grep pause

fa9666c1d9c6   k8s.gcr.io/pause:3.4.1  "/pause"  k8s_POD_kube-dns-599484b884-sv2js…
44218e010aeb   k8s.gcr.io/pause:3.4.1  "/pause"  k8s_POD_blackbox-exporter-55c457d…
5fb4b5942c66   k8s.gcr.io/pause:3.4.1  "/pause"  k8s_POD_kube-dns-599484b884-cq99x…
8007db79dcf2   k8s.gcr.io/pause:3.4.1  "/pause"  k8s_POD_konnectivity-agent-84f87c…
  • It can be seen that each Pod on the node will have a corresponding pause container. This pause container is responsible for creating and maintaining the network namespace. The underlying container will complete the creation of the network namespace when it is running, usually by containerd or CRI-O . Before deploying pods and creating containers, the network namespace is created by the runtime, and the container runtime will automatically complete these, and there is no need to manually execute ip netns to create the namespace.
  • The pause container contains very little code and goes to sleep immediately after deployment. However, it is essential and plays a vital role in the Kubernetes ecosystem.
  • When a Pod is created, the container runtime creates a network namespace with sleeping containers:

insert image description here

  • All other containers in the Pod join the network namespace created by the pause container:

insert image description here

  • At this point, the CNI assigns an IP address and connects the container to the network:

insert image description here

  • What is a container that goes to sleep for? To understand its purpose, imagine a Pod with two containers, like in the previous example, but without the pause container. Once the container starts, the CNI will:
    • Make the busybox container join the previous network namespace;
    • assign IP addresses;
    • Connect the container to the network.
  • What if Nginx crashes? CNI will have to go through all the steps again and the network will be down for both containers. Since sleeping containers are less likely to have any errors, creating a network namespace is usually a safer and more robust option (if one container in a Pod crashes, the rest can still reply to other network requests).

4. Assign an IP address to the Pod

  • Now that we know the pod and both containers will have the same IP address, how is that configured? Inside the pod network namespace, create an interface and assign it an IP address.
  • Verify it.
    • First, find the IP address of the Pod:
$ kubectl get Pod multi-container-Pod -o jsonpath={
    
    .status.PodIP}

10.244.4.40
    • Next, find the relevant network namespace. Since the network namespace is created from the physical interface, you need to access the cluster nodes first. If running minikube, use minikube ssh to access the node; if running in cloud factory, there should be some way to access the node via SSH. Once inside, find the newly created named network namespace:
$ ls -lt /var/run/netns

total 0
-r--r--r-- 1 root root 0 Sep 25 13:34 cni-0f226515-e28b-df13-9f16-dd79456825ac
-r--r--r-- 1 root root 0 Sep 24 09:39 cni-4e4dfaac-89a6-2034-6098-dd8b2ee51dcd
-r--r--r-- 1 root root 0 Sep 24 09:39 cni-7e94f0cc-9ee8-6a46-178a-55c73ce58f2e
-r--r--r-- 1 root root 0 Sep 24 09:39 cni-7619c818-5b66-5d45-91c1-1c516f559291
-r--r--r-- 1 root root 0 Sep 24 09:39 cni-3004ec2c-9ac2-2928-b556-82c7fb37a4d8
    • In the example, this is cni-0f226515-e28b-df13-9f16-dd79456825ac. You can then run exec commands within that namespace:
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip a

# output truncated
3: eth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 16:a4:f8:4f:56:77 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.4.40/32 brd 10.244.4.40 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::14a4:f8ff:fe4f:5677/64 scope link
       valid_lft forever preferred_lft forever
  • This IP is the IP address of the Pod! Find the network interface by looking for the 12 in @if12
$ ip link | grep -A1 ^12

12: vethweplb3f36a0@if16: mtu 1376 qdisc noqueue master weave state UP mode DEFAULT group default
    link/ether 72:1c:73:d9:d9:f6 brd ff:ff:ff:ff:ff:ff link-netnsid 1
  • You can also verify that the Nginx container is listening for HTTP traffic from within this namespace:
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac netstat -lnp

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      692698/nginx: master
tcp6       0      0 :::80                   :::*                    LISTEN      692698/nginx: master
  • If you don't have access to the worker nodes in your cluster via SSH, you can use kubectl exec to get a shell to the busybox container and use the ip and netstat commands directly inside.

5. View the traffic from Pod to Pod in the cluster

  • There are two possible scenarios for Pod-to-Pod communication:
    • Pod traffic is destined for Pods on the same node;
    • Pod traffic is destined for Pods on different nodes.
  • The whole workflow relies on virtual interface pairs and bridges, in order for a pod to communicate with other pods, it must first access the node's root namespace. Pods and the root namespace are connected via virtual Ethernet pairs. These virtual interface devices (the v in veth) connect and act as a tunnel between the two namespaces. Using this veth device, you connect one end to the pod's named space, and the other end is connected to the root namespace.

insert image description here

  • CNI can perform these actions, but they can also be done manually:
$ ip link add veth1 netns Pod-namespace type veth peer veth2 netns root
  • Now the Pod namespace has a tunnel that can access the root namespace. Each newly created Pod on the node will set up such a veth pair. One is to create an interface pair, and the other is to assign addresses to Ethernet devices and configure default routes.
  • Set up the veth1 interface in the Pod's namespace as follows:
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip addr add 10.244.4.40/24 dev veth1
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip link set veth1 up
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip route add default via 10.244.4.40
  • On the node, create another veth2 pair:
$ ip addr add 169.254.132.141/16 dev veth2
$ ip link set veth2 up
  • Existing veth pairs can be checked as before, in the Pod's namespace, to retrieve the suffix of the eth0 interface:
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip link show type veth

3: eth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 16:a4:f8:4f:56:77 brd ff:ff:ff:ff:ff:ff link-netnsid 0
  • In this case, you can use the command grep -A1 ^12 to find (or scroll to where the target is):
$ ip link show type veth

# output truncated
12: cali97e50e215bd@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-0f226515-e28b-df13-9f16-dd79456825ac
  • You can also use the ip -n cni-0f226515-e28b-df13-9f16-dd79456825ac link show type veth. command.
  • Note: 3: eth0@if12 and 12: symbols on the cali97e50e215bd@if3 interface, from the Pod namespace, the eth0 interface is connected to interface 12 of the root namespace, so it is @if12. On the other end of the veth pair, the root namespace connects to interface 3 of the Pod namespace, followed by bridges connecting both ends of the veth pair.

6. The Pod network namespace is connected to the Ethernet bridge

  • A bridge aggregates every virtual interface in the root namespace, allowing traffic between virtual pairs as well as traffic across the common root namespace. The relevant principle is: Ethernet bridges are at layer 2 of the OSI network model, and bridges can be viewed as virtual switches that accept connections from different namespaces and interfaces, and Ethernet bridges can connect multiple available networks on a node.
  • Thus, a bridge can be used to connect two interfaces, a veth of a Pod namespace to a veth of another Pod on the same node:

insert image description here

7. Track Pod-to-Pod traffic on the same node

  • Suppose there are two Pods on the same node, and Pod-A sends a message to Pod-B. Since the access target is not in the same namespace, Pod-A sends the data packet to its default interface eth0, which is bound to one end of the veth pair set, as a tunnel. This way, packets are forwarded to the root namespace on the node:

insert image description here

  • The Ethernet bridge acts as a virtual switch and needs the MAC address of the target Pod-B to work:

insert image description here

  • The ARP protocol will solve this problem. When the frame arrives at the bridge, it sends an ARP broadcast to all connected devices, and the bridge broadcast asks for the IP address holding Pod-B:

insert image description here

  • A MAC address reply with Pod-B IP will be received, and this message will be stored in the bridge ARP cache (lookup table):

insert image description here

  • After the mapping relationship between the IP address and the MAC address is stored, the bridge looks it up in the table and forwards the data packet to the correct endpoint. After the data packet arrives at the veth of Pod-B in the root namespace, it soon arrives at Pod-B The eth0 interface within the namespace:

insert image description here

  • So far, the communication between Pod-A and Pod-B is successful.

8. Track Pod-to-Pod communication on different nodes

  • For communication between pods across nodes, additional communication hops are taken. Follow the same steps as "Track Pod-to-Pod traffic on the same node" until the packet reaches the root namespace and needs to be sent to Pod-B:

insert image description here

  • When the destination IP is not in the local network, the message is forwarded to the default gateway of the node, the egress gateway or the default gateway of the node, which is usually located on the physical interface eth0 connecting the node to the network:

insert image description here

  • ARP resolution does not occur at this time, because the source IP and destination IP are not in the same network segment, and the checking of the network segment is done using bitwise operations. When the destination IP is not in the current network segment, the data packet is forwarded to the default gateway of the node.

Nine, the working principle of bitwise operation

  • When determining where to forward a packet, the source node must perform a bitwise operation, also known as an AND operation.
  • The rules of bitwise AND operations are as follows, except for 1 and 1, all are false:
0 AND 0 = 0
0 AND 1 = 0
1 AND 0 = 0
1 AND 1 = 1
  • If the IP of the source node is 192.168.1.1, the subnet mask is /24, and the destination IP is 172.16.1.1/16, the bitwise AND operation will tell that they are on different network segments. This means that the destination IP is not on the same network as the packet's source, and the packet will be forwarded through the default gateway.
  • The AND operation needs to be performed starting from the binary 32-bit address, and first find out the source IP network and target IP network segment:
Type Binary Converted
Src. IP Address 11000000.10101000.00000001.00000001 192.168.1.1
Src. Subnet Mask 11111111.11111111.11111111.00000000 255.255.255.0(/24)
Src. Network 11000000.10101000.00000001.00000000 192.168.1.0
Dst. IP Address 10101100.00010000.00000001.00000001 172.16.1.1
Dst. Subnet Mask 11111111.11111111.00000000.00000000 255.255.0.0(/16)
Dst. Network 10101100.00010000.00000000.00000000 172.16.0.0
  • After the bitwise operation, the destination IP needs to be compared with the subnet of the packet's source node:
Type Binary Converted
Dst. IP Address 10101100.00010000.00000001.00000001 172.16.1.1
Src. Subnet Mask 11111111.11111111.11111111.00000000 255.255.255.0(/24)
Network Result 10101100.00010000.00000001.00000000 172.16.1.0
  • The result of the operation is 172.16.1.0, which is not equal to 192.168.1.0 (the network of the source node), indicating that the source IP address and the destination IP address are not on the same network. If the destination IP is 192.168.1.2, ie in the same subnet as the sending IP, the AND operation will result in the node's local network.
Type Binary Converted
Dst. IP Address 11000000.10101000.00000001.00000010 192.168.1.2
Src. Subnet Mask 11111111.11111111.11111111.00000000 255.255.255.0(/24)
Network 11000000.10101000.00000001.00000000 192.168.1.0
  • After a bit-by-bit comparison, ARP looks up the MAC address of the default gateway through a lookup table. If there is an entry, the packet will be forwarded immediately, otherwise, broadcast first to find the MAC address of the gateway.
  • Now, the packet is routed to the default interface of another node, called Node-B:

insert image description here

  • In reverse order, the packet is now in the root namespace of Node-B, and arrives at the bridge, where ARP resolution occurs:

insert image description here

  • The routing system will return the MAC address of the interface connected to Pod-B:

insert image description here

  • The bridge forwards the frame through Pod-B's veth device and reaches Pod-B's namespace:

insert image description here

  • By now, everyone should be familiar with how traffic flows between Pods.

10. Container Network Interface - CNI

  • The container network interface (CNI) is mainly concerned with networking in the current node:

insert image description here

  • Think of CNI as a set of rules to follow to address Kubernetes networking needs. There are these CNI implementations available: Calico, Cillium, Flannel, Weave Net, and other network plugins, all following the same CNI standard.
  • If there is no CNI, you need to manually complete the following operations:
    • Create an interface;
    • Create a veth pair;
    • Set the network namespace;
    • Set static routing;
    • Configure the ethernet bridge;
    • assign IP addresses;
    • Create NAT rules;
    • And tons of other things.
  • This does not include all of the similar operations that need to be done when deleting or restarting Pods.
  • A CNI must support four different operations:
    • ADD - add a container to the network;
    • DEL - delete a container from the network;
    • CHECK - returns an error if there is a problem with the container's networking;
    • VERSION - Displays the version of the plugin.
  • Let's take a look, how does CNI work? When a Pod is assigned to a specific node, the Kubelet itself does not initialize the network, instead, the Kubelet delegates this task to the CNI. However, the Kubelet specifies the configuration in JSON format and sends it to the CNI plugin, you can go to the /etc/cni/net.d folder on the node and use the following command to view the current CNI configuration file:
$ cat 10-calico.conflist

{
    
    
  "name": "k8s-Pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
    
    
      "type": "calico",
      "datastore_type": "kubernetes",
      "mtu": 0,
      "nodename_file_optional": false,
      "log_level": "Info",
      "log_file_path": "/var/log/calico/cni/cni.log",
      "ipam": {
    
     "type": "calico-ipam", "assign_ipv4" : "true", "assign_ipv6" : "false"},
      "container_settings": {
    
    
          "allow_ip_forwarding": false
      },
      "policy": {
    
    
          "type": "k8s"
      },
      "kubernetes": {
    
    
          "k8s_api_root":"https://10.96.0.1:443",
          "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
      }
    },
    {
    
    
      "type": "bandwidth",
      "capabilities": {
    
    "bandwidth": true}
    },
    {
    
    "type": "portmap", "snat": true, "capabilities": {
    
    "portMappings": true}}
  ]
}
  • Each CNI plugin uses a different type of network configuration. For example, Calico uses a BGP-based layer 3 network to connect Pods, and Cilium uses an eBPF-based overlay network from layer 3 to layer 7. Like Calico, Cilium also supports configuring network policies to limit traffic. So which one should you use? There are two main types of CNIs:
    • In the first category, a CNI that allocates IP addresses to Pods from the cluster's IP pool using a basic network setup (also known as a flat network), which can quickly exhaust IP addresses and become a burden;
    • The other is to use an overlay network. Simply put, an overlay network is a rebuilt network on top of the main (underlying) network. The overlay network works by encapsulating packets from the underlying network that are sent to Pods on another node. , a popular technology for overlay networks is VXLAN, which can tunnel L2 domains over L3 networks.
  • So which is better? There is no single answer, it depends on your needs, are you building large clusters with tens of thousands of nodes? Maybe an overlay network is better; do you care about easier configuration and inspection of network traffic, and don't want to lose this ability in a complex network? Then a flat web is better for you.

11. Check the traffic from Pod to Service

  • Since pods are dynamic in Kubernetes, the IP address assigned to the pod is not static, the pod's IP is ephemeral and changes every time a pod is created or deleted. Services in Kubernetes solve this problem, providing a reliable mechanism for connecting a set of Pods:

insert image description here

  • By default, when a Service is created in Kubernetes, it is assigned a virtual IP. Within a Service, a selector can be used to associate the Service with a target Pod. What happens when a Pod is removed or added? The virtual IP of Service remains static. But traffic can reach the newly created pod without intervention. In other words, Services in Kubernetes are similar to load balancers, but how do they work?

12. Use Netfilter and Iptables to intercept and rewrite traffic

  • Services in Kubernetes are built on two components in the Linux kernel: network filters and iptables. Netfilter is a framework that can configure packet filtering, create NAT, port forwarding rules, and manage traffic in the network. In addition, it can screen and prohibit unauthorized access. On the other hand, iptables is a userland program that can be used to configure the IP packet filtering rules of the Linux kernel firewall.
  • iptables is implemented as a distinct Netfilter module, and filter rules can be modified on the fly using the iptables CLI and inserted into the netfilters mountpoint. Filters are configured in different tables that contain chains for processing network traffic packets. Different protocols use different kernel modules and programs, and when iptables is mentioned, it usually refers to IPv4; for IPv6, the terminal tool is ip6tables.
  • There are five chains in iptables, each of which maps directly to Netfilter's hooks. From the perspective of iptables, they are:
    • PRE_ROUTING
    • INPUT
    • FORWARD
    • OUTPUT
    • POST_ROUTING
  • They map to Netfilter hooks accordingly:
    • NF_IP_PRE_ROUTING
    • NF_IP_LOCAL_IN
    • NF_IP_FORWARD
    • NF_IP_LOCAL_OUT
    • NF_IP_POST_ROUTING
  • When a packet arrives, depending on the phase it is in, a Netfilter hook is "triggered" which executes specific iptables filtering rules:

insert image description here

  • oops! It may seem complicated, but it's nothing to worry about, that's why with Kubernetes, all the above is abstracted by using a Service, and a simple YAML definition can set these rules automatically.
  • Looking at the iptables rules, you can connect to the node and run:
$ iptables-save
  • You can also use this tool to visualize the iptables chain on a node, here is an example diagram from Visualizing iptables chain on a GKE node:

insert image description here

  • So far, we have understood how Pods on the same node communicate with Pods on different nodes. In the communication between Pod and Service, the first half of the link is the same:

insert image description here

  • When the request goes from Pod-A to Pod-B, because Pod-B is "behind" the Service, there will be some differences in the transmission process. The original request is sent on the eth0 interface of the Pod-A namespace, and then the request reaches the bridge of the root namespace through veth. Once on the bridge, the packet is immediately forwarded through the default gateway. As in the Pod-to-Pod part, the host does a bitwise comparison. Since the service's virtual IP is not part of the node's CIDR, packets are immediately forwarded through the default gateway. If the MAC address of the default gateway does not already appear in the lookup table, ARP resolution is performed to find the MAC address of the default gateway. Now the magic happens, before the packet is routed through the node, Netfilter's NF_IP_PRE_ROUTING hook is triggered and executes the iptables rule which modifies the destination IP address DNAT of the Pod-A packet.

insert image description here

  • The virtual IP address of the previous service is rewritten to the IP address of Pod-B, and the subsequent packet routing process is the same as for Pod-to-Pod communication:

insert image description here

  • After packet rewriting, the communication is Pod to Pod. However, in all these communications, a third-party feature is used, this feature is called conntrack or link tracking, when Pod-B sends back a response, conntrack associates the packet with the link and traces its origin , NAT relies heavily on conntrack, without link tracking, it will not know where to send the packet containing the response back, when using conntrack, the return path of the packet is easily set to the same source or destination NAT changes. The other part of the communication is the opposite of the current link, Pod-B received and processed the request, and now sends data back to Pod-A.

13. Check the response from the service

  • Pod-B sends a response, setting its IP address as the source address and Pod-A's IP address as the destination address:

insert image description here

  • When the packet arrives at the interface of the node where Pod-A resides, another NAT occurs:

insert image description here

  • At this point, conntrack starts working, modifying the source IP address, iptables rules perform SNAT, and modify the source IP address of Pod-B to the virtual IP of the original service:

insert image description here

  • For Pod-A, the response is from the Service instead of Pod-B, the rest is the same, once the SNAT is done, the packet goes to the bridge in the root namespace and is forwarded to Pod- via the veth pair a.

Guess you like

Origin blog.csdn.net/Forever_wj/article/details/131450305