The Kubernetes networking model defines a basic set of rules:
Pods in the cluster are able to communicate with any other Pod without using Network Address Translation (NAT);
A program running on a cluster node can communicate with any Pod on the same node without using network address translation (NAT);
Each Pod has its own IP address (IP-per-Pod), and any other Pod can access it through the same address.
These requirements do not limit the implementation to a certain solution. Instead, they describe properties of cluster networks in general terms, and to satisfy these constraints, the following challenges must be addressed:
How can I ensure containers in the same Pod behave as if they were on the same host?
Can Pods in the cluster access other Pods?
Can the pod access the service? Are services load balanced?
Can Pods receive traffic from outside the cluster?
2. How does the Linux network namespace work in a Pod?
Let's look at a main container running an application accompanied by another container, in the example there is a Pod with nginx and busybox containers:
Pods have independent network namespaces on nodes;
Assign an IP address to the Pod, and share the port between the two containers;
Both containers share the same network namespace and are locally visible to each other.
Network configuration is done quickly in the background, but let's take a step back and try to understand why the above actions are required to run a container? In Linux, the network namespace is an independent and isolated logical space. The network namespace can be regarded as an independent part after dividing the physical network interface into small blocks. Each part can be configured separately and has its own network rules. and resources, these include firewall rules, interfaces (virtual or physical), routes, and everything related to networking.
The physical network interface holds the root network namespace:
You can also use the Linux network namespace to create independent networks. Each network is independent and will not communicate with other networks by default unless configured:
But in the end, the physical interface is still required to handle all real packets, and all virtual interfaces are created based on the physical interface. Network namespaces can be managed with ip-netns, and the namespaces on a host can be listed with ip netns list.
It should be noted that the created network namespace will appear under /var/run/netns, but Docker does not follow this rule. For example, here are some namespaces for Kubernetes nodes (the cni- prefix means that the namespace is created by the CNI plugin):
$ ip netns list
cni-0f226515-e28b-df13-9f16-dd79456825ac(id:3)
cni-4e4dfaac-89a6-2034-6098-dd8b2ee51dcd(id:4)
cni-7e94f0cc-9ee8-6a46-178a-55c73ce58f2e (id:2)
cni-7619c818-5b66-5d45-91c1-1c516f559291 (id:1)
cni-3004ec2c-9ac2-2928-b556-82c7fb37a4d8 (id:0)
When a Pod is created and the Pod is assigned to a node, the CNI will assign an IP address and connect the container to the network. If a Pod contains multiple containers, they will all be placed in the same namespace.
When a Pod is created, the container runtime creates a network namespace for the container:
The CNI is then responsible for assigning an IP address to the Pod:
Finally CNI connects the container to the rest of the network:
So what happens when you list the namespaces of the containers on the node? It is possible to connect to a Kubernetes node via SSH and view the namespace:
$ lsns -t net
NSTYPENPROCSPIDUSERNETNSIDNSFSCOMMAND4026531992 net 1711 root unassigned /run/docker/netns/default /sbin/init noembed norestore
4026532286 net 24808655350/run/docker/netns/56c020051c3b /pause
4026532414 net 55489655351/run/docker/netns/7db647b9b187 /pause
lsns is a command to list all namespaces available on a host machine, there are various namespace classes in Linux, so where is the Nginx container? What are those pause containers?
3. In the Pod, the pause container creates a network namespace
First list all namespaces on the node to see if the Nginx container can be found:
Nginx containers are in the mount (mnt), Unix time-sharing (uts) and PID (pid) namespaces, but not in the network namespace (net). Unfortunately, lsns only shows the smallest PID of each process, but you can further filter based on this process ID.
Retrieve the Nginx container in all namespaces with the following command:
The pause process reappears, it hijacks the network namespace, what's going on? Every Pod in the cluster has an additional hidden container running in the background, called the pause container. List running containers on a node and get the pause container:
It can be seen that each Pod on the node will have a corresponding pause container. This pause container is responsible for creating and maintaining the network namespace. The underlying container will complete the creation of the network namespace when it is running, usually by containerd or CRI-O . Before deploying pods and creating containers, the network namespace is created by the runtime, and the container runtime will automatically complete these, and there is no need to manually execute ip netns to create the namespace.
The pause container contains very little code and goes to sleep immediately after deployment. However, it is essential and plays a vital role in the Kubernetes ecosystem.
When a Pod is created, the container runtime creates a network namespace with sleeping containers:
All other containers in the Pod join the network namespace created by the pause container:
At this point, the CNI assigns an IP address and connects the container to the network:
What is a container that goes to sleep for? To understand its purpose, imagine a Pod with two containers, like in the previous example, but without the pause container. Once the container starts, the CNI will:
Make the busybox container join the previous network namespace;
assign IP addresses;
Connect the container to the network.
What if Nginx crashes? CNI will have to go through all the steps again and the network will be down for both containers. Since sleeping containers are less likely to have any errors, creating a network namespace is usually a safer and more robust option (if one container in a Pod crashes, the rest can still reply to other network requests).
4. Assign an IP address to the Pod
Now that we know the pod and both containers will have the same IP address, how is that configured? Inside the pod network namespace, create an interface and assign it an IP address.
Verify it.
First, find the IP address of the Pod:
$ kubectl get Pod multi-container-Pod-o jsonpath={
.status.PodIP}10.244.4.40
Next, find the relevant network namespace. Since the network namespace is created from the physical interface, you need to access the cluster nodes first. If running minikube, use minikube ssh to access the node; if running in cloud factory, there should be some way to access the node via SSH. Once inside, find the newly created named network namespace:
In the example, this is cni-0f226515-e28b-df13-9f16-dd79456825ac. You can then run exec commands within that namespace:
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip a
# output truncated
3: eth0@if12:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 16:a4:f8:4f:56:77 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.244.4.40/32 brd 10.244.4.40 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::14a4:f8ff:fe4f:5677/64 scope link
valid_lft forever preferred_lft forever
This IP is the IP address of the Pod! Find the network interface by looking for the 12 in @if12
$ ip link | grep -A1^1212: vethweplb3f36a0@if16: mtu 1376 qdisc noqueue master weave state UP mode DEFAULT group default
link/ether 72:1c:73:d9:d9:f6 brd ff:ff:ff:ff:ff:ff link-netnsid 1
You can also verify that the Nginx container is listening for HTTP traffic from within this namespace:
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac netstat -lnp
ActiveInternetconnections(only servers)ProtoRecv-QSend-QLocalAddressForeignAddressStatePID/Program name
tcp 000.0.0.0:800.0.0.0:*LISTEN692698/nginx: master
tcp6 00:::80:::*LISTEN692698/nginx: master
If you don't have access to the worker nodes in your cluster via SSH, you can use kubectl exec to get a shell to the busybox container and use the ip and netstat commands directly inside.
5. View the traffic from Pod to Pod in the cluster
There are two possible scenarios for Pod-to-Pod communication:
Pod traffic is destined for Pods on the same node;
Pod traffic is destined for Pods on different nodes.
The whole workflow relies on virtual interface pairs and bridges, in order for a pod to communicate with other pods, it must first access the node's root namespace. Pods and the root namespace are connected via virtual Ethernet pairs. These virtual interface devices (the v in veth) connect and act as a tunnel between the two namespaces. Using this veth device, you connect one end to the pod's named space, and the other end is connected to the root namespace.
CNI can perform these actions, but they can also be done manually:
$ ip link add veth1 netns Pod-namespace typeveth peer veth2 netns root
Now the Pod namespace has a tunnel that can access the root namespace. Each newly created Pod on the node will set up such a veth pair. One is to create an interface pair, and the other is to assign addresses to Ethernet devices and configure default routes.
Set up the veth1 interface in the Pod's namespace as follows:
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip addr add 10.244.4.40/24 dev veth1
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip link set veth1 up
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip route add default via 10.244.4.40
On the node, create another veth2 pair:
$ ip addr add 169.254.132.141/16 dev veth2
$ ip link set veth2 up
Existing veth pairs can be checked as before, in the Pod's namespace, to retrieve the suffix of the eth0 interface:
$ ip netns exec cni-0f226515-e28b-df13-9f16-dd79456825ac ip link show typeveth3: eth0@if12:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
link/ether 16:a4:f8:4f:56:77 brd ff:ff:ff:ff:ff:ff link-netnsid 0
In this case, you can use the command grep -A1 ^12 to find (or scroll to where the target is):
$ ip link show typeveth
# output truncated
12: cali97e50e215bd@if3:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-0f226515-e28b-df13-9f16-dd79456825ac
You can also use the ip -n cni-0f226515-e28b-df13-9f16-dd79456825ac link show type veth. command.
Note: 3: eth0@if12 and 12: symbols on the cali97e50e215bd@if3 interface, from the Pod namespace, the eth0 interface is connected to interface 12 of the root namespace, so it is @if12. On the other end of the veth pair, the root namespace connects to interface 3 of the Pod namespace, followed by bridges connecting both ends of the veth pair.
6. The Pod network namespace is connected to the Ethernet bridge
A bridge aggregates every virtual interface in the root namespace, allowing traffic between virtual pairs as well as traffic across the common root namespace. The relevant principle is: Ethernet bridges are at layer 2 of the OSI network model, and bridges can be viewed as virtual switches that accept connections from different namespaces and interfaces, and Ethernet bridges can connect multiple available networks on a node.
Thus, a bridge can be used to connect two interfaces, a veth of a Pod namespace to a veth of another Pod on the same node:
7. Track Pod-to-Pod traffic on the same node
Suppose there are two Pods on the same node, and Pod-A sends a message to Pod-B. Since the access target is not in the same namespace, Pod-A sends the data packet to its default interface eth0, which is bound to one end of the veth pair set, as a tunnel. This way, packets are forwarded to the root namespace on the node:
The Ethernet bridge acts as a virtual switch and needs the MAC address of the target Pod-B to work:
The ARP protocol will solve this problem. When the frame arrives at the bridge, it sends an ARP broadcast to all connected devices, and the bridge broadcast asks for the IP address holding Pod-B:
A MAC address reply with Pod-B IP will be received, and this message will be stored in the bridge ARP cache (lookup table):
After the mapping relationship between the IP address and the MAC address is stored, the bridge looks it up in the table and forwards the data packet to the correct endpoint. After the data packet arrives at the veth of Pod-B in the root namespace, it soon arrives at Pod-B The eth0 interface within the namespace:
So far, the communication between Pod-A and Pod-B is successful.
8. Track Pod-to-Pod communication on different nodes
For communication between pods across nodes, additional communication hops are taken. Follow the same steps as "Track Pod-to-Pod traffic on the same node" until the packet reaches the root namespace and needs to be sent to Pod-B:
When the destination IP is not in the local network, the message is forwarded to the default gateway of the node, the egress gateway or the default gateway of the node, which is usually located on the physical interface eth0 connecting the node to the network:
ARP resolution does not occur at this time, because the source IP and destination IP are not in the same network segment, and the checking of the network segment is done using bitwise operations. When the destination IP is not in the current network segment, the data packet is forwarded to the default gateway of the node.
Nine, the working principle of bitwise operation
When determining where to forward a packet, the source node must perform a bitwise operation, also known as an AND operation.
The rules of bitwise AND operations are as follows, except for 1 and 1, all are false:
0AND0=00AND1=01AND0=01AND1=1
If the IP of the source node is 192.168.1.1, the subnet mask is /24, and the destination IP is 172.16.1.1/16, the bitwise AND operation will tell that they are on different network segments. This means that the destination IP is not on the same network as the packet's source, and the packet will be forwarded through the default gateway.
The AND operation needs to be performed starting from the binary 32-bit address, and first find out the source IP network and target IP network segment:
Type
Binary
Converted
Src. IP Address
11000000.10101000.00000001.00000001
192.168.1.1
Src. Subnet Mask
11111111.11111111.11111111.00000000
255.255.255.0(/24)
Src. Network
11000000.10101000.00000001.00000000
192.168.1.0
Dst. IP Address
10101100.00010000.00000001.00000001
172.16.1.1
Dst. Subnet Mask
11111111.11111111.00000000.00000000
255.255.0.0(/16)
Dst. Network
10101100.00010000.00000000.00000000
172.16.0.0
After the bitwise operation, the destination IP needs to be compared with the subnet of the packet's source node:
Type
Binary
Converted
Dst. IP Address
10101100.00010000.00000001.00000001
172.16.1.1
Src. Subnet Mask
11111111.11111111.11111111.00000000
255.255.255.0(/24)
Network Result
10101100.00010000.00000001.00000000
172.16.1.0
The result of the operation is 172.16.1.0, which is not equal to 192.168.1.0 (the network of the source node), indicating that the source IP address and the destination IP address are not on the same network. If the destination IP is 192.168.1.2, ie in the same subnet as the sending IP, the AND operation will result in the node's local network.
Type
Binary
Converted
Dst. IP Address
11000000.10101000.00000001.00000010
192.168.1.2
Src. Subnet Mask
11111111.11111111.11111111.00000000
255.255.255.0(/24)
Network
11000000.10101000.00000001.00000000
192.168.1.0
After a bit-by-bit comparison, ARP looks up the MAC address of the default gateway through a lookup table. If there is an entry, the packet will be forwarded immediately, otherwise, broadcast first to find the MAC address of the gateway.
Now, the packet is routed to the default interface of another node, called Node-B:
In reverse order, the packet is now in the root namespace of Node-B, and arrives at the bridge, where ARP resolution occurs:
The routing system will return the MAC address of the interface connected to Pod-B:
The bridge forwards the frame through Pod-B's veth device and reaches Pod-B's namespace:
By now, everyone should be familiar with how traffic flows between Pods.
10. Container Network Interface - CNI
The container network interface (CNI) is mainly concerned with networking in the current node:
Think of CNI as a set of rules to follow to address Kubernetes networking needs. There are these CNI implementations available: Calico, Cillium, Flannel, Weave Net, and other network plugins, all following the same CNI standard.
If there is no CNI, you need to manually complete the following operations:
Create an interface;
Create a veth pair;
Set the network namespace;
Set static routing;
Configure the ethernet bridge;
assign IP addresses;
Create NAT rules;
And tons of other things.
This does not include all of the similar operations that need to be done when deleting or restarting Pods.
A CNI must support four different operations:
ADD - add a container to the network;
DEL - delete a container from the network;
CHECK - returns an error if there is a problem with the container's networking;
VERSION - Displays the version of the plugin.
Let's take a look, how does CNI work? When a Pod is assigned to a specific node, the Kubelet itself does not initialize the network, instead, the Kubelet delegates this task to the CNI. However, the Kubelet specifies the configuration in JSON format and sends it to the CNI plugin, you can go to the /etc/cni/net.d folder on the node and use the following command to view the current CNI configuration file:
Each CNI plugin uses a different type of network configuration. For example, Calico uses a BGP-based layer 3 network to connect Pods, and Cilium uses an eBPF-based overlay network from layer 3 to layer 7. Like Calico, Cilium also supports configuring network policies to limit traffic. So which one should you use? There are two main types of CNIs:
In the first category, a CNI that allocates IP addresses to Pods from the cluster's IP pool using a basic network setup (also known as a flat network), which can quickly exhaust IP addresses and become a burden;
The other is to use an overlay network. Simply put, an overlay network is a rebuilt network on top of the main (underlying) network. The overlay network works by encapsulating packets from the underlying network that are sent to Pods on another node. , a popular technology for overlay networks is VXLAN, which can tunnel L2 domains over L3 networks.
So which is better? There is no single answer, it depends on your needs, are you building large clusters with tens of thousands of nodes? Maybe an overlay network is better; do you care about easier configuration and inspection of network traffic, and don't want to lose this ability in a complex network? Then a flat web is better for you.
11. Check the traffic from Pod to Service
Since pods are dynamic in Kubernetes, the IP address assigned to the pod is not static, the pod's IP is ephemeral and changes every time a pod is created or deleted. Services in Kubernetes solve this problem, providing a reliable mechanism for connecting a set of Pods:
By default, when a Service is created in Kubernetes, it is assigned a virtual IP. Within a Service, a selector can be used to associate the Service with a target Pod. What happens when a Pod is removed or added? The virtual IP of Service remains static. But traffic can reach the newly created pod without intervention. In other words, Services in Kubernetes are similar to load balancers, but how do they work?
12. Use Netfilter and Iptables to intercept and rewrite traffic
Services in Kubernetes are built on two components in the Linux kernel: network filters and iptables. Netfilter is a framework that can configure packet filtering, create NAT, port forwarding rules, and manage traffic in the network. In addition, it can screen and prohibit unauthorized access. On the other hand, iptables is a userland program that can be used to configure the IP packet filtering rules of the Linux kernel firewall.
iptables is implemented as a distinct Netfilter module, and filter rules can be modified on the fly using the iptables CLI and inserted into the netfilters mountpoint. Filters are configured in different tables that contain chains for processing network traffic packets. Different protocols use different kernel modules and programs, and when iptables is mentioned, it usually refers to IPv4; for IPv6, the terminal tool is ip6tables.
There are five chains in iptables, each of which maps directly to Netfilter's hooks. From the perspective of iptables, they are:
PRE_ROUTING
INPUT
FORWARD
OUTPUT
POST_ROUTING
They map to Netfilter hooks accordingly:
NF_IP_PRE_ROUTING
NF_IP_LOCAL_IN
NF_IP_FORWARD
NF_IP_LOCAL_OUT
NF_IP_POST_ROUTING
When a packet arrives, depending on the phase it is in, a Netfilter hook is "triggered" which executes specific iptables filtering rules:
oops! It may seem complicated, but it's nothing to worry about, that's why with Kubernetes, all the above is abstracted by using a Service, and a simple YAML definition can set these rules automatically.
Looking at the iptables rules, you can connect to the node and run:
$ iptables-save
You can also use this tool to visualize the iptables chain on a node, here is an example diagram from Visualizing iptables chain on a GKE node:
So far, we have understood how Pods on the same node communicate with Pods on different nodes. In the communication between Pod and Service, the first half of the link is the same:
When the request goes from Pod-A to Pod-B, because Pod-B is "behind" the Service, there will be some differences in the transmission process. The original request is sent on the eth0 interface of the Pod-A namespace, and then the request reaches the bridge of the root namespace through veth. Once on the bridge, the packet is immediately forwarded through the default gateway. As in the Pod-to-Pod part, the host does a bitwise comparison. Since the service's virtual IP is not part of the node's CIDR, packets are immediately forwarded through the default gateway. If the MAC address of the default gateway does not already appear in the lookup table, ARP resolution is performed to find the MAC address of the default gateway. Now the magic happens, before the packet is routed through the node, Netfilter's NF_IP_PRE_ROUTING hook is triggered and executes the iptables rule which modifies the destination IP address DNAT of the Pod-A packet.
The virtual IP address of the previous service is rewritten to the IP address of Pod-B, and the subsequent packet routing process is the same as for Pod-to-Pod communication:
After packet rewriting, the communication is Pod to Pod. However, in all these communications, a third-party feature is used, this feature is called conntrack or link tracking, when Pod-B sends back a response, conntrack associates the packet with the link and traces its origin , NAT relies heavily on conntrack, without link tracking, it will not know where to send the packet containing the response back, when using conntrack, the return path of the packet is easily set to the same source or destination NAT changes. The other part of the communication is the opposite of the current link, Pod-B received and processed the request, and now sends data back to Pod-A.
13. Check the response from the service
Pod-B sends a response, setting its IP address as the source address and Pod-A's IP address as the destination address:
When the packet arrives at the interface of the node where Pod-A resides, another NAT occurs:
At this point, conntrack starts working, modifying the source IP address, iptables rules perform SNAT, and modify the source IP address of Pod-B to the virtual IP of the original service:
For Pod-A, the response is from the Service instead of Pod-B, the rest is the same, once the SNAT is done, the packet goes to the bridge in the root namespace and is forwarded to Pod- via the veth pair a.