k8s network scheme--Flannel's "routine"

preface

Those who have a little knowledge of docker should understand that docker ip only exists in the napaces of the host. It does not have a position in the real physical network, that is, there is no corresponding route and mac address, so it cannot Forwarding and routing of data packets, which causes the problem that dockers belonging to different hosts cannot communicate

As smart as you might think of using nat port mapping to solve it; in a simple scenario where the number of servers is small, we can indeed do this. After a meal, map the application port in docker to the port of the host, and you can achieve the same The external port is reachable; however, if you are faced with hundreds of docker servers running, it will be a huge workload. A few operations are as fierce as a tiger. The process is actually very hard. I believe you who are used to brute force At that time, the idea of ​​"slacking off" will emerge

CNI

In order to solve the above problems, k8s proposed a network solution standard CNI adapted to its own architecture:

  • There is no need to set up NAT between Pods to realize mutual network communication
  • Each Pod will see its own IP address just like other Pod
  • Nodes running a Kubernetes cluster can support physical or virtual machines, or any environment that supports running Kubernetes. These nodes can also communicate with other Pod nodes without setting up NAT.

The official statement is a bit long-winded. What people say is that Dockers have to face each other directly, don't play tricks, exclude the use of "proxy-like" methods such as NAT, and dual use their own real IP communication; in other words, what k8s wants to achieve is flat In the network, containers do not need to rely on the host's ip, and only need their own ip to communicate with each other. However, based on the principle of only mentioning requirements and never participating, k8s did not implement the underlying network solution by itself, but instead implemented it through a third-party open source solution.

At present, the solutions commonly used in the industry are Flannel, Calico, Weave and Canal. In terms of implementation principles and popularity, the most representative ones are flannel and calico.

  • Flannel
    is implemented through a two-layer overlay, allowing docker data packets to be transmitted in the overlay tunnel

  • Calico
    regards each host as a router, and builds a BGP network, and docker data is forwarded through the three-layer routing

The following will try to analyze the principle of Flannel architecture

Flannel is an overlay network architecture designed by CoreOs for K8s. The so-called overlay is actually the nesting of tcp. A layer of tcp/udp protocol is encapsulated in the tcp protocol. The following figure shows the vxlan overlay packet format

image.png

According to different encapsulation protocols, flannel can also be divided into multiple backends

  • Decapsulation of udp
    udp mode is performed in user mode, which consumes performance
  • host-gw is
    only applicable to the same Layer 2 network and cannot cross Layer 3
  • vxlan
    vxlan decapsulation is carried out in the kernel mode, with high performance, and can cross-domain three layers

The components of Flannel vxlan mode mainly include ectd, Flannell, etc.

Component

ETCD

If you are careful, you may find that the network segment assigned by docker is 172.17.0.0/16. If you start a host again, you will find that the network segment assigned by docker is still the same network segment. Is there any possibility of duplication of docker ip? At this time, a central node is needed to ensure the global uniqueness of docker ip. etcd plays such a role. It stores the network segments allocated to all nodes. Of course, it is actually just a porter of Flannel meta-information and is really responsible. Divide network segments, register and report subnet vtep and other information is Flannneld

Flanneld

flanned runs on the host node as an agent of ectcd and has the following functions:

  • Obtain network configuration information from etcd
  • Divide the subnet and register it in etcd
  • Recording information to a host of the subnet /run/flannel/subnet.envin
root@ip-172-25-34-198:~#  cat /run/flannel/subnet.envFLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=192.168.2.1/24
FLANNEL_MTU = 8951
FLANNEL_IPMASQ=true
ubuntu@ip-172-25-33-13:~$ cat /run/flannel/subnet.envFLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=192.168.1.1/24
FLANNEL_MTU = 8951
FLANNEL_IPMASQ=true

As shown above, the two nodes are assigned to different subnets, and the information will be reported to ectcd and injected into the docker startup parameters, thus ensuring the uniqueness of the docker ip cluster

Vxlan packaging process

The ip is already unique, so how does the cross-node docker communicate? To predict the underlying principles, we need to cite a chestnut

It is easy to understand, based on the analysis of vxlan mode, the data forwarding process between docker 192.168.1.4 —> 192.168.2.2 across nodes

image.png

1. The data starts from the docker whose ip is 192.168.1.4. According to the routing table, the packet with the destination address of 192.168.2.2 matches the default route, so the packet is sent out from the container's eth0 network interface

1. The data starts from the docker whose ip is 192.168.1.4. According to the routing table, the packet with the destination address of 192.168.2.2 matches the default route, so the packet is sent out from the container's eth0 network interface

root@ip-172-25-33-13:/etc/cni/net.d# docker exec bd69 ip addr1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever3: eth0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 8951 qdisc noqueue state UP
    link / ether 96: 53: 14: 4a: 4e: bd brd ff: ff: ff: ff: ff: ff
    inet 192.168.1.4/24 scope global eth0
       valid_lft forever preferred_lft forever
root@ip-172-25-33-13:/etc/cni/net.d# docker exec bd69 ip routedefault via 192.168.1.1 dev eth010.244.0.0/16 via 192.168.1.1 dev eth0192.168.1.0/24 dev eth0  proto kernel  scope link  src 192.168.1.4

2. Where will the data from docker be sent to? In theory, docker and host are located in two different network namespaces, and the network is not connected. However, the emergence of vethpair makes the situation different. Vethpari connects the two The "network cable" of namespaces connects the docker eth0 interface and the host's cni0 interface in series, so the data packet will be forwarded to the host's cni0 again

3. In the routing table on the host, there is an entry with the address of 192.168.2.0/24, so the destination data packet that has drifted away is forwarded to the flannel.1 port again

root@ip-172-25-33-13:/etc/cni/net.d# ip routedefault via 172.25.32.1 dev eth0172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1 linkdown172.25.32.0/20 dev eth0  proto kernel  scope link  src 172.25.33.13192.168.0.0/24 via 192.168.0.0 dev flannel.1 onlink192.168.1.0/24 dev cni0  proto kernel  scope link  src 192.168.1.1192.168.2.0/24 via 192.168.2.0 dev flannel.1 onlink

4. flannel.1 is a virtual VTEP (Vxlan tunnel endpoint) device, responsible for the encapsulation and unpacking of vxlan protocol packets. Once the data packet is sent to flannel.1, it will enter the vxlan protocol routine

When the data message comes to flannel.1, it needs to encapsulate the data packet according to the vxlan protocol. At this time, the dst ip is 192.168.2.2, and flannel.1 needs to know the mac address corresponding to 192.168.2.2, which is different from the traditional two-layer addressing , Flannel.1 will not send an arp request to obtain the mac address of 192.168.2.2, but a flanned program in user space sent by the Linux kernel to a "L3 Miss" event request. After the Flanned program receives the request event from the kernel, it searches etcd for the mac address of the flannel.1 device in the subnet that matches the target address, that is, the mac address of the flannel.1 device in the host where the pod is sent, thus forming The complete vxlan inner data packet format is as follows

image.png

5. According to the top-to-bottom encapsulation process of the TCP protocol stack, to form a complete data frame that can be transmitted, the target ip and the target mac address are also required. How to find the two?
For the target ip, flannel.1 finds the ip address corresponding to the mac address 26:1c:b0:2b:17:31 from fdb (forwarding database) according to the mac address of the opposite vtep, as shown below, the mac address corresponds to ip is 172.25.34.198

root@ip-172-25-33-13:~# bridge fdb show dev flannel.1
22:5b:16:5a:1b:fc dst 172.25.42.118 self permanent26:1c:b0:2b:17:31 dst 172.25.34.198 self permanent

Then you can find the mac address corresponding to 172.25.34.198 through the arp protocol, so a complete data frame appears as follows:

image.png

Finally, the target ip is 172.25.34.198 and the data frame with the target port of 8472 is sent out from the host's eth0 port

6. The data frame is successfully forwarded through the layer 2 network to the port 8472 of the host at 172.25.34.198 at the opposite end. This port is the monitoring port of Flannneld. Flannneld decapsulates the vxlan data packet. It is found that the destination IP of the data packet is 192.168.2.2, which matches the system. Detailed routing, forwarding the data packet to the cni0 port

root@ip-172-25-34-198:~# ip route
default via 172.25.32.1 dev eth0172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1 linkdown
172.25.32.0/20 dev eth0  proto kernel  scope link  src 172.25.34.198
192.168.0.0/24 via 192.168.0.0 dev flannel.1 onlink
192.168.1.0/24 via 192.168.1.0 dev flannel.1 onlink
192.168.2.0/24 dev cni0  proto kernel  scope link  src 192.168.2.1

7. As mentioned above, the cni0 port is directly connected to the docker namespace, so the data is forwarded to docker192.168.2.1 through vethpair

After a lot of hard work and tried countless routines, the data packet finally realized the communication between the cross-host docke

to sum up

Flanneld and ETCD jointly build an overlay network covering all cluster nodes, enabling dockers to communicate with each other using their own ips in this layer of "tunnels".


Guess you like

Origin blog.51cto.com/3379770/2638236