K8S+DevOps Architect Practical Course | Network Implementation of Kubernetes Cluster

Video source: Station B "Docker&k8s Tutorial Ceiling, Absolutely the best one taught by Station B, this set of learning k8s to get all the core knowledge of Docker is here"

Organize the teacher's course content and test notes while studying, and share them with everyone. Any infringement will be deleted. Thank you for your support!

Attach a summary post: K8S+DevOps Architect Practical Course | Summary


CNI introduction and cluster network selection

The container network interface (Container Network Interface) realizes the Pod network communication and management of the kubernetes cluster. include:

  • CNI Plugin is responsible for configuring the network for the container, which includes two basic interfaces: Configure the network: AddNetwork(net NetworkConfig, rt RuntimeConf) (types.Result, error) Clean up the network: DelNetwork(net NetworkConfig, rt RuntimeConf) error
  • IPAM Plugin is responsible for assigning IP addresses to containers, and the main implementations include host-local and dhcp.

The support of the above two plug-ins enables the k8s network to support various management modes. Currently, there are a large number of support solutions in the industry, among which flannel and calico are more popular.

After kubernetes configures the cni network plug-in, its container network creation process is as follows:

  • Kubelet first creates the pause container to generate the corresponding network namespace
  • Call the network driver, because the CNI is configured, so the CNI related code will be called, and the configuration directory to identify the CNI is /etc/cni/net.d
  • The CNI driver invokes the specific CNI plug-in according to the configuration, the binary invokes, the directory of the executable file is /opt/cni/bin, and the project
  • The CNI plug-in configures the correct network for the pause container, and all other containers in the pod use the pause network

You can view the CNI implementation in the community here,
https://github.com/containernetworking/cni

General types: flannel, calico, etc., easy to deploy and use

Others: Select according to the specific network environment and network requirements, such as

  • For public cloud machines, you can choose custom backends from vendors and network plug-ins. For example, AWS, Alibaba, and Tencent all have their own plug-ins for flannel, as well as AWS ECS CNI
  • Private cloud vendors, such as Vmware NSX-T, etc.
  • Network performance, etc., MacVlan

Analysis of Flannel Network Model Implementation

There are multiple implementations of flannel's network:

  • udp
  • vxlan
  • host-gw

If not specified, the vxlan technology will be used as the Backend by default, which can be viewed as follows:

$ kubectl -n kube-system exec  kube-flannel-ds-amd64-cb7hs cat /etc/kube-flannel/net-conf.json
{
  "Network": "10.244.0.0/16",
  "Backend": {
    "Type": "vxlan"
  }
}

Introduction to vxlan and realization of point-to-point communication

The full name of VXLAN is Virtual Extensible Local Area Network (Virtual eXtensible Local Area Network). It is an overlay technology that builds a virtual layer-2 network through a layer-3 network.

It is created on the original IP network (three layers), and vxlan can be deployed as long as the three layers are reachable (can communicate with each other through IP). On each endpoint, there is a vtep responsible for encapsulating and unpacking the vxlan protocol message, that is, encapsulating the vtep communication message header on the virtual message. Multiple vxlan networks can be created on the physical network. These vxlan networks can be considered as a tunnel, and virtual machines on different nodes can be directly connected through the tunnel. Each vxlan network is identified by a unique VNI, and different vxlans may not affect each other.

  • VTEP (VXLAN Tunnel Endpoints): The edge device of the vxlan network, used to process vxlan packets (encapsulation and unpacking). vtep can be a network device (such as a switch) or a machine (such as a host in a virtualization cluster)
  • VNI (VXLAN Network Identifier): VNI ​​is the identifier of each vxlan, a total of 2^24 = 16,777,216, generally each VNI corresponds to a tenant, that is to say, a public cloud built using vxlan can theoretically support tens of millions of tenants

Demonstration: Between two machines k8s-slave1 and k8s-slave2, use the point-to-point capability of vxlan to realize the communication of virtual layer 2 network

k8s-slave1 node:

# 创建vTEP设备,对端指向k8s-slave2节点,指定VNI及underlay网络使用的网卡
$ ip link add vxlan20 type vxlan id 20 remote 172.21.51.69 dstport 4789 dev eth0

$ ip -d link show vxlan20

# 启动设备
$ ip link set vxlan20 up 

# 设置ip地址
$ ip addr add 10.0.136.11/24 dev vxlan20

k8s-slave2 node:

# 创建VTEP设备,对端指定向k8s-slave1节点,指定VNI及underlay网络使用的网卡
$ ip link add vxlan20 type vxlan id 20 remote 172.21.51.68 dstport 4789 dev eth0

# 启动设备
$ ip link set vxlan20 up 

# 设备ip地址
$ ip addr add 10.0.136.12/24 dev vxlan20

On the k8s-slave1 node:

$ ping 10.0.136.12

Tunnel is a logical concept, and there is no specific physical entity corresponding to it in the vxlan model. The tunnel can be regarded as a kind of virtual channel. The two sides of the vxlan communication (the virtual machine in the figure) think that they are communicating directly and do not know the existence of the underlying network. On the whole, each vxlan network seems to have built a separate communication channel, that is, a tunnel, for the communicating virtual machines.

The process of realization:

The message of the virtual machine is added with vxlan and the external message layer through vtep, and then sent out. After receiving it, the vtep of the other party removes the vxlan header and sends the original message to the destination virtual machine according to the VNI.

# 查看k8s-slave1主机路由
$ route -n
10.0.136.0       0.0.0.0         255.255.255.0   U     0      0        0 vxlan20

# 到了vxlan的设备后,
$ ip -d link show vxlan20
    vxlan id 20 remote 172.21.51.69 dev eth0 srcport 0 0 dstport 4789 ...

# 查看fdb地址表,主要由MAC地址、VLAN号、端口号和一些标志域等信息组成,vtep 对端地址为 172.21.51.69,换句话说,如果接收到的报文添加上 vxlan 头部之后都会发到 172.21.51.69
$ bridge fdb show | grep vxlan20
00:00:00:00:00:00 dev vxlan20 dst 172.21.51.69 via eth0 self permanent

Capture packets on the k8s-slave2 machine and check the vxlan-encapsulated packets:

# 在k8s-slave2机器执行
$ tcpdump -i eth0 host 172.21.51.68 -w vxlan.cap

# 在k8s-slave1机器执行
$ ping 10.0.136.12

Use wireshark to analyze ICMP type packets

Communication across host container networks

Thinking: In the container network mode, where should the vxlan device be connected?

Basic guarantee: the traffic of the destination container must be forwarded through the vtep device!

Demonstration: Using vxlan to realize cross-host container network communication

In order not to affect the existing network, a new bridge is created, and the container is connected to the new bridge to demonstrate the effect

On the k8s-slave1 node:

$ docker network ls

# 创建新网桥,指定cidr段
$ docker network create --subnet 172.18.0.0/16  network-luffy
$ docker network ls

# 新建容器,接入到新网桥
$ docker run -d --name vxlan-test --net network-luffy --ip 172.18.0.2 nginx:alpine

$ docker exec vxlan-test ifconfig

$ brctl show network-luffy

On the k8s-slave2 node:

# 创建新网桥,指定cidr段
$ docker network create --subnet 172.18.0.0/16  network-luffy

# 新建容器,接入到新网桥
$ docker run -d --name vxlan-test --net network-luffy --ip 172.18.0.3 nginx:alpine

Perform a ping test at this point:

$ docker exec vxlan-test ping 172.18.0.3

Analysis: After the data reaches the bridge, it cannot go out. Combined with the previous example, the traffic should be forwarded by the vtep device. Reminiscent of the characteristics of the bridge, the port connected to the bridge will be responsible for forwarding the data by the bridge. Therefore, it is equivalent to that the data sent by all containers will pass through vxlan port, vxlan transfers the traffic to the opposite vtep endpoint, and the bridge is responsible for transferring it to the container again.

k8s-slave1 node:

# 删除旧的vtep
$ ip link del vxlan20

# 新建vtep
$ ip link add vxlan_docker type vxlan id 100 remote 172.21.51.69 dstport 4789 dev eth0
$ ip link set vxlan_docker up
# 不用设置ip,因为目标是可以转发容器的数据即可

# 接入到网桥中
$ brctl addif br-904603a72dcd vxlan_docker

k8s-slave2 node:

# 删除旧的vtep
$ ip link del vxlan20

# 新建vtep
$ ip link add vxlan_docker type vxlan id 100 remote 172.21.51.68 dstport 4789 dev eth0
$ ip link set vxlan_docker up
# 不用设置ip,因为目标是可以转发容器的数据即可

# 接入到网桥中
$ brctl addif br-c6660fe2dc53 vxlan_docker

Perform the ping test again:

$ docker exec vxlan-test ping 172.18.0.3

Flannel's vxlan implementation essence

Thinking: What are the differences between the network environment of the k8s cluster and the manually implemented cross-host container communication?

  1. CNI requires that each Pod in the cluster must be assigned a unique Pod IP
  2. The communication in the k8s cluster is not vxlan point-to-point communication, because all nodes in the cluster need to be interconnected and cannot create a point-to-point vxlan model

How flannel assigns Pod address segments to each node:

$ kubectl -n kube-system exec kube-flannel-ds-amd64-cb7hs cat /etc/kube-flannel/net-conf.json
{
  "Network": "10.244.0.0/16",
  "Backend": {
    "Type": "vxlan"
  }
}

#查看节点的pod ip
[root@k8s-master bin]# kd get po -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE        
myblog-5d9ff54d4b-4rftt   1/1     Running   1          33h     10.244.2.19   k8s-slave2  
myblog-5d9ff54d4b-n447p   1/1     Running   1          33h     10.244.1.32   k8s-slave1

#查看k8s-slave1主机分配的地址段
$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.1.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

# kubelet启动容器的时候就可以按照本机的网段配置来为pod设置IP地址

Where is the vtep device:

$ ip -d link show flannel.1
# 没有remote ip,非点对点

How Pod traffic is transferred to the vtep device

$ brctl show cni0

# 每个Pod都会使用Veth pair来实现流量转到cni0网桥

$ route -n
10.244.0.0      10.244.0.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.1.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.2.0      10.244.2.0      255.255.255.0   UG    0      0        0 flannel.1

When vtep packets, how to get the IP and MAC information of the destination vetp

# flanneld启动的时候会配置--iface=eth0,通过该配置可以将网卡的ip及mac信息存储到ETCD中,
# 这样,flannel就知道所有的节点分配的IP段及vtep设备的IP和mac信息,而且所有节点的flanneld都可以感知到节点的添加和删除操作,就可以动态的更新本机的转发配置

Demonstrate the detailed flow process of cross-host Pod communication:

$ kubectl -n luffy get po -o wide
myblog-5d9ff54d4b-4rftt   1/1     Running   1          25h    10.244.2.19   k8s-slave2
myblog-5d9ff54d4b-n447p   1/1     Running   1          25h    10.244.1.32   k8s-slave1

$ kubectl -n luffy exec myblog-5d9ff54d4b-n447p -- ping 10.244.2.19 -c 2
PING 10.244.2.19 (10.244.2.19) 56(84) bytes of data.
64 bytes from 10.244.2.19: icmp_seq=1 ttl=62 time=0.480 ms
64 bytes from 10.244.2.19: icmp_seq=2 ttl=62 time=1.44 ms

--- 10.244.2.19 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.480/0.961/1.443/0.482 ms

# 查看路由
$ kubectl -n luffy exec myblog-5d9ff54d4b-n447p -- route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.244.1.1      0.0.0.0         UG    0      0        0 eth0
10.244.0.0      10.244.1.1      255.255.0.0     UG    0      0        0 eth0
10.244.1.0      0.0.0.0         255.255.255.0   U     0      0        0 eth0

# 查看k8s-slave1 的veth pair 和网桥
$ brctl show
bridge name     bridge id               STP enabled     interfaces
cni0            8000.6a9a0b341d88       no              veth048cc253
                                                        veth76f8e4ce
                                                        vetha4c972e1
# 流量到了cni0后,查看slave1节点的route
$ route -n
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.136.2   0.0.0.0         UG    100    0        0 eth0
10.0.136.0      0.0.0.0         255.255.255.0   U     0      0        0 vxlan20
10.244.0.0      10.244.0.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.1.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.2.0      10.244.2.0      255.255.255.0   UG    0      0        0 flannel.1
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.136.0   0.0.0.0         255.255.255.0   U     100    0        0 eth0

# 流量转发到了flannel.1网卡,查看该网卡,其实是vtep设备
$ ip -d link show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether 8a:2a:89:4d:b0:31 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vxlan id 1 local 172.21.51.68 dev eth0 srcport 0 0 dstport 8472 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    
# 该转发到哪里,通过etcd查询数据,然后本地缓存,流量不用走多播发送
$ bridge fdb show dev flannel.1
a6:64:a0:a5:83:55 dst 192.168.136.10 self permanent
86:c2:ad:4e:47:20 dst 172.21.51.69 self permanent

# 对端的vtep设备接收到请求后做解包,取出源payload内容,查看k8s-slave2的路由
$ route -n
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.136.2   0.0.0.0         UG    100    0        0 eth0
10.0.136.0      0.0.0.0         255.255.255.0   U     0      0        0 vxlan20
10.244.0.0      10.244.0.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.2.0      0.0.0.0         255.255.255.0   UG    0      0        0 cni0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.136.0   0.0.0.0         255.255.255.0   U     100    0        0 eth0

#根据路由规则转发到cni0网桥,然后由网桥转到具体的Pod中

Actual request graph:

  • The IP packets in pod-a (10.244.2.19) in the k8s-slave1 node are sent to eth0 through the routing table in pod-a, and then forwarded to the bridge cni0 in the host through veth pair
  • The IP packet arriving at cni0 finds that the IP packet leading to 10.244.2.19 should be handed over to flannel.1 interface by matching the routing table of node k8s-slave1
  • As a VTEP device, flannel.1 will packetize according to the VTEP configuration after receiving the message. It will query ETCD for the first time and know that the vtep device at 10.244.2.19 is a k8s-slave2 machine with an IP address of 172.21.51.67. MAC address for VXLAN packet.
  • Through the network connection between node k8s-slave2 and k8s-slave1, the VXLAN packet reaches the eth0 interface of k8s-slave2
  • VXLAN packets are forwarded to VTEP device flannel.1 via port 8472 for unpacking
  • The decapsulated IP packet matches the routing table (10.244.2.0) in node k8s-slave2, and the kernel forwards the IP packet to cni0
  • cni0 forwards the IP packet to pod-b connected to cni0

Use host-gw mode to improve cluster network performance

The vxlan mode is suitable for a layer-3 reachable network environment, and has very loose requirements on the network of the cluster, but at the same time, it will bring additional overhead to performance due to additional packetization and unpacketization through the VTEP device.

The purpose of the network plug-in is actually to send the traffic of the cni0 bridge of the local machine to the cni0 bridge of the destination host. In fact, many clusters are deployed in the same layer-2 network environment, and layer-2 hosts can be directly used as gateways for traffic forwarding. In this case, the traffic can be forwarded directly through the routing table without unpacking the packet.

Why doesn't a Layer 3 reachable network directly use the gateway to forward traffic?

内核当中的路由规则,网关必须在跟主机当中至少一个 IP 处于同一网段。
由于k8s集群内部各节点均需要实现Pod互通,因此,也就意味着host-gw模式需要整个集群节点都在同一二层网络内。

Modify flannel's network backend:

$ kubectl edit cm kube-flannel-cfg -n kube-system
...
net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "host-gw"
      }
    }
kind: ConfigMap
...

Rebuild Flannel Pods

$ kubectl -n kube-system get po |grep flannel
kube-flannel-ds-amd64-5dgb8          1/1     Running   0          15m
kube-flannel-ds-amd64-c2gdc          1/1     Running   0          14m
kube-flannel-ds-amd64-t2jdd          1/1     Running   0          15m

$ kubectl -n kube-system delete po kube-flannel-ds-amd64-5dgb8 kube-flannel-ds-amd64-c2gdc kube-flannel-ds-amd64-t2jdd

# 等待Pod新启动后,查看日志,出现Backend type: host-gw字样
$  kubectl -n kube-system logs -f kube-flannel-ds-amd64-4hjdw
I0704 01:18:11.916374       1 kube.go:126] Waiting 10m0s for node controller to sync
I0704 01:18:11.916579       1 kube.go:309] Starting kube subnet manager
I0704 01:18:12.917339       1 kube.go:133] Node controller sync successful
I0704 01:18:12.917848       1 main.go:247] Installing signal handlers
I0704 01:18:12.918569       1 main.go:386] Found network config - Backend type: host-gw
I0704 01:18:13.017841       1 main.go:317] Wrote subnet file to /run/flannel/subnet.env

View node routing table:

$ route -n 
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.136.2   0.0.0.0         UG    100    0        0 eth0
10.244.0.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.1.0      172.21.51.68  255.255.255.0   UG    0      0        0 eth0
10.244.2.0      172.21.51.69  255.255.255.0   UG    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
192.168.136.0   0.0.0.0         255.255.255.0   U     100    0        0 eth0
  • The IP packets in pod-a (10.244.2.19) in the k8s-slave1 node are sent to eth0 through the routing table in pod-a, and then forwarded to the bridge cni0 in the host through veth pair
  • The IP packet arriving at cni0 finds that the IP packet leading to 10.244.2.19 should use the gateway 172.21.51.69 for forwarding by matching the routing table of node k8s-slave1
  • The packet arrives at the eth0 network card of the k8s-slave2 node (172.21.51.69), and forwards it to the cni0 network card according to the routing rules of the node
  • cni0 forwards the IP packet to pod-b connected to cni0

Guess you like

Origin blog.csdn.net/guolianggsta/article/details/131609679
Recommended