Started from scratch K8s | Kubernetes Advanced Network Model

https://www.kubernetes.org.cn/6838.html

13:57 2020-03-03 alicloudnative Category: Kubernetes tutorial / introductory tutorial reading (1130) Comments (0)

Author | Ye Lei (rice farmers) Alibaba senior technical experts

REVIEW: This article will be based before the introduction of the basic network model , more in-depth understanding of some, I hope to give readers a broader and deeper awareness. First, a brief review of the history of the container of the network, analyze what the origin of Kubernetes network model; secondly will analyze a practical implementation (Flannel Hostgw), shows the packet transformation process from the container to the host machine; and finally for and the network is closely related to Servcie do a more in-depth introduction and use of the mechanism, through a simple example illustrates the principle of Service.

Kubernetes network model context

Docker container network originated in the network. Docker uses a relatively simple network model, that is, inside the bridge plus internal reserves IP. The advantage of this design is that the network and the outside world is decoupled from the container, without tying up the resources of the host IP or host, is entirely virtual. It is designed to be: When you need to access the outside world, will use this method to borrow SNAT Service Node's IP to go outside. For example, when the container needs to provide services, and is used DNAT technology, which is to open a port on the Node, then iptable or some other mechanism, the process flow into the container to achieve their goals.

The problem is that the model, which is the external network can not distinguish the container network traffic, which is the host network traffic. For example, if the time to do a highly available, 172.16.1.1 and 172.16.1.2 is to have the same features two containers, then we need both tied into a Group provide services, but this time we found that from the outside look nothing in common to both, they are the IP host port borrowing, it is difficult both transports and collects together.

On this basis, Kubernetes propose such a mechanism: that is, each Pod, which is a function of collecting small groups should have its own "identity card", or ID. On the TCP protocol stack, the ID is the IP.

This IP is really belong to the Pod, the outside world no matter by what method must give it. Pod IP access to this service is access to its real middle reject any or altered. Such as in access to the IP 10.1.1.1 10.1.2.1 Pod, the results to the 10.1.2.1 find that it is actually borrowed IP host computer, instead of the source IP, this is not allowed. Internal Pod will be asked to share the IP, so as to solve how some of the functional cohesiveness container atomic become a deployment issue.

The remaining problem is that our deployment means. Kubernetes on how to achieve this model really is no limit, with underlay network to control an external router diversion is possible; if you want to decoupling, plus a layer with the overlay network overlay network on top of the underlying network, this is also possible of. In short, as long as the purpose of the model can be required.

Pod exactly how Internet

The container network network packets transmitted exactly how?

We can look at the following two dimensions:

Protocol layers
Network topology

The first dimension is the protocol level.

TCP protocol stack and its concept is the same, need to bundle up from two, three, four layers of, from right to left when the contract, i.e. prior application data, and then sent to a TCP or UDP four layer protocol, continues to transmit downward, plus the IP header, the MAC header can be coupled sent out. When the received packet in the reverse order, first the release of the MAC header, IP header and then peeled off, and finally found a process needs to be received by the protocol port number.

The second dimension is the network topology.

A problem to be solved by the package container in two steps: first, how to jump from a host space of the container (c1) the space (Infra); a second step, how to reach the distal end of the host space.

My personal understanding is that the container network solutions can be considered by the three levels of access, flow control channel.

The first is the access, that is between us and the host container which is a mechanism used to make the connection, such as Veth + bridge, Veth + pair such a classic way, there are new mechanisms for the use of high version of the kernel and other ways (e.g., mac / IPvlan, etc.) to the packet sent to the host space;
The second is the flow control, that I do not support this program to Network Policy, if supported, but also the manner in which to achieve. It should be noted that we need to achieve a certain way on a key point in the data path must pass through. If the data path does not pass through the Hook point, it will not work;
The third is the channel, i.e. between the two hosts is accomplished through what packet transmission. We have a variety of ways, such as to route the way, it can be divided into specific BGP routes or direct routes. There are a variety of tunneling technologies. Our ultimate aim is to achieve a package within the container through the container, through the access layer reached the host, then the host through the flow control module (if any) to the channel reaches the opposite end.

One of the most simple routing scheme: Flannel-host-gw

This program uses each Node exclusive network segments, each Subnet will be binding on a Node, Gateway is also set locally, or directly on the internal ports cni0 this bridge. The benefit of this program is to manage simple, the downside is not migrate across Node Pod. That is, after the IP, the segment is already part of the Node will not be able to migrate to another Node.

The essence of this solution is that the route table is provided, as shown in FIG. Then as we recite it.

The first is very simple, we will add this line to set the card time. My default route is specified by which IP is gone, what is the default device;
The second is a feedback on the rule of Subnet. I have to say this segment is 10.244.0.0, the mask is 24, it's the gateway address is on the bridge, which is 10.244.0.1. This means that the network of each packet are sent to the IP on the bridge;
A third feedback yes right right end. If your network is 10.244.1.0 (on the right in the figure Subnet), we put IP (10.168.0.3) on its Host card as a gateway. That is, if the packet is sent to 10.244.1.0 this segment, just please 10.168.0.3 as a gateway.

Then look at the packet in the end is how to run up?

After assuming the container (10.244.0.2) wants to send a packet to 10.244.1.3, it produces in the local TCP or UDP packet, and then successively fill the peer IP address, the Ethernet MAC address of the local as well as the source MAC peer MAC. Generally local will set a default route, the default route will cni0 IP as its default gateway, the peer MAC is the MAC address of the gateway. This package can then be sent to the bridge to go. If the present segment in the bridge, it can be solved by exchanging the MAC layer.

In this case we do not belong to an IP network segment, so the protocol is sent to the host stack to which the bridge will be processed. Host protocol stack to just find a MAC address of the peer. 10.168.0.3 use it as a gateway through the local ARP probe, we got the MAC address of 10.168.0.3. That is assembled by layers of the protocol stack, we have reached the end, the fill Dst-MAC is the MAC address of host adapters, right, so that the packet sent to the peer host eth0 up from eth0.

So we can see that there is a limitation implied, MAC address, must be able to reach the above figure, after the end of fill, but if not between the two host connected to the second floor, through a number of intermediate gateways, Some sophisticated routing, then the MAC will not be direct, this scheme is not used. When the packet arrives at the MAC address of the peer and found this package really is to it, but they are not its own IP, began Forward process, to stack the bags, after routing go again, just find 10.244. 1.0 / 24 needs to send to the gateway 10.244.1.1, cni0 to reach the bridge, it will find the corresponding MAC address 10.244.1.3, then through the bridge mechanism, on the packet arrived at the end of the container.

As you can see, the whole process is always two, three, hair when they become two-story, do the routing, that is a little big collar ring. This is a relatively simple scheme, if leave the middle tunnel, the tunnel may be a device VXLAN, at this time no direct route to fill, the fill end of the pair of tunnel number.

How exactly Service Work

Service is actually a mechanism for balancing (Load Balance) load.

We think it is a user-side (Client Side) load balancing, that is to say RIP to VIP conversions on the user side has been completed, do not need to arrive at a centralized or a NGINX is a ELB such components decision-making.

Its implementation is such that: a first group composed of a set of functions Pod rear end, and then define a virtual IP access as an inlet on the front end. In general, since IP is not easy to remember, we're sending a DNS domain name, the domain name after the Client to get access to the virtual IP then into real IP. Kube-proxy is the realization of the core of the mechanism, it hides a lot of complexity. Its working mechanism is to monitor changes in Pod / Service by apiserver (for example, is not added Service, Pod) and feeds it to the local rules or user-mode processes.

A version of the LVS Service

Let's actually do a LVS version of Service. LVS is a special kernel mechanism for load balancing. It works the fourth floor, the performance will be better than that achieved with iptable in.

Suppose we are a Kube-proxy, a Service, get configuration, as shown below: It has a the Cluster IP, the IP port is 9376, the container needs to be fed to the port 80, there are three operative the Pod, which are an IP 10.1.2.3, 10.1.14.5, 10.1.3.8.

It thing to do is:

Step 1, to bind to a local VIP (spoofing kernel);

You first need to make sure it has a core of such a Virtual IP, which is LVS working mechanism of the decision, because it works in the fourth floor, do not care about IP forwarding, IP is the only think that it will be split into its own TCP this layer or UDP. In the first step, we set up the IP into the kernel, to tell the kernel so it does have a IP. There are ways to achieve a lot, we use here is the ip route directly add local way, by way of add on IP Dummy dummy device is also possible.

Step 2, create a virtual server for the IPVS of virtual IP;

I need to tell it to load balancing to distribute the IP, the latter parameter is the number of distribution strategy and so on. virtual server's IP is actually our Cluster IP.

Step 3, create the appropriate real server for the IPVS service.

We need to configure the real server to virtual server, what is the real back-end service delivery Yes. For example, we have just seen three Pod, so he took three of the IP assigned to the virtual server, fully one-up on it. Kube-proxy with this work is similar. It also needs to monitor only some changes Pod, such as the number of Pod into five, then the rules should become five. If there is one dead or Pod was killed, then it would lose an accordingly. Or the entire Service is revoked, then all of these rules will be deleted. So it is actually doing some management-level work.

What? Load balancing is also divided into internal and external

Finally, we introduce the type of Service can be divided into the following four categories.

ClusterIP

Inside a virtual cluster IP, the IP will be bound to the Group Pod top of a pile service, which is the default service. The disadvantage is that this is the only way for the cluster within the Node.

NodePort

Cluster for external calls. The Service Node carrying on a static port, the port number and Service correspondence, so users can pass outside the cluster: a way to call Service.

LoadBalancer

To the cloud vendor's expansion interface. Like Ali cloud, like Amazon's cloud vendors are all mature LB mechanisms, these mechanisms may be implemented by a large cluster, in order not to waste this ability, cloud vendors can be extended by this interface. First, it will automatically create NodePort and ClusterIP these two mechanisms, cloud vendors can choose to LB directly linked to these two mechanisms, or both are not directly linked to the Pod's RIP ELB cloud vendor's back-end is possible of.

ExternalName

Abandon the internal mechanism, dependence on external facilities, such as a user particularly strong, he felt that we offer are of no use, is to implement their own, this time a domain name and a Service will tie it, the entire work is an external load balancing achieved.

Below is an example. It flexibly applied ClusterIP, NodePort and other services, but also combines the ELB cloud vendors, into a very flexible, highly scalable, available a real production system.

First, we use ClusterIP do the service entrance of the function Pod. As you can see, if there are three Pod, then there are three Service Cluster IP as their service entrance. These are the Client-side manner, how to do some control in the Server end it?

First Ingress will play some of the Pod (Ingress is K8s later added a service, in essence, is a pile of homogeneous Pod), then these Pod organize themselves exposed to a NodePort of IP, K8s work on this ended.

A user access to any port 23456 Ingress Pod will have access to the service, behind which there is a Controller, and the rear end of Ingress Service IP will be managed, eventually transferred to ClusterIP, and then transferred to our function Pod. We went to the aforementioned butt ELB cloud vendors, we can let go ELB 23456 listens port on all cluster nodes, as long as the service on port 23456, it is considered an instance of the Ingress run.

The flow through a whole to resolve external domain names reached the ELB cloud vendors with shunt, ELB through load balancing and then call back to the real Pod by ClusterIP reach Ingress, Ingress by NodePort way. Such a system looks rich, robust better. There is not any single point of a link, any link also has management and feedback.

This paper summarizes

The main content to stop here, and here we briefly summarize:

We fundamentally understand the evolutionary origin Kubernetes the network model, where PerPodPerIP understanding of intentions;
Things remain the same network, in accordance with the model down from layer 4 is contracting process, the release layer is receiving packets anyway process, the container is also true network;
Ingress and other mechanisms are at a higher level (service <-> port) to facilitate the deployment of a cluster of external services, by deploying a real example available, we hope that the concept of joint Ingress + Cluster IP + PodIP other terms, the introduction of community understanding , new mechanism for new resource objects.

[Reprint] scratch entry K8s | Kubernetes Advanced Network Model