[Cloud native | Learn Kubernetes from scratch] 20. Detailed explanation of Service proxy kube-proxy components

This article has been included in the column " Learn k8s from scratch "
Previous article: Kubernetes core technology Service actual combat Click jump

insert image description here

Introduction to kube-proxy components

Kubernetes service just abstracts the way the application provides services to the outside world. The real application runs in the container in the Pod. Our request goes to the nodePort corresponding to the kubernetes nodes, then how does the request on the nodePort go further to provide background services What about the Pod's? It is achieved through kube-proxy

kube-proxy is deployed on each Node node of k8s and is the core component of Kubernetes. When we create a service, kube-proxy will add some rules to iptables to implement routing and load balancing functions for us. Before k8s1.8, kube-proxy used the iptables mode by default, and realized the load balancing of services through the iptables rules on each node, but with the increase of the number of services, the iptables mode due to linear search matching, full update, etc. characteristics, its performance will drop significantly. Since the 1.8 version of k8s, kube-proxy has introduced IPVS mode. IPVS mode is based on Netfilter as well as iptables, but uses a hash table. Therefore, when the number of services reaches a certain scale, the speed advantage of hash table lookup will appear, so that Improve the service performance of the service.

kubectl get pods -n kube-system -o wide

A service is a service abstraction for a group of pods, equivalent to the LB of a group of pods, responsible for distributing requests to corresponding pods. The service will provide an IP for this LB, generally called cluster IP. The role of kube-proxy is mainly responsible for the implementation of the service. Specifically, it realizes the internal access from the pod to the service and the external access from the node port to the service.

1. kube-proxy is actually the access entry for managing services, including access from Pods in the cluster to services and access to services outside the cluster.

2. The kube-proxy manages the endpoints of the service. The service exposes a Virtual IP, which can also be called a Cluster IP. By accessing the Cluster IP:Port in the cluster, you can access the Pod under the corresponding service in the cluster.

Three working modes of kube-proxy

1. Userspace method:

insert image description here

When the Client Pod wants to access the Server Pod, it first sends the request to the service iptables rule in the kernel space, which then forwards the request to the port of the kube-proxy listening on the specified socket. After the kube-proxy processes the request, After distributing the request to the specified Server Pod, the request is forwarded to the service ip in the kernel space, and the service iptables will forward the request to the Server Pod in each node.

There is a big problem with this mode. The client requests to enter the kernel space first, and then enter the user space to access kube-proxy. After the kube-proxy package is completed, it enters the iptables of the kernel space, and then distributes it to each node according to the rules of iptables. Userspace pods. Since it needs to communicate back and forth between user space and kernel space, it is very inefficient. Before Kubernetes version 1.1, userspace was the default proxy model.

2. iptables method:

insert image description here

When the client IP requests, it directly requests the local kernel service ip, and directly forwards the request to each pod according to the rules of iptables. Because iptable NAT is used to complete the forwarding, there is also a non-negligible performance loss. In addition, if there are tens of thousands of Service/Endpoints in the cluster, the iptables rules on Node will be very large, and the performance will be further reduced. The iptables proxy mode was introduced by Kubernetes version 1.1 and has become the default type since version 1.2.

3. ipvs method:

insert image description here

Kubernetes introduced the ipvs proxy mode since version 1.9-alpha and has been the default since version 1.11. When the client requests arrive at the kernel space, it is directly distributed to each pod according to the rules of ipvs. kube-proxy monitors Kubernetes Service objects and Endpoints, calls the netlink interface to create ipvs rules accordingly and periodically synchronizes ipvs rules with Kubernetes Service objects and Endpoints objects to ensure that the ipvs state is as expected. When accessing the service, traffic will be redirected to one of the backend pods. Similar to iptables, ipvs is based on netfilter's hook functionality, but uses a hash table as the underlying data structure and works in kernel space. This means ipvs can redirect traffic faster and has better performance when synchronizing proxy rules. Additionally, ipvs provides more options for load balancing algorithms, such as:

rr:轮询调度  

lc:最小连接数  

dh:目标哈希  

sh:源哈希  

sed:最短期望延迟  

nq:不排队调度 

If a service backend pod changes, and the label selector adapts to one more pod, the adapted information will be immediately reflected on the apiserver, and kube-proxy must be able to watch the change in the information in etc, and immediately convert it to Rules in ipvs or iptables, it's all dynamic and real-time, and the same goes for deleting a pod.

insert image description here

Regardless of the above, kube-proxy monitors the latest status information about Pods written by apiserver into etcd by means of watch. Once it detects that a Pod resource is deleted or newly created, it will immediately reflect these changes in iptables Or in the ipvs rule, so that iptables and ipvs will not have the situation that the Server Pod does not exist when scheduling the Clinet Pod request to the Server Pod. Since k8s1.11, service uses ipvs rules by default. If ipvs is not activated, it will downgrade to use iptables rules.

Analysis of iptables rules generated by kube-proxy

1. The type of service is ClusterIp, iptables rule analysis

Although the service created in k8s has an ip address, the ip of the service is virtual and does not exist on the physical machine. It is in the iptables or ipvs rules.

[root@k8smaster service]# kubectl apply -f pod_test.yaml 
[root@k8smaster service]# kubectl apply -f service_test.yaml 
[root@k8smaster node]# kubectl get svc -l run=my-nginx
NAME       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
my-nginx   ClusterIP   10.105.254.244   <none>        80/TCP    15s
[root@k8smaster node]# kubectl get pods -l run=my-nginx -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP           NODE       NOMINATED NODE  
my-nginx-5898cf8d98-5trvw   1/1     Running   0          40s   10.244.1.5   k8snode2   <none>          
my-nginx-5898cf8d98-phfqr   1/1     Running   0          40s   10.244.1.4   k8snode2   <none>          
[root@k8smaster node]# iptables -t nat -L | grep 10.105.254.244
KUBE-MARK-MASQ  tcp  -- !10.244.0.0/16        10.105.254.244       /* default/my-nginx: cluster IP */ tcp dpt:http
KUBE-SVC-BEPXDJBUHFCSYIC3  tcp  --  anywhere             10.105.254.244       /* default/my-nginx: cluster IP */ tcp dpt:http

[root@k8smaster node]# iptables -t nat -L | grep KUBE-SVC-BEPXDJBUHFCSYIC3
KUBE-SVC-BEPXDJBUHFCSYIC3  tcp  --  anywhere             10.105.254.244       /* default/my-nginx: cluster IP */ tcp dpt:http			#把service关联的pod做了转发
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (1 references)		

[root@k8smaster node]# iptables -t nat -L | grep 10.244.1.5
KUBE-MARK-MASQ  all  --  10.244.1.5           anywhere             /* default/my-nginx: */
DNAT       tcp  --  anywhere             anywhere             /* default/my-nginx: */ tcp to:10.244.1.5:80
#DNAT转发 kubesvc接收请求,在过滤podip的时候有个mark也会标记ip,然后做了一个dnat转发到10.244.1.5:80这个pod上

#通过上面可以看到之前创建的 service,会通过 kube-proxy 在 iptables 中生成一个规则,来实现流量路由,有一系列目标为 KUBE-SVC-xxx 链的规则,每条规则都会匹配某个目标 ip 与端口。也就是说访问某个 serivce的ip和端口请求会由 KUBE-SVC-xxx 链来通过DNAT转发到对应的podip和端口上。

2. The type of service is nodePort, iptables rule analysis

[root@k8smaster node]# kubectl apply -f pod_nodeport.yaml 
deployment.apps/my-nginx-nodeport created
[root@k8smaster node]# kubectl apply -f service_nodeport.yaml 
service/my-nginx-nodeport created

[root@k8smaster node]# kubectl get pods -l run=my-nginx-nodeport -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP           NODE       NOMINATED
my-nginx-nodeport-5fccbb754b-m4csx   1/1     Running   0          34s   10.244.1.7   k8snode2   <none>      
my-nginx-nodeport-5fccbb754b-rg48l   1/1     Running   0          34s   10.244.1.6   k8snode2   <none>
[root@k8smaster node]# kubectl get svc -l run=my-nginx-nodeport 
NAME                TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
my-nginx-nodeport   NodePort   10.105.58.82   <none>        80:30380/TCP   39s
 
[root@k8smaster node]# iptables -t nat -S | grep 30380 
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/my-nginx-nodeport:" -m tcp --dport 30380 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/my-nginx-nodeport:" -m tcp --dport 30380 -j KUBE-SVC-6JXEEPSEELXY3JZG
#一个是mark链一个是svc 在访问物理机ip和端口,访问会先经过这两个链

[root@k8smaster node]# iptables -t nat -S | grep KUBE-SVC-6JXEEPSEELXY3JZG
-N KUBE-SVC-6JXEEPSEELXY3JZG
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/my-nginx-nodeport:" -m tcp --dport 30380 -j KUBE-SVC-6JXEEPSEELXY3JZG
-A KUBE-SERVICES -d 10.105.58.82/32 -p tcp -m comment --comment "default/my-nginx-nodeport: cluster IP" -m tcp --dport 80 -j KUBE-SVC-6JXEEPSEELXY3JZG
-A KUBE-SVC-6JXEEPSEELXY3JZG -m comment --comment "default/my-nginx-nodeport:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-36FBCF7ZW3VDH33Q
-A KUBE-SVC-6JXEEPSEELXY3JZG -m comment --comment "default/my-nginx-nodeport:" -j KUBE-SEP-K2MGI3AJIGBK3IJ5
#会通过iptables的probability机制有0.50的概率进入KUBE-SEP-36FBCF7ZW3VDH33Q这个链,剩下50%还是最后那个GBK3IJ5这个链

[root@k8smaster node]# iptables -t nat -S | grep KUBE-SEP-36FBCF7ZW3VDH33
-N KUBE-SEP-36FBCF7ZW3VDH33Q
-A KUBE-SEP-36FBCF7ZW3VDH33Q -s 10.244.1.6/32 -m comment --comment "default/my-nginx-nodeport:" -j KUBE-MARK-MASQ
-A KUBE-SEP-36FBCF7ZW3VDH33Q -p tcp -m comment --comment "default/my-nginx-nodeport:" -m tcp -j DNAT --to-destination 10.244.1.6:80
-A KUBE-SVC-6JXEEPSEELXY3JZG -m comment --comment "default/my-nginx-nodeport:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-36FBCF7ZW3VDH33Q

#-A KUBE-SEP-36FBCF7ZW3VDH33Q -p tcp -m comment --comment "default/my-nginx-nodeport:" -m tcp -j DNAT --to-destination 10.244.1.6:80 是做了一个dnat把请求给10.244.1.6这个pod的80端口了 下面可以看到ip是相同的
[root@k8smaster node]# kubectl get pods -l run=my-nginx-nodeport -o wide
NAME                                 READY   STATUS    RESTARTS   AGE     IP           NODE       NOMINATED
my-nginx-nodeport-5fccbb754b-m4csx   1/1     Running   0          8m24s   10.244.1.7   k8snode2   <none>    
my-nginx-nodeport-5fccbb754b-rg48l   1/1     Running   0          8m24s   10.244.1.6   k8snode2   <none>    

 
[root@k8smaster node]# iptables -t nat -S | grep KUBE-SEP-K2MGI3AJIGBK3IJ5
-N KUBE-SEP-K2MGI3AJIGBK3IJ5
-A KUBE-SEP-K2MGI3AJIGBK3IJ5 -s 10.244.1.7/32 -m comment --comment "default/my-nginx-nodeport:" -j KUBE-MARK-MASQ
-A KUBE-SEP-K2MGI3AJIGBK3IJ5 -p tcp -m comment --comment "default/my-nginx-nodeport:" -m tcp -j DNAT --to-destination 10.244.1.7:80
-A KUBE-SVC-6JXEEPSEELXY3JZG -m comment --comment "default/my-nginx-nodeport:" -j KUBE-SEP-K2MGI3AJIGBK3IJ5
#也是一个dnat,把请求分给另外一个pod,通过这两个链50%的概率来转发到两个pod上

write at the end

It is not easy to create, if you think the content is helpful to you, please give me a three-link follow to support me! If there are any mistakes, please point them out in the comments and I will change them in time!
The series that is currently being updated: learn k8s from scratch.
Thank you for watching. The article is mixed with personal understanding. If there is any error, please contact me and point it out~
insert image description here

Guess you like

Origin blog.csdn.net/qq_45400861/article/details/126850632