【kubernetes/k8s源码分析】kubectl-proxy ipvs源码分析

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/zhonglinzhang/article/details/84643387

kubernetes版本: 1.12.1  源码路径 pkg/proxy/ipvs/proxier.go

本文只讲解IPVS相关部分,启动流程前文:

https://blog.csdn.net/zhonglinzhang/article/details/80185053

WHY IPVS

    尽管 Kubernetes 在版本v1.6中已经支持5000个节点,但使用 iptables 的 kube-proxy 实际上是将集群扩展到5000个节点的瓶颈。 在5000节点集群中使用 NodePort 服务,如果有2000个服务并且每个服务有10个 pod,这将在每个工作节点上至少产生20000个 iptable 记录,这可能使内核非常繁忙。  

WHAT ?

   kube-proxy引入了IPVS,IPVS与iptables基于Netfilter,但IPVS采用的hash表,因此当service数量规模特别大时,hash查表的速度优势就会突显,而提高查找service性能

HOW IPVS?

  kube-proxy启动参数

  /usr/bin/kube-proxy --bind-address=10.12.51.172 --hostname-override=10.12.51.172 --cluster-cidr=10.254.0.0/16 --kubeconfig=/etc/kubernetes/kube-proxy.kubeconfig --logtostderr=true --v=2

--ipvs-scheduler=wrr --ipvs-min-sync-period=5s --ipvs-sync-period=5s --proxy-mode=ipvs

参数 --masquerade-all=true

    则 ipvs 将伪装所有访问 Service 的 Cluster IP 的流量,此时的行为和 iptables 一样

参数--cluster-cidr=<cidr>

参数: –cleanup-ipvs:true清除在 IPVS 模式下创建的 IPVS 配置和 IPTables 规则。

参数: –ipvs-sync-period  同步IPVS 规则的最大间隔时间(’5s’,’1m’)。

参数: –ipvs-min-sync-period 同步 IPVS 规则的最小间隔时间间隔(例如’5s’,’1m’)

参数: –ipvs-scheduler 默认为rr

  • rr: round-robin
  • lc: least connection
  • dh: destination hashing
  • sh: source hashing
  • sed: shortest expected delay
  • nq: never queue

  

IPVS原理

        摘自网上文章,一目了然

        ipvs : 工作于内核空间,主要用于使用户定义的策略生效;

        ipvsadm : 工作于用户空间,主要用于用户定义和管理集群服务的工具;

 

IPVS 中有三种代理模式:

       NAT(masq),IPIP 和 DR。

        只有 NAT 模式支持端口映射。 Kube-proxy 利用 NAT 模式进行端口映射。

IPVS DR方式原理

ipset原理

   ipset是iptables的扩展, 创建匹配整个地址集合的规则。而普通的iptables链只能单IP匹配, ip集合存储在带索引的数据结构中,这种结构即时集合比较大也可以进行高效的查找。官网:http://ipset.netfilter.org/

   ipvs 会使用 iptables 进行包过滤、SNAT、masquared(伪装)。具体来说,ipvs 将使用ipset来存储需要DROPmasquared的流量的源或目标地址,以确保 iptables 规则的数量是恒定的

 

内核模块

    确保 ipvs 需要的内核模块,需要下面几个模块:ip_vs、ip_vs_rr、ip_vs_wrr、ip_vs_sh、nf_conntrack_ipv4

var ipvsModules = []string{
	"ip_vs",
	"ip_vs_rr",
	"ip_vs_wrr",
	"ip_vs_sh",
	"nf_conntrack_ipv4",
}

    

1. NewProxier函数

  1.1 设置内核参数

  • net/ipv4/conf/all/route_localnet: 是否允许外部访问localhost
  • net/bridge/bridge-nf-call-iptables: 1为二层的网桥在转发包时也会被iptables的FORWARD规则所过滤,这样就会出现L3层的iptables rules去过滤L2的帧的问题
  • net/ipv4/vs/conntrack
  • net/ipv4/ip_forward: 是否打开ipv4的IP转发(0:禁止 1:打开)
	// Set the route_localnet sysctl we need for
	if err := sysctl.SetSysctl(sysctlRouteLocalnet, 1); err != nil {
		return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlRouteLocalnet, err)
	}

	// Proxy needs br_netfilter and bridge-nf-call-iptables=1 when containers
	// are connected to a Linux bridge (but not SDN bridges).  Until most
	// plugins handle this, log when config is missing
	if val, err := sysctl.GetSysctl(sysctlBridgeCallIPTables); err == nil && val != 1 {
		glog.Infof("missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended")
	}

	// Set the conntrack sysctl we need for
	if err := sysctl.SetSysctl(sysctlVSConnTrack, 1); err != nil {
		return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlVSConnTrack, err)
	}

	// Set the ip_forward sysctl we need for
	if err := sysctl.SetSysctl(sysctlForward, 1); err != nil {
		return nil, fmt.Errorf("can't set sysctl %s: %v", sysctlForward, err)
	}

  1.2 初始化IPSet列表

    load定义的IPSet到ipsetList map中

	// initialize ipsetList with all sets we needed
	proxier.ipsetList = make(map[string]*IPSet)
	for _, is := range ipsetInfo {
		if is.isIPv6 {
			proxier.ipsetList[is.name] = NewIPSet(ipset, is.name, is.setType, isIPv6, is.comment)
		}
		proxier.ipsetList[is.name] = NewIPSet(ipset, is.name, is.setType, false, is.comment)
	}

  1.3 syncRunner初始化,主要函数是syncProxyRules

	proxier.syncRunner = async.NewBoundedFrequencyRunner("sync-runner", proxier.syncProxyRules, minSyncPeriod, syncPeriod, burstSyncs)

2. syncProxyRules函数

  2.1 reset 四个buffer

   在头部写入*filter,*nat标志表的起始

	// Reset all buffers used later.
	// This is to avoid memory reallocations and thus improve performance.
	proxier.natChains.Reset()
	proxier.natRules.Reset()
	proxier.filterChains.Reset()
	proxier.filterRules.Reset()

	// Write table headers.
	writeLine(proxier.filterChains, "*filter")
	writeLine(proxier.natChains, "*nat")

  2.2 创建dunmny device

    # ip route show table local type local proto kernel

  • 10.12.51.172 dev eth0  scope host  src 10.12.51.172 
  • 10.254.0.1 dev kube-ipvs0  scope host  src 10.254.0.1 
  • 10.254.0.2 dev kube-ipvs0  scope host  src 10.254.0.2 
  • 10.254.69.27 dev kube-ipvs0  scope host  src 10.254.69.27 
  • 10.254.86.39 dev kube-ipvs0  scope host  src 10.254.86.39 
  • 127.0.0.0/8 dev lo  scope host  src 127.0.0.1 
  • 127.0.0.1 dev lo  scope host  src 127.0.0.1 
  • 172.30.46.1 dev docker0  scope host  src 172.30.46.1
	// make sure dummy interface exists in the system where ipvs Proxier will bind service address on it
	_, err := proxier.netlinkHandle.EnsureDummyDevice(DefaultDummyDevice)
	if err != nil {
		glog.Errorf("Failed to create dummy interface: %s, error: %v", DefaultDummyDevice, err)
		return
	}

  2.3 使用ipset创建规则

	// make sure ip sets exists in the system.
	for _, set := range proxier.ipsetList {
		if err := ensureIPSet(set); err != nil {
			return
		}
		set.resetEntries()
	}

  规则如下:

Name: Type: Revision: Header: Size in memory: References: Members:
KUBE-LOOP-BACK hash:ip,port,ip 2 family inet hashsize 1024 maxelem 65536 16824 1 172.30.46.39,tcp:6379,172.30.46.39
172.30.3.15,udp:53,172.30.3.15
KUBE-NODE-PORT-TCP bitmap:port 1 range 0-65535 524432 1 31011
32371
KUBE-CLUSTER-IP hash:ip,port 2 family inet hashsize 1024 maxelem 65536 16688 2 10.254.0.2,tcp:53
10.254.0.2,udp:53
10.254.86.39,tcp:6379
10.254.0.1,tcp:443
10.254.69.27,tcp:443
 

3. 对于每一个service建立IPVS规则

// Build IPVS rules for each service.
	for svcName, svc := range proxier.serviceMap {
		svcInfo, ok := svc.(*serviceInfo)
		if !ok {
			glog.Errorf("Failed to cast serviceInfo %q", svcName.String())
			continue
		}

  3.1 对于KUBE-LOOP-BACK,更新数据

    比如members这样:

  • 172.30.46.39,tcp:6379,172.30.46.39
  • 172.30.3.15,udp:53,172.30.3.15
  • 172.30.3.27,tcp:6379,172.30.3.27
  • 172.30.3.15,tcp:53,172.30.3.15
  • 10.12.51.171,tcp:6443,10.12.51.17
		// Handle traffic that loops back to the originator with SNAT.
		for _, e := range proxier.endpointsMap[svcName] {
			ep, ok := e.(*proxy.BaseEndpointInfo)
			if !ok {
				glog.Errorf("Failed to cast BaseEndpointInfo %q", e.String())
				continue
			}
			epIP := ep.IP()
			epPort, err := ep.Port()
			// Error parsing this endpoint has been logged. Skip to next endpoint.
			if epIP == "" || err != nil {
				continue
			}
			entry := &utilipset.Entry{
				IP:       epIP,
				Port:     epPort,
				Protocol: protocol,
				IP2:      epIP,
				SetType:  utilipset.HashIPPortIP,
			}
			if valid := proxier.ipsetList[kubeLoopBackIPSet].validateEntry(entry); !valid {
				glog.Errorf("%s", fmt.Sprintf(EntryInvalidErr, entry, proxier.ipsetList[kubeLoopBackIPSet].Name))
				continue
			}
			proxier.ipsetList[kubeLoopBackIPSet].activeEntries.Insert(entry.String())
		}

   3.2 KUBE-CLUSTER-IP更新到map中

  •   10.254.0.2,tcp:53
  • 10.254.0.2,udp:53
  • 10.254.86.39,tcp:6379
  • 10.254.0.1,tcp:443
  • 10.254.69.27,tcp:443
		// Capture the clusterIP.
		// ipset call
		entry := &utilipset.Entry{
			IP:       svcInfo.ClusterIP.String(),
			Port:     svcInfo.Port,
			Protocol: protocol,
			SetType:  utilipset.HashIPPort,
		}
		// add service Cluster IP:Port to kubeServiceAccess ip set for the purpose of solving hairpin.
		// proxier.kubeServiceAccessSet.activeEntries.Insert(entry.String())
		if valid := proxier.ipsetList[kubeClusterIPSet].validateEntry(entry); !valid {
			glog.Errorf("%s", fmt.Sprintf(EntryInvalidErr, entry, proxier.ipsetList[kubeClusterIPSet].Name))
			continue
		}
		proxier.ipsetList[kubeClusterIPSet].activeEntries.Insert(entry.String())
// Capture externalIPs.
// Capture load-balancer ingress
KUBE-NODE-PORT-LOCAL-TCP
KUBE-NODE-PORT-LOCAL-UDP

略过这些,操作大致相同。

4. inspectWithIptablesChain

KUBE-POSTROUTING匹配KUBE-LOOP-BACK ipset表,则伪装: -A KUBE-POSTROUTING -m comment --comment "Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose" -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE

-A KUBE-SERVICES -m addrtype --dst-type LOCAL -j KUBE-NODE-PORT
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT

// ipsetWithIptablesChain is the ipsets list with iptables source chain and the chain jump to
// `iptables -t nat -A <from> -m set --match-set <name> <matchType> -j <to>`
// example: iptables -t nat -A KUBE-SERVICES -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-NODE-PORT
// ipsets with other match rules will be created Individually.
// Note: kubeNodePortLocalSetTCP must be prior to kubeNodePortSetTCP, the same for UDP.
var ipsetWithIptablesChain = []struct {
	name          string
	from          string
	to            string
	matchType     string
	protocolMatch string
}{
	{kubeLoopBackIPSet, string(kubePostroutingChain), "MASQUERADE", "dst,dst,src", ""},
	{kubeLoadBalancerSet, string(kubeServicesChain), string(KubeLoadBalancerChain), "dst,dst", ""},
	{kubeLoadbalancerFWSet, string(KubeLoadBalancerChain), string(KubeFireWallChain), "dst,dst", ""},
	{kubeLoadBalancerSourceCIDRSet, string(KubeFireWallChain), "RETURN", "dst,dst,src", ""},
	{kubeLoadBalancerSourceIPSet, string(KubeFireWallChain), "RETURN", "dst,dst,src", ""},
	{kubeLoadBalancerLocalSet, string(KubeLoadBalancerChain), "RETURN", "dst,dst", ""},
	{kubeNodePortLocalSetTCP, string(KubeNodePortChain), "RETURN", "dst", "tcp"},
	{kubeNodePortSetTCP, string(KubeNodePortChain), string(KubeMarkMasqChain), "dst", "tcp"},
	{kubeNodePortLocalSetUDP, string(KubeNodePortChain), "RETURN", "dst", "udp"},
	{kubeNodePortSetUDP, string(KubeNodePortChain), string(KubeMarkMasqChain), "dst", "udp"},
	{kubeNodePortSetSCTP, string(kubeServicesChain), string(KubeNodePortChain), "dst", "sctp"},
	{kubeNodePortLocalSetSCTP, string(KubeNodePortChain), "RETURN", "dst", "sctp"},
}

5. writeIptablesRules

    将规则写入nat rule buffer中,写入filter buffer中,下面一大堆都是这种操作

	for _, set := range ipsetWithIptablesChain {
		if _, find := proxier.ipsetList[set.name]; find && !proxier.ipsetList[set.name].isEmpty() {
			args = append(args[:0], "-A", set.from)
			if set.protocolMatch != "" {
				args = append(args, "-p", set.protocolMatch)
			}
			args = append(args,
				"-m", "comment", "--comment", proxier.ipsetList[set.name].getComment(),
				"-m", "set", "--match-set", set.name,
				set.matchType,
			)
			writeLine(proxier.natRules, append(args, "-j", set.to)...)
		}
	}

-A KUBE-SERVICES ! -s 10.254.0.0/16 -m comment --comment "Kubernetes service cluster ip + port for masquerade purpose" -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ

	if !proxier.ipsetList[kubeClusterIPSet].isEmpty() {
		args = append(args[:0],
			"-A", string(kubeServicesChain),
			"-m", "comment", "--comment", proxier.ipsetList[kubeClusterIPSet].getComment(),
			"-m", "set", "--match-set", kubeClusterIPSet,
		)
		if proxier.masqueradeAll {
			writeLine(proxier.natRules, append(args, "dst,dst", "-j", string(KubeMarkMasqChain))...)
		} else if len(proxier.clusterCIDR) > 0 {
			// This masquerades off-cluster traffic to a service VIP.  The idea
			// is that you can establish a static route for your Service range,
			// routing to any node, and that node will bridge into the Service
			// for you.  Since that might bounce off-node, we masquerade here.
			// If/when we support "Local" policy for VIPs, we should update this.
			writeLine(proxier.natRules, append(args, "dst,dst", "! -s", proxier.clusterCIDR, "-j", string(KubeMarkMasqChain))...)
		} else {
			// Masquerade all OUTPUT traffic coming from a service ip.
			// The kube dummy interface has all service VIPs assigned which
			// results in the service VIP being picked as the source IP to reach
			// a VIP. This leads to a connection from VIP:<random port> to
			// VIP:<service port>.
			// Always masquerading OUTPUT (node-originating) traffic with a VIP
			// source ip and service port destination fixes the outgoing connections.
			writeLine(proxier.natRules, append(args, "src,dst", "-j", string(KubeMarkMasqChain))...)
		}
	}

-A KUBE-LOAD-BALANCER -j KUBE-MARK-MASQ

	// mark drop for KUBE-LOAD-BALANCER
	writeLine(proxier.natRules, []string{
		"-A", string(KubeLoadBalancerChain),
		"-j", string(KubeMarkMasqChain),
	}...)

	// mark drop for KUBE-FIRE-WALL
	writeLine(proxier.natRules, []string{
		"-A", string(KubeFireWallChain),
		"-j", string(KubeMarkDropChain),
	}...)

-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT

	// If the masqueradeMark has been added then we want to forward that same
	// traffic, this allows NodePort traffic to be forwarded even if the default
	// FORWARD policy is not accept.
	writeLine(proxier.filterRules,
		"-A", string(KubeForwardChain),
		"-m", "comment", "--comment", `"kubernetes forwarding rules"`,
		"-m", "mark", "--mark", proxier.masqueradeMark,
		"-j", "ACCEPT",
	)

  这个主要是创建

-A KUBE-FORWARD -s 10.254.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 10.254.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

	// The following rules can only be set if clusterCIDR has been defined.
	if len(proxier.clusterCIDR) != 0 {
		// The following two rules ensure the traffic after the initial packet
		// accepted by the "kubernetes forwarding rules" rule above will be
		// accepted, to be as specific as possible the traffic must be sourced
		// or destined to the clusterCIDR (to/from a pod).
		writeLine(proxier.filterRules,
			"-A", string(KubeForwardChain),
			"-s", proxier.clusterCIDR,
			"-m", "comment", "--comment", `"kubernetes forwarding conntrack pod source rule"`,
			"-m", "conntrack",
			"--ctstate", "RELATED,ESTABLISHED",
			"-j", "ACCEPT",
		)
		writeLine(proxier.filterRules,
			"-A", string(KubeForwardChain),
			"-m", "comment", "--comment", `"kubernetes forwarding conntrack pod destination rule"`,
			"-d", proxier.clusterCIDR,
			"-m", "conntrack",
			"--ctstate", "RELATED,ESTABLISHED",
			"-j", "ACCEPT",
		)
	}

6. 使用iptables-restore批量导入Linux防火墙规则

	// Sync iptables rules.
	// NOTE: NoFlushTables is used so we don't flush non-kubernetes chains in the table.
	proxier.iptablesData.Reset()
	proxier.iptablesData.Write(proxier.natChains.Bytes())
	proxier.iptablesData.Write(proxier.natRules.Bytes())
	proxier.iptablesData.Write(proxier.filterChains.Bytes())
	proxier.iptablesData.Write(proxier.filterRules.Bytes())

	glog.V(5).Infof("Restoring iptables rules: %s", proxier.iptablesData.Bytes())
	err = proxier.iptables.RestoreAll(proxier.iptablesData.Bytes(), utiliptables.NoFlushTables, utiliptables.RestoreCounters)
	if err != nil {
		glog.Errorf("Failed to execute iptables-restore: %v\nRules:\n%s", err, proxier.iptablesData.Bytes())
		// Revert new local ports.
		utilproxy.RevertPorts(replacementPortsMap, proxier.portsMap)
		return
	}

7.  获得当前绑定地址

	// Clean up legacy bind address
	// currentBindAddrs represents ip addresses bind to DefaultDummyDevice from the system
	currentBindAddrs, err := proxier.netlinkHandle.ListBindAddress(DefaultDummyDevice)
	if err != nil {
		glog.Errorf("Failed to get bind address, err: %v", err)
	}

ipvs模式,通过svc创建的Cluster都绑定在kube-ipvs0这块虚拟网卡。创建 ClusterIP 执行以下三项操作:

  • 节点中存在虚拟接口为 kube-ipvs0
  • 服务 IP 地址绑定到虚拟接口
  • 分别为每个服务 IP 地址创建 IPVS 虚拟服务器

TCP  10.254.23.85:5566 wrr
  -> 172.30.3.29:5566             Masq    1      0          0         
  -> 172.30.46.41:5566            Masq    1      0          0

猜你喜欢

转载自blog.csdn.net/zhonglinzhang/article/details/84643387