Problems and solutions encountered in scaling Kubernetes to 2500 nodes

Kubernetes has claimed to be able to carry more than 5,000 nodes since 1.6 , but on the way from dozens to 5,000, it is inevitable to encounter problems.

This article shares the experience of Open API on the road to kubernetes 5000, including problems encountered, trying to solve problems, and finding real problems.

Problems encountered and how to solve them

Question 1: After 1 ~ 500 nodes

question:

kubectl sometimes has a timeout (ps kubectl -v=6can display all API detail instructions)

Try to fix:

  • At first I thought it was a problem with the load of the kube-apiserver server, try adding a proxy as a replica to assist in load balancing
  • But when there are more than 10 backup masters, it is found that the problem is not because the kube-apiserver cannot bear the load, GKE can carry 500 nodes through a 32-core VM

reason:

  • Exclude the above reasons and start to check the remaining services on the master (etcd, kube-proxy)
  • Start trying to tune etcd
  • By using datadog to view etcd throughput, it is found that there is abnormal latency (latency spiking ~100 ms)
  • Through the performance evaluation of the Fio tool, it was found that only 10% of the IOPS (Input/Output Per Second) was used, and the performance was reduced due to the write latency (write latency 2ms).
  • Try to change the SSD from a network hard drive to a local temp drive (SSD) for each machine
  • Results from ~100ms —> 200us

Question 2: When ~1000 nodes

question:

  • Found that kube-apiserver reads 500mb per second from etcd

Try to fix:

  • View network traffic between containers through Prometheus

reason:

  • It is found that Fluentd and Datadog crawl data on each node too frequently
  • Reduced the crawl frequency of two services, the network performance decreased from 500mb/s to almost nothing
  • etcd tip: By --etcd-servers-overrideswriting Kubernetes Event data as a cut, it can be processed on different machines, as shown below
--etcd-servers-overrides=/events#https://0.example.com:2381;https://1.example.com:2381;https://2.example.com:2381

Problem 3: 1000 to 2000 nodes

question:

  • No more data can be written, an error cascading failure is reported
  • kubernetes-ec2-autoscaler returns the problem after all etcd are stopped and shuts down all etcd

Try to fix:

  • The guess is that the etcd hard disk is full, but checking the SSD still has a lot of space
  • Check if there is a preset space limit and find that there is a 2GB size limit

Solution:

  • Add to etcd startup parameters--quota-backend-bytes
  • Modify kubernetes-ec2-autoscaler logic - if more than 50% of the problem occurs, shut down the cluster

Optimization of various services

High availability of Kube masters

Generally speaking, our architecture is a kube-master (the main Kubernetes service provider with kube-apiserver, kube-scheduler and kube-control-manager) plus multiple slaves. But to achieve high availability, you need to refer to the following methods:

  • kube-apiserver needs to set up multiple services, and --apiserver-countrestart and set through parameters
  • kubernetes-ec2-autoscaler can help us automatically close idle resources, but this is contrary to the principle of Kubernetes scheduler, but these settings can help us concentrate resources as much as possible.
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
  {"name" : "GeneralPredicates"},
  {"name" : "MatchInterPodAffinity"},
  {"name" : "NoDiskConflict"},
  {"name" : "NoVolumeZoneConflict"},
  {"name" : "PodToleratesNodeTaints"}
  ],
"priorities" : [
  {"name" : "MostRequestedPriority", "weight" : 1},
  {"name" : "InterPodAffinityPriority", "weight" : 2}
  ]
}

The above is an example of adjusting the kubernetes scheduler. By increasing the weight of InterPodAffinityPriority, our purpose is achieved. More demonstration reference examples .

It should be noted that the current Kubernetes Scheduler Policy does not support dynamic switching, and kube-apiserver needs to be restarted (issue: 41600 )

The impact of adjusting the scheduler policy

OpenAI used KubeDNS , but found out shortly after -

question:

  • Frequent DNS queries cannot be found (randomly)
  • Over ~200QPS domain lookup

Try to fix:

  • Try to see why there is such a state, and found that some nodes are running more than 10 KuberDNS

Solution:

  • Concentration of many PODs due to scheduler policy
  • KubeDNS is very lightweight and can easily be assigned to the same node, resulting in the concentration of domain lookup
  • Need to modify POD affinity ( related introduction ), try to assign KubeDNS to different nodes
affinity:
 podAntiAffinity:
   requiredDuringSchedulingIgnoredDuringExecution:
   - weight: 100
     labelSelector:
       matchExpressions:
       - key: k8s-app
         operator: In
         values:
         - kube-dns
     topologyKey: kubernetes.io/hostname

Docker image pulls slow when creating new nodes

question:

  • Every time a new node is established, docker image pull takes 30 minutes

Try to fix:

  • There is a large container image Dota , almost 17GB, which affects the image pulling of the entire node
  • Start checking kubelet for other image pull options

Solution:

  • Added option --serialize-image-pulls=falseto kubelet to enable image pulling so that other services can pull earlier (see: kubelet startup options )
  • This option requires docker storgae to switch to overlay2 (refer to the docker teaching article )
  • And store the docker image to SSD, which can make image pull faster

Added: source trace

// serializeImagePulls when enabled, tells the Kubelet to pull images one
// at a time. We recommend *not* changing the default value on nodes that
// run docker daemon with version  < 1.9 or an Aufs storage backend.
// Issue #10959 has more details.
SerializeImagePulls *bool `json:"serializeImagePulls"`

Improve the speed of docker image pull

In addition, you can also improve the speed of the pull in the following ways

The kubelet parameter --image-pull-progress-deadlineshould be increased to 30mins and the docker daemon parameter should be max-concurrent-downloadadjusted to 10 for multi-threaded download

Network performance improvement

Flannel performance limitations

The network traffic between OpenAI nodes can reach 10-15GBit/s, but due to Flannel, the traffic will drop to ~2GBit/s

The solution is to remove the Flannel and use the actual network

  • hostNetwork: true
  • dnsPolicy: ClusterFirstWithHostNet

Here are some more notes to read in detail


Want easy-to-use, production-ready Kubernetes? Try Haoyu Rainbond - packaging Kubernetes in the form of applications, it is easier to understand and use, and various management processes are available out of the box!

Rainbond is an application-centric open source PaaS that deeply integrates Kubernetes-based container management, Service Mesh microservice architecture best practices, multi-type CI/CD application construction and delivery, and multi-data center resource management and other technologies to provide users with cloud-native application full life cycle solutions, build an ecosystem of interconnection between applications and infrastructure, applications and applications, and infrastructure and infrastructure, to meet the needs of agile development and high efficiency to support rapid business development. Operations and lean management requirements.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324923636&siteId=291194637