The correct posture of Kubelet resource reservation from a cluster avalanche

Author: [email protected]

Kubelet Node Allocatable

  • Kubelet Node Allocatable is used to reserve resources for Kube components and System processes, so as to ensure that Kube and System processes have sufficient resources even when the nodes are fully loaded.
  • Currently, three resource reservations are supported: cpu, memory, and ephemeral-storage.
  • Node Capacity is all the hardware resources of Node, kube-reserved is the resource reserved for the kube component, system-reserved is the resource reserved for the System process, eviction-threshold is the threshold setting of the kubelet eviction, and allocatable is the real scheduler scheduling The reference value for Pod (to ensure that the request resource of all Pods on the Node does not exceed Allocatable).
  • Node Allocatable Resource = Node Capacity - Kube-reserved - system-reserved - eviction-threshold

Enter image description

How to configure

  • --enforce-node-allocatable, the default is pods, to reserve resources for the kube component and the System process, it needs to be set to pods,kube-reserved,system-reserve.
  • --cgroups-per-qos, Enabling QoS and Pod level cgroups, enabled by default. When enabled, kubelet will manage the cgroups of all workload Pods.
  • --cgroup-driver, the default is cgroupfs, another option is systemd. Depending on the cgroup driver used by the container runtime, the kubelet is consistent with it. For example, if you configure docker to use systemd cgroup driver, then kubelet also needs to configure --cgroup-driver=systemd.
  • --kube-reserved, used to configure the amount of resources reserved for kube components (kubelet, kube-proxy, dockerd, etc.), such as --kube-reserved=cpu=1000m, memory=8Gi, ephemeral-storage=16Gi.
  • --kube-reserved-cgroup, if you set --kube-reserved, be sure to set the corresponding cgroup, and the cgroup directory must be created in advance, otherwise the kubelet will not be automatically created and the kubelet will fail to start. For example, set it to kube-reserved-cgroup=/kubelet.service .
  • --system-reserved, used to configure the amount of resources reserved for the System process, such as --system-reserved=cpu=500m, memory=4Gi, ephemeral-storage=4Gi.
  • --system-reserved-cgroup, if you set --system-reserved, be sure to set the corresponding cgroup, and the cgroup directory must be created in advance, otherwise the kubelet will not be automatically created and the kubelet will fail to start. For example, set it to system-reserved-cgroup=/system.slice.
  • --eviction-hard, used to configure the hard eviction condition of kubelet, only supports two incompressible resources, memory and ephemeral-storage. When MemoryPressure occurs, Scheduler will not schedule new Best-Effort QoS Pods to this node. When DiskPressure occurs, Scheduler will not schedule any new Pods to this node. For more interpretation of Kubelet Eviction, please refer to my related blog post.
  • The code of Kubelet Node Allocatable is very simple. The main thing is pkg/kubelet/cm/node_container_manager.gothat interested students should go and read it by themselves.

For how to plan the Cgroup structure of Node, please refer to the official recommendation: recommended-cgroups-setup

Sample

Take the following kubelet resource reservation as an example, Node Capacity is memory=32Gi, cpu=16, ephemeral-storage=100Gi, we configure the kubelet as follows:

--enforce-node-allocatable=pods,kube-reserved,system-reserved
--kube-reserved-cgroup=/kubelet.service
--system-reserved-cgroup=/system.slice
--kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi
--system-reserved=cpu=500m,memory=1Gi,ephemeral-storage=1Gi
--eviction-hard=memory.available<500Mi,nodefs.available<10%

NodeAllocatable = NodeCapacity - Kube-reserved - system-reserved - eviction-threshold = cpu=14.5,memory=28.5Gi,ephemeral-storage=98Gi.

The Scheduler will ensure that all Pod Resource Requests on Node do not exceed NodeAllocatable. When the sum of memory and storage used by Pods exceeds NodeAllocatable, kubelet Evict Pods will be triggered.

I stepped on the pit

kube-reserved-cgroup and system-reserved-cgroup configuration

At first, I only made the following configurations for kubelet --kube-reserved, --system-reserved, I thought kubelet would automatically create corresponding Cgroups for kube and system, and set the corresponding cpu share, memory limit, etc., and then Sit back and relax.

However, this is not the case until there is a problem with a TensorFlow worker online. Unlimited use of the node's cpu causes the cpu usage on the node to continue to run at 100%, and squeezes the cpu usage of the kubelet component, resulting in kubelet and APIServer. If the heartbeat is interrupted, the node is Not Ready.

Then, Kubernetes will start the greedy worker on one of the other optimal Ready Nodes, and then fill up the CPU of this node, and the node is Not Ready.

In this way, a cluster avalanche occurred, and the Nodes in the cluster became Not Ready one by one, and the consequences were very serious.

--kube-reservedAfter adding the following configuration to the kubelet, it can ensure that when the Node is under high load, it can also ensure that at least the set cpu cores are available when the kubelet needs cpu .

--enforce-node-allocatable=pods,kube-reserved,system-reserved
--kube-reserved-cgroup=/kubelet.service
--system-reserved-cgroup=/system.slice

Note, because the cpu set by kube-reserved is actually written to the cpu shares under the kube-reserved-cgroup. Students who know cpu shares know that it only works when the cpu of the cluster is full and needs to be preempted, so you will see that the cpu usage of Node may still reach 100%, but it doesn't matter, components such as kubelet have not received it. Influence, if kubelet needs more cpu at this time, then it can grab more time slices, up to the cpu nums set by kube-reserved.

cgroup subsystems that Kubernetes will check

  • In Kubernetes 1.7, Kubelet startup checks for the existence of the following cgroup subsystems:

Enter image description

  • In Kubernetes 1.8 and 1.9, Kubelet startup checks for the existence of the following cgroup subsystems:

Enter image description

For the Centos system, the cpuset and hugetlb subsystems do not initialize system.slice by default, so they need to be created manually, otherwise Failed to start ContainerManager Failed to enforce System Reserved Cgroup Limits on "/system.slice": "/system.slice" cgroup does not existan error log will be reported.

We can do this by configuring ExecStartPre in the kubelet service.

Enter image description

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324398991&siteId=291194637