kubelet reservation system, kube resources

kubelet reservation system, kube resources

Kubernetes Capacity scheduling node can follow. By default, the pod can be used all of the available capacity of the node. This is a problem because the node itself usually run a lot of drive and Kubernetes OS system daemons (system daemons). Unless these systems daemons left out of resources, or they will compete for resources with the pod and lead to a shortage of node resource problem.

kubelet discloses a feature called Node Allocatable help to reserve computing resources to the system daemon. Kubernetes recommended cluster administrator configures the Node Allocatable in accordance with the workload density per node.

Node Allocatable

      Node Capacity
---------------------------
|     kube-reserved       |
|-------------------------|
|     system-reserved     |
|-------------------------|
|    eviction-threshold   |
|-------------------------|
|                         |
|      allocatable        |
|   (available for pods)  |
|                         |
|                         |
---------------------------

Allocatable on Kubernetes pod node is defined as the amount of available computing resources. The scheduler will not oversubscribed Allocatable. Currently supported CPU, memory and storage these parameters.

Node Allocatable v1.Node exposed as part of the API objects, it is also part of the kubectl describe node CLI.

In kubelet, you can reserve resources for the two types of system daemons.

Pod enable QoS and level of cgroups

In order to properly implement the node node allocatable range, you must enable the new cgroup hierarchy by --cgroups-per-qos mark. This flag is enabled by default. When enabled, kubelet will create all end users pod cgroup in the hierarchy of their administration.

Cgroup drive configuration

kubelet driven operation supports the use cgroup cgroup hierarchy on the host. --Cgroup-driver by drive configuration flag.

Supported parameter values ​​are as follows:

  • cgroupfs is the default drive, direct operating cgroup file system on the mainframe to manage cgroup sandbox.
  • systemd is optional drive, use resources init system supports instantaneous slice management cgroup sandbox.

Runtime depending on an associated container (container runtime) configuration, the operator may need to select a particular cgroup drive to ensure the normal operation. For example if the operator uses the cgroup drive provided docker runtime, you must configure kubelet use systemd cgroup drive.

Kube Reserved

  • Kubelet Flag: --kube-reserved=[cpu=100m][,][memory=100Mi][,][storage=1Gi]
  • Kubelet Flag: --kube reserved cgroup =

kube-reserved in order to give such kubelet, container runtime, node problem detector, etc. kubernetes system daemon for resource reservation. This does not mean to give the system daemon running in pod form to reserve resources. kube-reserved pod is generally a node density (Density pod) function. The performance dashboard shows the cpu and memory usage kubelet and docker engine pod from multiple levels of density.

To perform selective kube-reserved on a system daemon, which set the value --kube-reserved-cgroup flag kubelet Kube daemon for parental control group.

Recommended kubernetes system daemon placed under the top-level control group (for example runtime.slice on systemd machine). Each system daemon should be run in its own sub-control group in the ideal case. Please refer to this document , for more details on the recommended control over the group hierarchy.

Please note that if --kube-reserved-cgroup does not exist, Kubelet will not create it. If you specify an invalid cgroup, Kubelet will fail.

System reserved value (System Reserved)

  • Kubelet Flag: --system-reserved=[cpu=100mi][,][memory=100Mi][,][storage=1Gi]
  • Kubelet Flag: --system-reserved-cgroup=

system-reserved for such as sshd, udev and other system daemons for resource reservation. system-reserved memory should be reserved for the kernel, because the kernel currently used memory is not written in the Kubernetes the pod. It also recommended as user login session to reserve resources (user.slice systemd System).

Alternatively, in order to perform system-reserved in the system daemons, specify parental control group value OS system daemon --system-reserved-cgroup kubelet flag.

Recommended system daemons on the OS under the control of a top-level group (for example system.slice on systemd machine).

Please note that if --system-reserved-cgroup does not exist, Kubelet does not create it. If an invalid cgroup specified, Kubelet will fail.

Expulsion threshold (Eviction Thresholds)

  • Kubelet Flag: --eviction-hard=[memory.available<500Mi]

Node-level memory pressure will lead to insufficient system memory (System OOMs), which will affect all pod and running across the nodes. Node can temporarily offline until memory has been recovered so far. In order to prevent (or reduce the likelihood) of system memory is insufficient, kubelet provides insufficient resources (Out of Resource) management. Expulsion (Eviction) supported only for memory and storage. After reserving some memory by --eviction-hard sign, when the available memory drops retention on the node, kubelet will try to expel pod. Assume that if a system daemon does not exist on the node, pod will not use more than capacity-eviction-hard resources. Therefore, in order to expel the reserved resources pod is not available.

Execution node Allocatable

  • Kubelet Flag: --enforce-node-allocatable=pods[,][system-reserved][,][kube-reserved]

Scheduler Allocatable treated by the pod available capacity.

kubelet default execution Allocatable in the pod. Whenever the total amount of all pod than Allocatable, expulsion pod measures will be implemented. For more details about the eviction policy can be found in here found. Please measure this value by setting the control pods kubelet --enforce-node-allocatable flag.

Alternatively, kube-reserved by specifying system-reserved value and simultaneously in the same flag can be made kubelet kube-reserved and perform system-reserved. Please note that in order to perform kube-reserved or system-reserved, you need to specify --kube-reserved-cgroup or --system-reserved-cgroup respectively.

General principle

System daemons expect to be treated the same as a similar Guaranteed pod. System daemons can be explosive growth in their control group, you need this behavior as part of kubernetes deployment management. For example, kubelet should have its own control group and container and runtime (container runtime) Share Kube-reserved resources. However, if you perform a kube-reserved, the kubelet not suddenly erupt and consume all available resources nodes.

Please be cautious when performing system-reserved reserved operations because it could lead to critical systems on a service node CPU resources, or because of lack of memory (OOM) has been terminated.

  • As the execution Allocatable start on the pods.
  • Mechanisms for monitoring and alarm daemon tracking system in place soon enough to try the kube-reserved based on the amount of exploration (usage heuristics) mode.
  • As time progresses, if absolutely necessary, you can perform system-reserved.

With the addition of growth over time and more and more features, kube system daemon demand for resources may also increase. After kubernetes project will try to reduce the use of node system daemon, but it was not a priority. So, please look forward to in the future releases will reduce Allocatable capacity.

Sample scenario

This is an example of a node is calculated for explaining Allocatable:

  • Node has 32Gi memory, CPU and 16-core storage 100Gi
  • --kube-reserved set cpu = 1, memory = 2Gi, storage = 1Gi
  • --system-reserved set cpu = 500m, memory = 1Gi, storage = 1Gi
  • --eviction-hard 设置为 memory.available<500Mi,nodefs.available<10%

In this scenario, Allocatable will be 14.5 CPUs, 28.5Gi 98Gi memory and storage. The scheduler requests pod to ensure all the total amount of memory on the node does not exceed 28.5Gi, storage does not exceed 88Gi. When the amount of memory to use pod of more than 28.5Gi or total disk use more than 88Gi, Kubelet will expel them. If all the processes on the node as much as possible the use of the CPU, you can not use more than pod add up to 14.5 CPUs resources.

When no execution kube-reserved and / or system-reserved and the system daemon exceeds its reservation, if the amount is more than 31.5Gi node memory or store more than 90Gi, kubelet will expel pod.

Available Properties

As Kubernetes version 1.2, already you can optionally specify the kube-reserved and system-reserved reserved. When available in the same publication, the scheduler will be converted to use alternative Allocatable Capacity.

As Kubernetes 1.6 version, eviction-thresholds are taken into account by calculating Allocatable. To use the old version of the behavior, set --experimental-allocatable-ignore-eviction kubelet flag is true.

As Kubernetes 1.6 version, kubelet using a control group to perform Allocatable on the pod. To use the old version behavior, unset --enforce-node-allocatable kubelet flag. Note that unless --kube-reserved or --system-reserved or --eviction-hard flag is not the default parameters, otherwise Allocatable implementation deployment will not affect already existing.

As Kubernetes 1.6 version, kubelet start their own pod cgroup sandbox, an exclusive part in this sandbox cgroup cgroup hierarchy kubelet managed in. Before a previous version upgrade kubelet, requiring the operator to drain nodes, to ensure that the container and its associated pod starts the appropriate section in cgroup hierarchy.

Guess you like

Origin www.cnblogs.com/h-gallop/p/12049076.html