k8s节点资源预留的正确姿势

QOS分类

CPU限额中的request和limit

节点资源预留相关参数

为什么参数enforce-node-allocatable要添加kube-reserved,system,reserved （预留参数配置中cpu相关都是设置cpu.shares）

配置过程遇到的问题（cgroup要手动创建，代码）

我认为的一个合理配置

说明

在使用systemd作为cgroup driver时，节点上所有pod的top level cgroup是kubepods.slice，需要注意的是：kubepods.slice中的memory限额（memory.limit_in_bytes）的值是Allocatable + Eviction Hard Thresholds，也就是Capacity - KubeReserved - SystemReserved，为什么会这么设计呢？

If we enforce Node Allocatable (28.9Gi) via top level cgroups, then pods can never exceed 28.9Gi in which case evictions will not be performed unless kernel memory consumption is above 100Mi.

In order to support evictions and avoid memcg OOM kills for pods, we will set the top level cgroup limits for pods to be Node Allocatable + Eviction Hard Thresholds.

However, the scheduler is not expected to use more than 28.9Gi and so Node Allocatable on Node Status will be 28.9Gi.

If kube and system components do not use up all their reservation, with the above example, pods will face memcg OOM kills from the node allocatable cgroup before kubelet evictions kick in. To better enforce QoS under this situation, Kubelet will apply the hard eviction thresholds on the node allocatable cgroup as well, if node allocatable is enforced. The resulting behavior will be the same for user pods. With the above example, Kubelet will evict pods whenever pods consume more than 28.9Gi which will be <100Mi from 29Gi which will be the memory limits on the Node Allocatable cgroup.

参考：

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-allocatable.md

k8s节点资源预留的正确姿势

猜你喜欢