The first anniversary of Koordinator, the new version v1.2.0 supports node resource reservation and is compatible with community rescheduling strategies

Author: You Yi, Lu Feng

background

Koordinator is an open source project hatched and born based on Alibaba's years of accumulated experience in the field of container scheduling. It can improve container performance and reduce cluster resource costs. Through technical capabilities such as department mixing, resource profiling, and scheduling optimization, it can improve the operating efficiency and reliability of delay-sensitive workloads and batch jobs, and optimize the efficiency of cluster resource usage.

insert image description here

Since its release in April 2022, Koordinator has released 10 iterations so far, attracting a large number of outstanding engineers including Alibaba, Xiaomi, Xiaohongshu, iQiyi, 360, Youzan, etc. to contribute. With the coming of the spring of 2023, Koordinator also ushered in its first anniversary, and we are happy to announce to you that Koordinator v1.2 is officially released. In the new version, Koordinator supports the node resource reservation function, and is compatible with the rescheduling strategy of the K8s community. At the same time, it adds support for AMD environment L3 Cache and memory bandwidth isolation on the stand-alone side.

In the new version, a total of 12 new developers participated in the construction of the Koordiantor community, they are @Re-Grh, @chengweiv5, @kingeasternsun, @shelwinnn, @yuexian1234, @Syulin7, @tzzcfrank, @Dengerwei, @complone , @AlbeeSo, @xigang, @leason00 , thank the above developers for their contributions and participation.

Know the new features early

Node resource reservation

There are various forms of applications included in the hybrid scenario. In addition to the cloud-native containers, there are also many applications that have not yet been containerized. These applications will run together with K8s containers in the form of processes on the host machine. In order to reduce resource competition between K8s applications and other types of applications on the node side, Koordinator supports reserving some resources so that it does not participate in the resource scheduling of the scheduler or the resource allocation on the node side, achieving the effect of resource separation. In the v1.2 version, Koordiantor already supports the reservation of CPU and memory resource dimensions, and allows to directly specify the reserved CPU number, as follows.

Node resource reservation statement

The amount of resources to be reserved or the specific CPU number can be configured on the Node, for example as follows:

apiVersion: v1
kind: Node
metadata:
  name: fake-node
  annotations: # specific 5 cores will be calculated, e.g. 0, 1, 2, 3, 4, and then those core will be reserved.
    node.koordinator.sh/reservation: '{"resources":{"cpu":"5"}}'
---
apiVersion: v1
kind: Node
metadata:
  name: fake-node
  annotations: # the cores 0, 1, 2, 3 will be reserved.
    node.koordinator.sh/reservation: '{"reservedCPUs":"0-3"}'

When the stand-alone component Koordlet reports the node resource topology information, it will update the specific reserved CPU number to the Annotation of the NodeResourceTopology object.

Scheduling and rescheduling scene adaptation

In the process of allocating resources, the scheduler involves resource verification in various situations, including Quota management, node capacity verification, CPU topology verification, etc. These scenarios need to increase the consideration of node reserved resources, for example, When the scheduler calculates the CPU capacity of a node, it needs to deduct the resources reserved by the node.

cpus(alloc) = cpus(total) - cpus(allocated) - cpus(kubeletReserved) - cpus(nodeAnnoReserved)

In addition, the calculation of oversold resources in the Batch mix also needs to deduct this part of resources, and considering the resource consumption of some system processes in the node, Koord-Manager will take the maximum value of node reservation and system usage during calculation ,Specifically:

reserveRatio = (100-thresholdPercent) / 100.0
node.reserved = node.alloc * reserveRatio
system.used = max(node.used - pod.used, node.anno.reserved)
Node(BE).Alloc = Node.Alloc - Node.Reserved - System.Used - Pod(LS).Used

For rescheduling, each plug-in strategy needs to perceive the amount of reserved resources of the node in scenarios such as node capacity and utilization rate calculation. In addition, if there are already containers occupying the reserved resources of the node, rescheduling needs to consider expulsion to ensure node capacity Be properly managed to avoid competition for resources. We will support this part of rescheduling-related functions in subsequent versions, and fans are welcome to participate in the joint construction.

Stand-alone resource management

For LS-type Pods, the stand-alone Koordlet component will dynamically calculate the shared CPU pool according to the CPU allocation, and the CPU cores reserved by the node will be excluded to ensure the isolation of LS-type pods and other non-containerized process resources. At the same time, for stand-alone-related QoS policies, such as the CPUSuppress suppression policy, the amount of reserved resources will be taken into account when calculating node utilization.

suppress(BE) := node.Total * SLOPercent - pod(LS).Used - max(system.Used, node.anno.reserved)

For a detailed description of the node resource reservation function, you can refer to the introduction in the design document, see: https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20221227-node-resource -reservation.md

Compatible with community rescheduling policies

Thanks to the increasingly mature framework of Koordinator Descheduler, in Koordinator v1.2 version, by introducing an interface adaptation mechanism, it can be seamlessly compatible with the existing plugins of Kubernetes Desceheduler. When using it, you only need to deploy Koordinator Descheduler that is All functions up to upstream can be used.

In terms of implementation, Koordinator Descheduler does not make any intrusive changes by importing upstream code, ensuring full compatibility with all upstream plug-ins, parameter configurations, and their operating strategies. At the same time, Koordinator allows users to specify enhanced evictors for upstream plug-ins, thereby reusing security policies such as resource reservation, workload availability guarantee, and global flow control provided by Koordinator.

List of compatible plugins:

  • HighNodeUtilization
  • LowNodeUtilization
  • PodLifeTime
  • RemoveFailedPods
  • RemoveDuplicates
  • RemovePodsHavingTooManyRestarts
  • RemovePodsViolatingInterPodAntiAffinity
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingTopologySpreadConstraint
  • DefaultEvictor

When using it, you can refer to the following configuration, taking RemovePodsHavingTooManyRestarts as an example:

apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
clientConnection:
  kubeconfig: "/Users/joseph/asi/koord-2/admin.kubeconfig"
leaderElection:
  leaderElect: false
  resourceName: test-descheduler
  resourceNamespace: kube-system
deschedulingInterval: 10s
dryRun: true
profiles:
- name: koord-descheduler
  plugins:
    evict:
      enabled:
        - name: MigrationController
   deschedule:
     enabled:
       - name: RemovePodsHavingTooManyRestarts
  pluginConfig:
    - name: RemovePodsHavingTooManyRestarts
      args:
        apiVersion: descheduler/v1alpha2
        kind: RemovePodsHavingTooManyRestartsArgs
        podRestartThreshold: 10

Enhanced resource reservation and scheduling capabilities

Koordinator introduced the Reservation mechanism in an earlier version, which helps solve the problem of resource delivery determinism by reserving resources and reusing them for Pods with specified characteristics. For example, in the rescheduling scenario, it is expected that the Pod that is evicted must have resources to use, instead of having no resources available after being evicted, causing stability problems; or when capacity expansion is required, some PaaS platforms hope to first determine whether the resources for application scheduling and orchestration are met. , and then decide whether to expand, or do some preparatory work in advance.

Koordinator Reservation is defined by CRD. Each Reservation object will be faked as a Pod in koord-scheduler for scheduling. Such a Pod is called Reserve Pod. Reserve Pod can reuse the existing scheduling plug-ins and scoring plug-ins to find suitable nodes. , and finally occupy the corresponding resources in the internal state of the scheduler. When a Reservation is created, it will specify which Pods the reserved resources will be used in the future. You can specify a specific Pod, or you can specify some workload objects, or Pods with certain labels. When these Pods are scheduled by koord-scheduler, the scheduler will find the Reservation object that can be used by the Pod, and use the resources of the Reservation first. And the Reservation Status will record which Pod is used, and the Pod Annotations will also record which Reservation is used. After the Reservation is used, the internal state will be automatically cleaned up to ensure that other Pods will not be unschedulable due to the Reservation.

In Koordinator v1.2, we have done a lot of optimization. First of all, we let go of the restriction that only the resources held by the Reservation can be used, allowing the resource boundary of the Reservation to be crossed, and the resources reserved by the Reservation can be used, and the remaining resources on the node can also be used. Moreover, we have extended the Kubernetes Scheduler Framework in a non-intrusive way to support the reservation of fine-grained resources, that is, CPU cores and GPU devices can be reserved. We also modified the default behavior that the Reservation can be reused to AllocateOnce, that is, once the Reservation is used by a Pod, the Reservation will be discarded. This change is based on the consideration that AllocateOnce can better cover most scenarios, so as the default behavior, it will be easier for everyone to use.

Support L3 Cache and memory bandwidth isolation in AMD environment

In the v0.3.0 version, Koordiantor has already supported the L3 Cache and memory bandwidth isolation of the Intel environment. In the latest version 1.2.0, we have added support for the AMD environment.

The Linux kernel L3 Cache and memory bandwidth isolation capabilities provide a unified resctrl interface that supports both Intel and AMD environments. The main difference is that the memory bandwidth isolation interface provided by Intel is in percentage format, while the memory bandwidth isolation interface provided by AMD is in absolute value format ,details as follows.

# Intel Format
# resctrl schema
L3:0=3ff;1=3ff
MB:0=100;1=100

# AMD Format
# resctrl schema
L3:0=ffff;1=ffff;2=ffff;3=ffff;4=ffff;5=ffff;6=ffff;7=ffff;8=ffff;9=ffff;10=ffff;11=ffff;12=ffff;13=ffff;14=ffff;15=ffff
MB:0=2048;1=2048;2=2048;3=2048;4=2048;5=2048;6=2048;7=2048;8=2048;9=2048;10=2048;11=2048;12=2048;13=2048;14=2048;15=2048

The interface format consists of two parts, L3 represents the available "way" of the corresponding socket or CCD, expressed in hexadecimal data format, each bit represents one way; MB represents the memory that the corresponding socket or CCD can use Bandwidth range, Intel selectable range is 0~100 percentage format, AMD corresponds to absolute value format, the unit is Gb/s, 2048 means unlimited. Koordiantor uniformly provides an interface in percentage format, and automatically senses whether the node environment is AMD, and determines the format filled in the resctrl interface.

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  resource-qos-config: |-
    {
      "clusterStrategy": {
        "lsClass": {
           "resctrlQOS": {
             "enable": true,
             "catRangeStartPercent": 0,
             "catRangeEndPercent": 100,
             "MBAPercent": 100
           }
         },
        "beClass": {
           "resctrlQOS": {
             "enable": true,
             "catRangeStartPercent": 0,
             "catRangeEndPercent": 30,
             "MBAPercent": 100
           }
         }
      }
    }

Other functions

Through the v1.2 release [ 1 ] page, you can see the new features included in more versions.

Future plan

In the next version, Koordiantor focuses on the following functions, including:

  • Hardware topology-aware scheduling, which comprehensively considers the topological relationship of multiple resource dimensions such as node CPU, memory, and GPU, and performs scheduling optimization within the cluster.
  • Enhancements to observability and traceability of the rescheduler.
  • Enhancements to GPU resource scheduling capabilities.

insert image description here

Koordinator is an open community. Cloud native enthusiasts are very welcome to participate in co-construction in various ways. Whether you are a beginner or a proficient in the field of cloud native, we are very much looking forward to hearing your voice! You can also use the DingTalk search group number: 33383887 to join the Koordinator community DingTalk group:

Related Links:

[1] v1.2 release

https://github.com/koordinator-sh/koordinator/releases/tag/v1.2.0

Click here to learn about the Koordinator project now!

Guess you like

Origin blog.csdn.net/alisystemsoftware/article/details/130333079