How to customize the Kubernetes scheduling algorithm?

With the development of cloud computing and container technology, the container technology with docker as the core is rapidly applied by developers and technology companies. Kubernetes has become the de facto container cluster management system with its rich enterprise-level and production-level functions. However, k8s 通用性weakens the scheduling algorithm 定制性. This article will investigate the method of customizing the scheduling algorithm and give an open source implementation.

k8s and scheduler architecture

Figure 1-1 is the overall architecture diagram of Kubernetes. Cluster nodes are divided into two roles: Master节点and Node节点. The Master node is the management center of the entire cluster, responsible for cluster management, container scheduling, state storage and other components running on the Master node; the Node node is the actual working node, responsible for running specific containers.

bf1c44385bc892974ba8b369152186d0.png

Figure 1-1 The overall architecture of Kubernetes

The Kubernetes scheduler is an independently running process, and the internal running process can be logically divided into multiple modules. Figure 1-2 shows the specific modules included in the default scheduler. The configuration module is responsible for reading the configuration information of the scheduler and initializing the scheduler according to the configuration content.

  • The priority queue module is a priority heap data structure, which is responsible for sorting the pods to be scheduled according to the priority. The pod with the highest priority is in the front, and the scheduler will poll the priority queue. When there is a pod to be scheduled in the queue, the scheduling process will be executed. [1] .

  • The scheduling module consists of three parts 算法模块, , Node缓存and 调度扩展点. The algorithm module provides a series of basic algorithms for scoring Nodes, such as the NodeResourcesBalancedAllocation algorithm that balances the CPU and memory usage of nodes. The algorithm module is scalable, and users can modify and add their own scheduling algorithms ;The Node cache module is responsible for caching the latest state data of the cluster nodes, providing data support for the scheduling algorithm; the scheduling extension point is composed of a series of extension points, each extension point is responsible for different functions, the most important extension points are Filter, Score and Bind These three extension points.

  • Finally, the binding module is responsible for binding the Node and Pod selected by the scheduler.

01783ca1e28ba9a0160b28b6c9a3d7ef.png

Figure 1-2 Kubernetes scheduler architecture

The Kubernetes scheduler code adopts a pluggable plug-in design idea, including core parts and pluggable parts. The configuration module, priority queue, and Node cache in Figure 1-2 are the core parts, while the algorithm module and scheduling extension points are pluggable parts. This plug-in design allows some functions of the scheduler to be implemented through plug-ins, which is convenient for code modification and function expansion, while keeping the core code of the scheduler simple and maintainable [2] .

Figure 1-3 lists the specific extension points included in the scheduler extension point module. The scheduling process of Pod is divided into 调度周期and 绑定周期, and the scheduling and binding cycles together constitute the scheduling context of Pod. The scheduling context is composed of a series of extension points, and each extension point is responsible for a part of the function. The most important extension points are the preselection (Filter) and preferred (Score) extension points in the scheduling cycle and the binding (Bind) extension in the binding cycle. point.

The pre-selection extension point is responsible for judging whether each node can meet the resource requirements of the Pod, and filtering out the node if it is not satisfied. The preferred extension point part will run the default scoring algorithm for each Pod, and the final score will be weighted and aggregated to obtain the final comprehensive score of all nodes; the scheduler will select the node with the highest comprehensive score. If there are multiple nodes with the same and highest score, The scheduler 水塘采样算法randomly selects one of the nodes as the scheduling result, and then reserves the resource usage requested by the Pod on the node to prevent it from being used by other Pods. In the binding cycle, the scheduler binds the Pod to the node with the highest score. The essence of this step is to modify the node-related information in the Pod object and update it to the storage component etcd.

63f1acfe9b80b47bd31bb983c0af1870.png

Figure 1-3 Kubernetes scheduler extension point architecture

Customized Algorithm Solution

If you want to implement a custom scheduling algorithm, there are three main options:

  1. Modify the source code of the default scheduler, add your own scheduling algorithm, and then recompile and deploy the scheduler. The scheduler research in papers kcss [3] and kubecg [4] is based on this scheme;

  2. Develop your own scheduler and run it in the cluster at the same time as the default scheduler;

  3. Based on the Kubernetes Scheduler Extender mechanism [5] , a custom algorithm is implemented in the extended scheduler, and the algorithm implementation in the paper dynamic IO [6] is based on this scheme.

The advantages and disadvantages of the above three custom scheduling algorithm implementation schemes are shown in Table 2-1. On the whole,

  • Solution 1 has the smallest changes, but doing so will destroy the maintainability of open source software. When the Kubernetes backbone code is updated, the modified scheduler must be consistent with the upstream code, which will bring a lot of maintenance and testing work .

  • Solution 2 is to implement your own scheduler and run multiple schedulers in the cluster. There is no cluster resource data synchronization between multiple schedulers, and there are problems of concurrent scheduling data competition and data inconsistency.

  • Solution 3 requires the default scheduler to interact with the Extender through the API, and new network requests will increase the time-consuming of the entire scheduling process.

Table 2-1 Comparison of self-developed scheduling algorithm solutions

plan advantage shortcoming
Solution 1: Modify the scheduler source code Minor changes Broken source code, not easy to maintain
Scenario 2: Running multiple schedulers No source code changes There are data races, inconsistencies
Solution 3: Develop an extended scheduler No source code changes There is network time consumption

The scheduler implementation in this paper adopts scheme 3, and an extended scheduler conforming to the Scheduler Extender mechanism and API specification is designed and developed, which is named Liang . Code 2-1 is a policy configuration file in the JSON format of the extended scheduler. The policy file is passed to the Kubernetes default scheduler through the configuration file parameters, where urlPrefix indicates the API address monitored by the extended scheduler Liang after running, and prioritizeVerb indicates that the preferred extension point is in Extend routing in the scheduler. After the default scheduler finishes running the scoring plug-in at the preferred extension point, it will send an HTTP POST network request to Liang's API address, and pass the Pod and candidate node information together in the HTTP Body. After receiving the POST request, the extended scheduler Liang will score the nodes according to the scoring algorithm and return the result to the default scheduler.

{
    "kind": "Policy",
    "apiVersion": "v1",
    "extenders": [
        {
            "urlPrefix": "http://localhost:8000/v1",
            "prioritizeVerb": "prioritizeVerb",
            "weight": 1,
            "enableHttps": false,
            "httpTimeout": 1000000000,
            "nodeCacheCapable": true,
            "ignorable": false
        }
    ]
}

Code 2-1

Figure 2-1 shows the startup process of the default scheduler (kube-scheduler) with extensions. The configuration information of the extended scheduler Liang is told to the default scheduler through the kube-policy.json configuration file.

52a5c84b509fa6371fc185fe9fb3b900.png

Figure 2-1 The extended scheduler is started by passing the configuration file to the default scheduler

Extended Scheduler Liang

The extended scheduler Liang is independent of the Kubernetes default scheduler. Liang's module design and organizational structure are shown in Figure 3-1, including two parts: multi-dimensional resource collection and storage and API services. Multi-dimensional resource data collection is realized by running Prometheus and node-exporter in the cluster. The extended scheduler Liang is responsible for obtaining multi-dimensional indicators from Prometheus and then using the scheduling algorithm to return the results to the default scheduler.

ca00dc0b2f65eddfc583f6cf127bfc23.png

Figure 3-1 The overall architecture of the extended scheduler Liang

  1. The api server module is responsible for implementing the API interface that conforms to the data format and transmission specifications of the extended scheduler. After receiving the scoring request from Kubernetes, Liang parses the information of the Pod and candidate nodes in the request, and passes it as parameters to the internal scheduling algorithm to obtain the candidate The node's scoring result is returned to the default scheduler.

  2. Scheduling algorithm module, which extends the core module of scheduler Liang, is responsible for implementing custom scheduling algorithms. Thanks to the extended scheduler mechanism, multiple custom scheduling algorithms can be implemented in Liang. This paper mainly designs and implements two scheduling algorithms, BNP and CMDN.

  3. The data cache module has two main functions:

    1. Get the status data of all nodes in the entire Kubernetes cluster by requesting the API of Prometheus.

    2. Implement a memory-based index data caching mechanism, provide an interface for writing and reading index data, and improve the speed of obtaining multi-dimensional index data when the algorithm is running.

Liang uses Go language to develop, the code size is about 3400 lines, and the open source website is Liang open source address [7] .

Table 3-1 shows the time-consuming comparison between whether the extended scheduler uses the caching mechanism and the default scheduler to make scheduling decisions. The scheduling time is obtained by printing timestamps in the source code of the Kubernetes scheduler, and running 9 times respectively to calculate the average value. As can be seen from Table 3-1, the default scheduler takes very little time to make a scheduling decision, less than 1ms. In the case of adding the extended scheduler and caching mechanism, the average scheduling decision time is 4.439ms, which is about 3ms longer than the default scheduler. The increased time is mainly due to the network request time between the default scheduler and the extended scheduler Liang and Liang is the time it takes to run the scheduling algorithm.

When the extended scheduler does not add a cache mechanism, the average time-consuming for each scheduling decision is 1110.439ms, and the scheduling time-consuming rapidly increases by more than 100 times, mainly because Prometheus is requested to calculate and obtain information in the cluster every time a scheduling decision is made. index data. Therefore, the extended scheduler plus the caching mechanism can avoid the network request time brought by Prometheus, reduce the decision-making time of the extended scheduler, and improve the performance of the extended scheduler.

Table 3-1 Decision-making time of different scheduler architectures

scheduling type average decision time
default scheduler 0.945ms
Extended scheduler - use cache 4.439ms
Extended scheduler - does not use cache 1110.439ms

BNP algorithm

The BNP algorithm is implemented in Liang, which incorporates the network IO usage into the consideration of the k8s scheduling algorithm, and can balance the network IO usage in the cluster.

Figure 3-2 shows the changes in the network IO resources in the entire cluster in the default scheduling algorithm and BNP algorithm in the experiment. The data is collected once each Pod is deployed, and a total of nine Pods are deployed. It can be clearly seen that the distribution of network IO resources in the BNP experiment is more balanced than that of the default scheduling algorithm.

74e6f9fc15eda91cc3919fe85d8499cd.png

Figure 3-2 Changes in the network IO usage rate of the bnp algorithm

CMDN algorithm

The CMDN algorithm is implemented in Liang. Its goal is to make the multi-dimensional resource allocation in the cluster more balanced or compact. The core step is to comprehensively sort the five indicators of CPU, memory, disk IO, network IO, and network card bandwidth. Best Node to deploy Pods. Figure 3-3 is a comparison of CPU usage changes in the experiment. It can be clearly seen that the balance of CPU usage under the CMDN balancing strategy is more balanced than that of the default scheduling algorithm.

e58c5508a189b6c73cf6ae7ae9f16b6c.png

Figure 3-3 Changes in CPU usage under the cmdn algorithm balancing strategy

Summarize

The generality of the Kubernetes scheduling algorithm weakens the customization of the algorithm . This paper studies the k8s scheduler architecture and extension mechanism, compares three customized scheduling algorithm schemes, selects the expansion scheme to implement 扩展调度器Liang, and implements two scheduling algorithms BNP and CMDN in Liang to demonstrate the ability of customized algorithms.

The extension scheme greatly enriches the capabilities of the customized scheduling algorithm and can meet the needs of many customized scenarios. At the same time, it should be noted that custom scheduling algorithms often require more data, which requires the additional deployment of data acquisition modules in the k8s cluster, which increases operation and maintenance costs and reduces the versatility of customized scheduling algorithms.

References

[1]

When there is a Pod to be scheduled in the queue, the scheduling process will be executed: http://dx.doi.org/10.24138/jcomss.v16i1.1027

[2]

At the same time keep the scheduler core code simple and maintainable: https://v1-20.docs.kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/

[3]

kcss: https://doi.org/10.1007/s11227-020-03427-3

[4]

kubecg: https://doi.org/10.1002/spe.2898

[5]

Kubernetes Scheduler Extender mechanism: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md

[6]

dynamic IO: https://doi.org/10.1145/3407947.3407950

[7]

Liang open source address: https://github.com/adolphlwq/liang

Guess you like

Origin blog.csdn.net/u011387521/article/details/121668437