This article is shared from Huawei Cloud Community "How kube-scheduler completes scheduling and adjusts scheduling weights", author: You can make friends.
I. Overview
Kube-scheduler is the default scheduler of k8s cluster. It monitors (watch mechanism) kube-apiserver, queries pods that have not been scheduled, and schedules pods to the most suitable Node in the cluster according to the scheduling policy.
2. Scheduling process
First, we create a pod through the API or kubectl tool. Kube-apiserver receives the request information and stores it in etcd. The scheduler monitors the apiserver through the watch mechanism to see the list of pods that have not yet been scheduled. It loops through and tries to allocate nodes to each pod. The allocation process is as follows:
-
Informer component list-watch apiserver in kube-scheduler, use spec.nodeName="" to filter out Pods that have not been scheduled yet
-
Preselection (predicate): The scheduler filters out nodes that do not meet the conditions through the Predicate algorithm.
-
Priority: For the nodes that pass the preselection, the scoring mechanism is used to filter out the nodes with the highest scores.
-
After the scheduler selects a suitable node for the Pod, bind the Pod to the node (assign the node name to the spec.nodeName field of the pod)
Note: Pod.spec.nodeName is used to force constraints to schedule Pods to the specified Node. By specifying nodeName, you can directly bypass the scheduler and will not do anything. Resource filtering and inspection
3. kuble-scheduler scheduling principle
The scheduling framework of Kube-scheduler is called Scheduler Framework in Kubernetes. During the scheduling process, Pod needs to go through the following stages in sequence. Each stage has its own scheduling algorithm. The scheduling algorithm is provided by the plug-in. You can also develop your own plug-in at designated stages. Each plug-in can implement a specific scheduling algorithm in a specified stage. For example, the NodeAffinity plug-in filters out nodes that are not compatible with the Pod in the Filter stage.
-
PreFilter: Preprocess Pod-related information, or check certain conditions that the cluster or Pod must meet. If the PreFilter plug-in returns an error, the scheduling cycle will terminate.
-
Filter: Filter out nodes that cannot run the Pod. For each node, the scheduler will call these filter plugins in the order they are configured. If any filter plugin marks a node as infeasible, the node is excluded directly and the remaining filter plugins are not called for that node.
-
PostFilter: Called after the Filter stage, but only when the Pod has no viable nodes. A typical post-filtering implementation is preemption, which attempts to make this Pod scheduler by preempting resources from other Pods.
-
PreScore: Runs the scoring task to generate the shared state of the scorable plug-in. If the PreScore plug-in returns an error, the scheduling cycle will terminate
-
Score: Score schedulable nodes by calling each scoring plug-in
-
NormalizeScore: Standardizes the score of each plug-in to be between [0, 100]
-
Reserve: Select reserved nodes before the binding cycle
-
Permit: approve or reject the result of the pod scheduling cycle
-
PreBind: Used to perform any work required before the Pod is bound. For example, a pre-bound plugin might need to provide a network volume and mount it on the target node before allowing the Pod to run on that node.
-
Bind: Used to bind the Pod to the node. The Bind plug-in will not be called until all PreBind plug-ins are completed.
-
PostBind: This is an informational extension point. The post-binding plugin is called after the Pod is successfully bound. This is the end of the binding cycle and can be used to clean up related resources
The pre-selection stage of the scheduler corresponds to filter, which is mainly used to filter nodes that do not meet the Pod scheduling conditions; the optimization stage corresponds to score, which is mainly used to score each node. Node score = plug-in score * plug-in weight; then the node with the highest score is sorted and selected.
Scheduling phase |
Implement plugin name |
Plug-in function introduction |
filter |
PodTopologySpread |
Determine whether the node satisfies the topological distribution of the Pod. If not, filter the node. |
InterPodAffinity |
Determine whether the node satisfies the Pod's affinity configuration. If not, filter the node. |
|
NodePorts |
Determine whether the node meets the Pod's port request. If not, filter the node. |
|
NodeAffinity |
Determine whether the node meets the node affinity configuration of the Pod. If not, filter the node. |
|
VolumeBinding |
Determine whether the node meets the node affinity of pv, and save the nodes that meet the conditions for dynamically creating pvc (such as topology) for use in subsequent stages. |
|
TaintToleration |
NoSchedule and NoExecute filter nodes based on Pod tolerance and node taint |
|
Score |
NodeAffinity |
The score is calculated based on the plug-in weight, and then the node score is calculated based on the strategy weight ratio. The score range is 0~100, and the weight defaults to 2. |
NodeResourcesBalancedAllocatio |
The score is obtained based on the proportion of different resources (cpu, mem, volume) to the node capacity plus the weight of the corresponding resource. The score range is 0~100, and the weight defaults to 1 |
|
ImageLocality |
Score is based on the size of the image in the Pod and the distribution of the image on all nodes. The score range is 0~100, and the weight defaults to 1. |
|
InterPodAffinity |
The score is calculated based on the plug-in weight, and then the node score is calculated based on the strategy weight ratio. The score range is 0~100, and the weight defaults to 2. |
|
TaintToleration |
Calculate the score according to the PreferNoSchedule strategy, the score range is 0~100, and the weight defaults to 3 |
|
NodeResourcesFit |
Three strategies: LeastAllocated (the less allocated, the higher the score), MostAllocated (the more allocated, the higher the score), RequestedToCapacityRatio (request value to capacity ratio) |
|
PodTopologySpread |
A score is obtained based on the topological matching degree and weight. The score range is 0~100 and the weight defaults to 2. |
3.1 kubernetes 1.23 version scheduler filter stage and score stage source code analysis
3.2 Example of modifying the default weight of the scheduler plug-in
3.2.1 Environment preparation
Environment: There are two nodes in the cluster: k8s-0001 and k8s-0002; the existing workload nginx is scheduled to node k8s-0002, the workload test, and the yaml file is as follows:
apiVersion: apps/v1 kind: Deployment metadata: name: test spec: selector: matchLabels: app: test template: metadata: labels: app: test spec: containers: - name: container-1 image: nginx:latest dnsPolicy: ClusterFirst affinity: nodeAffinity: #Use node affinity to schedule it to k8s-0001 preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: kubernetes.io/hostname operator: In values: - k8s-0001 podAffinity: #Use load affinity to schedule it to k8s-0002 preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - nginx namespaces: - default topologyKey: kubernetes.io/hostname
3.2.2 Adjust the InterPodAffinity weight so that the workload test is scheduled to node k8s-0002
apiVersion: v1 kind: ConfigMap metadata: name: scheduler-config namespace: kube-system data: scheduler-config.yaml: | apiVersion: kubescheduler.config.k8s.io/v1beta3 #v1beta3 is available for clusters with versions above 1.23 child: KubeSchedulerConfiguration profiles: - schedulerName: default-scheduler plugins: score: disabled: - name: InterPodAffinity - name: NodeAffinity enabled: - name: InterPodAffinity #Increase load affinity weight weight: 100 - name: NodeAffinity weight: 1
Check the kube-scheduler scheduling log. The k8s-002 score is score 100 * weight 100, with a total of 10,000 points, and is scheduled to the k8s-002 node.
3.2.3 Adjust the NodeAffinity weight so that the workload test is scheduled to node k8s-0001
apiVersion: v1 kind: ConfigMap metadata: name: scheduler-config namespace: kube-system data: scheduler-config.yaml: | apiVersion: kubescheduler.config.k8s.io/v1beta3 child: KubeSchedulerConfiguration profiles: - schedulerName: default-scheduler plugins: score: disabled: - name: InterPodAffinity - name: NodeAffinity enabled: - name: InterPodAffinity weight: 1 - name: NodeAffinity #Increase node affinity weight weight: 100
Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei