Let’s talk about how kube-scheduler completes scheduling and adjusts scheduling weights

This article is shared from Huawei Cloud Community "How kube-scheduler completes scheduling and adjusts scheduling weights", author: You can make friends.

I. Overview

Kube-scheduler is the default scheduler of k8s cluster. It monitors (watch mechanism) kube-apiserver, queries pods that have not been scheduled, and schedules pods to the most suitable Node in the cluster according to the scheduling policy.

2. Scheduling process

First, we create a pod through the API or kubectl tool. Kube-apiserver receives the request information and stores it in etcd. The scheduler monitors the apiserver through the watch mechanism to see the list of pods that have not yet been scheduled. It loops through and tries to allocate nodes to each pod. The allocation process is as follows:

  • Informer component list-watch apiserver in kube-scheduler, use spec.nodeName="" to filter out Pods that have not been scheduled yet

  • Preselection (predicate): The scheduler filters out nodes that do not meet the conditions through the Predicate algorithm.

  • Priority: For the nodes that pass the preselection, the scoring mechanism is used to filter out the nodes with the highest scores.

  • After the scheduler selects a suitable node for the Pod, bind the Pod to the node (assign the node name to the spec.nodeName field of the pod)

Note: Pod.spec.nodeName is used to force constraints to schedule Pods to the specified Node. By specifying nodeName, you can directly bypass the scheduler and will not do anything. Resource filtering and inspection

3. kuble-scheduler scheduling principle

The scheduling framework of Kube-scheduler is called Scheduler Framework in Kubernetes. During the scheduling process, Pod needs to go through the following stages in sequence. Each stage has its own scheduling algorithm. The scheduling algorithm is provided by the plug-in. You can also develop your own plug-in at designated stages. Each plug-in can implement a specific scheduling algorithm in a specified stage. For example, the NodeAffinity plug-in filters out nodes that are not compatible with the Pod in the Filter stage.

  • PreFilter: Preprocess Pod-related information, or check certain conditions that the cluster or Pod must meet. If the PreFilter plug-in returns an error, the scheduling cycle will terminate.

  • Filter: Filter out nodes that cannot run the Pod. For each node, the scheduler will call these filter plugins in the order they are configured. If any filter plugin marks a node as infeasible, the node is excluded directly and the remaining filter plugins are not called for that node.

  • PostFilter: Called after the Filter stage, but only when the Pod has no viable nodes. A typical post-filtering implementation is preemption, which attempts to make this Pod scheduler by preempting resources from other Pods.

  • PreScore: Runs the scoring task to generate the shared state of the scorable plug-in. If the PreScore plug-in returns an error, the scheduling cycle will terminate

  • Score: Score schedulable nodes by calling each scoring plug-in

  • NormalizeScore: Standardizes the score of each plug-in to be between [0, 100]

  • Reserve: Select reserved nodes before the binding cycle

  • Permit: approve or reject the result of the pod scheduling cycle

  • PreBind: Used to perform any work required before the Pod is bound. For example, a pre-bound plugin might need to provide a network volume and mount it on the target node before allowing the Pod to run on that node.

  • Bind: Used to bind the Pod to the node. The Bind plug-in will not be called until all PreBind plug-ins are completed.

  • PostBind: This is an informational extension point. The post-binding plugin is called after the Pod is successfully bound. This is the end of the binding cycle and can be used to clean up related resources

The pre-selection stage of the scheduler corresponds to filter, which is mainly used to filter nodes that do not meet the Pod scheduling conditions; the optimization stage corresponds to score, which is mainly used to score each node. Node score = plug-in score * plug-in weight; then the node with the highest score is sorted and selected.

 
 
 

Scheduling phase

Implement plugin name

Plug-in function introduction

filter

PodTopologySpread

Determine whether the node satisfies the topological distribution of the Pod. If not, filter the node.

InterPodAffinity

Determine whether the node satisfies the Pod's affinity configuration. If not, filter the node.

NodePorts

Determine whether the node meets the Pod's port request. If not, filter the node.

NodeAffinity

Determine whether the node meets the node affinity configuration of the Pod. If not, filter the node.

VolumeBinding

Determine whether the node meets the node affinity of pv, and save the nodes that meet the conditions for dynamically creating pvc (such as topology) for use in subsequent stages.

TaintToleration

NoSchedule and NoExecute filter nodes based on Pod tolerance and node taint

Score

NodeAffinity

The score is calculated based on the plug-in weight, and then the node score is calculated based on the strategy weight ratio. The score range is 0~100, and the weight defaults to 2.

NodeResourcesBalancedAllocatio

The score is obtained based on the proportion of different resources (cpu, mem, volume) to the node capacity plus the weight of the corresponding resource. The score range is 0~100, and the weight defaults to 1

ImageLocality

Score is based on the size of the image in the Pod and the distribution of the image on all nodes. The score range is 0~100, and the weight defaults to 1.

InterPodAffinity

The score is calculated based on the plug-in weight, and then the node score is calculated based on the strategy weight ratio. The score range is 0~100, and the weight defaults to 2.

TaintToleration

Calculate the score according to the PreferNoSchedule strategy, the score range is 0~100, and the weight defaults to 3

NodeResourcesFit

Three strategies: LeastAllocated (the less allocated, the higher the score), MostAllocated (the more allocated, the higher the score), RequestedToCapacityRatio (request value to capacity ratio)

PodTopologySpread

A score is obtained based on the topological matching degree and weight. The score range is 0~100 and the weight defaults to 2.

 

3.1 kubernetes 1.23 version scheduler filter stage and score stage source code analysis

 

3.2 Example of modifying the default weight of the scheduler plug-in

3.2.1 Environment preparation

Environment: There are two nodes in the cluster: k8s-0001 and k8s-0002; the existing workload nginx is scheduled to node k8s-0002, the workload test, and the yaml file is as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
        - name: container-1
          image: nginx:latest
      dnsPolicy: ClusterFirst
      affinity:
        nodeAffinity: #Use node affinity to schedule it to k8s-0001
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                      - k8s-0001
        podAffinity: #Use load affinity to schedule it to k8s-0002
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - nginx
                namespaces:
                  - default
                topologyKey: kubernetes.io/hostname

3.2.2 Adjust the InterPodAffinity weight so that the workload test is scheduled to node k8s-0002

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta3 #v1beta3 is available for clusters with versions above 1.23
    child: KubeSchedulerConfiguration
    profiles:
    - schedulerName: default-scheduler 
      plugins:
        score:
          disabled:
          - name: InterPodAffinity
          - name: NodeAffinity
          enabled:
          - name: InterPodAffinity #Increase load affinity weight
            weight: 100
          - name: NodeAffinity
            weight: 1

Check the kube-scheduler scheduling log. The k8s-002 score is score 100 * weight 100, with a total of 10,000 points, and is scheduled to the k8s-002 node.

3.2.3 Adjust the NodeAffinity weight so that the workload test is scheduled to node k8s-0001

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta3
    child: KubeSchedulerConfiguration
    profiles:
    - schedulerName: default-scheduler
      plugins:
        score:
          disabled:
          - name: InterPodAffinity
          - name: NodeAffinity
          enabled:
          - name: InterPodAffinity
            weight: 1
          - name: NodeAffinity #Increase node affinity weight
            weight: 100

 

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10322283