Introduction to Kubernetes Scheduling Algorithms

scheduling process

The scheduler is an independent process that is responsible for continuously pulling unscheduled pods from the apiserver, as well as a list of schedulable nodes, filtering through some column algorithms, selecting a node and binding it to the pod, and binding the result of the binding. write back to apiserver

Scheduling Algorithm

The following explains the source code based on k8s v1.6.6

The algorithm needs to go through two stages, namely filtering and scoring. First, a part is filtered out to ensure that the remaining nodes are schedulable, and then the highest scoring node is selected in the scoring stage, which is the output node of the scheduler.

Algorithm flow:

filter

The filtering link is a filter chain, including multiple filters, each equivalent to a function, receiving node and the pod to be scheduled as parameters, and returning bool to determine whether it can be scheduled. An extensible filter chain can be accomplished by combining multiple functions. The currently registered filter functions in k8s are as follows:

Algorithm name default Detailed description
NoVolumeZoneConflict Yes When the zone-label (region) on the host contains the zone label under the PersistentVolume in the pod, it can be scheduled. When the host has no zone-label, it means that there is no zone restriction, and it can also be scheduled
MaxEBSVolumeCount Yes When the AWS EBS Volume mounted on the host exceeds the default limit of 39, it will not be scheduled to the host
MaxGCEPDVolumeCount Yes When the GCD Persistent Disk mounted on the host exceeds the default limit of 16, it will not be scheduled to the machine
MaxAzureDiskVolumeCount Yes When the Azure Disk Volume mounted on the host exceeds the default limit of 16, it will not be scheduled to the machine
NoDiskConflict Yes

When the volumes used by all pods on the host conflict with the volumes used by the pods to be scheduled, the host will not be scheduled. This check is only for GCE, Amazon EBS, Ceph RBD, ISCSI, the specific rules are:

  •       GCE PersistentDisk allows multiple read-only mounts of the same volume
  •       EBS prohibits two pods from mounting a volume with the same id
  •       Ceph RBD prohibits two pods from sharing a monitor, pool, and image
  •       ISCSI prohibits two pods from sharing the same IQN
MatchInterPodAffinity Yes Affinity check, set the pod with scheduling as X, when all running pods on the host and X are not mutually exclusive, they can be scheduled 
PodToleratesNodeTaints Yes When a pod can tolerate (tolerate) all the taints (taints) of the host, it can be scheduled (the way to tolerate the taint label is to label itself with the corresponding tolerations label)
CheckNodeMemoryPressure Yes When the remaining memory of the host is tight, BestEffort pods cannot be scheduled to the host
CheckNodeDiskPressure Yes When the remaining disk space of the host is limited, the host cannot be scheduled 
PodFitsHostPorts Yes When the HostPort used by all containers in the pod to be scheduled conflicts with the port used on the worker node, the host will not be scheduled. 
PodFitsPorts Yes Superseded by PodFitsHostPorts 
PodFitsResources Yes When the total resource - the total amount of requests for resources by all pods in the host < the amount of pod request resources with scheduling, the host will not be scheduled, and the CPU, MEM, and GPU resources will now be checked
HostName Yes If the pod to be scheduled specifies pod.Spec.Host, it will be scheduled to this host 
MatchNodeSelector Yes  When the host label matches the nodeSelector and annotations scheduler.alpha.kubernetes.io/affinity in the pod, it can be scheduled


score

The scoring link is also a link, including multiple scoring functions. Each scoring function receives the node and the pod to be scheduled as parameters, and returns a score ranging from 0 to 10. Each scoring function also has a weight value. The total score calculated by a node is the sum of the score * weight value of all scoring functions, and the node with the largest total score is obtained (if there are multiple, randomly select one), the node is the final node to be scheduled

Example: Suppose there is a node nodeA, there are two scoring functions priorityFunc1 and priorityFunc2 (each method can return a score), and the two methods have weight factors weight1 and weight2 respectively. Then the total score of nodeA: finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

The currently registered scoring functions in k8s are as follows:

Algorithm name default Weights Detailed description
SelectorSpreadPriority Yes 1 The more scattered the pods of the same service/rc, the higher the score
ServiceSpreadingPriority no 1 The more scattered the pods of the same service, the higher the excellent score. They are replaced by SelectorSpreadPriority, which is retained in the system and not used.
InterPodAffinityPriority Yes 1 The higher the affinity between a pod and other pods running on the node, the higher the score
LeastRequestedPriority Yes 1 The more resources remaining, the higher the score. cpu((capacity - sum(requested)) * 10 / capacity) + memory((capacity - sum(requested)) * 10 / capacity) / 2
BalancedResourceAllocation Yes 1 The closer the cpu and memory utilization are, the higher the score. 10 - abs(cpuFraction-memoryFraction)*10
NodePreferAvoidPodsPriority Yes 10000 When the node's annotation scheduler.alpha.kubernetes.io/preferAvoidPods is set, it means that the node does not want to be scheduled, and the score is low. When it is not set, the score is high. The reason for the larger weight is that once the preferAvoidPods is set, it means that the node does not want to be scheduled, and the score of this item is 0, and the scores of other nodes that are not set are all 10000* points, which is equivalent to filtering out the node directly. Thinking: In fact, it can be processed in the filtering process
NodeAffinityPriority 1 pod与node的亲和性匹配度越高,得分越高
TaintTolerationPriority 1 pod对node的污点(taint)的容忍(tolerate)程度越高,得分越高
EqualPriority 1 所有机器得分一样
ImageLocalityPriority 1 待调度的pod会使用到一些镜像,拥有这些镜像越多的节点,得分越高
MostRequestedPriority 1 request资源越多,得分越高,与LeastRequestedPriority相反。(cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2


官方版本发展

参考资料

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325482315&siteId=291194637