scheduling process
The scheduler is an independent process that is responsible for continuously pulling unscheduled pods from the apiserver, as well as a list of schedulable nodes, filtering through some column algorithms, selecting a node and binding it to the pod, and binding the result of the binding. write back to apiserver
Scheduling Algorithm
The following explains the source code based on k8s v1.6.6
The algorithm needs to go through two stages, namely filtering and scoring. First, a part is filtered out to ensure that the remaining nodes are schedulable, and then the highest scoring node is selected in the scoring stage, which is the output node of the scheduler.
Algorithm flow:
filter
The filtering link is a filter chain, including multiple filters, each equivalent to a function, receiving node and the pod to be scheduled as parameters, and returning bool to determine whether it can be scheduled. An extensible filter chain can be accomplished by combining multiple functions. The currently registered filter functions in k8s are as follows:
Algorithm name | default | Detailed description |
---|---|---|
NoVolumeZoneConflict | Yes | When the zone-label (region) on the host contains the zone label under the PersistentVolume in the pod, it can be scheduled. When the host has no zone-label, it means that there is no zone restriction, and it can also be scheduled |
MaxEBSVolumeCount | Yes | When the AWS EBS Volume mounted on the host exceeds the default limit of 39, it will not be scheduled to the host |
MaxGCEPDVolumeCount | Yes | When the GCD Persistent Disk mounted on the host exceeds the default limit of 16, it will not be scheduled to the machine |
MaxAzureDiskVolumeCount | Yes | When the Azure Disk Volume mounted on the host exceeds the default limit of 16, it will not be scheduled to the machine |
NoDiskConflict | Yes | When the volumes used by all pods on the host conflict with the volumes used by the pods to be scheduled, the host will not be scheduled. This check is only for GCE, Amazon EBS, Ceph RBD, ISCSI, the specific rules are:
|
MatchInterPodAffinity | Yes | Affinity check, set the pod with scheduling as X, when all running pods on the host and X are not mutually exclusive, they can be scheduled |
PodToleratesNodeTaints | Yes | When a pod can tolerate (tolerate) all the taints (taints) of the host, it can be scheduled (the way to tolerate the taint label is to label itself with the corresponding tolerations label) |
CheckNodeMemoryPressure | Yes | When the remaining memory of the host is tight, BestEffort pods cannot be scheduled to the host |
CheckNodeDiskPressure | Yes | When the remaining disk space of the host is limited, the host cannot be scheduled |
PodFitsHostPorts | Yes | When the HostPort used by all containers in the pod to be scheduled conflicts with the port used on the worker node, the host will not be scheduled. |
PodFitsPorts | Yes | Superseded by PodFitsHostPorts |
PodFitsResources | Yes | When the total resource - the total amount of requests for resources by all pods in the host < the amount of pod request resources with scheduling, the host will not be scheduled, and the CPU, MEM, and GPU resources will now be checked |
HostName | Yes | If the pod to be scheduled specifies pod.Spec.Host, it will be scheduled to this host |
MatchNodeSelector | Yes | When the host label matches the nodeSelector and annotations scheduler.alpha.kubernetes.io/affinity in the pod, it can be scheduled |
score
The scoring link is also a link, including multiple scoring functions. Each scoring function receives the node and the pod to be scheduled as parameters, and returns a score ranging from 0 to 10. Each scoring function also has a weight value. The total score calculated by a node is the sum of the score * weight value of all scoring functions, and the node with the largest total score is obtained (if there are multiple, randomly select one), the node is the final node to be scheduled
Example: Suppose there is a node nodeA, there are two scoring functions priorityFunc1 and priorityFunc2 (each method can return a score), and the two methods have weight factors weight1 and weight2 respectively. Then the total score of nodeA: finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)
The currently registered scoring functions in k8s are as follows:
Algorithm name | default | Weights | Detailed description |
---|---|---|---|
SelectorSpreadPriority | Yes | 1 | The more scattered the pods of the same service/rc, the higher the score |
ServiceSpreadingPriority | no | 1 | The more scattered the pods of the same service, the higher the excellent score. They are replaced by SelectorSpreadPriority, which is retained in the system and not used. |
InterPodAffinityPriority | Yes | 1 | The higher the affinity between a pod and other pods running on the node, the higher the score |
LeastRequestedPriority | Yes | 1 | The more resources remaining, the higher the score. cpu((capacity - sum(requested)) * 10 / capacity) + memory((capacity - sum(requested)) * 10 / capacity) / 2 |
BalancedResourceAllocation | Yes | 1 | The closer the cpu and memory utilization are, the higher the score. 10 - abs(cpuFraction-memoryFraction)*10 |
NodePreferAvoidPodsPriority | Yes | 10000 | When the node's annotation scheduler.alpha.kubernetes.io/preferAvoidPods is set, it means that the node does not want to be scheduled, and the score is low. When it is not set, the score is high. The reason for the larger weight is that once the preferAvoidPods is set, it means that the node does not want to be scheduled, and the score of this item is 0, and the scores of other nodes that are not set are all 10000* points, which is equivalent to filtering out the node directly. Thinking: In fact, it can be processed in the filtering process |
NodeAffinityPriority | 是 | 1 | pod与node的亲和性匹配度越高,得分越高 |
TaintTolerationPriority | 是 | 1 | pod对node的污点(taint)的容忍(tolerate)程度越高,得分越高 |
EqualPriority | 否 | 1 | 所有机器得分一样 |
ImageLocalityPriority | 否 | 1 | 待调度的pod会使用到一些镜像,拥有这些镜像越多的节点,得分越高 |
MostRequestedPriority | 否 | 1 | request资源越多,得分越高,与LeastRequestedPriority相反。(cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2 |
官方版本发展
参考资料
- https://github.com/kubernetes/community/blob/release-1.6/contributors/devel/scheduler.md
- https://github.com/kubernetes/community/blob/release-1.6/contributors/devel/scheduler_algorithm.md
- http://cizixs.com/2017/03/10/kubernetes-intro-scheduler
- https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
- https://book.douban.com/subject/26894736/