Kubernetes scheduling
1 Introduction
-
The scheduler uses the watch mechanism of kubernetes to discover the newly created pods in the cluster that have not yet been scheduled to the Node. The scheduler will schedule each unscheduled Pod found to an appropriate Node to run.
-
kube-scheduler is the default scheduler for Kubernetes clusters and is part of the cluster control plane. If you really want or have needs in this regard, kube-scheduler is designed to allow you to write a scheduling component yourself and replace the original kube-scheduler.
-
Factors that need to be considered when making scheduling decisions include: individual and overall resource requests, hardware/software/policy restrictions, affinity and anti-affinity requirements (use more), data locality, interference between loads, and so on.
The default strategy can refer to the
scheduling framework
2. Factors affecting kubernetes scheduling
2.1 nodeName
-
nodeName is the simplest method for node selection constraints, but it is generally not recommended. If nodeName is specified in PodSpec, it takes precedence over other node selection methods.
-
Use nodeName to select some restrictions on the node: (will report an error)
if the specified node does not exist.
If the specified node does not have the resources to accommodate the pod, the pod scheduling fails.
The names of nodes in a cloud environment are not always predictable or stable.
Example
[root@server2 ~]# vim pod1.yml
[root@server2 ~]# cat pod1.yml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: myapp:v1
nodeName: server3 ##指定server3
[root@server2 ~]# kubectl apply -f pod1.yml
pod/nginx created
[root@server2 ~]# kubectl get pod -o wide ##查看详情
2.2 nodeSelector
-
nodeSelector is the simplest recommended form of node selection constraint. (Where is the tag priority scheduling, where will it be next time). If two have tags at the same time, but one has insufficient resources, it will be scheduled to the other host.
-
Add labels to the selected nodes:
kubectl label nodes server2 disktype=ssd
Example
[root@server2 ~]# kubectl label nodes server4 disktype=ssd ##添加标签
[root@server2 ~]# kubectl get node --show-labels
[root@server2 ~]# vim pod1.yml
[root@server2 ~]# cat pod1.yml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: myapp:v1
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
nodeSelector:
disktype: ssd
2.3 Affinity and Anti-Affinity
If node and pod affinity exist at the same time and conflict, an error will be reported!
2.3.1 node affinity
-
Affinity and anti-affinity. nodeSelector provides a very simple way to constrain pods to nodes with specific labels. The affinity/anti-affinity feature greatly expands the types of constraints you can express.
You can find that the rules are "soft"/"preferences" rather than hard requirements. Therefore, if the scheduler cannot meet the requirements,
you can still schedule the pod. You can use the pod label on the node to constrain instead of using the node itself. Tags to allow which pods can or cannot be placed together. -
Node affinity (acts only during the scheduling period)
requiredDuringSchedulingIgnoredDuringExecution must meet
preferredDuringSchedulingIgnoredDuringExecution Propensity to meet -
IgnoreDuringExecution means that if the label of the Node changes during the Pod operation, causing the affinity policy to be unsatisfied, the current Pod will continue to run.
-
nodeaffinity also supports the configuration of a variety of rule matching conditions, such as
In: label value is in the list,
NotIn: label value is not in the list,
Gt: label value is greater than the set value, and Pod affinity is not supported.
Lt: label value is less than the setting Does not support pod affinity
Exists: the set label exists
DoesNotExist: the set label does not exist
Example 1
[root@server2 ~]# kubectl label nodes server3 disktype=ssd ##添加和server4一样的标签
[root@server2 ~]# kubectl label nodes server4 role=db ##server4添加role
[root@server2 ~]# cat pod1.yml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: myapp:v1
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: role
operator: In
values:
- db
[root@server2 ~]# kubectl apply -f pod1.yml
[root@server2 ~]# kubectl get pod -o wide
2.3.2 Pod affinity
Pod affinity and anti-affinity podAffinity mainly solves the problem of PODs and which PODs can be deployed in the same topology domain (topology domains are implemented by host labels, which can be a single host or a cluster and zone composed of multiple hosts. Etc.)
podAntiAffinity mainly solves the problem that PODs cannot be deployed in the same topology domain as which PODs. They deal with the relationship between POD and POD within the Kubernetes cluster.
Inter-Pod affinity and anti-affinity may be more useful when used with higher-level collections (such as ReplicaSets, StatefulSets, Deployments, etc.). It is easy to configure a set of workloads that should be in the same defined topology (for example, nodes).
Inter-Pod affinity and anti-affinity require a lot of processing, which may significantly slow down scheduling in large-scale clusters.
Example: Affinity
[root@server2 ~]# kubectl run demo --image=busyboxplus -it ##运行一个pod
[root@server2 ~]# kubectl get pod -o wide ##查看在哪一个server主机
[root@server2 ~]# vim pod2.yaml
[root@server2 ~]# cat pod2.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: myapp:v1
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: run
operator: In
values:
- demo
topologyKey: kubernetes.io/hostname
[root@server2 ~]# kubectl apply -f pod2.yaml
[root@server2 ~]# kubectl get pod -o wide ##都在一个节点
Example: Anti-Affinity
[root@server2 ~]# vim pod2.yaml
[root@server2 ~]# cat pod2.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: myapp:v1
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
affinity:
podAntiAffinity: ##反亲和只需要修改
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: run
operator: In
values:
- demo
topologyKey: kubernetes.io/hostname
[root@server2 ~]# kubectl apply -f pod2.yaml
[root@server2 ~]# kubectl get pod -o wide ##查看是否不再一个节点
2.4 Taints
-
NodeAffinity node affinity is a property defined on Pod, which enables Pod to be scheduled to a Node according to our requirements, while Taints is the opposite. It can make Node refuse to run Pod or even expel Pod.
-
Taints (taint) is an attribute of Node. After Taints is set, Kubernetes will not schedule Pod to this Node. So Kubernetes sets an attribute Tolerations for Pod, as long as the Pod can tolerate the Node. Stain, then Kubernetes will ignore the taint on Node and can (not necessarily) schedule the Pod.
-
You can use the command kubectl taint to add a taint to the node:
$ kubectl taint nodes node1 key=value:NoSchedule //Create
$ kubectl describe nodes server1 |grep Taints //Query
$ kubectl taint nodes node1 key:NoSchedule- //Delete
it ] Possible values: [NoSchedule | PreferNoSchedule | NoExecute]
NoSchedule: POD will not be scheduled to nodes marked as taints.
PreferNoSchedule: The soft strategy version of NoSchedule.
NoExecute: This option means that once Taint takes effect, if the POD running in the node does not correspond to the Tolerate setting, it will be ejected directly. -
The key, value, and effect defined in tolerations must be kept with the taint set on the node:
if the operator is Exists, the value can be omitted.
If operator is Equal, the relationship between key and value must be equal.
If the operator attribute is not specified, the default value is Equal.
There are also two special values:
when no key is specified, all keys and values can be matched with Exists and all taints can be tolerated.
When no effect is specified, all effects are matched.
Example: a taint and a tolerance label
[root@server2 ~]# kubectl describe nodes server2 | grep Taints ##master有一个污点
Taints: node-role.kubernetes.io/master:NoSchedule
[root@server2 ~]# kubectl describe nodes server3 | grep Taints
Taints: <none>
[root@server2 ~]# kubectl describe nodes server4 | grep Taints
[root@server2 ~]# kubectl taint node server3 key1=v1:NoExecute ##给server3生成一个污点
[root@server2 ~]# vim pod.yml
[root@server2 ~]# cat pod.yml ##设置容忍标签tolerations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: default
spec:
replicas: 2
selector:
matchLabels:
run: nginx
template:
metadata:
labels:
run: nginx
spec:
hostNetwork: true
containers:
- name: nginx
image: myapp:v1
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 0.5
memory: 512Mi
tolerations:
- key: "key1"
operator: "Equal"
value: "v1"
effect: "NoExecute"
[root@server2 ~]# kubectl apply -f pod.yml
[root@server2 ~]# kubectl get pod -o wide ##都在server3上是因为使用的calico网络插件
Example: Two special values
##server3和server4都设置污点
[root@server2 ~]# kubectl describe nodes server3 | grep Taints
Taints: key1=v1:NoExecute
[root@server2 ~]# kubectl taint node server4 key2=v2:NoSchedule
node/server4 tainted
[root@server2 ~]# kubectl describe nodes server4 | grep Taints
Taints: key2=v2:NoSchedule
[root@server2 ~]# vim pod.yml
[root@server2 ~]# cat pod.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: default
spec:
replicas: 2
selector:
matchLabels:
run: nginx
template:
metadata:
labels:
run: nginx
spec:
hostNetwork: true
containers:
- name: nginx
image: myapp:v1
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 0.5
memory: 200Mi
tolerations:
- operator: "Exists"
[root@server2 ~]# kubectl apply -f pod.yml
[root@server2 ~]# kubectl get pod
[root@server2 ~]# kubectl get pod -o wide ##查看是否两台主机都可以运行
2.5 Affect pod scheduling instructions
- There are also instructions that affect Pod scheduling: cordon, drain, and delete. Pods created later will not be scheduled to the node, but the degree of violence is different.
2.5.1 cordon
- Cordon stop scheduling: the
least impact, only the node will be adjusted to SchedulingDisabled, the newly created pod will not be scheduled to the node, the original pod of the node will not be affected, and the external service will still be provided normally.
$ kubectl cordon server3
$ kubectl get node
NAME STATUS ROLES AGE VERSION
server1 Ready 29m v1.17.2
server2 Ready 12d v1.17.2
server3 Ready, SchedulingDisabled 9d v1.17.2
$ kubectl uncordon server3 //Restore
[root@server2 ~]# kubectl taint node server3 key1=v1:NoExecute- ##删除污点
[root@server2 ~]# kubectl taint node server4 key2=v2:NoSchedule- ##删除污点
[root@server2 ~]# kubectl cordon server3 ##停止调度server3
[root@server2 ~]# kubectl get node ##查看一下node的情况
[root@server2 ~]# cat pod.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: default
spec:
replicas: 2
selector:
matchLabels:
run: nginx
template:
metadata:
labels:
run: nginx
spec:
hostNetwork: true
containers:
- name: nginx
image: myapp:v1
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 0.5
memory: 200Mi
[root@server2 ~]# kubectl apply -f pod.yml
[root@server2 ~]# kubectl get pod -o wide ##全部运行在了server4上
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-b49457b9-7h2q5 1/1 Running 0 3s 172.25.13.4 server4 <none> <none>
nginx-b49457b9-b6bbl 1/1 Running 0 3s 172.25.13.4 server4 <none> <none>
[root@server2 ~]# kubectl uncordon server3 ##解除停止调度
[root@server2 ~]# kubectl uncordon server4
2.5.2 drain
- Drain expel the node:
expel the pod on the node first, recreate it on other nodes, and then set the node to SchedulingDisabled.
$ kubectl drain server3 ##Expulsion node
node/server3 cordoned
evicting pod "web-1"
evicting pod "coredns-9d85f5447-mgg2k"
pod/coredns-9d85f5447-mgg2k evicted
pod/web-1 evicted
node/server3 evicted
$ kubectl uncordon server3 ##Remove
[root@server2 ~]# kubectl drain server4 --ignore-daemonsets ##
[root@server2 ~]# kubectl get node
server4 Ready,SchedulingDisabled <none> 10d v1.20.2
[root@server2 ~]# kubectl apply -f pod.yml
[root@server2 ~]# kubectl get pod -o wide
[root@server2 ~]# kubectl uncordon server4 ##删除
2.5.3 delete
- delete Delete node The
most violent one, first expel the pod on the node, recreate it on other nodes, then delete the node from the master node, the master loses its control, if you want to resume scheduling, you need to enter the node node and restart the kubelet service
$ kubectl delete node server3
$ systemctl restart kubelet //Based on the self-registration function of node, resume use
[root@server2 ~]# kubectl delete nodes server3 ##删除节点server3
[root@server2 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
server2 Ready control-plane,master 10d v1.20.2
server4 Ready <none> 10d v1.20.2
[root@server3 ~]# systemctl restart kubelet.service ##被删除的节点重新启动kubelet服务
[root@server2 ~]# kubectl get node ##server3节点恢复
NAME STATUS ROLES AGE VERSION
server2 Ready control-plane,master 10d v1.20.2
server3 Ready <none> 11s v1.20.2
server4 Ready <none> 10d v1.20.2