k8s-cluster scheduling
Introduction to a scheduler
1.1 Introduction
Scheduler is the scheduler of kubernetes. The main task is to distribute the defined pods to the nodes of the cluster. Sounds very simple, but there are many issues to consider:
Fairness: How to ensure that each node can be allocated resources. Efficient use of resources: All resources in the cluster are maximized
Efficiency: The performance of scheduling is better, and the scheduling of large batches of pods can be completed as soon as possible
Flexible: The logical Sheduler that allows users to control the scheduling according to their own needs is run as a separate program. After starting, the API Server will always be strong and obtain pods with empty PodSpec.NodeName. For each pod, a binding will be created to indicate that the pod Which node should be placed.
1.2 Scheduling process
The scheduling is divided into several parts: first, the nodes that do not meet the condition are filtered out, this process is called predicate; then the nodes that pass are sorted according to priority, this is priority; and finally, the node with the highest priority is selected from them. If there is an error in any step in the middle, the error is returned directly. If there is no suitable node in the predicate process, the pod will always be in the pending state and continue to retry scheduling until a node meets the conditions.
After this step, if there are multiple nodes that meet the conditions, continue the priority process: Sort the nodes according to priority size
Two-node affinity
-
pod.spec.nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution: soft strategy
requiredDuringSchedulingIgnoredDuringExecution: hard strategy (must be satisfied)
-
Key-value operation relationship
In: The value of label is in a list
NotIn: The value of label is not in a list
Gt: The value of label is greater than a certain value
Lt: The value of label is less than a certain value
Exists: a label exists
DoesNotExist: a label does not exist
2.1 Hard strategy
You must run a pod on k8s-node2
apiVersion: v1
kind: Pod
metadata:
name: affinity
labels:
app: node-affinity-pod
spec:
containers:
- name: with-node-affinity
image: wangyanglinux/myapp:v1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k8s-node2
2.2 Soft strategy
I want to run a pod on node3 without node3
apiVersion: v1
kind: Pod
metadata:
name: affinity1
labels:
app: node-affinity-pod
spec:
containers:
- name: with-node-affinity
image: wangyanglinux/myapp:v1
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k8s-node3
Three pod affinity
pod.spec.affinity.podAffinity/podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution: soft strategy
requiredDuringSchedulingIgnoredDuringExecution: hard strategy
3.1 Hard strategy
The pod-3 pod must be on the same node as the node-affinity-pod value
apiVersion: v1
kind: Pod
metadata:
name: pod-3
labels:
app: pod-3
spec:
containers:
- name: pod-3
image: wangyanglinux/myapp:v1
affinity:
podAffinity: #在
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- node-affinity-pod
topologyKey: kubernetes.io/hostname
kubectl get pod --show-labels
3.2 Soft strategy
The pod-4 pod is not on the same node as the pod-3
[root@k8s-master01 diaodu]# cat pod4.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-4
labels:
app: pod-4
spec:
containers:
- name: pod-4
image: wangyanglinux/myapp:v1
affinity:
podAntiAffinity: #不在
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- pod-3
topologyKey: kubernetes.io/hostname
One on node1, one on node2
Four stains and tolerance
4.1 Introduction
Node affinity is a property of a pod (preference or hard requirement), which makes the pod attracted to a specific type of node. Taint is the opposite, it enables nodes to exclude a specific type of pod
Taint and toleration cooperate with each other and can be used to prevent pods from being allocated to inappropriate nodes. One or more taints can be applied to each node, which means that pods that cannot tolerate these taints will not be accepted by the node. If toleration is applied to pods, it means that these pods can (but are not required) be scheduled to nodes with matching taints
4.2 Stain
The composition of Taint
The kubectl taint command can be used to set a stain on a Node. After the Node is stained, there is a repulsive relationship between the Pod and the Node, which can allow the Node to refuse the scheduled execution of the Pod and even evict the Pod that the Node already exists. Go out.
Each stain has a key and value as the stain label, where value can be empty, and the effect describes the role of the stain. Currently tainteffect supports the following three options:
NoSchedule: indicates that k8s will not schedule the Pod to the Node with this stain
PreferNoSchedule: indicates that k8s will try to avoid scheduling Pods to Nodes with this stain
NoExecute :表示 k8s 将不会将 Pod 调度到具有该污点的 Node 上,同时会将 Node 上已经存在的 Pod 驱逐出去
-
查看污点
[root@k8s-master01 diaodu]# kubectl describe nodes k8s-master01|grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule
-
设置污点
kubectl taint nodes k8s-node1 key=hu:NoExecute
设置完成之后,发现在node1上面的pod,都消失了
设置之前:
设置之后:
-
去除污点
查看污点:
kubectl taint nodes k8s-node1 key:NoExecute-
然后再次查看,发现污点没了
4.3 容忍
一句话,就是pod设置了容忍,即使node有污点,也可以分配
设置了污点的 Node 将根据 taint 的 effect:NoSchedule、PreferNoSchedule、NoExecute 和 Pod 之间产生互斥的关系,Pod 将在一定程度上不会被调度到 Node 上。 但我们可以在 Pod 上设置容忍 ( Toleration ) ,意思是设置了容忍的 Pod 将可以容忍污点的存在,可以被调度到存在污点的 Node 上。
例子:
apiVersion: v1
kind: Pod
metadata:
name: pod-3
labels:
app: pod-3
spec:
containers:
- name: pod-3
image: wangyanglinux/myapp:v1
tolerations:
- key: "key"
operator: "Equal"
value: "hu"
effect: "NoExecute"
tolerationSeconds: 3600 #表示3600秒之后才被删除
其中 key, vaule, effect 要与 Node 上设置的 taint 保持一致
operator 的值为 Exists 将会忽略 value 值
tolerationSeconds 用于描述当 Pod 需要被驱逐时可以在 Pod 上继续保留运行的时间
4.3.1 当不指定 key 值时,表示容忍所有的污点 key:
tolerations:
- operator: "Exists"
4.3.2、当不指定 effect 值时,表示容忍所有的污点作用
tolerations:
- key: "key"
operator: "Exists
4.3.3 有多个 Master 存在时,防止资源浪费,可以如下设置
尽可以不分配在master上面,如果node节点不够用,在分配
kubectl taint nodes k8s-master01 node-role.kubernetes.io/master=:PreferNoSchedule
五 固定节点
5.1 根据node节点的主机名选择
Pod.spec.nodeName 将 Pod 直接调度到指定的 Node 节点上,会跳过 Scheduler 的调度策略,该匹配规则是强制匹配
root@k8s-master01 diaodu]# cat pod5.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: myweb
spec:
replicas: 3
template:
metadata:
labels:
app: myweb
spec:
nodeName: k8s-node1
containers:
- name: myweb
image: wangyanglinux/myapp:v1
ports:
- containerPort: 80
然后去查看效果
5.2 根据node标签去选择
设置标签
kubectl label node k8s-node2 disk=ssd
查看标签
例子:
[root@k8s-master01 diaodu]# cat label.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: myweb
spec:
replicas: 3
template:
metadata:
labels:
app: myweb
spec:
nodeSelector:
disk: ssd #标签
containers:
- name: myweb
image: wangyanglinux/myapp:v1
ports:
- containerPort: 80
查看效果都运行在node2上面