[Cloud Native] K8S Cluster

1. Scheduling constraints

Kubernetes uses the List-Watch mechanism to collaborate with each component to maintain data synchronization, and the design between each component achieves decoupling.

1.1 POT creation process

Insert image description here

(1) There are three List-Watches here, namely Controller Manager (running on Master), Scheduler (running on Master), and kubelet (running on Node). They will listen (Watch) events sent by APIServer after the process is started.

(2) Users submit requests to APIServer through kubectl or other API clients to create a copy of the Pod object.

(3) APIServer attempts to store the relevant meta-information of the Pod object into etcd. After the write operation is completed, APIServer will return confirmation information to the client.

(4) When etcd accepts the Pod creation information, it will send a Create event to APIServer.

(5) Because the Controller Manager has been listening (Watch, through port 6443 of https) for events in the APIServer. At this time, the APIServer receives the Create event and sends it to the Controller Manager.

(6) After receiving the Create event, the Controller Manager calls the Replication Controller to ensure the number of copies that need to be created on the Node. Once the number of replicas is less than the number defined in RC, RC will automatically create replicas. In short, it is a Controller that guarantees the number of copies (PS: responsible for expansion and contraction).

(7) After the Controller Manager creates a Pod copy, the APIServer will record the detailed information of the Pod in etcd. For example, the number of copies of the Pod and the contents of the Container.

(8) The same etcd will send the Pod creation information to APIServer through events.

(9) Because the Scheduler is monitoring (Watching) the APIServer, and it plays the role of "connecting the upper and lower" in the system, "connecting the upper" means that it is responsible for receiving the created Pod events and arranging Nodes for them; "connecting the lower" means After the placement work is completed, the kubelet process on the Node will take over the subsequent work and be responsible for the "second half" of the Pod's life cycle. In other words, the role of the Scheduler is to bind the Pod to be scheduled to the Node in the cluster according to the scheduling algorithm and policy.

(10) The Scheduler will update the Pod information after the scheduling is completed, and the information at this time will be richer. In addition to knowing the number of replicas of the Pod, the content of the replicas. You also know which Node to deploy to. And update the above Pod information to API Server, update it to etcd from APIServer, and save it.

(11) etcd sends the successfully updated event to APIServer, and APIServer also begins to reflect the scheduling results of this Pod object.

(12) Kubelet is a process running on Node. It also listens through List-Watch (Watch, through port 6443 of https) for Pod update events sent by APIServer. The kubelet will try to call Docker on the current node to start the container, and send the resulting status of the Pod and container back to the APIServer.

(13) APIServer stores Pod status information in etcd. After etcd confirms that the write operation completed successfully, APIServer sends a confirmation message to the relevant kubelet through which the event will be accepted.

#Note: After the work of creating Pod has been completed, why does kubelet still keep listening? The reason is very simple. If kubectl issues a command at this time to expand the number of Pod copies, then the above process will be triggered again, and kubelet will adjust the Node resources according to the latest Pod deployment. Or maybe the number of Pod copies has not changed, but the image file in it has been upgraded, and kubelet will automatically obtain the latest image file and load it.

1.1 Scheduling process

  • Scheduler is the scheduler of kubernetes. Its main task is to allocate defined pods to the nodes of the cluster. The main issues to consider are as follows:

●Fairness: How to ensure that each node can be allocated resources
●Efficient resource utilization: All resources in the cluster are used to the maximum extent
●Efficiency: The scheduling performance should be good, and a large number of pods can be scheduled as quickly as possible
●Flexibility: Allow users to schedule according to Your own needs control the scheduling logic

  • Sheduler runs as a separate program. After it is started, it will always listen to the APIServer, obtain the pods whose spec.nodeName is empty, and create a binding for each pod to indicate which node the pod should be placed on.

  • Scheduling is divided into several parts: first, filtering out nodes that do not meet the conditions, this process is called the budget strategy (predicate); then sorting the passing nodes according to priority, this is the priority strategy (priorities); and finally selecting the priorities. The highest node. If there is an error in any of the intermediate steps, an error will be returned directly.

Predicate has a series of common algorithms that can be used

●PodFitsResources: Whether the remaining resources on the node are greater than the resources requested by the pod.
●PodFitsHost: If the pod specifies NodeName, check whether the node name matches NodeName.
●PodFitsHostPorts: Whether the port already used on the node conflicts with the port applied for by the pod.
●PodSelectorMatches: Filter out nodes that do not match the label specified by the pod.
●NoDiskConflict: The mounted volume does not conflict with the volume specified by the pod, unless they are both read-only.

  • If there is no suitable node during the predicate process, the pod will remain in the pending state and continue to retry scheduling until a node meets the conditions. After this step, if there are multiple nodes that meet the conditions, continue the priorities process: sort the nodes according to their priority.

A priority consists of a series of key-value pairs, where the key is the name of the priority item and the value is its weight (the importance of the item). There are a range of common priority options including:

●LeastRequestedPriority: The weight is determined by calculating the CPU and Memory usage. The lower the usage, the higher the weight. In other words, this priority indicator favors nodes with a lower resource usage ratio.
●BalancedResourceAllocation: The closer the CPU and Memory usage on the node are, the higher the weight. This is generally used together with the above, not alone. For example, the CPU and Memory usage of node01 is 20:60, and the CPU and Memory usage of node02 is 50:50. Although the total usage of node01 is lower than that of node02, the CPU and Memory usage of node02 are closer, so node02 will be preferred during scheduling. .
●ImageLocalityPriority: tends to already have nodes that want to use mirrors. The larger the total size of the mirror, the higher the weight.

  • All priority items and weights are calculated through the algorithm to obtain the final result.

2. Designate node scheduling

  • pod.spec.nodeName schedules the Pod directly to the specified Node node, skipping the Scheduler's scheduling policy. This matching rule is mandatory matching.
vim myapp.yaml
apiVersion: apps/v1  
kind: Deployment  
metadata:
  name: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      nodeName: node01
      containers:
      - name: myapp
        image: soscscs/myapp:v1
        ports:
        - containerPort: 80
		
kubectl apply -f myapp.yaml

kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
myapp-6bc58d7775-6wlpp   1/1     Running   0          14s   10.244.1.25   node01   <none>           <none>
myapp-6bc58d7775-szcvp   1/1     Running   0          14s   10.244.1.26   node01   <none>           <none>
myapp-6bc58d7775-vnxlp   1/1     Running   0          14s   10.244.1.24   node01   <none>           <none>

Insert image description here
Insert image description here

View detailed events (found that they have not been assigned by the scheduler)

kubectl describe pod myapp-6bc58d7775-6wlpp
......
 Type    Reason   Age   From             Message
  ----    ------   ----  ----             -------
  Normal  Pulled   95s   kubelet, node01  Container image "soscscs/myapp:v1" already present on machine
  Normal  Created  99s   kubelet, node01  Created container nginx
  Normal  Started  99s   kubelet, node01  Started container nginx

Insert image description here

2.1 Select nodes by label

  • pod.spec.nodeSelector: Select nodes through the label-selector mechanism of kubernetes. The scheduler schedules the policy to match the label, and then schedules the Pod to the target node. This matching rule is a mandatory constraint.
//获取标签帮助
kubectl label --help
Usage:
  kubectl label [--overwrite] (-f FILENAME | TYPE NAME) KEY_1=VAL_1 ... KEY_N=VAL_N [--resource-version=version] [options]

//需要获取 node 上的 NAME 名称
kubectl get node
NAME     STATUS   ROLES    AGE   VERSION
master   Ready    master   30h   v1.20.11
node01   Ready    <none>   30h   v1.20.11
node02   Ready    <none>   30h   v1.20.11

//给对应的 node 设置标签分别为 kgc=a 和 kgc=b
kubectl label nodes node01 kgc=a

kubectl label nodes node02 kgc=b

//查看标签
kubectl get nodes --show-labels
NAME     STATUS   ROLES    AGE   VERSION   LABELS
master   Ready    master   30h   v1.20.11   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
node01   Ready    <none>   30h   v1.20.11   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kgc=a,kubernetes.io/arch=amd64,kubernetes.io/hostname=node01,kubernetes.io/os=linux
node02   Ready    <none>   30h   v1.20.11   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kgc=b,kubernetes.io/arch=amd64,kubernetes.io/hostname=node02,kubernetes.io/os=linux

//修改成 nodeSelector 调度方式
vim myapp1.yaml
apiVersion: apps/v1
kind: Deployment  
metadata:
  name: myapp1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp1
  template:
    metadata:
      labels:
        app: myapp1
    spec:
      nodeSelector:
	    kgc: a
      containers:
      - name: myapp1
        image: soscscs/myapp:v1
        ports:
        - containerPort: 80


kubectl apply -f myapp1.yaml 

kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
myapp1-58cff4d75-52xm5   1/1     Running   0          24s   10.244.1.29   node01   <none>           <none>
myapp1-58cff4d75-f747q   1/1     Running   0          24s   10.244.1.27   node01   <none>           <none>
myapp1-58cff4d75-kn8gk   1/1     Running   0          24s   10.244.1.28   node01   <none>           <none>

//查看详细事件(通过事件可以发现要先经过 scheduler 调度分配)
kubectl describe pod myapp1-58cff4d75-52xm5
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  57s   default-scheduler  Successfully assigned default/myapp1-58cff4d75-52xm5 to node01
  Normal  Pulled     57s   kubelet, node01    Container image "soscscs/myapp:v1" already present on machine
  Normal  Created    56s   kubelet, node01    Created container myapp1
  Normal  Started    56s   kubelet, node01    Started container myapp1



Insert image description here

Insert image description here

//修改一个 label 的值,需要加上 --overwrite 参数
kubectl label nodes node02 kgc=a --overwrite

//删除一个 label,只需在命令行最后指定 label 的 key 名并与一个减号相连即可:
kubectl label nodes node02 kgc-

//指定标签查询 node 节点
kubectl get node -l kgc=a

3. Affinity

https://kubernetes.io/zh/docs/concepts/scheduling-eviction/assign-pod-node/

(1) Node affinity

(1) Node affinity

pod.spec.nodeAffinity
●preferredDuringSchedulingIgnoredDuringExecution:软策略
●requiredDuringSchedulingIgnoredDuringExecution:硬策略

(2) Pod affinity

pod.spec.affinity.podAffinity/podAntiAffinity
●preferredDuringSchedulingIgnoredDuringExecution:软策略
●requiredDuringSchedulingIgnoredDuringExecution:硬策略

You can think of yourself as a Pod. When you sign up to learn cloud computing, if you are more inclined to go to the class led by teacher Zhangsan and treat classes led by different teachers as a node, this is node affinity. If you must go to the class taught by Teacher Zhangsan, this is a hard strategy; and if you say you want to go and it is best to go to the class taught by Teacher Zhangsan, this is a soft strategy.
If you have a very good friend named Lisi, and you tend to be in the same class as Lisi, this is Pod affinity. If you must go to Lisi's class, this is a hard strategy; and if you say you want to go and it is best to go to Lisi's class, this is a soft strategy. The soft strategy is to not go, but the hard strategy is to not go.

Key value operation relationship

●In: The value of label is in a certain list
●NotIn: The value of label is not in a certain list
●Gt: The value of label is greater than a certain value
●Lt: The value of label is less than a certain value
●Exists: A certain label exists
● DoesNotExist: A label does not exist

kubectl get nodes --show-labels
NAME     STATUS   ROLES    AGE   VERSION   LABELS
master   Ready    master   11d   v1.20.11   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
node01   Ready    <none>   11d   v1.20.11   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node01,kubernetes.io/os=linux
node02   Ready    <none>   11d   v1.20.11   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node02,kubernetes.io/os=linux

3.1requiredDuringSchedulingIgnoredDuringExecution: hard strategy

mkdir /opt/affinity
cd /opt/affinity

vim pod1.yaml
apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-affinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname    #指定node的标签
            operator: NotIn     #设置Pod安装到kubernetes.io/hostname的标签值不在values列表中的node上
            values:
            - node02
			

kubectl apply -f pod1.yaml

kubectl get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
affinity   1/1     Running   0          13s   10.244.1.30   node01   <none>           <none>

kubectl delete pod --all && kubectl apply -f pod1.yaml && kubectl get pods -o wide

#如果硬策略不满足条件,Pod 状态一直会处于 Pending 状态。

3.1 preferredDuringSchedulingIgnoredDuringExecution: soft strategy

vim pod2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-affinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1   #如果有多个软策略选项的话,权重越大,优先级越高
        preference:
          matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node03


kubectl apply -f pod2.yaml

kubectl get pods -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
affinity   1/1     Running   0          5s    10.244.2.35   node02   <none>           <none>

//把values:的值改成node01,则会优先在node01上创建Pod
kubectl delete pod --all && kubectl apply -f pod2.yaml && kubectl get pods -o wide

//如果把硬策略和软策略合在一起使用,则要先满足硬策略之后才会满足软策略
//示例:
apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-affinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:   #先满足硬策略,排除有kubernetes.io/hostname=node02标签的节点
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: NotIn
            values:
            - node02
      preferredDuringSchedulingIgnoredDuringExecution:  #再满足软策略,优先选择有kgc=a标签的节点
	  - weight: 1
        preference:
          matchExpressions:
          - key: kgc
            operator: In
            values:
            - a

3.3Pod affinity and anti-affinity

Scheduling strategy match tag Operator Topological domain support Scheduling target
nodeAffinity Host In, NotIn, Exists,DoesNotExist, Gt, Lt no Specify host
podAffinity Pod In, NotIn, Exists,DoesNotExist yes Pod is in the same topological domain as the specified Pod
podAntiAffinity Pod In, NotIn, Exists,DoesNotExist yes The Pod is not in the same topological domain as the specified Pod
kubectl label nodes node01 kgc=a
kubectl label nodes node02 kgc=a

//创建一个标签为 app=myapp01 的 Pod
vim pod3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: myapp01
  labels:
    app: myapp01
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
	

kubectl apply -f pod3.yaml

kubectl get pods --show-labels -o wide
NAME      READY   STATUS    RESTARTS   AGE   IP           NODE     NOMINATED NODE   READINESS GATES   LABELS
myapp01   1/1     Running   0          37s   10.244.2.3   node01   <none>           <none>            app=myapp01

//使用 Pod 亲和性调度,创建多个 Pod 资源
vim pod4.yaml
apiVersion: v1
kind: Pod
metadata:
  name: myapp02
  labels:
    app: myapp02
spec:
  containers:
  - name: myapp02
    image: soscscs/myapp:v1
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - myapp01
        topologyKey: kgc
		
#仅当节点和至少一个已运行且有键为“app”且值为“myapp01”的标签 的 Pod 处于同一拓扑域时,才可以将该 Pod 调度到节点上。 (更确切的说,如果节点 N 具有带有键 kgc 和某个值 V 的标签,则 Pod 有资格在节点 N 上运行,以便集群中至少有一个具有键 kgc 和值为 V 的节点正在运行具有键“app”和值 “myapp01”的标签的 pod。)
#topologyKey 是节点标签的键。如果两个节点使用此键标记并且具有相同的标签值,则调度器会将这两个节点视为处于同一拓扑域中。 调度器试图在每个拓扑域中放置数量均衡的 Pod。
#如果 kgc 对应的值不一样就是不同的拓扑域。比如 Pod1 在 kgc=a 的 Node 上,Pod2 在 kgc=b 的 Node 上,Pod3 在 kgc=a 的 Node 上,则 Pod2 和 Pod1、Pod3 不在同一个拓扑域,而Pod1 和 Pod3在同一个拓扑域。

kubectl apply -f pod4.yaml

kubectl get pods --show-labels -o wide
NAME      READY   STATUS    RESTARTS   AGE   IP           NODE     NOMINATED NODE   READINESS GATES   LABELS
myapp01   1/1     Running   0          15m   10.244.1.3   node01   <none>           <none>            app=myapp01
myapp02   1/1     Running   0          8s    10.244.1.4   node01   <none>           <none>            app=myapp02
myapp03   1/1     Running   0          52s   10.244.2.53  node02   <none>           <none>            app=myapp03
myapp04   1/1     Running   0          44s   10.244.1.51  node01   <none>           <none>            app=myapp03
myapp05   1/1     Running   0          38s   10.244.2.54  node02   <none>           <none>            app=myapp03
myapp06   1/1     Running   0          30s   10.244.1.52  node01   <none>           <none>            app=myapp03
myapp07   1/1     Running   0          24s   10.244.2.55  node02   <none>           <none>            app=myapp03


3.4 Use Pod anti-affinity scheduling

Example 1


vim pod5.yaml
apiVersion: v1
kind: Pod
metadata:
  name: myapp10
  labels:
    app: myapp10
spec:
  containers:
  - name: myapp10
    image: soscscs/myapp:v1
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - myapp01
          topologyKey: kubernetes.io/hostname

#如果节点处于 Pod 所在的同一拓扑域且具有键“app”和值“myapp01”的标签, 则该 pod 不应将其调度到该节点上。 (如果 topologyKey 为 kubernetes.io/hostname,则意味着当节点和具有键 “app”和值“myapp01”的 Pod 处于相同的拓扑域,Pod 不能被调度到该节点上。)

kubectl apply -f pod5.yaml

kubectl get pods --show-labels -o wide
NAME      READY   STATUS    RESTARTS   AGE   IP           NODE     NOMINATED NODE   READINESS GATES   LABELS
myapp01   1/1     Running   0          44m   10.244.1.3   node01   <none>           <none>            app=myapp01
myapp02   1/1     Running   0          29m   10.244.1.4   node01   <none>           <none>            app=myapp02
myapp10   1/1     Running   0          75s   10.244.2.4   node02   <none>           <none>            app=myapp03


Example 2:

vim pod6.yaml
apiVersion: v1
kind: Pod
metadata:
  name: myapp20
  labels:
    app: myapp20
spec:
  containers:
  - name: myapp20
    image: soscscs/myapp:v1
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - myapp01
        topologyKey: kgc
		
//由于指定 Pod 所在的 node01 节点上具有带有键 kgc 和标签值 a 的标签,node02 也有这个kgc=a的标签,所以 node01 和 node02 是在一个拓扑域中,反亲和要求新 Pod 与指定 Pod 不在同一拓扑域,所以新 Pod 没有可用的 node 节点,即为 Pending 状态。
kubectl get pod --show-labels -owide
NAME          READY   STATUS    RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES   LABELS
myapp01       1/1     Running   0          43s     10.244.1.68   node01   <none>           <none>            app=myapp01
myapp20       0/1     Pending   0          4s      <none>        <none>   <none>           <none>            app=myapp03

kubectl label nodes node02 kgc=b --overwrite

kubectl get pod --show-labels -o wide
NAME          READY   STATUS    RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES   LABELS
myapp01       1/1     Running   0          7m40s   10.244.1.68   node01   <none>           <none>            app=myapp01
myapp21       1/1     Running   0          7m1s    10.244.2.65   node02   <none>           <none>            app=myapp03


4. Stains and Tolerance

4.1 Taint

Node affinity is an attribute (preference or hard requirement) of a Pod that causes the Pod to be attracted to a specific type of node. Taint, on the other hand, enables nodes to exclude a specific class of Pods.
Taint and Toleration work together to prevent Pods from being assigned to inappropriate nodes. One or more taints can be applied to each node, which means that Pods that cannot tolerate these taints will not be accepted by the node. If toleration is applied to Pods, it means that these Pods can (but are not necessarily) scheduled to nodes with matching taints.

Use the kubectl taint command to set a taint on a Node. After the Node is set with a taint, it will have an exclusive relationship with the Pod, which allows the Node to refuse the scheduling and execution of the Pod, or even evict the existing Pod from the Node. go out.

The composition format of the stain is as follows:

key=value:effect

Each stain has a key and value as the label of the stain, where value can be empty and effect describes the effect of the stain.

Currently, the taint effect supports the following three options:

●NoSchedule: indicates that k8s will not schedule the Pod to the Node with the taint
●PreferNoSchedule: indicates that k8s will try to avoid scheduling the Pod to the Node with the taint
●NoExecute: indicates that k8s will not schedule the Pod to the Node with the taint On the tainted Node, the existing Pods on the Node will also be evicted.

kubectl get nodes
NAME     STATUS   ROLES    AGE   VERSION
master   Ready    master   11d   v1.20.11
node01   Ready    <none>   11d   v1.20.11
node02   Ready    <none>   11d   v1.20.11

//master 就是因为有 NoSchedule 污点,k8s 才不会将 Pod 调度到 master 节点上
kubectl describe node master
......
Taints:             node-role.kubernetes.io/master:NoSchedule


#设置污点
kubectl taint node node01 key1=value1:NoSchedule

#节点说明中,查找 Taints 字段
kubectl describe node node-name  

#去除污点
kubectl taint node node01 key1:NoSchedule-


kubectl get pods -o wide
NAME      READY   STATUS    RESTARTS   AGE     IP           NODE     NOMINATED NODE   READINESS GATES
myapp01   1/1     Running   0          4h28m   10.244.2.3   node02   <none>           <none>
myapp02   1/1     Running   0          4h13m   10.244.2.4   node02   <none>           <none>
myapp03   1/1     Running   0          3h45m   10.244.1.4   node01   <none>           <none>

kubectl taint node node02 check=mycheck:NoExecute

//查看 Pod 状态,会发现 node02 上的 Pod 已经被全部驱逐(注:如果是 Deployment 或者 StatefulSet 资源类型,为了维持副本数量则会在别的 Node 上再创建新的 Pod)
kubectl get pods -o wide
NAME      READY   STATUS    RESTARTS   AGE     IP           NODE     NOMINATED NODE   READINESS GATES
myapp03   1/1     Running   0          3h48m   10.244.1.4   node01   <none>           <none>


4.2 Tolerations

The tainted Node will have a mutually exclusive relationship between the taint's effects: NoSchedule, PreferNoSchedule, NoExecute and Pod, and the Pod will not be scheduled to the Node to a certain extent. But we can set tolerances on Pods, which means that Pods with tolerances can tolerate the existence of taints and can be scheduled to Nodes with taints.

kubectl taint node node01 check=mycheck:NoExecute

vim pod3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: myapp01
  labels:
    app: myapp01
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
	
kubectl apply -f pod3.yaml

//在两个 Node 上都设置了污点后,此时 Pod 将无法创建成功
kubectl get pods -o wide
NAME      READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
myapp01   0/1     Pending   0          17s   <none>   <none>   <none>           <none>

vim pod3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: myapp01
  labels:
    app: myapp01
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
  tolerations:
  - key: "check"
    operator: "Equal"
    value: "mycheck"
    effect: "NoExecute"
    tolerationSeconds: 3600
	
#其中的 key、vaule、effect 都要与 Node 上设置的 taint 保持一致
#operator 的值为 Exists 将会忽略 value 值,即存在即可
#tolerationSeconds 用于描述当 Pod 需要被驱逐时可以在 Node 上继续保留运行的时间

kubectl apply -f pod3.yaml

//在设置了容忍之后,Pod 创建成功
kubectl get pods -o wide
NAME      READY   STATUS    RESTARTS   AGE   IP           NODE     NOMINATED NODE   READINESS GATES
myapp01   1/1     Running   0          10m   10.244.1.5   node01   <none>           <none>


//其它注意事项
(1)当不指定 key 值时,表示容忍所有的污点 key
  tolerations:
  - operator: "Exists"
  
(2)当不指定 effect 值时,表示容忍所有的污点作用
  tolerations:
  - key: "key"
    operator: "Exists"

(3)有多个 Master 存在时,防止资源浪费,可以如下设置
kubectl taint node Master-Name node-role.kubernetes.io/master=:PreferNoSchedule

//如果某个 Node 更新升级系统组件,为了防止业务长时间中断,可以先在该 Node 设置 NoExecute 污点,把该 Node 上的 Pod 都驱逐出去
kubectl taint node node01 check=mycheck:NoExecute

//此时如果别的 Node 资源不够用,可临时给 Master 设置 PreferNoSchedule 污点,让 Pod 可在 Master 上临时创建
kubectl taint node master node-role.kubernetes.io/master=:PreferNoSchedule

//待所有 Node 的更新操作都完成后,再去除污点
kubectl taint node node01 check=mycheck:NoExecute-


//cordon 和 drain
##对节点执行维护操作:
kubectl get nodes

//将 Node 标记为不可调度的状态,这样就不会让新创建的 Pod 在此 Node 上运行
kubectl cordon <NODE_NAME> 		 #该node将会变为SchedulingDisabled状态

//kubectl drain 可以让 Node 节点开始释放所有 pod,并且不接收新的 pod 进程。drain 本意排水,意思是将出问题的 Node 下的 Pod 转移到其它 Node 下运行
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --force

--ignore-daemonsets:无视 DaemonSet 管理下的 Pod。
--delete-emptydir-data:如果有 mount local volume 的 pod,会强制杀掉该 pod。
--force:强制释放不是控制器管理的 Pod。

注:执行 drain 命令,会自动做了两件事情:
(1)设定此 node 为不可调度状态(cordon)
(2)evict(驱逐)了 Pod

//kubectl uncordon 将 Node 标记为可调度的状态
kubectl uncordon <NODE_NAME>

5. Pod startup phase (phase)

  • After the Pod is created, until it runs permanently, there are many steps in between, and there are many possibilities for errors, so there will be many different states.

Generally speaking, the pod process includes the following steps:
(1) Scheduling to a certain node. Kubernetes selects a node based on a certain priority algorithm and uses it as a node to run as a Pod
(2) Pull the image
(3) Mount the storage configuration, etc.
(4) Run the container. If there is a health check, its status will be set based on the results of the check.

// The possible states of phase are:

●Pending: Indicates that APIServer has created a Pod resource object and has stored it in etcd, but it has not been scheduled (for example, it has not been scheduled to a certain node), or it is still in the process of downloading the image from the warehouse.

●Running: The Pod has been scheduled to a certain node, and all containers in the Pod have been created by kubelet. At least one container is running, or is being started or restarted (that is, Pods in the Running state may not be accessible normally).

●Succeeded: Some pods are not long-running, such as jobs and cronjobs. After a period of time, all containers in the Pod are successfully terminated and will not be restarted. Feedback on the results of task execution is required.

●Failed: All containers in the Pod have been terminated, and at least one container has terminated due to failure. In other words, the container exits with a non-zero status or is terminated by the system. For example, there is a problem with the command writing.

●Unknown: Indicates that the Pod status cannot be read, usually because kube-controller-manager cannot communicate with the Pod. There is a problem with the Node where the Pod is located or the connection is lost, resulting in the status of the Pod being Unknown.

How to delete a Pod in Unknown status?

●Remove the problematic Node from the cluster. When using the public cloud, kube-controller-manager will automatically delete the corresponding Node after the VM is deleted. In a cluster deployed on physical machines, the administrator needs to manually delete the Node (kubectl delete node <node_name>).

●Pasively wait for the Node to return to normal. Kubelet will communicate with kube-apiserver again to confirm the expected status of these Pods, and then decide to delete or continue to run these Pods.

●Proactively delete Pods and forcibly delete Pods by executing kubectl delete pod <pod_name> --grace-period=0 --force . However, it should be noted here that this method is not recommended unless it is clearly known that the Pod is indeed stopped (for example, the VM or physical machine where the Node is located has been shut down). Especially for Pods managed by StatefulSet, forced deletion can easily lead to problems such as brain splitting or data loss.

Troubleshooting steps:

//查看Pod事件
kubectl describe TYPE NAME_PREFIX  

//查看Pod日志(Failed状态下)
kubectl logs <POD_NAME> [-c Container_NAME]

//进入Pod(状态为running,但是服务没有提供)
kubectl exec –it <POD_NAME> bash

//查看集群信息
kubectl get nodes

//发现集群状态正常
kubectl cluster-info

//查看kubelet日志发现
journalctl -xefu kubelet

Guess you like

Origin blog.csdn.net/wang_dian1/article/details/132212301