In-depth k8s: Kubernetes daemon DaemonSet and source code analysis

Recommended reading:

Recently, I have been working overtime to deal with the things in the project. The more problems I find, the more I feel that I am inadequate, and I hope I can learn more. I think the meaning of life is to be able to constantly seek breakthroughs.

This article will talk about DaemonSet and Job together with CronJob. When I talk about a certain piece of content, I will also associate some other content, so that readers can understand as much as possible, and then I will start to add some main source code analysis at the beginning of this article.

Daemon Pod has three main characteristics:

  1. This Pod runs on every node (Node) in the Kubernetes cluster;
  2. There is only one such Pod instance on each node;
  3. When a new node joins the Kubernetes cluster, the Pod will be automatically created on the new node; and when the old node is deleted, the Pod on it will be recycled accordingly.

Daemon Pod can be used on the Agent component of the network plug-in, log component, monitoring component, etc.

Create a DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd-elasticsearch
        image: mirrorgooglecontainers/fluentd-elasticsearch:v2.4.0
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

This DaemonSet manages a Pod mirrored by fluentd-elasticsearch. Forward the logs in the Docker container to ElasticSearch through fluentd.

In this DaemonSet, the selector is used to select and manage all Pods that carry the tag name=fluentd-elasticsearch. Then use template to define the pod template.

Then after running this DaemonSet, a controller called DaemonSet Controller will get all Node lists from Etcd, and then traverse all Nodes. Then check if there is a Pod with name=fluentd-elasticsearch tag running on Node.

If there is no such pod, then one such pod is created; if the number of such pods on the node is greater than 1, then the excess pod will be deleted.

run:

$ kubectl apply -f ds-els.yaml

Then check the operation:

$ kubectl get pod -n kube-system -l name=fluentd-elasticsearch

NAME                          READY   STATUS    RESTARTS   AGE
fluentd-elasticsearch-nwqph   1/1     Running   0          4m11s

Since I am a single node, only one pod is running.

Then look at the DaemonSet object in the Kubernetes cluster:

$ kubectl get ds -n kube-system fluentd-elasticsearch
NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
fluentd-elasticsearch   1         1         1       1            1           <none>          27m

Then let's take a look at the source code, k8s manages the Pod deletion operation through the manage method in daemon_controller:

The manage method first obtains the mapping relationship between daemon pod and node, and then determines whether each node needs to run daemon pod, and then after traversing the node, the list of pods that need to be created and the list of pods that need to be deleted are handed over to syncNodes for execution.

func (dsc *DaemonSetsController) manage(ds *apps.DaemonSet, nodeList []*v1.Node, hash string) error { 
    // 获取已存在 daemon pod 与 node 的映射关系
    nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds)
    if err != nil {
        return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err)
    }

    // 判断每一个 node 是否需要运行 daemon pod
    var nodesNeedingDaemonPods, podsToDelete []string
    for _, node := range nodeList {
        nodesNeedingDaemonPodsOnNode, podsToDeleteOnNode, err := dsc.podsShouldBeOnNode(
            node, nodeToDaemonPods, ds)

        if err != nil {
            continue
        }
        //将需要删除的Pod和需要在某个节点创建Pod存入列表中
        nodesNeedingDaemonPods = append(nodesNeedingDaemonPods, nodesNeedingDaemonPodsOnNode...)
        podsToDelete = append(podsToDelete, podsToDeleteOnNode...)
    }

    podsToDelete = append(podsToDelete, getUnscheduledPodsWithoutNode(nodeList, nodeToDaemonPods)...)

    //为对应的 node 创建 daemon pod 以及删除多余的 pods
    if err = dsc.syncNodes(ds, podsToDelete, nodesNeedingDaemonPods, hash); err != nil {
        return err
    }

    return nil
}

Let's take a look at how the podsShouldBeOnNode method determines which Pods need to be created and deleted:

In podsShouldBeOnNode, the nodeShouldRunDaemonPod method will be called to determine whether the node needs to run the daemon pod and whether the scheduling is successful, and then obtain whether the daemon pod is created on the node.

By judging shouldRun, shouldContinueRunning, the node list that needs to create daemon pods and the pod list that needs to be deleted are obtained. ShouldSchedule mainly checks whether the resources on the node are sufficient, and shouldContinueRunning is true by default.

func (dsc *DaemonSetsController) podsShouldBeOnNode(
    node *v1.Node,
    nodeToDaemonPods map[string][]*v1.Pod,
    ds *apps.DaemonSet,
) (nodesNeedingDaemonPods, podsToDelete []string, err error) {
    //判断该 node 是否需要运行 daemon pod 以及能不能调度成功
    shouldRun, shouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(node, ds)
    if err != nil {
        return
    }
    //获取该节点上的指定ds的pod列表
    daemonPods, exists := nodeToDaemonPods[node.Name]

    switch {
    //如果daemon pod是可以运行在这个node上,但是还没有创建,那么创建一个
    case shouldRun && !exists: 
        nodesNeedingDaemonPods = append(nodesNeedingDaemonPods, node.Name)
    //  需要 pod 一直运行
    case shouldContinueRunning: 
        var daemonPodsRunning []*v1.Pod
        for _, pod := range daemonPods {
            if pod.DeletionTimestamp != nil {
                continue
            }
            //如果 pod 运行状态为 failed,则删除该 pod
            if pod.Status.Phase == v1.PodFailed { 
                ...
                podsToDelete = append(podsToDelete, pod.Name)
            } else {
                daemonPodsRunning = append(daemonPodsRunning, pod)
            }
        } 
        //如果节点上已经运行 daemon pod 数 > 1,保留运行时间最长的 pod,其余的删除
        if len(daemonPodsRunning) > 1 {
            sort.Sort(podByCreationTimestampAndPhase(daemonPodsRunning))
            for i := 1; i < len(daemonPodsRunning); i++ {
                podsToDelete = append(podsToDelete, daemonPodsRunning[i].Name)
            }
        }
    //  如果 pod 不需要继续运行但 pod 已存在则需要删除 pod
    case !shouldContinueRunning && exists: 
        for _, pod := range daemonPods {
            if pod.DeletionTimestamp != nil {
                continue
            }
            podsToDelete = append(podsToDelete, pod.Name)
        }
    }

    return nodesNeedingDaemonPods, podsToDelete, nil
}

The rolling update of the DaemonSet object is the same as the StatefulSet, and the .spec.updateStrategy.type update strategy can be  set. Two strategies are currently supported:

  • OnDelete: The default policy, after updating the template, a new Pod will be created only after the old Pod is manually deleted;
  • RollingUpdate: After updating the DaemonSet template, the old Pod is automatically deleted and a new Pod is created.

The specific rolling update can be found in: in-depth k8s: kubernetes StatefulSet controller and source code analysis review.

Only run Pods on certain nodes

If you want DaemonSet to run on a specific Node, you can use nodeAffinity.

as follows:

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: metadata.name
            operator: In
            values:
            - node1

For the pod above, we specify nodeAffinity. The meaning of matchExpressions is that this pod can only run on the node whose metadata.name is node1. Operator=In means partial matching. In addition, operator can also specify: In, NotIn , Exists, DoesNotExist, Gt, Lt, etc.

requiredDuringSchedulingIgnoredDuringExecution indicates the rules that must be met to schedule pod to a node. In addition to this rule, there is also preferredDuringSchedulingIgnoredDuringExecution. Scheduling pod to a node may not meet the rule

When we use the following command:

$ kubectl edit pod -n kube-system fluentd-elasticsearch-nwqph

...
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - node1
...

You can see that DaemonSet automatically added affinity to us for node scheduling. We can also set affinity in yaml to override the system default configuration.

Taints and Tolerations

In the k8s cluster, we can taint the Node so that the pod can avoid inappropriate nodes. After setting up one or more Taints on a node, unless the pod explicitly states that it can tolerate these taints, it will not be able to run on these nodes.

E.g:

kubectl taint nodes node1 key=value:NoSchedule

The above put a taint on node1, which will prevent pod from being scheduled to node1.

If you want to remove this stain, you can do this:

kubectl taint nodes node1 key:NoSchedule-

If we want the pod to run on the tainted node, we need to declare the Toleration on the pod, indicating that the Node with the Taint can be tolerated.

For example, we can declare the following pod:

apiVersion: v1
kind: Pod
metadata:
  name: pod-taints
spec:
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
  containers:
    - name: pod-taints
      image: busybox:latest

The operator here can be Exists, which means there is no need to specify the value, and the value of Equal means that it needs to be equal to the value.

NoSchedule means that if a pod does not declare tolerate this Taint, the system will not schedule the Pod to the node with this Taint. In addition to NoSchedule, it can also be PreferNoSchedule, indicating that if a Pod does not declare tolerate this Taint, the system will try to avoid scheduling the pod to this node, but it is not mandatory.

In the fluentd-elasticsearch DaemonSet above, we have added

tolerations:
- key: node-role.kubernetes.io/master
  effect: NoSchedule

This is because by default, the Kubernetes cluster does not allow users to deploy Pods on the Master node. Because the Master node carries a "taint" called node-role.kubernetes.io/master by default. Therefore, in order to be able to deploy the DaemonSet Pod on the Master node, I must make the Pod "tolerate" this "taint".

Guess you like

Origin blog.csdn.net/weixin_45784983/article/details/109100066