Analysis of Kubernetes StatefulSet

The difference between StatefulSet and Deployment

"Deployment is used to deploy stateless services, and StatefulSet is used to deploy stateful services".

Specifically, what scenarios need to use StatefulSet? The official recommendation is that if the application you deploy meets one or more of the following deployment requirements, it is recommended to use StatefulSet.

  • Stable , unique network identification.
  • Stable , durable storage.
  • Orderly , graceful deployment and scaling.
  • Orderly , graceful deletion and stopping.
  • Orderly , automatic rolling updates.

The stability is mainly for maintaining the previous network identity and persistent storage after the Pod re-schedule. The network identifier mentioned here includes the hostname and the A Record corresponding to the Pod in the DNS in the cluster. It does not guarantee that the IP will remain unchanged after the Pod re-schedule. To keep the Pod IP unchanged, we can use the stable Pod hostname to customize IPAM to obtain a fixed Pod IP. With the characteristics of StatefulSet 稳定的唯一的网络标识, we can easily implement the fixed IP requirements of Pods, and if we use Deployment, it will be much more complicated. You need to consider the parameter control (maxSurge, maxUnavailable) in the rolling update process, the IP waste caused by IP pool reservation, etc.

Therefore, I want to add another StatefulSet usage scenario:

  • To implement a fixed Pod IP solution, StatefulSet can be given priority ;

Best Practices

  • The storage of the Pod corresponding to the StatefulSet is best created dynamically through the StorageClass: each Pod will create a corresponding PVC according to the VolumeClaimTemplate defined in the StatefulSet, and then the PVS will automatically create the corresponding PV through the StorageClass and mount it to the Pod. So in this way, you need to create the corresponding StorageClass in advance. Of course, you can also manually create the corresponding PVs by the administrator in advance, as long as you can ensure that the automatically created PVCs can match these PVs.

  • For data security, when Pods in a StatefulSet are deleted or the StatefulSet is scaled down, Kubernetes will not automatically delete the PVs corresponding to the StatefulSet, and these PVs cannot be bound by other PVCs by default. When you manually delete the PV after confirming that the data is useless, whether the data is deleted depends on the ReclaimPolicy configuration of the PV. Reclaim Policy supports the following three types:

    • Retain, which means that you need to manually clean up;

    • Recycle, equivalent to rm -rf /thevolume/*

    • Delete, the default value, depends on the back-end storage system to implement itself.

      Notice:

      • Currently only NFS and HostPath support Recycle;
      • EBS,GCE PD, Azure Disk,Openstack Cinder支持Delete。
  • Please delete the PVC corresponding to the StatefulSet carefully. First, ensure that the Pods have been completely terminated, and then consider deleting the PV after confirming that the data in the Volume is not needed. Because deleting a PVC may trigger the automatic deletion of the corresponding PV, and according to the recalimPolicy configuration in StorageClass, data in the volume may be lost.

  • Because the stateful application is deployed, we need to create the corresponding Headless Service by ourselves. Note that the Label must match the Label of the Pods in the StatefulSet. Kubernetes will create corresponding SRV Records for the Headless Service, including all the backend Pods, and KubeDNS will select them through the Round Robin algorithm.

  • In Kubernetes 1.8+, you must ensure that the StatefulSet's spec.selector matches .spec.template.metadata.labels, otherwise the StatefulSet creation will fail. Before Kubernetes 1.8, StatefulSet's spec.selector defaulted to .spec.template.metadata.labels if not specified.

  • Before shrinking the StatefulSet, you need to confirm that the corresponding Pods are Ready, otherwise even if you trigger the shrinking operation, Kubernetes will not actually perform the shrinking operation.

How to understand stable network identity

The "stable network identifier" repeatedly emphasized in StatefulSet mainly refers to the hostname of Pods and the corresponding DNS Records.

  • HostName : The hostname of the Pods of the StatefulSet is generated in this format: $(statefulset name)-$(ordinal), ordinalfrom 0 ~ N-1(N is the expected number of replicas).
    • When the StatefulSet Controller creates pods, it will add a pod name label to the pod: statefulset.kubernetes.io/pod-name, and then set it to the pod name and hostname of the pod.
    • What is the use of pod name label? We can create an independent Service to match this specified pod, and then it is convenient for us to debug this pod separately.
  • DNS Records
    • DNS resolution of the Headless Service: $(service name).$(namespace).svc.cluster.localresolves to one of the Pods in the backend through DNS RR. SRV Records only contain the corresponding Running and Ready Pods. Pods that are not Ready will not be included in the corresponding SRV Records.
    • DNS resolution of Pod: $(hostname).$(service name).$(namespace).svc.cluster.localresolve to the Pod corresponding to the hostname.

How to understand stable persistent storage

  • Each Pod corresponds to a PVC, and the name of the PVC is composed of: $(volumeClaimTemplates.name)-$(pod's hostname), which corresponds to the corresponding Pod one-to-one.
  • When a Pod is re-schedule (actually recreated), the bound PV of its corresponding PVC will still be automatically mounted to the new Pod.
  • Kubernetes will create N (N is the desired number of copies) PVCs according to VolumeClaimTemplate, and PVCs will automatically create PVs according to the specified StorageClass.
  • When the StatefulSet is deleted by cascading, the corresponding PVCs will not be deleted automatically, so the PVCs need to be deleted manually.
  • When the StatefulSet is deleted through cascade or the corresponding Pods are deleted directly, the corresponding PVs are not automatically deleted. You need to manually delete the PV.

Differences from Deployment when deploying and scaling

  • When deploying a StatefulSet application with N replicas, it is created in strict accordance with the increasing order of index from 0 to N-1, and the next Pod must be created on the premise that the previous Pod is Ready.
  • When deleting a StatefulSet application with N replicas, delete it strictly according to the descending order of index from N-1 to 0. The next Pod deletion must be the previous Pod shutdown and complete deletion.
  • When expanding a StatefulSet application, each new Pod must be the premise that the previous Pod is Ready.
  • When a StatefulSet application is scaled down, a Pod that is not deleted must be shut down and successfully deleted by the previous Pod.
  • Note that pod.Spec.TerminationGracePeriodSeconds of StatefulSet should not be set to 0.

What should I do if the Node network is abnormal?

  • Under normal circumstances, the StatefulSet Controller will ensure that there will not be multiple StatefulSet Pods with the same network identity in the same namespace in the cluster.
  • If the above situation occurs in the cluster, it may lead to fatal problems such as the stateful application not working properly, or even data loss.

So under what circumstances will there be multiple StatefulSet Pods with the same network identity under the same namespace? Let's consider the situation where Node has network Unreachable:

  • If you use a version before Kubernetes 1.5, when the Node Condition is NetworkUnavailable, the node controller will forcefully delete these pods objects on this Node from the apiserver, then the StatefulSet Controller will automatically recreate Pods with the same identity on other Ready Nodes. Doing so is actually very risky. It may lead to multiple StatefulSet Pods with the same network identity for a period of time, which may cause the stateful application to not work properly. So try not to use StatefulSets in versions prior to Kubernetes 1.5, or you know the risk and ignore it.
  • If you use Kubernetes version 1.5+, when Node Condition is NetworkUnavailable, node controller will not force delete these pods objects on this Node from apiserver, the state of these pods is marked as Terminatingor in apiserver Unknown, so StatefulSet Controller does not Pods with the same identity will be recreated on other Nodes. When you determine that the StatefulSet Pods on this Node are shut down or cannot be networked differently from other Pods in the StatefulSet, then you need to forcefully delete these unreachable pods objects in the apiserver, and then the StatefulSet Controller can recreate the same identity on other Ready Nodes Pods, so that the StatefulSet continues to work healthily.

So in Kubernetes 1.5+, how to force delete that StatefulSet pods from the apiserver? There are three methods as follows:

  • If the Node is permanently unable to connect to the network or shuts down, it means that it can be determined that the Pods on this Node cannot communicate with other Pods, and will not affect the availability of the StatefulSet application, then it is recommended to manually delete the NetworkUnavailable Node from the apiserver, Kubernetes The Pods object on it will be automatically deleted from the apiserver.
  • If the Node is caused by the split-brain of the cluster network, it is recommended to check the network problem and recover successfully, because the Pods state is already Terminatingor Unkown, so the kubelet will automatically delete these Pods after obtaining this information from the apiserver.
  • In other cases, consider deleting these Pods directly from the apiserver, because at this time you cannot determine whether the corresponding Pods have been shut down or have no effect on the StatefulSet application. After forced deletion, it may lead to multiple StatefulSets with the same network identity under the same namespace. Pods, so try not to use this method.
    • kubectl delete pods <pod> --grace-period=0 --force

Little knowledge: The current Node Condition has the following 6 types:

Node Condition Description
OutOfDisk True if there is insufficient free space on the node for adding new pods, otherwise False
Ready True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last 40 seconds
MemoryPressure True if pressure exists on the node memory – that is, if the node memory is low; otherwise False
DiskPressure True if pressure exists on the disk size – that is, if the disk capacity is low; otherwise False
NetworkUnavailable True if the network for the node is not correctly configured, otherwise False
ConfigOK True if the kubelet is correctly configured, otherwise False

Pod management strategy of StatefulSet

Kubernetes 1.7+, StatefulSet began to support Pod Management Policy configuration, providing the following two configurations:

  • OrderedReady , the default management strategy of StatefulSet's Pod, is to deploy, delete, and scale one by one and in sequence, which is also the default strategy.
  • Parallel , which supports parallel creation or deletion of all Pods under the same StatefulSet, and does not wait for the previous operation one by one to ensure success before proceeding to the next Pod. In fact, there are very few scenarios where this management strategy is used.

Update strategy for StatefulSet

The StatefulSet's update strategy ( .spec.updateStrategy.typespecified by ) supports the following two:

  • OnDelete , the meaning is the same as Deployment's OnDelete strategy, everyone should be familiar with it, so I won't introduce it much.
  • RollingUpdate , the rolling update process is roughly the same as Deployment, the difference is:
    • Equivalent to Deployment's maxSurge=0, maxUnavailable=1 (in fact, StatefulSet does not have these two configurations)
    • The process of rolling update is ordered (reverse order), the index is performed one by one from N-1 to 0, and the next Pod creation must be the previous Pod Ready, and the next Pod deletion must be the previous Pod shutdown and completely deleted as a premise.
    • Supports rolling update of some instances, some not updated, .spec.updateStrategy.rollingUpdate.partitionspecify an index demarcation point by.
      • All Pods with ordinal greater than or equal to the value specified by partition will be rolled over.
      • All Pods with ordinal less than the value specified by partition will remain unchanged. Even if these Pods are recreated, they will be created according to the original pod template and will not be updated to the latest version.
      • In particular, if the value of partition is greater than the expected number of replicas N of the StatefulSet, then no rolling update of any Pods will be triggered.

Thinking: What happens if a Pod fails to update when the StatefulSet is rolled over?
Let's sell it first, and we will answer it when we analyze the source code of StatefulSet Controller in the next article.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324454960&siteId=291194637