Read the K8s persistent storage process in one article

4.9 Headline.png

Author | Sun Zhiheng (Hui Zhi) Alibaba Development Engineer

Introduction: As we all know, K8s' persistent storage (Persistent Storage) ensures that application data exists independently of the application life cycle, but its internal implementation is rarely mentioned. What is the internal storage process of K8s? What is the calling relationship between PV, PVC, StorageClass, Kubelet, CSI plug-in, etc. These mysteries will be revealed one by one in this article.

K8s persistent storage foundation

Before explaining the K8s storage process, first review the basic concepts of persistent storage in K8s.

1. Explanation of terms

  • in-tree : The code logic is in the K8s official warehouse;

  • out-of-tree : The code logic is outside the K8s official warehouse, decoupling from the K8s code;

  • PV : PersistentVolume, a cluster-level resource, created by the cluster administrator or External Provisioner. The life cycle of PV is independent of the Pod using PV, and the details of the storage device are stored in the PV Spec.

  • PVC : PersistentVolumeClaim, namespace-level resources, created by the user or StatefulSet controller (according to VolumeClaimTemplate). PVC is similar to Pod. Pod consumes Node resources and PVC consumes PV resources. Pod can request a specific level of resources (CPU and memory), while PVC can request a specific storage volume size and access mode (Access Mode);

  • StorageClass : StorageClass is a cluster-level resource, created by the cluster administrator. SC provides administrators with a "class" template for dynamically providing storage volumes. The .Spec in SC defines in detail the different service quality levels, backup strategies, etc. of storage volume PVs;

  • CSI : Container Storage Interface, the purpose is to define industry standard "container storage interface", so that storage vendors (SP) plug-ins developed based on the CSI standard can work in different container orchestration (CO) systems, CO systems include Kubernetes, Mesos, Swarm, etc.

2. Component introduction

  • PV Controller : Responsible for PV / PVC binding and cycle management, and perform ** Provision / Delete ** operation of data volumes according to requirements;

  • AD Controller : responsible for the ** Attach / Detach ** operation of the data volume, attaching the device to the target node;

  • Kubelet: Kubelet is the main "node agent" running on each Node node, and its functions are Pod life cycle management, container health check, container monitoring, etc .;

  • Volume Manager : component in Kubelet, responsible for ** Mount / Umount ** operation of data volume (also responsible for ** Attach / Detach ** operation of data volume, need to configure kubelet related parameters to enable this feature), volume device format Change etc.

  • Volume Plugins : storage plug-ins, developed by storage vendors, the purpose is to expand the volume management capabilities of various storage types, to achieve various operational capabilities of third-party storage, which is the realization of the above blue operations . Volume Plugins are in-tree and out-of-tree;

  • External Provioner : External Provioner is a sidecar container. Its function is to call the CreateVolume and DeleteVolume functions in Volume Plugins to perform ** Provision / Delete ** operations. Because the K8s PV controller cannot directly call the related functions of Volume Plugins, it is called by External Provioner through gRPC;

  • External Attacher : External Attacher is a sidecar container. Its function is to call the ControllerPublishVolume and ControllerUnpublishVolume functions in Volume Plugins to perform ** Attach / Detach ** operations. Because the AD controller of K8s cannot directly call the related functions of Volume Plugins, it is called by External Attacher through gRPC.

3. Persistent volume use

Kubernetes introduced PV and PVC in order to enable applications and their developers to normally request storage resources and avoid processing storage facility details . There are two ways to create a PV:

  • One is that the cluster administrator manually creates the PV required by the application statically ;

  • The other is that the user manually creates a PVC and the Provisioner component dynamically creates the corresponding PV.

Let's take the NFS shared storage as an example to see the difference between the two.

Statically create a storage volume

The process of statically creating a storage volume is shown in the following figure:

1.png

Step 1 : The cluster administrator creates NFS PV, which belongs to the in-tree storage type natively supported by K8s. The yaml file is as follows:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  nfs:
    server: 192.168.4.1
    path: /nfs_storage

Step 2 : The user creates PVC, and the yaml file is as follows:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

The kubectl get pv command shows that PV and PVC are bound:

[root@huizhi ~]# kubectl get pvc
NAME      STATUS   VOLUME               CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nfs-pvc   Bound    nfs-pv-no-affinity   10Gi       RWO                           4s

Step 3 : The user creates an application and uses the PVC created in the second step.

apiVersion: v1
kind: Pod
metadata:
  name: test-nfs
spec:
  containers:
  - image: nginx:alpine
    imagePullPolicy: IfNotPresent
    name: nginx
    volumeMounts:
    - mountPath: /data
      name: nfs-volume
  volumes:
  - name: nfs-volume
    persistentVolumeClaim:
      claimName: nfs-pvc

At this time, the remote storage of NFS is mounted to the / data directory of the nginx container in the Pod.

Create storage volume dynamically

To dynamically create storage volumes, ** nfs-client-provisioner ** and corresponding storageclass are required to be deployed in the cluster  .

Compared with static storage volume creation, the dynamic creation of storage volume reduces the intervention of cluster administrators. The process is shown in the following figure:

2.png

The cluster administrator only needs to ensure that there are storageclasses related to NFS in the environment:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: nfs-sc
provisioner: example.com/nfs
mountOptions:
  - vers=4.1

Step 1 : The user creates a PVC, where the storageClassName of the PVC is specified as the storageclass name of the NFS above:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs
  annotations:
    volume.beta.kubernetes.io/storage-class: "example-nfs"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Mi
  storageClassName: nfs-sc

Step 2 : nfs-client-provisioner in the cluster will dynamically create the corresponding PV. At this point, you can see that the PV has been created in the environment and is bound to the PVC.

[root@huizhi ~]# kubectl get pv
NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS      CLAIM         REASON    AGE
pvc-dce84888-7a9d-11e6-b1ee-5254001e0c1b   10Mi        RWX           Delete          Bound       default/nfs             4s

Step 3 : The user creates the application and uses the PVC created in the second step, the same as the third step of statically creating a storage volume.

K8s persistent storage process

1. Process Overview

Here is a reference to the flow chart of @ 郡 宝 in the cloud native storage course

3.png

The process is as follows:

  1. The user created a Pod containing a PVC, which requires a dynamic storage volume;

  2. ** Scheduler ** Schedule Pod to a suitable Worker node based on Pod configuration, node status, PV configuration and other information;

  3. ** PV controller ** watch until the PVC used by the Pod is in Pending state, so call  Volume Plugin (in-tree) to create a storage volume and create a PV object (out-of-tree is handled by External Provisioner);

  4. The AD controller finds that the Pod and PVC are in the pending state, so it calls ** Volume Plugin ** to mount the storage device to the target Worker node

  5. On the Worker node, the Volume Manager in Kubelet  waits for the storage device to be mounted, and mounts the device to the global directory through the  Volume Plugin  : / var / lib / kubelet / pods / [pod uid] /volumes/kubernetes.io~iscsi / [PV
    name]
    (take iscsi as an example);

  6. ** Kubelet ** Start Pod Containers via Docker  , and use  bind mount  to map the volumes mounted to the local global directory to the container .

The more detailed process is as follows:

4.png

2. Detailed process

Different K8s versions have slightly different persistent storage processes. This article is based on Kubernetes version 1.14.8.

As can be seen from the above flowchart, storage volume is divided into three stages from creation to application use: Provision / Delete, Attach / Detach, Mount / Unmount.

provisioning volumes

5.png

There are two workers in the PV controller:

  • ClaimWorker : handling PVC add / update / delete related events and PVC state migration;
  • VolumeWorker : Responsible for the state migration of PV.

PV status migration (UpdatePVStatus):

  • The initial state of PV is Available, and when PV is bound to PVC, the state becomes Bound;
  • After the PVC bound to the PV is deleted, the status becomes Released;
  • When the PV recycling strategy is Recycled or manually delete the PV's .Spec.ClaimRef, the PV status becomes Available;
  • When the PV recycling policy is unknown or Recycle fails or storage volume deletion fails, the PV status becomes Failed;
  • Manually delete the PV Spec.ClaimRef, the PV status becomes Available.

PVC status migration (UpdatePVCStatus):

  • When there is no PV that meets the PVC conditions in the cluster, the PVC status is Pending. After the PV and PVC are bound, the PVC status changes from Pending to Bound;
  • The PV bound to PVC is deleted in the environment, and the PVC status becomes Lost;
  • After binding to a PV of the same name again **, the PVC status becomes Bound.

The provisioning process is as follows (simulated users create a new PVC here):

Static storage volume process (FindBestMatch) : The PV controller first filters a PV with the status of Available in the environment to match the new PVC.

  • DelayBinding : The PV controller determines whether the PVC requires delayed binding: 1. Check whether the PVC annotation contains volume.kubernetes.io/selected-node. If it exists, it indicates that the PVC has been designated by the scheduler (belonging to  ProvisionVolume) ), So there is no need for delayed binding; 2. If there is no volume.kubernetes.io/selected-node in the PVC annotation and there is no StorageClass, the default means that no delayed binding is required; if there is a StorageClass, check its VolumeBindingMode field, if For WaitForFirstConsumer, you need to delay binding, if it is Immediate, you do not need to delay binding;

  • FindBestMatchPVForClaim : The PV controller attempts to find an existing PV in an environment that meets the requirements of PVC. The PV controller will filter ** all PVs ** once, and will select the best matching PV from the PVs that meet the conditions. Screening rules: 1. Whether VolumeMode matches; 2. Whether PV is bound to PVC; 3. Whether PV's .Status.Phase is Available; 4. LabelSelector check, PV and PVC labels must be consistent; 5. PV and Whether the StorageClass of PVC is consistent; 6. Each iteration updates the PV that meets the minimum PVC requested size and returns as the final result;

  • Bind : The PV controller binds the selected PV and PVC: 1. Update the PV's Spec.ClaimRef information to the current PVC; 2. Update the PV's .Status.Phase to Bound; 3. Add PV's annotation: pv .kubernetes.io / bound-by-controller: "yes"; 4. Update PVC's .Spec.VolumeName to PV name; 5. Update PVC's .Status.Phase to Bound; 6. Add PVC annotation: pv. kubernetes.io/bound-by-controller: "yes" and pv.kubernetes.io/bind-completed: "yes";

Dynamic storage volume process (ProvisionVolume): If there is no suitable PV in the environment, enter the dynamic Provisioning scenario:

  • Before Provisioning : 1. The PV controller first judges whether the StorageClass used by the PVC is in-tree or out-of-tree: by checking whether the Provisioner field of the StorageClass contains the ** "kubernetes.io/" ** prefix; 2. PV The controller updates the PVC annotation: claim.Annotations ["volume.beta.kubernetes.io/storage-provisioner"] = storageClass.Provisioner;

  • in-tree Provisioning (internal provisioning): 1. The in-tree Provioner will implement the NewProvisioner method of the ProvisionableVolumePlugin interface to return a new Provisioner; 2. The PV controller calls the Provisioner ’s Provision function, which returns a PV object ; 3. The PV controller creates the PV object returned in the previous step, binds it to PVC, Spec.ClaimRef is set to PVC, .Status.Phase is set to Bound, and .Spec.StorageClassName is set to the same StorageClassName as PVC; meanwhile new Increase annotation: "pv.kubernetes.io/bound-by-controller"="yes" and "pv.kubernetes.io/provisioned-by" = plugin.GetPluginName ();

  • Out-of-tree Provisioning (external provisioning): 1. External Provisioner checks whether claim.Spec.VolumeName in PVC is empty, skip it if it is not empty; 2. External Provisioner checks claim.Annotations [in PVC [ "volume.beta.kubernetes.io/storage-provisioner"] is equal to its own Provisioner Name (External Provisioner will pass in --provisioner parameter at startup to determine its own Provisioner Name); 3. If the VolumeMode of PVC = Block, Check whether External Provisioner supports block devices; 4. External Provisioner calls the Provision function: calls the CreateVolume interface of the CSI storage plug-in through gRPC; 5. External Provisioner creates a PV to represent the volume, and binds the PV to the previous PVC.

deleting volumes

The Deleting process is the inverse operation of Provisioning:

The user deletes the PVC, deletes the PV controller and changes PV.Status.Phase to Released.


When PV.Status.Phase == Released, the PV controller first checks the value of Spec.PersistentVolumeReclaimPolicy, skip directly when Retain, and Delete when:

  • in-tree Deleting: 1. The Provioner of in-tree will implement the NewDeleter method of the DeleteableVolumePlugin interface to return a new Deleter; 2. The controller calls the Delete function of the Deleter to delete the corresponding volume; 3. After the volume is deleted, the PV The controller will delete the PV object;

  • out-of-tree Deleting: 1. The External Provisioner calls the Delete function and calls the DeleteVolume interface of the CSI plug-in through gRPC; 2. After the volume is deleted, the External Provisioner deletes the PV object

Attaching Volumes

Both Kubelet components and AD controllers can perform attach / detach operations. If --enable-controller-attach-detach is specified in the startup parameters of Kubelet, Kubelet will do it; otherwise, AD will do it by default. The AD controller is taken as an example to explain the attach / detach operation.

6.png

There are two core variables in the AD controller:

  • DesiredStateOfWorld (DSW) : The expected data volume mounting state in the cluster, which contains the information of nodes-> volumes-> pods;
  • ActualStateOfWorld (ASW) : The actual data volume mounting state in the cluster, including the information of volumes-> nodes.

Attaching process is as follows:

The AD controller initializes DSW and ASW according to the resource information in the cluster.

There are three components inside the AD controller that periodically update DSW and ASW:

  • Reconciler. A GoRoutine is run periodically to ensure that the volume is mounted / removed . ASW is continuously updated during this period:

in-tree attaching: 1. The in-tree Attacher will implement the NewAttacher method of the AttachableVolumePlugin interface to return a new Attacher; 2. The AD controller calls the Attacher's Attach function to attach the device; 3. Update the ASW.

out-of-tree attaching: 1. Call the in-tree CSIAttacher to create a  VolumeAttachement (VA) object, which contains Attacher information, node name, and PV information to be attached; 2. External Attacher will watch VolumeAttachement resources in the cluster When it finds that there is a data volume that needs to be mounted, it calls the Attach function and calls the ControllerPublishVolume interface of the CSI plug-in through gRPC.

  • DesiredStateOfWorldPopulator. Running periodically through a GoRoutine, the main function is to update the DSW:

findAndRemoveDeletedPods-traverse all Pods in DSW, if it has been deleted from the cluster, then remove it from DSW;
findAndAddActivePods-traverse all Pods in PodLister, if there is no such Pod in DSW, add it to DSW

  • PVC Worker. Watch the add / update event of PVC, process the pods related to PVC, and update the DSW in real time.

Detaching Volumes

Detaching process is as follows:

  • When the Pod is deleted, the AD controller will watch the event. First, the AD controller checks whether the Node resource where the Pod is located contains the "volumes.kubernetes.io/keep-terminated-pod-volumes" label. If it does, it does not perform the operation; if it does not, it removes the volume from the DSW;

  • The AD controller brings the ActualStateOfWorld state closer to the DesiredStateOfWorld state through ** Reconciler **. When it finds that there is a volume in the ASW that does not exist in the DSW, it will do a Detach operation:

in-tree detaching: 1. The AD controller will implement the NewDetacher method of the AttachableVolumePlugin interface to return a new Detacher; 2. The controller calls Detacher's Detach function, and detach corresponds to the volume; 3. The AD controller updates ASW.

Out-of-tree detaching: 1. The AD controller calls the in-tree CSIAttacher to delete the related VolumeAttachement object; 2. The External Attacher will watch the VolumeAttachement (VA) resource in the cluster and find the Detach function when there is a data volume that needs to be removed , Call the ControllerUnpublishVolume interface of the CSI plug-in through gRPC; 3. The AD controller updates ASW.

7.png

** Volume Manager ** also has two core variables:

  • DesiredStateOfWorld (DSW) : The expected data volume mounting state in the cluster, including the information of volumes-> pods;
  • ActualStateOfWorld (ASW) : The actual data volume mounting status in the cluster, including the information of volumes-> pods.

The Mounting / UnMounting process is as follows:

The purpose of the global directory (global mount path): block devices can only be mounted once on Linux, and in the K8s scenario, a PV may be mounted on multiple Pod instances on the same Node. If the block device is formatted and mounted to a temporary global directory on Node, then use the bind mount technology in Linux to mount this global directory into the corresponding directory in the Pod, it can meet the requirements. In the above flowchart, the global directory is / var / lib / kubelet / pods / [pod uid] /volumes/kubernetes.io~iscsi/ [PV
name]

VolumeManager initializes DSW and ASW according to the resource information in the cluster.

There are two components inside VolumeManager that periodically update DSW and ASW:

  • DesiredStateOfWorldPopulator : run periodically through a GoRoutine, the main function is to update DSW;
  • Reconciler : Run periodically through a GoRoutine to ensure that the volume is mounted / unmounted . ASW is continuously updated during this period:

unmountVolumes : Make sure that the volumes are unmounted after the Pod is deleted. Iterate through all Pods in ASW, if it is not in DSW (indicating that Pod is deleted), take VolumeMode = FileSystem as an example, then perform the following operations:

  1. Remove all bind-mounts: call the TearDown interface of Unmounter (if it is out-of-tree, call the NodeUnpublishVolume interface of the CSI plug-in);
  2. Unmount volume: call the UnmountDevice function of DeviceUnmounter (if it is out-of-tree, call the NodeUnstageVolume interface of the CSI plug-in);
  3. Update ASW.

mountAttachVolumes : Ensure that the volumes to be used by the Pod are mounted successfully. Iterate through all the Pods in the DSW, if it is not in the ASW (indicating that the directory to be mounted is mapped to the Pod), here takes VolumeMode = FileSystem as an example, perform the following operations:

  1. Wait for the volume to be attached to the node (attached by External Attacher or Kubelet itself);
  2. Mount the volume to the global directory: call the MountDevice function of DeviceMounter (if it is out-of-tree, call the NodeStageVolume interface of the CSI plug-in);
  3. Update ASW: The volume has been mounted to the global directory;
  4. Bind-mount volume to Pod: call the Seter interface of Mounter (if it is out-of-tree, call the NodePublishVolume interface of the CSI plug-in);
  5. Update ASW.

unmountDetachDevices : Ensure that the volumes that need to be unmounted are unmounted . Iterate through the UnmountedVolumes in all ASWs, if it is not in the DSW (indicating that the volume is no longer needed), perform the following operations:

  1. Unmount volume: call the UnmountDevice function of DeviceUnmounter (if it is out-of-tree, call the NodeUnstageVolume interface of the CSI plug-in);
  2. Update ASW.

to sum up

This article first introduces the basic concepts and usage methods of K8s persistent storage, and deeply analyzes the internal storage process of K8s. On K8s, the use of any kind of storage is inseparable from the above process (attach / detach is not used in some scenarios), and the storage problem in the environment must be a failure in one of the links.

There are many pits for container storage, especially in a proprietary cloud environment. But the more challenges, the more opportunities! At present, the domestic proprietary cloud market is also a leader in the storage field. Our agile PaaS container team welcomes the heroes to join us to create a better future together!

Reference link

  1. Kubernetes community source code
  2. [Cloud native open class] Kubernetes storage architecture and plug-in use (Junbao)
  3. [Cloud Native Open Course] Application Storage and Persistent Data Volume-Core Knowledge (Top)
  4. 【kubernetes-design-proposals】volume-provisioning
  5. 【Kubernetes-design-proposals】 CSI Volume Plugins in Kubernetes Design Doc

The cloud native application team is hiring!

Alibaba Cloud's native application platform team is currently thirsty for talent, if you meet:

  • Passionate about cloud-native technologies in containers and infrastructure related fields Landing, innovative technology implementation, open source contribution, leading academic achievements);

  • Excellent presentation skills, communication skills and teamwork skills; forward thinking about technology and business; possess strong ownership, result-oriented and good at decision-making;

  • Be familiar with at least one programming language in Java and Golang;

  • Bachelor degree or above, more than 3 years working experience.

Resume can be delivered to the mailbox: [email protected], if you have any questions, please add WeChat consultation: The Beatles1994

Cloud native webinar invites you to participate

Click to schedule a live broadcast now
8.png

" Alibaba Cloud Native focuses on microservices, serverless, containers, service mesh and other technical fields, focuses on cloud native popular technology trends, cloud native large-scale landing practices, and is the public number that understands cloud native developers best.

Guess you like

Origin www.cnblogs.com/alisystemsoftware/p/12672973.html