Cloud native in-depth analysis of the best practices of Kubernetes local persistent storage solution OpenEBS LocalPV

1. K8s local storage

  • K8s supports up to 20+ types of persistent storage, such as the common CephFS, Glusterfs, etc., but most of these are distributed storage. With the development of the community, more and more users expect to mount the working nodes in the K8s cluster. The downloaded data disk is utilized, so there is support for local type persistent volumes.
  • You can call a local type persistent volume: Local Persistent Volume, or LocalPV for short. LocalPV represents a mounted local (working node) storage device, such as a disk, partition or directory. Therefore, LocalPV is not as reliable as distributed storage, but it is extremely fast. This also determines the usage scenarios of LocalPV: High I/O sensitivity and tolerance to small probability data loss.
  • K8s official documentation has a simple example of using LocalPV. Here is a brief summary of the characteristics of K8s LocalPV:
    • It can only be used as a statically created persistent volume and does not support dynamic provisioning, which means that the PV must be created manually;
    • Compared to hostPath volumes, LocalPV can be used in a durable and portable manner without the need to manually schedule Pods to nodes. The system can understand the node constraints of the volume by looking at the node affinity (nodeAffinity) configuration of the PV.
    • If you want to use a storage class to automatically bind PVCs and PVs, you must configure the StorageClass for lazy binding.
  • Examples are as follows:
apiVersion: storage.K8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
  • The volumeBindingMode: WaitForFirstConsumer attribute is delayed volume binding, which allows the scheduler to consider the scheduling restrictions of the Pod when selecting a suitable PV for the PVC. For example, suppose you create two PVs, namely PV1 and PV2, and then create a Pod and declare a PVC called PVC1. PV1 and PV2 meet the requirements of PVC1 at the same time. However, at this time, the storage class cannot immediately associate PVC1 with PV1 or PV2. To bind any PV in the PV, the scheduling strategy of the Pod must be considered. If the Pod specifies node affinity and must be deployed to the node where PV1 is located, then PVC1 needs to be bound to PV1 but not to PV2.
  • It can be found that when a Pod needs to use LocalPV, the binding of PVC to PV needs to consider the scheduling of the Pod. Therefore, the storage class of LocalPV cannot support immediate binding and can only delay the binding timing until the Pod is scheduled (WaitForFirstConsumer). .

2. OpenBES local storage

① Introduction to OpenBES

  • Since the usage limitations of K8s LocalPV cannot meet production needs, it is necessary to find alternatives. Fortunately, someone has already implemented a more powerful LocalPV storage solution: OpenEBS LocalPV.
  • OpenEBS official site:OpenEBS official site.
  • OpenEBS can convert any available storage on a K8s worker node into a local or distributed (also known as replicated) persistent volume.
  • OpenEBS was originally built by MayaData and donated to CNCF, and is now a CNCF sandbox project.

② Volume type

  • OpenEBS supports two volume types: local volumes and replicated volumes. The architecture is as follows:

Insert image description here

  • Local volumes can directly abstract the data disk inserted on the working node into a persistent volume for use by Pods on the current node. The replication volume is relatively complex. OpenEBS uses its internal engine to create a microservice for each replication volume. When used, the stateful service writes data to the OpenEBS engine, and the engine synchronously replicates the data to multiple nodes in the cluster. , thereby achieving high availability.

③ Local volume

  • Since this implementation of OpenEBS is aimed at local persistent storage, we mainly analyze the use of OpenEBS local volumes. OpenEBS local volumes support multiple types: Hostpath, Device, LVM, ZFS, and Rawfile. Each type has its own characteristics and has its own applicable scenarios. For example, compared to K8s's native Hostpath, OpenEBS's Hostpath can support the external data disk directory as the Hostpath directory, thus avoiding the problem that the Pod may fill the host directory. Device enables the use of block devices for LocalPV, which is extremely fast. With the help of LVM's capabilities, LocalPV can be used more flexibly, which can support dynamic expansion and contraction operations of PV.
  • OpenEBS has implemented a separate project for each type it supports. Taking the use of block devices as an example, we introduce the practice of using OpenEBS LocalPV. The project address is:device-localpv , then Device-LocalPV is used to refer to it.

④ Practice

  • Environment preparation:
    • Use minikube to build a K8s cluster environment. OpenEBS officially requires the K8s version to be 1.20+. According to actual tests, there is no problem with version 1.19, but it is recommended to use the recommended version first:
    • Use VirtualBox as the driver to start two minikube nodes: minikube and minikube-m02 (refer to: https://minikube.sigs.k8s.io/docs/drivers/virtualbox/);
    • The minikube node mounts a 4GB disk, and the minikube-m02 node mounts a 4GB and an 8GB disk respectively.
  • Install Device-LocalPV:
    • Since OpenEBS Device-LocalPV itself is an application developed for cloud native, installation is very simple and only requires a kubectl apply command.
kubectl apply -f https://raw.githubusercontent.com/openebs/device-localpv/develop/deploy/device-operator.yaml
  • After executing the above command, you will get the following related Pods:
NAMESPACE     NAME                               READY   STATUS    RESTARTS   AGE
kube-system   openebs-device-controller-0        2/2     Running   0          2m23s
kube-system   openebs-device-node-4wld7          2/2     Running   0          2m23s
kube-system   openebs-device-node-p2r6m          2/2     Running   0          2m23s
    • Make sure that these Pods are all in the Running state, which means Device-LocalPV is installed successfully. If the installation fails, you need to troubleshoot according to the description information of the kubectl describe command.
  • Prepare the disk:
    • Device-LocalPV can directly take over the block devices on the node. Sometimes multiple data disks may be inserted into the node at the same time. Among these data disks, we may not want to use some data disks as LocalPV. In order to be able to distinguish which block devices can For use by Device-LocalPV, a (~10MiB) Meta partition needs to be created on the corresponding block device to store disk identification information. The Meta partition has the following requirements:
      • Is the first partition of the block device (ID_PART_ENTRY_NUMBER=1);
      • Cannot be formatted into any file system;
      • The flags partition flag cannot be set.
    • The operation command is as follows:
## 在 minikube 节点上执行如下命令
$ sudo parted /dev/sdb mklabel gpt
$ sudo parted /dev/sdb mkpart test-device 1MiB 10MiB

## 在 minikube-m02 节点执行如下命令
$ sudo parted /dev/sdb mklabel gpt
$ sudo parted /dev/sdb mkpart test-device 1MiB 10MiB
$ sudo parted /dev/sdc mklabel gpt
$ sudo parted /dev/sdc mkpart test-device 1MiB 10MiB
    • The above operations are performed on the block devices on the two nodes minikube and minikube-m02 respectively. The meta partition operations are performed. /dev/sdb and /dev/sdc are the names of the block devices mounted on the nodes. They need to be specified according to your actual situation. The Meta partition names of the three disks are all called test-device. This is intentional and will be used when creating the storage class next.
  • Create storage class:
    • Since Device-LocalPV supports dynamic provision, the steps to create a storage class are inevitable. Save the following yaml file as sc.yaml and then create the storage class through kubectl apply -f sc.yaml:
apiVersion: storage.K8s.io/v1
kind: StorageClass
metadata:
  name: openebs-device-sc
allowVolumeExpansion: false
parameters:
  devname: "test-device"
provisioner: device.csi.openebs.io
volumeBindingMode: WaitForFirstConsumer
    • The storage class parameters field needs to specify devname, and its value is the partition name test-device specified when partitioning the block device. When the storage class creates the PV, it is based on the matching of this partition name to find which block devices are used for Device. -LocalPV is used.
  • Create a StatefulSet to apply for the use of LocalPV:
    • StatefulSet and related resources are defined as follows:
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
    - port: 80
      name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: hello
spec:
  selector:
    matchLabels:
      app: hello
  serviceName: "nginx"
  replicas: 2
  template:
    metadata:
      labels:
        app: hello
    spec:
      terminationGracePeriodSeconds: 1
      containers:
        - name: html
          image: busybox
          imagePullPolicy: IfNotPresent
          command:
            - sh
            - -c
            - 'while true; do echo "`date` [`hostname`] Hello from OpenEBS Local PV." >> /mnt/store/index.html; sleep $(($RANDOM % 5 + 300)); done'
          volumeMounts:
            - mountPath: /mnt/store
              name: csi-devicepv
        - name: web
          image: K8s.gcr.io/nginx-slim:0.8
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 80
              name: web
          volumeMounts:
            - name: csi-devicepv
              mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
    - metadata:
        name: csi-devicepv
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "openebs-device-sc"
        resources:
          requests:
            storage: 1Gi
    • Create a Service and a StatefulSet named hello, focusing on the StatefulSet, which starts two copies and applies for a 1Gi-sized PVC through volumeClaimTemplates. After installing the above file resources through the kubectl apply command, you can see that the two Pods are scheduled to different nodes:
# 查看 Pod 调度情况
➜ kubectl get pod -o wide
NAME      READY   STATUS    RESTARTS   AGE     IP           NODE           NOMINATED NODE   READINESS GATES
hello-0   2/2     Running   0          4m13s   10.244.1.3   minikube-m02   <none>           <none>
hello-1   2/2     Running   0          2m42s   10.244.0.3   minikube       <none>           <none>

# 查看 PVC 资源
➜ kubectl get pvc -o wide
NAME                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE     VOLUMEMODE
csi-devicepv-hello-0   Bound    pvc-042661c8-c000-4dde-9950-2b6859d5f273   1Gi        RWO            openebs-device-sc   4m46s   Filesystem
csi-devicepv-hello-1   Bound    pvc-26f92829-e0d4-4520-86da-2d7741cd68c2   1Gi        RWO            openebs-device-sc   3m15s   Filesystem

# 查看 PV 资源
➜ kubectl get pv -o wide
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS        REASON   AGE   VOLUMEMODE
pvc-042661c8-c000-4dde-9950-2b6859d5f273   1Gi        RWO            Delete           Bound    default/csi-devicepv-hello-0   openebs-device-sc            18m   Filesystem
pvc-26f92829-e0d4-4520-86da-2d7741cd68c2   1Gi        RWO            Delete           Bound    default/csi-devicepv-hello-1   openebs-device-sc            17m   Filesystem
    • Now log in to the two minikube host nodes respectively and use the fdisk command to check the disk usage of the two nodes:

Insert image description here

    • The left side is the minikube node, and the right side is the minikube-m02 node. You can see that after the PV is successfully created, a partition of the same size as the PV will be created on the block device of the corresponding node to provide support for LocalPV, and the life cycle management of this partition All are done by Device-LocalPV, which will be created when the PV is created and destroyed when the PV is deleted.
    • At this point, the OpenEBS Device-LocalPV practical exercise is completed. Normally, through the above practical steps, you should be able to successfully build and use OpenEBS Device-LocalPV. But you may encounter some strange problems. Next, we will analyze several pitfalls that may be encountered during the implementation of OpenEBS Device-LocalPV.

⑤ Problem analysis

  • parted command version issue:
    • According to experience, the performance of different versions of the parted command may be inconsistent. As shown below, using two different versions of the parted tool to execute the same command to perform Meta partition operations on the block device, the results obtained are different:

Insert image description here
Insert image description here

    • It can be found that the parted command partition of version 3.3 in the screenshot will generate flag (msftdata), but the parted command partition of version 3.1 below does not generate flag. OpenEBS Device-LocalPV stipulates that the Meta partition cannot have a flag. If a flag is generated, Device-LocalPV will ignore the block device and not use it.
    • In order to ensure that the expected results are obtained when executing the parted command to partition the disk, you can always enter the container of the OpenEBS Daemonset and use the parted command provided by OpenEBS to partition. This ensures that the version of the parted command used internally by Device-LocalPV is consistent and no unexpected problems will occur. The Pods corresponding to Daemonset are the two Pods openebs-device-node-4wld7 and openebs-device-node-p2r6m above. When OpenEBS starts, it will start a Daemonset workload on each node to operate the node block device. LocalPV support.
  • PV dynamic expansion problem: OpenEBS Device-LocalPV currently does not support expansion operations, so when creating a storage class, you need to specify the allowVolumeExpansion attribute value as false to mark that the PV created by this storage class does not support dynamic expansion operations.
  • PV may not be created successfully Issue:
    • You may encounter the problem that when the capacity requested by the PVC is exactly equal to the remaining capacity of the block device, the PV cannot be created successfully and the PVC is always in the Pending state.
    • Assume that there is only one node, and there is only one block device on the node for Device-LocalPV to use, with a capacity of 10Gi. If you apply for 4 PVCs continuously, their capacities are 1G, 3G, 3G, and 1G respectively, corresponding to the 4 partitions of sdb2, sdb3, sdb4, and sdb5 in the screenshot below (the fdisk command displays the results).

Insert image description here

    • Now if you delete the first PVC with a capacity of 3G, Device-LocalPV will automatically delete the /dev/sdb3 partition. However, if you try to create a 3G PVC again at this time, the PVC will never be created successfully and will always be in the Pending state. Whenever a PVC is created, the storage class will use Device-LocalPV to create a new disk partition on the block device of the node with the capacity equal to the capacity requested in the PVC. As can be seen from the screenshot, the Start and End of this partition are From small to large and continuously. The PVC capacity of the newly created 3G is equal to the /dev/sdb3 partition capacity, and theoretically it should be successfully created.
    • Analyzing the Device-LocalPV source code, we can find that when calculating whether the remaining available capacity of the block device on the node meets the capacity requested by the PVC, Device-LocalPV makes a judgment through the statement if tmp.SizeMiB > partSize, where tmp.SizeMiB is the block The remaining available capacity of the device, partSize is the PVC requested capacity. Only when the remaining capacity of the block device is greater than the PVC requested capacity, the partition operation will be performed. If there are no available partitions that meet the PVC requested capacity, the PVC will always be in the Pending state.

Insert image description here

    • It can be seen that changing if tmp.SizeMiB > partSize to if tmp.SizeMiB >= partSize can solve this problem. In addition, during the process of reading the source code, you can also find that Device-LocalPV is used when calculating the available partitions of the block device. , the calculation of the partition size will involve the unit conversion operation from Bytes to Mib:

Insert image description here

    • beginBytes and endBytes correspond to Start and End in the screenshot of the fdisk command above. If the partitions are not aligned at this time, there will be a problem of loss of floating point calculation accuracy. BeginMib will be rounded up by math.Ceil, and endMib will be rounded up by math.Floor. Rounding down, the final sizeMib may be smaller than the actual remaining available space. This may lead to the phenomenon that the remaining disk capacity meets the PVC application capacity, but the PVC cannot be successfully created. Therefore, when creating a PVC, you should try to apply for an integer multiple of 1024. size capacity.

3. CSI

  • In K8s, if the built-in storage function of K8s does not meet our production needs, it can be expanded through a plug-in mechanism called CSI, and Device-LocalPV uses this mechanism to implement it.
  • CSI is the abbreviation of Container Storage Interface. It is the container storage interface specification officially defined by K8s. It attempts to define a unified industry standard specifically used to expand the storage capabilities of the container orchestration system.

① Infrastructure

  • A CSI plug-in contains two main parts: External Components and Custom Components. External Components are officially provided by K8s, while Custom Components are provided by the author of the plug-in. These two Components each contain 3 components and work together.
  • The basic architecture of CSI is as follows:

Insert image description here

② Workflow

  • First, when the CSI plug-in starts, the Driver Registrar component in External Components starts working first. It obtains the basic information of the CSI plug-in by communicating with the Identity component in Custom Components and registers it in the kubelet.
  • The Provisioner component in External Components monitors the creation of PVC objects in APIServer through the Watch mechanism. Once a new PVC is created, Provisioner will communicate with the Controller component in Custom Components and let it create PV-related resources.
  • After creating the PV, the next step is the Attach stage, and the Attach operation corresponds to the Attacher component in External Components. It will also communicate with the Controller component in Custom Components to complete the Attach operation together.
  • The last step of the Mount operation is completed by the kubelet on the Node node directly calling the Node component in External Components. At this point, Pod internal applications can use LocalPV mounted on the host node.

四、Device-LocalPV

① Deployment

  • To analyze the execution process of a program, of course you must start from the program startup entry. And OpenEBS is a cloud-native application, so the first thing you need to look at is the deployment method of the project. The OpenEBS Device-LocalPV project deployment yaml file is also in its project. git repository.
  • As you can see, the two most important resources in the deployment file are a StatefulSet named openebs-device-controller and a DaemonSet named openebs-device-node.
  • Two containers are started in the StatefulSet, namely the External Provisioner component officially provided by K8s and the Custom Controller component developed by OpenEBS. These two components are placed in a Pod to work together.
  • Two containers are also started in DaemonSet, namely the External Driver Registrar component officially provided by K8s and the Custom Node component developed by OpenEBS. So, where is the External Attacher component in the CSI mechanism? In fact, LocalPV does not have an Attach operation. After it is created, it only requires one Mount operation to use it, so there is no need to deploy this component.
  • Maybe we will be curious about how to communicate between External components and Custom corresponding components? According to the contents of the Device-LocalPV project deployment yaml file provided above, it is not difficult to find that there is unix:///xxx/csi.sock in it. In fact, the communication between them relies on gRPC based on Unix socket. It is carried out, which not only achieves decoupling between components, but also enables efficient communication.

② Components

  • Knowing what components are deployed in the OpenEBS Device-LocalPV project, then analyze the program execution process by reading the source code. Whether it is StatefulSet or DaemonSet, the Device-LocalPV program startup entry file is the same. After the program is started, the run function will be called in the main function. The run function is defined as follows:
func run(config *config.Config) {
    
    
        if config.Version == "" {
    
    
                config.Version = version.Current()
        }

        klog.Infof("Device Driver Version :- %s - commit :- %s", version.Current(), version.GetGitCommit())
        klog.Infof(
                "DriverName: %s Plugin: %s EndPoint: %s NodeID: %s",
                config.DriverName,
                config.PluginType,
                config.Endpoint,
                config.NodeID,
        )

        if len(config.IgnoreBlockDevicesRegex) > 0 {
    
    
                device.DeviceConfiguration.IgnoreBlockDevicesRegex = regexp.MustCompile(config.IgnoreBlockDevicesRegex)
        }

        err := driver.New(config).Run()
        if err != nil {
    
    
                log.Fatalln(err)
        }
        os.Exit(0)
}
  • It is worth noting that the code err := driver.New(config).Run() starts a Driver through the config parameter and executes the Run method. You can trace it inside the New function to see its implementation:
func New(config *config.Config) *CSIDriver {
    
    
   driver := &CSIDriver{
    
    
      config: config,
      cap:    GetVolumeCapabilityAccessModes(),
   }

   switch config.PluginType {
    
    
   case "controller":
      driver.cs = NewController(driver)
   case "agent":
      driver.ns = NewNode(driver)
   }

   driver.ids = NewIdentity(driver)
   return driver
}
  • It can be found that the New function will create a CSIDriver object through config internally. This object will register the Controller component or Node component according to the configuration parameters (agent and Node are equivalent in the Device-LocalPV project), and these two components correspond to the StatefulSet respectively. and DaemonSet, that is to say, the Controller component will be started as a StatefulSet, and the Node component will be started as a DaemonSet.
  • **Identity component**:
    • According to the above source code, it can be found that no matter which component in Controller or Node is started, the Identity component will be registered (driver.ids = NewIdentity(driver)), because the CSI plug-in implemented by Device-LocalPV is separated into two workloads. Deployment, and they all need to be registered to K8s, which is what the Identity component does.
    • Identity is implemented as follows, and three methods are defined, namely GetPluginInfo, Probe, and GetPluginCapabilities:
package driver

import (
	"github.com/container-storage-interface/spec/lib/go/csi"
	"github.com/openebs/device-localpv/pkg/version"
	"golang.org/x/net/context"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"
)

// identity is the server implementation
// for CSI IdentityServer
type identity struct {
    
    
	driver *CSIDriver
}

// NewIdentity returns a new instance of CSI
// IdentityServer
func NewIdentity(d *CSIDriver) csi.IdentityServer {
    
    
	return &identity{
    
    
		driver: d,
	}
}

// GetPluginInfo returns the version and name of
// this service
//
// This implements csi.IdentityServer
func (id *identity) GetPluginInfo(
	ctx context.Context,
	req *csi.GetPluginInfoRequest,
) (*csi.GetPluginInfoResponse, error) {
    
    

	if id.driver.config.DriverName == "" {
    
    
		return nil, status.Error(codes.Unavailable, "missing driver name")
	}

	if id.driver.config.Version == "" {
    
    
		return nil, status.Error(codes.Unavailable, "missing driver version")
	}

	return &csi.GetPluginInfoResponse{
    
    
		Name: id.driver.config.DriverName,
		// TODO
		// verify which version needs to be used:
		// config.version or version.Current()
		VendorVersion: version.Current(),
	}, nil
}

// TODO
// Need to implement this
//
// # Probe checks if the plugin is running or not
//
// This implements csi.IdentityServer
func (id *identity) Probe(
	ctx context.Context,
	req *csi.ProbeRequest,
) (*csi.ProbeResponse, error) {
    
    

	return &csi.ProbeResponse{
    
    }, nil
}

// GetPluginCapabilities returns supported capabilities
// of this plugin
//
// Currently it reports whether this plugin can serve
// the Controller interface. Controller interface methods
// are called dependant on this
//
// This implements csi.IdentityServer
func (id *identity) GetPluginCapabilities(
	ctx context.Context,
	req *csi.GetPluginCapabilitiesRequest,
) (*csi.GetPluginCapabilitiesResponse, error) {
    
    

	return &csi.GetPluginCapabilitiesResponse{
    
    
		Capabilities: []*csi.PluginCapability{
    
    
			{
    
    
				Type: &csi.PluginCapability_Service_{
    
    
					Service: &csi.PluginCapability_Service{
    
    
						Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
					},
				},
			},
			{
    
    
				Type: &csi.PluginCapability_Service_{
    
    
					Service: &csi.PluginCapability_Service{
    
    
						Type: csi.PluginCapability_Service_VOLUME_ACCESSIBILITY_CONSTRAINTS,
					},
				},
			},
		},
	}, nil
}
    • The GetPluginInfo method returns the name and version number of the plug-in. Probe, as its name implies, is a probe program. K8s can use this probe to check whether the plug-in is working properly.
    • The GetPluginCapabilities method returns the capabilities of the current plug-in to tell K8s what functions this CSI plug-in implements. For example, the Device-LocalPV project does not implement the Attach function. When we specify this CSI plug-in as the provisioner of the storage class when creating the PVC, K8s The Attach phase will be automatically skipped and the Mount phase will be entered directly.
    • Careful readers may have noticed that each method has a line of // This implements csi.ControllerServer comment. In fact, the names of these methods are fixed and have been defined by the CSI specification. The author of the CSI plug-in only needs to follow the specification. Just implement the corresponding method.
  • Controller assembly:
    • When analyzing the basic architecture of CSI above, we mentioned that the External Provisioner component monitors the creation of PVC objects in the APIServer through the Watch mechanism. Once a new PVC is created, the Provisioner component will communicate with the Custom Controller component to let it create PV-related resources. What I’m talking about here is creating PV related resources, not creating PV. The so-called PV-related resource is a CRD, which is a resource type customized by developers who write CSI components. A CRD corresponds to a PV, and its life cycle is basically the same. The reason for this is that PV is an internal resource of K8s, and CSI is a general specification that applies not only to K8s but also to any container orchestration system. Therefore, the CSI plug-in should define a resource type by itself to correspond to the PV.
    • The CRD corresponding to the PV is called devicevolumes.local.openebs.io, which is defined in the Device-LocalPV project deployment file and is equivalent to the PV resources managed by Device-LocalPV itself. A typical devicevolumes.local.openebs.io resource is defined as follows:
apiVersion: v1
items:
- apiVersion: local.openebs.io/v1alpha1
  kind: DeviceVolume
  metadata:
    creationTimestamp: "2022-08-01T08:45:51Z"
    finalizers:
    - device.openebs.io/finalizer
    generation: 3
    labels:
      kubernetes.io/nodename: minikube-m02
    name: pvc-8e659633-e052-439a-85eb-30a6d385a12b
    namespace: openebs
    resourceVersion: "1715"
    uid: 540aa699-4f6f-487c-b38b-38f8b414bd7c
  spec:
    capacity: "1073741824"
    devname: device-localpv
    ownerNodeID: minikube
  status:
    state: Ready
kind: List
metadata:
  resourceVersion: ""
    • In order to facilitate the management of node block devices, Device-LocalPV also defines a CRD called devicesenodes.local.openebs.io. This CRD corresponds to the K8s working node. There are several nodes used by Device-LocalPV. devicesnodes.local.openebs.io resource, this CRD records all available block devices on the current node. So when was devicevolumes.local.openebs.io created? There is a key method in the Custom Controller component called Createvolume, which is responsible for the creation of PV-related CRD.
    • When the External Provisioner component detects the creation of a new PVC, it will execute the Provision method inside the component. Within the Provision method, the Createvolume method of the Custom Controller component will be called through gRPC to create a CRD. After the CRD is created, it will be created by the Provisioner. to create PV objects. The relevant code is implemented as follows:

Insert image description here

  • Agent group
    • The Device-LocalPV component is Agent, which is actually the Node component in the CSI plug-in. From this name, it is easy to guess that all operations on the node host will be completed through this component.
    • After the Provisioner component and the Controller component create the PV and CRD respectively, there are two steps left, which are to create a partition on the block device and mount the partition inside the container. These two operations are the responsibilities of the Agent component.
    • The constructor of Agent component is as follows:
// NewNode returns a new instance
// of CSI NodeServer
func NewNode(d *CSIDriver) csi.NodeServer {
    
    
        var ControllerMutex = sync.RWMutex{
    
    }

        // set up signals so we handle the first shutdown signal gracefully
        stopCh := signals.SetupSignalHandler()

        // start the device node resource watcher
        go func() {
    
    
                err := devicenode.Start(&ControllerMutex, stopCh)
                if err != nil {
    
    
                        klog.Fatalf("Failed to start Device node controller: %s", err.Error())
                }
        }()

        // start the device volume  watcher
        go func() {
    
    
                err := volume.Start(&ControllerMutex, stopCh)
                if err != nil {
    
    
                        klog.Fatalf("Failed to start Device volume management controller: %s", err.Error())
                }
        }()

        if d.config.ListenAddress != "" {
    
    
                exposeMetrics(d.config, stopCh)
        }

        return &node{
    
    
                driver: d,
        }
}
    • It can be found that inside the Agent component, two Goroutines are started, which are used to monitor the two CRDs of devicesenodes.local.openebs.io and devicevolumes.local.openebs.io. Both of them implement the syncHandler method internally, and perform operations based on the CRD status. Act accordingly.
    • The main logic of volume's syncHandler is as follows:
func (c *VolController) syncHandler(key string) error {
    
    
   ...
   // Get the Vol resource with this namespace/name
   Vol, err := c.VolLister.DeviceVolumes(namespace).Get(name)
   if K8serror.IsNotFound(err) {
    
    
      runtime.HandleError(fmt.Errorf("devicevolume '%s' has been deleted", key))
      return nil
   }
   if err != nil {
    
    
      return err
   }
   VolCopy := Vol.DeepCopy()
   err = c.syncVol(VolCopy)
   return err
}

func (c *VolController) syncVol(vol *apis.DeviceVolume) error {
    
    
   ...
   // if the status Pending means we will try to create the volume
   if vol.Status.State == device.DeviceStatusPending {
    
    
      err = device.CreateVolume(vol)
      if err == nil {
    
    
         err = device.UpdateVolInfo(vol, device.DeviceStatusReady)
      } else if custError, ok := err.(*apis.VolumeError); ok && custError.Code == apis.InsufficientCapacity {
    
    
         vol.Status.Error = custError
         return device.UpdateVolInfo(vol, device.DeviceStatusFailed)
      }
   }
   return err
}
    • It can be seen that the syncHandler method will pass the queried devicevolumes.local.openebs.io information to the syncVol method, and there is a very critical line of code device.CreateVolume(vol) in this method. The function of calling this method is based on the CRD information to create a real partition on the block device.
    • Once the partition is successfully created, only the last step, the Mount operation, is left. The Mount operation is completed by the kubelet of the node where the PV is located directly calling the NodePublishVolume method of the Agent component. It is worth noting that in the yaml file that deploys the Device-LocalPV project, there is a mountPropagation: "Bidirectional" configuration in the volumeMounts attribute of the container where the Agent component is located. Its function is to enable the Mount command executed inside the container to be propagated upward to the host. on the host. Therefore, even though the Agent component runs in the container, the Mount command executed inside it can still take effect on the node.

③ Scheduling strategy

  • Device-LocalPV provides two scheduling strategies: CapacityWeighted and VolumeWeighted. The scheduling strategy can be specified through parameters in the storage class:
parameters:
 scheduler: "VolumeWeighted"
 devname: "test-device"
  • CapacityWeighted
    • CapacityWeighted is the default scheduling policy, that is, scheduling based on usage capacity. It will find nodes where OpenEBS has been deployed, score them based on the used capacity of the block device on the node, and prioritize nodes with smaller used capacity.
    • As shown in the figure below, 3 PVs already exist on the Node1 node, 2 PVs exist on the Node2 node, and OpenEBS is not deployed on the Node3 node. If a new PVC is created at this time, how will the PV be scheduled?

Insert image description here

    • According to the CapacityWeighted scheduling policy, the Node3 node is excluded first. Although the number of PVs already existing on the Node1 node is more than that on the Node2 node, the used capacity is less, so it is sorted according to the used capacity. Obviously, this newly created The PV will be scheduled to the Node1 node.
  • VolumeWeighted
    • The VolumeWeighted scheduling policy schedules based on the number of used volumes, searches for nodes where OpenEBS has been deployed, scores based on the number of volumes allocated to the block device on the node, and prioritizes scheduling to nodes with a smaller number of used volumes.
    • Then analysis, under the same situation as in the figure above, after using the VolumeWeighted scheduling policy, the newly created PV will be scheduled to the Node2 node:

Insert image description here

  • Customized scheduling policy:
    • If things develop in an ideal direction, then there is no problem with the above two scheduling strategies provided by Device-LocalPV. However, in real scenarios, you may encounter the following situations:

Insert image description here

    • The number of nodes where OpenEBS is now deployed has been expanded from 2 to 3, but only OpenEBS is deployed on Node3, and no block device has been inserted for use by OpenEBS.
      In this case, if a new PVC is created, the PVC will always be in the Pending state and the PV cannot be created. Because no matter which scheduling strategy is used, Device-LocalPV will always sort the nodes with a block device usage of 0 at the front during the node scoring stage, so both strategies will eventually schedule the PV to the Node3 node, and because There is no block device used by OpenEBS on the Node3 node, so the disk cannot be partitioned and the PV cannot be successfully created.

④ Troubleshooting

  • Node failure:
    • If a node being used by LocalPV fails, you can restore LocalPV to use by migrating the block device. The specific steps are as follows:
      • Move the data disk to the new node;
      • If OpenEBS is not deployed on the new node, you need to deploy the DaemonSet of Device-LocalPV to the new node first;
      • Modify the node to which the CRD resource devicevolumes.local.openebs.io belongs, that is, modify the spec.ownerNodeID attribute to the new node;
      • Modify the node information of PV, that is, modify the spec.nodeAffinity attribute to the new node;
      • Delete the Pod using the PV to automatically restart and schedule it to the new node specified by the PV.
  • Disk failure: If a block device being used by LocalPV fails, data loss will occur.
    • It should be noted that when a disk partition used by a PV fails, the PV cannot sense it. Only the programs deployed in the Pod can sense it. You can try to perform read and write operations inside the failed Pod container, and you will get the following results mistake:
/mnt/store # echo abc > a.txt
sh: can't create a.txt: Input/output error
    • Therefore, programs using LocalPV must have a relatively complete exception handling mechanism to deal with possible failures.

Guess you like

Origin blog.csdn.net/Forever_wj/article/details/135017504