Detailed explanation of CSI working principle and architecture design of JuiceFS CSI Driver

Container Storage Interface (CSI) is abbreviated as CSI. CSI establishes an industry-standard interface specification. With the help of the CSI Container Orchestration System (CO), any storage system can be exposed to its own container workloads. JuiceFS CSI Driver enables applications on Kubernetes to use JuiceFS through PVC (PersistentVolumeClaim) by implementing the CSI interface. This article will introduce the working principle of CSI and the architecture design of JuiceFS CSI Driver in detail.

Basic Components of CSI

There are two types of cloud providers in CSI, one is in-tree type and the other is out-of-tree type. The former refers to the storage plug-in running inside the K8s core component; the latter refers to the storage plug-in running independently outside the K8s component. This article mainly introduces out-of-tree type plugins.

Out-of-tree type plug-ins mainly interact with K8s components through the gRPC interface, and K8s provides a large number of SideCar components to cooperate with CSI plug-ins to achieve rich functions. For out-of-tree type plug-ins, the components used are divided into SideCar components and plug-ins that need to be implemented by third parties.

SideCar Components

external-attach

Monitor the VolumeAttachment object, and call the ControllerPublishVolumeand ControllerUnpublishVolumeinterfaces of the CSI driver Controller service to attach the volume to the node or delete it from the node.

If the storage system needs the attach/detach step, this component needs to be used, because the Attach/Detach Controller inside K8s will not directly call the interface of the CSI driver.

external commissions

Monitor the PVC object and call the CreateVolumeand DeleteVolumeinterfaces of the CSI driver Controller service to provide a new volume. The premise is that the provisioner field of the StorageClass specified in the PVC is the same as the return value of the GetPluginInfointerface . Once the new volume is provided, K8s will create the corresponding PV.

If the recycling policy of the PV bound to the PVC is delete, the external-provisioner component will call the DeleteVolumeinterface . Once the volume is successfully deleted, the component also deletes the corresponding PV.

The component also supports creating data sources from snapshots. If the data source of the Snapshot CRD is specified in the PVC, the component will obtain information about the snapshot through the SnapshotContentobject , and pass this content CreateVolumeto the CSI driver when calling the interface. The CSI driver needs to create a volume based on the snapshot of the data source.

external-resizer

Monitors the PVC object. If the user requests more storage on the PVC object, this component will call the NodeExpandVolumeinterface to expand the volume.

external-snapshotter

This component is required to work with the Snapshot Controller. The Snapshot Controller creates the corresponding VolumeSnapshotContent based on the Snapshot objects created in the cluster, and the external-snapshotter is responsible for monitoring the VolumeSnapshotContent objects. When the VolumeSnapshotContent is monitored, its corresponding parameters are CreateSnapshotRequestpassed to CSI driver Controller service, and its CreateSnapshotinterface is called. The component is also responsible for calling DeleteSnapshot, ListSnapshotsinterface.

livenessprobe

It is responsible for monitoring the health of the CSI driver and reporting it to K8s through the Liveness Probe mechanism. It is responsible for restarting the pod when an abnormality in the CSI driver is detected.

node-driver-registrar

By directly calling the NodeGetInfointerface , the information of the CSI driver is registered on the kubelet of the corresponding node through the plug-in registration mechanism of the kubelet.

external-health-monitor-controller

Check the health of the CSI volume by calling the ListVolumesor ControllerGetVolumeinterface of the CSI driver Controller service, and report it in the event of the PVC.

external-health-monitor-agent

Check the health of the CSI volume by calling the NodeGetVolumeStatsinterface and report it in the event of the pod.

third-party plugins

The third-party storage provider (ie SP, Storage Provider) needs to implement two plug-ins, Controller and Node. The Controller is responsible for volume management and is deployed in the form of StatefulSet; Node is responsible for mounting the Volume into the pod and deploying it in each node in the form of DaemonSet middle.

The CSI plugin, kubelet and K8s external components are called interactively through Unix Domani Socket gRPC. CSI defines three sets of RPC interfaces, which SP needs to implement in order to communicate with K8s external components. The three groups of interfaces are: CSI Identity, CSI Controller and CSI Node. Let's take a look at these interface definitions in detail.

CSI Identity

For providing the identity information of the CSI driver, both Controller and Node need to implement. The interface is as follows:

service Identity {
  rpc GetPluginInfo(GetPluginInfoRequest)
    returns (GetPluginInfoResponse) {}

  rpc GetPluginCapabilities(GetPluginCapabilitiesRequest)
    returns (GetPluginCapabilitiesResponse) {}

  rpc Probe (ProbeRequest)
    returns (ProbeResponse) {}
}

GetPluginInfoIt must be implemented. The node-driver-registrar component will call this interface to register the CSI driver with the kubelet; GetPluginCapabilitiesit is used to indicate which functions the CSI driver mainly provides.

CSI Controller

It is used to implement functions such as creating/deleting volumes, attaching/detaching volumes, volume snapshots, and volume scaling. The Controller plugin needs to implement this set of interfaces. The interface is as follows:

service Controller {
  rpc CreateVolume (CreateVolumeRequest)
    returns (CreateVolumeResponse) {}

  rpc DeleteVolume (DeleteVolumeRequest)
    returns (DeleteVolumeResponse) {}

  rpc ControllerPublishVolume (ControllerPublishVolumeRequest)
    returns (ControllerPublishVolumeResponse) {}

  rpc ControllerUnpublishVolume (ControllerUnpublishVolumeRequest)
    returns (ControllerUnpublishVolumeResponse) {}

  rpc ValidateVolumeCapabilities (ValidateVolumeCapabilitiesRequest)
    returns (ValidateVolumeCapabilitiesResponse) {}

  rpc ListVolumes (ListVolumesRequest)
    returns (ListVolumesResponse) {}

  rpc GetCapacity (GetCapacityRequest)
    returns (GetCapacityResponse) {}

  rpc ControllerGetCapabilities (ControllerGetCapabilitiesRequest)
    returns (ControllerGetCapabilitiesResponse) {}

  rpc CreateSnapshot (CreateSnapshotRequest)
    returns (CreateSnapshotResponse) {}

  rpc DeleteSnapshot (DeleteSnapshotRequest)
    returns (DeleteSnapshotResponse) {}

  rpc ListSnapshots (ListSnapshotsRequest)
    returns (ListSnapshotsResponse) {}

  rpc ControllerExpandVolume (ControllerExpandVolumeRequest)
    returns (ControllerExpandVolumeResponse) {}

  rpc ControllerGetVolume (ControllerGetVolumeRequest)
    returns (ControllerGetVolumeResponse) {
        option (alpha_method) = true;
    }
}

As mentioned above when introducing the external components of K8s, different interfaces are provided for different component calls to cooperate to achieve different functions. For example CreateVolume/ DeleteVolumecooperate with external-provisioner to realize the function of creating/deleting volume; ControllerPublishVolume/ ControllerUnpublishVolumecooperate with external-attacher to realize the attach/detach function of volume, etc.

CSI Node

It is used to implement functions such as mount/umount volume, check volume status, etc. Node plugins need to implement this set of interfaces. The interface is as follows:

service Node {
  rpc NodeStageVolume (NodeStageVolumeRequest)
    returns (NodeStageVolumeResponse) {}

  rpc NodeUnstageVolume (NodeUnstageVolumeRequest)
    returns (NodeUnstageVolumeResponse) {}

  rpc NodePublishVolume (NodePublishVolumeRequest)
    returns (NodePublishVolumeResponse) {}

  rpc NodeUnpublishVolume (NodeUnpublishVolumeRequest)
    returns (NodeUnpublishVolumeResponse) {}

  rpc NodeGetVolumeStats (NodeGetVolumeStatsRequest)
    returns (NodeGetVolumeStatsResponse) {}

  rpc NodeExpandVolume(NodeExpandVolumeRequest)
    returns (NodeExpandVolumeResponse) {}

  rpc NodeGetCapabilities (NodeGetCapabilitiesRequest)
    returns (NodeGetCapabilitiesResponse) {}

  rpc NodeGetInfo (NodeGetInfoRequest)
    returns (NodeGetInfoResponse) {}
}

NodeStageVolumeIt is used to realize the function that multiple pods share a volume. It supports to mount the volume to a temporary directory first, and NodePublishVolumethen mount it to the pod through ; NodeUnstageVolumeit is the inverse operation.

work process

Let's take a look at the entire workflow of pod mount volume. The whole process has three stages: Provision/Delete, Attach/Detach, Mount/Unmount, but not every storage solution will go through these three stages. For example, NFS does not have the Attach/Detach stage.

The whole process involves not only the work of the components introduced above, but also the AttachDetachController and PVController components of the ControllerManager and the kubelet. The three stages of Provision, Attach, and Mount will be analyzed in detail below.

Provision

Let's first look at the Provision stage, the whole process is shown in the figure above. Among them, both extenal-provisioner and PVController are watch PVC resources.

  1. When PVController watches to see that there is a PVC created in the cluster, it will judge whether there is an in-tree plugin that matches it. If not, it will judge that its storage type is out-of-tree type, so it will annotate the PVC volume.beta.kubernetes.io/storage-provisioner={csi driver name};
  2. When the extenal-provisioner watch to the PVC's annotation csi driver is consistent with its own csi driver, call the CSI Controller CreateVolumeinterface ;
  3. When the CreateVolumeinterface returns successfully, the external-provisioner will create the corresponding PV in the cluster;
  4. When PVController watches to create PVs in the cluster, it binds PVs to PVCs.

Attach

The Attach stage refers to attaching the volume to the node. The whole process is shown in the figure above.

  1. ADController monitors that a pod is scheduled to a node and uses a CSI-type PV, and will call the interface of the internal in-tree CSI plug-in, which will create a VolumeAttachment resource in the cluster;
  2. The external-attacher component watch will call the ControllerPublishVolumeinterface ;
  3. When the ControllerPublishVolumeinterface successfully called, the external-attacher sets the Attached state of the corresponding VolumeAttachment object to true;
  4. When ADController watch to the Attached state of the VolumeAttachment object is true, update the state ActualStateOfWorld inside ADController.

Mount

The final step of mounting the volume into the pod involves the kubelet. The whole process is simply that when the kubelet on the corresponding node creates a pod, it will call the CSI Node plug-in to perform the mount operation. Next, we will analyze the component segmentation inside the kubelet.

First, in the main function syncPodof , kubelet will call the WaitForAttachAndMountmethod of its subcomponent volumeManager and wait for volume mount to complete:

func (kl *Kubelet) syncPod(o syncPodOptions) error {
...
	// Volume manager will not mount volumes for terminated pods
	if !kl.podIsTerminated(pod) {
		// Wait for volumes to attach/mount
		if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
			kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)
			klog.Errorf("Unable to attach or mount volumes for pod %q: %v; skipping pod", format.Pod(pod), err)
			return err
		}
	}
...
}

The volumeManager contains two components: desiredStateOfWorldPopulator and reconciler. These two components cooperate with each other to complete the mount and umount process of the volume in the pod. The whole process is as follows:

The collaborative pattern of desiredStateOfWorldPopulator and reconciler is the pattern of producer and consumer. Two queues are maintained in the volumeManager (strictly speaking, it is an interface, but it acts as a queue here), namely DesiredStateOfWorld and ActualStateOfWorld. The former maintains the expected state of the volume in the current node; the latter maintains the volume in the current node. actual state.

The desiredStateOfWorldPopulator only does two things in its own loop. One is to obtain the newly created Pod of the current node from the podManager of the kubelet, and record the volume information to be mounted in DesiredStateOfWorld; the other is to obtain it from the podManager. For the deleted pod in the current node, check whether its volume is in the record of ActualStateOfWorld. If not, delete it in DesiredStateOfWorld to ensure that DesiredStateOfWorld records the expected state of all volumes in the node. The relevant code is as follows (in order to simplify the logic, some code has been deleted):

// Iterate through all pods and add to desired state of world if they don't
// exist but should
func (dswp *desiredStateOfWorldPopulator) findAndAddNewPods() {
	// Map unique pod name to outer volume name to MountedVolume.
	mountedVolumesForPod := make(map[volumetypes.UniquePodName]map[string]cache.MountedVolume)
	...
	processedVolumesForFSResize := sets.NewString()
	for _, pod := range dswp.podManager.GetPods() {
		dswp.processPodVolumes(pod, mountedVolumesForPod, processedVolumesForFSResize)
	}
}

// processPodVolumes processes the volumes in the given pod and adds them to the
// desired state of the world.
func (dswp *desiredStateOfWorldPopulator) processPodVolumes(
	pod *v1.Pod,
	mountedVolumesForPod map[volumetypes.UniquePodName]map[string]cache.MountedVolume,
	processedVolumesForFSResize sets.String) {
	uniquePodName := util.GetUniquePodName(pod)
    ...
	for _, podVolume := range pod.Spec.Volumes {   
		pvc, volumeSpec, volumeGidValue, err :=
			dswp.createVolumeSpec(podVolume, pod, mounts, devices)

		// Add volume to desired state of world
		_, err = dswp.desiredStateOfWorld.AddPodToVolume(
			uniquePodName, pod, volumeSpec, podVolume.Name, volumeGidValue)
		dswp.actualStateOfWorld.MarkRemountRequired(uniquePodName)
    }
}

The reconciler is the consumer, and it mainly does three things:

  1. unmountVolumes(): Traverse the volume in ActualStateOfWorld to determine whether it is in DesiredStateOfWorld, if not, call the interface of CSI Node to perform unmount, and record it in ActualStateOfWorld;
  2. mountAttachVolumes(): Obtain the volume to be mounted from DesiredStateOfWorld, call the interface of CSI Node to mount or expand, and record it in ActualStateOfWorld;
  3. unmountDetachDevices(): Traverse the volume in ActualStateOfWorld. If it has been attached but has no pod in use, and there is no record in DesiredStateOfWorld, unmount/detach it.

Let us take mountAttachVolumes()as an example to see how it calls the interface of CSI Node.

func (rc *reconciler) mountAttachVolumes() {
	// Ensure volumes that should be attached/mounted are attached/mounted.
	for _, volumeToMount := range rc.desiredStateOfWorld.GetVolumesToMount() {
		volMounted, devicePath, err := rc.actualStateOfWorld.PodExistsInVolume(volumeToMount.PodName, volumeToMount.VolumeName)
		volumeToMount.DevicePath = devicePath
		if cache.IsVolumeNotAttachedError(err) {
			...
		} else if !volMounted || cache.IsRemountRequiredError(err) {
			// Volume is not mounted, or is already mounted, but requires remounting
			err := rc.operationExecutor.MountVolume(
				rc.waitForAttachTimeout,
				volumeToMount.VolumeToMount,
				rc.actualStateOfWorld,
				isRemount)
			...
		} else if cache.IsFSResizeRequiredError(err) {
			err := rc.operationExecutor.ExpandInUseVolume(
				volumeToMount.VolumeToMount,
				rc.actualStateOfWorld)
			...
		}
	}
}

The operations of performing mount are all done rc.operationExecutorin , and then look at the code of operationExecutor:

func (oe *operationExecutor) MountVolume(
	waitForAttachTimeout time.Duration,
	volumeToMount VolumeToMount,
	actualStateOfWorld ActualStateOfWorldMounterUpdater,
	isRemount bool) error {
	...
	var generatedOperations volumetypes.GeneratedOperations
		generatedOperations = oe.operationGenerator.GenerateMountVolumeFunc(
			waitForAttachTimeout, volumeToMount, actualStateOfWorld, isRemount)

	// Avoid executing mount/map from multiple pods referencing the
	// same volume in parallel
	podName := nestedpendingoperations.EmptyUniquePodName

	return oe.pendingOperations.Run(
		volumeToMount.VolumeName, podName, "" /* nodeName */, generatedOperations)
}

The function first constructs the execution function, and then executes it, then look at the constructor:

func (og *operationGenerator) GenerateMountVolumeFunc(
	waitForAttachTimeout time.Duration,
	volumeToMount VolumeToMount,
	actualStateOfWorld ActualStateOfWorldMounterUpdater,
	isRemount bool) volumetypes.GeneratedOperations {

	volumePlugin, err :=
		og.volumePluginMgr.FindPluginBySpec(volumeToMount.VolumeSpec)

	mountVolumeFunc := func() volumetypes.OperationContext {
		// Get mounter plugin
		volumePlugin, err := og.volumePluginMgr.FindPluginBySpec(volumeToMount.VolumeSpec)
		volumeMounter, newMounterErr := volumePlugin.NewMounter(
			volumeToMount.VolumeSpec,
			volumeToMount.Pod,
			volume.VolumeOptions{})
		...
		// Execute mount
		mountErr := volumeMounter.SetUp(volume.MounterArgs{
			FsUser:              util.FsUserFrom(volumeToMount.Pod),
			FsGroup:             fsGroup,
			DesiredSize:         volumeToMount.DesiredSizeLimit,
			FSGroupChangePolicy: fsGroupChangePolicy,
		})
		// Update actual state of world
		markOpts := MarkVolumeOpts{
			PodName:             volumeToMount.PodName,
			PodUID:              volumeToMount.Pod.UID,
			VolumeName:          volumeToMount.VolumeName,
			Mounter:             volumeMounter,
			OuterVolumeSpecName: volumeToMount.OuterVolumeSpecName,
			VolumeGidVolume:     volumeToMount.VolumeGidValue,
			VolumeSpec:          volumeToMount.VolumeSpec,
			VolumeMountState:    VolumeMounted,
		}

		markVolMountedErr := actualStateOfWorld.MarkVolumeAsMounted(markOpts)
		...
		return volumetypes.NewOperationContext(nil, nil, migrated)
	}

	return volumetypes.GeneratedOperations{
		OperationName:     "volume_mount",
		OperationFunc:     mountVolumeFunc,
		EventRecorderFunc: eventRecorderFunc,
		CompleteFunc:      util.OperationCompleteHook(util.GetFullQualifiedPluginNameForVolume(volumePluginName, volumeToMount.VolumeSpec), "volume_mount"),
	}
}

Here, first go to the CSI plugin list registered to the kubelet to find the corresponding plugin, then execute volumeMounter.SetUpit, and finally update the ActualStateOfWorld record. Here is csiMountMgr responsible for executing the external CSI plug-in, the code is as follows:

func (c *csiMountMgr) SetUp(mounterArgs volume.MounterArgs) error {
	return c.SetUpAt(c.GetPath(), mounterArgs)
}

func (c *csiMountMgr) SetUpAt(dir string, mounterArgs volume.MounterArgs) error {
	csi, err := c.csiClientGetter.Get()
	...

	err = csi.NodePublishVolume(
		ctx,
		volumeHandle,
		readOnly,
		deviceMountPath,
		dir,
		accessMode,
		publishContext,
		volAttribs,
		nodePublishSecrets,
		fsType,
		mountOptions,
	)
    ...
	return nil
}

As you can see, the csiMountMgr of the volumeManager is calling the CSI Node NodePublishVolume/ NodeUnPublishVolumeinterface in the kubelet. So far, the volume process of the entire Pod has been sorted out.

How JuiceFS CSI Driver works

Next, let's take a look at how the JuiceFS CSI Driver works. The architecture diagram is as follows:

JuiceFS creates a pod in the CSI Node NodePublishVolumeinterface for execution juicefs mount xxx, thus ensuring that the juicefs client runs in the pod. If there are multiple business pods sharing a storage, the mount pod will be reference counted in the annotation to ensure that it will not be created repeatedly. The specific code is as follows (for the convenience of reading, the log and other irrelevant codes are omitted):

func (p *PodMount) JMount(jfsSetting *jfsConfig.JfsSetting) error {
	if err := p.createOrAddRef(jfsSetting); err != nil {
		return err
	}
	return p.waitUtilPodReady(GenerateNameByVolumeId(jfsSetting.VolumeId))
}

func (p *PodMount) createOrAddRef(jfsSetting *jfsConfig.JfsSetting) error {
	...
	
	for i := 0; i < 120; i++ {
		// wait for old pod deleted
		oldPod, err := p.K8sClient.GetPod(podName, jfsConfig.Namespace)
		if err == nil && oldPod.DeletionTimestamp != nil {
			time.Sleep(time.Millisecond * 500)
			continue
		} else if err != nil {
			if K8serrors.IsNotFound(err) {
				newPod := r.NewMountPod(podName)
				if newPod.Annotations == nil {
					newPod.Annotations = make(map[string]string)
				}
				newPod.Annotations[key] = jfsSetting.TargetPath
				po, err := p.K8sClient.CreatePod(newPod)
				...
				return err
			}
			return err
		}
      ...
		return p.AddRefOfMount(jfsSetting.TargetPath, podName)
	}
	return status.Errorf(codes.Internal, "Mount %v failed: mount pod %s has been deleting for 1 min", jfsSetting.VolumeId, podName)
}

func (p *PodMount) waitUtilPodReady(podName string) error {
	// Wait until the mount pod is ready
	for i := 0; i < 60; i++ {
		pod, err := p.K8sClient.GetPod(podName, jfsConfig.Namespace)
		...
		if util.IsPodReady(pod) {
			return nil
		}
		time.Sleep(time.Millisecond * 500)
	}
	...
	return status.Errorf(codes.Internal, "waitUtilPodReady: mount pod %s isn't ready in 30 seconds: %v", podName, log)
}

Whenever a service pod exits, the CSI Node will NodeUnpublishVolumedelete , and the mount pod will be deleted only when the last record is deleted. The specific code is as follows (for the convenience of reading, the log and other irrelevant codes are omitted):

func (p *PodMount) JUmount(volumeId, target string) error {
   ...
	err = retry.RetryOnConflict(retry.DefaultBackoff, func() error {
		po, err := p.K8sClient.GetPod(pod.Name, pod.Namespace)
		if err != nil {
			return err
		}
		annotation := po.Annotations
		...
		delete(annotation, key)
		po.Annotations = annotation
		return p.K8sClient.UpdatePod(po)
	})
	...

	deleteMountPod := func(podName, namespace string) error {
		return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
			po, err := p.K8sClient.GetPod(podName, namespace)
			...
			shouldDelay, err = util.ShouldDelay(po, p.K8sClient)
			if err != nil {
				return err
			}
			if !shouldDelay {
				// do not set delay delete, delete it now
				if err := p.K8sClient.DeletePod(po); err != nil {
					return err
				}
			}
			return nil
		})
	}

	newPod, err := p.K8sClient.GetPod(pod.Name, pod.Namespace)
	...
	if HasRef(newPod) {
		return nil
	}
	return deleteMountPod(pod.Name, pod.Namespace)
}

The CSI Driver is decoupled from the juicefs client, and the upgrade will not affect the business container; running the client independently in the pod makes it more observable under the control of K8s; at the same time, we can also enjoy the benefits of the pod For example, the isolation is stronger, and the resource quota of the client can be set separately.

Summarize

This article starts from three aspects: CSI components, CSI interfaces, and how volumes are mounted on pods, analyzes the working process of the entire CSI system, and introduces the working principle of JuiceFS CSI Driver. CSI is the standard storage interface for the entire container ecosystem. CO communicates with CSI plug-ins through gRPC. In order to be universal, K8s has designed many external components to cooperate with CSI plug-ins to achieve different functions, thus ensuring the purity of K8s internal logic. And the ease of use of the CSI plug-in.

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324140747&siteId=291194637