Kubernetes data volume management source code analysis

Overview

Volume is a very important link in k8s, which is mainly used to store some system or business data produced by pods in k8s. k8s provides volume management logic in kubelet

Source code analysis

The first is the kubelet startup method

func main() {
       s := options.NewKubeletServer()
       s.AddFlags(pflag.CommandLine)

       flag.InitFlags()
       logs.InitLogs()
       defer logs.FlushLogs()

       verflag.PrintAndExitIfRequested()

       if err := app.Run(s, nil); err != nil {
              fmt.Fprintf(os.Stderr, "error: %v\n", err)
              os.Exit(1)
       }
}

It is easy to find that the run method contains all the important information of the kubelet

func run(s *options.KubeletServer, kubeDeps *kubelet.KubeletDeps) (err error) {
        
		//配置验证
    	...

       if kubeDeps == nil {
              ...

              kubeDeps, err = UnsecuredKubeletDeps(s)

              ...
       }

       //初始化cAdvisor以及containerManager等管理器
       ...


       if err := RunKubelet(&s.KubeletConfiguration, kubeDeps, s.RunOnce, standaloneMode); err != nil {
              return err
       }

       ...
}

There are two important methods related to volume management.

  • UnsecuredKubeletDeps: It will initialize the docker client, network management plug-in, data management plug-in and other system core components. Because it is not convenient to open to the outside world, it is named unsecured. Among them, what we need to pay attention to is its initialization operation for the volume plugin

    	func UnsecuredKubeletDeps(s *options.KubeletServer) (*kubelet.KubeletDeps, error) {
    
    	    ...
    
    		return &kubelet.KubeletDeps{
    			Auth:               nil, 
    			CAdvisorInterface:  nil, 
    			Cloud:              nil, 
    			ContainerManager:   nil,
    			DockerClient:       dockerClient,
    			KubeClient:         nil,
    			ExternalKubeClient: nil,
    			Mounter:            mounter,
    			NetworkPlugins:     ProbeNetworkPlugins(s.NetworkPluginDir, s.CNIConfDir, s.CNIBinDir),
    			OOMAdjuster:        oom.NewOOMAdjuster(),
    			OSInterface:        kubecontainer.RealOS{},
    			Writer:             writer,
    			VolumePlugins:      ProbeVolumePlugins(s.VolumePluginDir),
    			TLSOptions:         tlsOptions,
    		}, nil
    	}
    

    When initializing the volume plugin, VolumePluginDir will be passed as the path of the custom plugin. The default path is **/usr/libexec/kubernetes/kubelet-plugins/volume/exec/**

    	func ProbeVolumePlugins(pluginDir string) []volume.VolumePlugin {
    		allPlugins := []volume.VolumePlugin{}
    		allPlugins = append(allPlugins, aws_ebs.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, empty_dir.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, gce_pd.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, git_repo.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, host_path.ProbeVolumePlugins(volume.VolumeConfig{})...)
    		allPlugins = append(allPlugins, nfs.ProbeVolumePlugins(volume.VolumeConfig{})...)
    		allPlugins = append(allPlugins, secret.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, iscsi.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, glusterfs.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, rbd.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, cinder.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, quobyte.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, cephfs.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, downwardapi.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, fc.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, flocker.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, flexvolume.ProbeVolumePlugins(pluginDir)...)
    		allPlugins = append(allPlugins, azure_file.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, configmap.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, vsphere_volume.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, azure_dd.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, photon_pd.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, projected.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, portworx.ProbeVolumePlugins()...)
    		allPlugins = append(allPlugins, scaleio.ProbeVolumePlugins()...)
    		return allPlugins
    	}
    
    

    It can be observed that among the many plugins, there is one named flexvolume, and only this plugin has the parameter pluginDir, indicating that only this plugin supports custom implementation. How kubelet interacts with these plugins, and what interfaces these plugins provide, you need to continue reading the code

  • RunKubelet: This is the startup method of the kubelet service, and the most important functions are hidden in startKubelet

    	func RunKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *kubelet.KubeletDeps, runOnce bool, standaloneMode bool) error {
    
    		//初始化启动器
    		...
    
    		if runOnce {
    			if _, err := k.RunOnce(podCfg.Updates()); err != nil {
    				return fmt.Errorf("runonce failed: %v", err)
    			}
    			glog.Infof("Started kubelet %s as runonce", version.Get().String())
    		} else {
    			startKubelet(k, podCfg, kubeCfg, kubeDeps)
    			glog.Infof("Started kubelet %s", version.Get().String())
    		}
    		return nil
    	}
    

    startKubelet contains two links

    • Continuously synchronize the pod information of the apiserver, and update the volume status synchronously according to the newly added and deleted pods
    • Start the service and listen for requests from the controller manager. The controller manager can assist the kubelet to manage the volume, and the user can also choose to disable the management of the controller manager
    	func startKubelet(k kubelet.KubeletBootstrap, podCfg *config.PodConfig, kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *kubelet.KubeletDeps) {
    		// 同步pod信息
    		go wait.Until(func() { k.Run(podCfg.Updates()) }, 0, wait.NeverStop)
    
    		// 启动kubelet服务
    		if kubeCfg.EnableServer {
    			go wait.Until(func() {
    				k.ListenAndServe(net.ParseIP(kubeCfg.Address), uint(kubeCfg.Port), kubeDeps.TLSOptions, kubeDeps.Auth, kubeCfg.EnableDebuggingHandlers, kubeCfg.EnableContentionProfiling)
    			}, 0, wait.NeverStop)
    		}
    		if kubeCfg.ReadOnlyPort > 0 {
    			go wait.Until(func() {
    				k.ListenAndServeReadOnly(net.ParseIP(kubeCfg.Address), uint(kubeCfg.ReadOnlyPort))
    			}, 0, wait.NeverStop)
    		}
    	}
    

    The Run method that tracks synchronized pod information will trace this code

    	func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
    
    	    ...
    
    		go kl.volumeManager.Run(kl.sourcesReady, wait.NeverStop)
    
    		if kl.kubeClient != nil {
    			//同步node信息
    			go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
    		}
    
    		// 同步pod信息
    		kl.pleg.Start()
    		kl.syncLoop(updates, kl)
    	}
    
    

    kl.volumeManager is the core interface of kubelet for data volume management

    	type VolumeManager interface {
    		Run(sourcesReady config.SourcesReady, stopCh <-chan struct{})
    
    		WaitForAttachAndMount(pod *v1.Pod) error
    
    		GetMountedVolumesForPod(podName types.UniquePodName) container.VolumeMap
    
    		GetExtraSupplementalGroupsForPod(pod *v1.Pod) []int64
    
    		GetVolumesInUse() []v1.UniqueVolumeName
    
    		ReconcilerStatesHasBeenSynced() bool
    
    		VolumeIsAttached(volumeName v1.UniqueVolumeName) bool
    
    		MarkVolumesAsReportedInUse(volumesReportedAsInUse []v1.UniqueVolumeName)
    	}
    

    The Run of the VolumeManager will execute an asynchronous loop. When the pod is scheduled to the node, it will check all the volumes applied for by the pod, and perform attach/detach/mount/unmount operations according to the relationship between these volumes and the pod.

    	func (vm *volumeManager) Run(sourcesReady config.SourcesReady, stopCh <-chan struct{}) {
    		defer runtime.HandleCrash()
    
    		go vm.desiredStateOfWorldPopulator.Run(sourcesReady, stopCh)
    		glog.V(2).Infof("The desired_state_of_world populator starts")
    
    		glog.Infof("Starting Kubelet Volume Manager")
    		go vm.reconciler.Run(stopCh)
    
    		<-stopCh
    		glog.Infof("Shutting down Kubelet Volume Manager")
    	}
    

    The focus is on the two methods vm.desiredStateOfWorldPopulator.Run and vm.reconciler.Run . Before introducing these two methods, a key information needs to be added, which is also the key information for understanding these two methods.

    The way the kubelet manages volumes is based on two different states:

    • DesiredStateOfWorld: In expectation, the usage of the volume by the pod, referred to as the expected state . When pod.yaml customizes the volume and submits it successfully, the expected status has been determined
    • ActualStateOfWorld: In practice, the usage of voluem by pods, referred to as actual state . The actual state is the result of kubelet's background thread monitoring

    After understanding these two states, you can roughly know what the vm.desiredStateOfWorldPopulator.Run method does. Obviously, it updates DesiredStateOfWorld based on the pod information synchronized from the apiserver. Another method , vm.reconciler.Run , is the coordinator of the expected state and the actual state, which is responsible for adjusting the actual state to the expected state. The update implementation of the expected state, and how the coordinator specifically coordinates, need to continue reading the code to understand

    Tracing vm.desiredStateOfWorldPopulator.Run , we found this logic

    	func (dswp *desiredStateOfWorldPopulator) findAndAddNewPods() {
    		for _, pod := range dswp.podManager.GetPods() {
    			if dswp.isPodTerminated(pod) {
    				continue
    			}
    			dswp.processPodVolumes(pod)
    		}
    	}
    

    The kubelet will synchronize the newly added pods to the podManager of desiredStateOfWorldPopulator. This code is to poll the pods in the non-end state and hand them over to the desiredStateOfWorldPopulator for processing

    	func (dswp *desiredStateOfWorldPopulator) processPodVolumes(pod *v1.Pod) {
    
    		...
    
    		for _, podVolume := range pod.Spec.Volumes {
    			volumeSpec, volumeGidValue, err :=
    				dswp.createVolumeSpec(podVolume, pod.Namespace)
    			if err != nil {
    				glog.Errorf(
    					"Error processing volume %q for pod %q: %v",
    					podVolume.Name,
    					format.Pod(pod),
    					err)
    				continue
    			}
    
    
    			_, err = dswp.desiredStateOfWorld.AddPodToVolume(
    				uniquePodName, pod, volumeSpec, podVolume.Name, volumeGidValue)
    			if err != nil {
    				glog.Errorf(
    					"Failed to add volume %q (specName: %q) for pod %q to desiredStateOfWorld. err=%v",
    					podVolume.Name,
    					volumeSpec.Name(),
    					uniquePodName,
    					err)
    			}
    
    			glog.V(10).Infof(
    				"Added volume %q (volSpec=%q) for pod %q to desired state.",
    				podVolume.Name,
    				volumeSpec.Name(),
    				uniquePodName)
    		}
    
    		dswp.markPodProcessed(uniquePodName)
    	}
    

    desiredStateOfWorldPopulator does not handle heavy logic, but acts as a proxy to deliver the logic that controls the expected state of a pod to desiredStateOfWorld and mark it as handled

    	func (dsw *desiredStateOfWorld) AddPodToVolume(
    		podName types.UniquePodName,
    		pod *v1.Pod,
    		volumeSpec *volume.Spec,
    		outerVolumeSpecName string,
    		volumeGidValue string) (v1.UniqueVolumeName, error) {
    
    		...
    
    		dsw.volumesToMount[volumeName].podsToMount[podName] = podToMount{
    			podName:             podName,
    			pod:                 pod,
    			spec:                volumeSpec,
    			outerVolumeSpecName: outerVolumeSpecName,
    		}
    
    		return volumeName, nil
    	}
    

    In this logic, we ignore the previous series of preprocessing operations and focus directly on the core: the way to determine the expected state is to use a mapping table structure to bind the relationship between volumes and pods. This relationship table is the binding reference for the relationship

    After reading the processing logic of desiredStateOfWorldPopulator, enter another core interface reconciler. It is the most important controller in the volume manager

    Tracing the Run method of the reconciler, we locate the core piece of code

    	func (rc *reconciler) reconcile() {
    
    		//umount
    		for _, mountedVolume := range rc.actualStateOfWorld.GetMountedVolumes() {
    			if !rc.desiredStateOfWorld.PodExistsInVolume(mountedVolume.PodName, mountedVolume.VolumeName) {
    
    				...
    
    				err := rc.operationExecutor.UnmountVolume(
    					mountedVolume.MountedVolume, rc.actualStateOfWorld)
    
    				...
    			}
    		}
    
    		// attach/mount
    		for _, volumeToMount := range rc.desiredStateOfWorld.GetVolumesToMount() {
    			volMounted, devicePath, err := rc.actualStateOfWorld.PodExistsInVolume(volumeToMount.PodName, volumeToMount.VolumeName)
    			volumeToMount.DevicePath = devicePath
    			if cache.IsVolumeNotAttachedError(err) {
    
    				...
    
    				err := rc.operationExecutor.AttachVolume(volumeToAttach, rc.actualStateOfWorld)
    
    				...
    
    			} else if !volMounted || cache.IsRemountRequiredError(err) {
    
    				...
    
    				err := rc.operationExecutor.MountVolume(
    					rc.waitForAttachTimeout,
    					volumeToMount.VolumeToMount,
    					rc.actualStateOfWorld)
    
    				...
    			}
    		}
    
    		//detach/unmount
    		for _, attachedVolume := range rc.actualStateOfWorld.GetUnmountedVolumes() {
    			if !rc.desiredStateOfWorld.VolumeExists(attachedVolume.VolumeName) &&
    				!rc.operationExecutor.IsOperationPending(attachedVolume.VolumeName, nestedpendingoperations.EmptyUniquePodName) {
    				if attachedVolume.GloballyMounted {
    
    					...
    
    					err := rc.operationExecutor.UnmountDevice(
    						attachedVolume.AttachedVolume, rc.actualStateOfWorld, rc.mounter)
    					...
    
    				} else {
    
    					...
    
    					err := rc.operationExecutor.DetachVolume(
    							attachedVolume.AttachedVolume, false,rc.actualStateOfWorld)
    
    					...
    				}
    			}
    		}
    	}
    

    I omitted redundant code and kept the core part. This control logic is a coordinator. The specific thing to do is to coordinate operations according to the difference between the actual state and the expected state.

    • There is no binding relationship between the volume and the expected state of the pod, then detach the volume and perform the unmount operation on the pod and volume
    • There is a binding relationship between the volume and the expected state of the pod, then attach the volume and perform the mount operation on the pod and volume

    If a custom flexvolume plugin is used, the above methods will make system calls to the methods implemented in the plugin

    • AttachVolume: call attach
    • DetachVolume:调用detach
    • MountVolume:调用mountdevice,mount
    • UnmountVolume:调用unmount
    • UnmountDevice:调用umountdevice

    The lvm plugin provided by flex volume. If you need to support mount and unmount operations, you can add it in this script

    	#!/bin/bash
    
    	# Copyright 2015 The Kubernetes Authors.
    	#
    	# Licensed under the Apache License, Version 2.0 (the "License");
    	# you may not use this file except in compliance with the License.
    	# You may obtain a copy of the License at
    	#
    	#     http://www.apache.org/licenses/LICENSE-2.0
    	#
    	# Unless required by applicable law or agreed to in writing, software
    	# distributed under the License is distributed on an "AS IS" BASIS,
    	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    	# See the License for the specific language governing permissions and
    	# limitations under the License.
    
    	# Notes:
    	#  - Please install "jq" package before using this driver.
    	usage() {
    		err "Invalid usage. Usage: "
    		err "\t$0 init"
    		err "\t$0 attach <json params> <nodename>"
    		err "\t$0 detach <mount device> <nodename>"
    		err "\t$0 waitforattach <mount device> <json params>"
    		err "\t$0 mountdevice <mount dir> <mount device> <json params>"
    		err "\t$0 unmountdevice <mount dir>"
    		err "\t$0 isattached <json params> <nodename>"
    		exit 1
    	}
    
    	err() {
    		echo -ne $* 1>&2
    	}
    
    	log() {
    		echo -ne $* >&1
    	}
    
    	ismounted() {
    		MOUNT=`findmnt -n ${MNTPATH} 2>/dev/null | cut -d' ' -f1`
    		if [ "${MOUNT}" == "${MNTPATH}" ]; then
    			echo "1"
    		else
    			echo "0"
    		fi
    	}
    
    	getdevice() {
    		VOLUMEID=$(echo ${JSON_PARAMS} | jq -r '.volumeID')
    		VG=$(echo ${JSON_PARAMS}|jq -r '.volumegroup')
    
    		# LVM substitutes - with --
    		VOLUMEID=`echo $VOLUMEID|sed s/-/--/g`
    		VG=`echo $VG|sed s/-/--/g`
    
    		DMDEV="/dev/mapper/${VG}-${VOLUMEID}"
    		echo ${DMDEV}
    	}
    
    	attach() {
    		JSON_PARAMS=$1
    		SIZE=$(echo $1 | jq -r '.size')
    
    		DMDEV=$(getdevice)
    		if [ ! -b "${DMDEV}" ]; then
    			err "{\"status\": \"Failure\", \"message\": \"Volume ${VOLUMEID} does not exist\"}"
    			exit 1
    		fi
    		log "{\"status\": \"Success\", \"device\":\"${DMDEV}\"}"
    		exit 0
    	}
    
    	detach() {
    		log "{\"status\": \"Success\"}"
    		exit 0
    	}
    
    	waitforattach() {
    		shift
    		attach $*
    	}
    
    	domountdevice() {
    		MNTPATH=$1
    		DMDEV=$2
    		FSTYPE=$(echo $3|jq -r '.["kubernetes.io/fsType"]')
    
    		if [ ! -b "${DMDEV}" ]; then
    			err "{\"status\": \"Failure\", \"message\": \"${DMDEV} does not exist\"}"
    			exit 1
    		fi
    
    		if [ $(ismounted) -eq 1 ] ; then
    			log "{\"status\": \"Success\"}"
    			exit 0
    		fi
    
    		VOLFSTYPE=`blkid -o udev ${DMDEV} 2>/dev/null|grep "ID_FS_TYPE"|cut -d"=" -f2`
    		if [ "${VOLFSTYPE}" == "" ]; then
    			mkfs -t ${FSTYPE} ${DMDEV} >/dev/null 2>&1
    			if [ $? -ne 0 ]; then
    				err "{ \"status\": \"Failure\", \"message\": \"Failed to create fs ${FSTYPE} on device ${DMDEV}\"}"
    				exit 1
    			fi
    		fi
    
    		mkdir -p ${MNTPATH} &> /dev/null
    
    		mount ${DMDEV} ${MNTPATH} &> /dev/null
    		if [ $? -ne 0 ]; then
    			err "{ \"status\": \"Failure\", \"message\": \"Failed to mount device ${DMDEV} at ${MNTPATH}\"}"
    			exit 1
    		fi
    		log "{\"status\": \"Success\"}"
    		exit 0
    	}
    
    	unmountdevice() {
    		MNTPATH=$1
    		if [ ! -d ${MNTPATH} ]; then
    			log "{\"status\": \"Success\"}"
    			exit 0
    		fi
    
    		if [ $(ismounted) -eq 0 ] ; then
    			log "{\"status\": \"Success\"}"
    			exit 0
    		fi
    
    		umount ${MNTPATH} &> /dev/null
    		if [ $? -ne 0 ]; then
    			err "{ \"status\": \"Failed\", \"message\": \"Failed to unmount volume at ${MNTPATH}\"}"
    			exit 1
    		fi
    
    		log "{\"status\": \"Success\"}"
    		exit 0
    	}
    
    	isattached() {
    		log "{\"status\": \"Success\", \"attached\":true}"
    		exit 0
    	}
    
    	op=$1
    
    	if [ "$op" = "init" ]; then
    		log "{\"status\": \"Success\"}"
    		exit 0
    	fi
    
    	if [ $# -lt 2 ]; then
    		usage
    	fi
    
    	shift
    
    	case "$op" in
    		attach)
    			attach $*
    			;;
    		detach)
    			detach $*
    			;;
    		waitforattach)
    			waitforattach $*
    			;;
    		mountdevice)
    			domountdevice $*
    			;;
    		unmountdevice)
    			unmountdevice $*
    			;;
    		isattached)
    	                isattached $*
    	                ;;
    		*)
    			log "{ \"status\": \"Not supported\" }"
    			exit 0
    	esac
    
    	exit 1
    

    It is worth noting why there are two mount operations, one mountdevice and one mount. What do they do?

    In fact, the volume management method provided by k8s is that a volume can be mounted by multiple pods. If a device needs to be used as the volume of multiple pods, it needs to be mounted multiple times. But device can only be mounted once. Therefore, the method adopted by k8s is to first mount the device to a global directory with mountdevice, and then this global directory can be mounted to the pod's volume directory multiple times. In this way, multiple pods can be mounted on the same volume

Summarize

Only by understanding the code of the volume manager can you be familiar with it when using the volume plugin it provides or implementing a custom flex volume plugin. The above code is based on the k8s v1.6.6 version

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325426214&siteId=291194637