Kubelet Deivce Manager source code analysis

Author: [email protected]

Creation of Device Manager

When was the Device Manager created

Like Volume Manager, QoS Container Manager, etc., Device Manager is one of the many Managers managed by kubelet. The Device Manager is created in NewContainerManager when the kubelet starts.

pkg/kubelet/cm/container_manager_linux.go:197

func NewContainerManager(mountUtil mount.Interface, cadvisorInterface cadvisor.Interface, nodeConfig NodeConfig, failSwapOn bool, devicePluginEnabled bool, recorder record.EventRecorder) (ContainerManager, error) {
	
	...

	glog.Infof("Creating device plugin manager: %t", devicePluginEnabled)
	if devicePluginEnabled {
		cm.deviceManager, err = devicemanager.NewManagerImpl()
	} else {
		cm.deviceManager, err = devicemanager.NewManagerStub()
	}
	if err != nil {
		return nil, err
	}
	...
}	

ManagerImpl structure

It is necessary for us to understand the structure of Device Manager first:

// ManagerImpl is the structure in charge of managing Device Plugins.
type ManagerImpl struct {
	socketname string
	socketdir  string

	endpoints map[string]endpoint // Key is ResourceName
	mutex     sync.Mutex

	server *grpc.Server

	// activePods is a method for listing active pods on the node
	// so the amount of pluginResources requested by existing pods
	// could be counted when updating allocated devices
	activePods ActivePodsFunc

	// sourcesReady provides the readiness of kubelet configuration sources such as apiserver update readiness.
	// We use it to determine when we can purge inactive pods from checkpointed state.
	sourcesReady config.SourcesReady

	// callback is used for updating devices' states in one time call.
	// e.g. a new device is advertised, two old devices are deleted and a running device fails.
	callback monitorCallback

	// healthyDevices contains all of the registered healthy resourceNames and their exported device IDs.
	healthyDevices map[string]sets.String

	// unhealthyDevices contains all of the unhealthy devices and their exported device IDs.
	unhealthyDevices map[string]sets.String

	// allocatedDevices contains allocated deviceIds, keyed by resourceName.
	allocatedDevices map[string]sets.String

	// podDevices contains pod to allocated device mapping.
	podDevices podDevices
	store      utilstore.Store
	pluginOpts map[string]*pluginapi.DevicePluginOptions
}

The following is the description of the core field:

  • socketname: is the socket name exposed by kubelet, ie kubelet.sock.

  • socketdir: The directory where device plugins' socket is stored, /var/lib/kubelet/device-plugins/.

  • Endpoints: map object, key is Resource Name, value is endpoint interface (including run, stop, allocate, preStartContainer, getDevices, callback, isStoped, StopGracePeriodExpired), each endpoint interface corresponds to a registered device plugin, responsible for connecting with device plugin gRPC communicates and caches device states fed back by the device plugin.

  • server: The gRPC Server exposed by the Register service.

  • activePods: Used to get all active pods on the node, that is, Pods in non-Terminated state. In the initializeRuntimeDependentModules of kubelet, the activePods Func will be registered as the following function:

    	// GetActivePods returns non-terminal pods
    	func (kl *Kubelet) GetActivePods() []*v1.Pod {
    		allPods := kl.podManager.GetPods()
    		activePods := kl.filterOutTerminatedPods(allPods)
    		return activePods
    	}
    
  • callback: is the callback function when kubelet receives device plugin's ListAndWatch gRCP stream when devices state changes, including new device addition, old device deletion, and device state change, so through the callback method of the ListAndWatch interface, the device can be automatically discovered and hot swap.

    	type monitorCallback func(resourceName string, added, updated, deleted []pluginapi.Device)
    	```	
    
    
  • healthyDevices: map object, the key is the Resource Name, and the value is the corresponding healthy device IDs.

  • unhealthyDevices: map object, the key is the Resource Name, and the value is the corresponding unhealthy device IDs.

  • allocatedDevices: map object, the key is the Resource Name, and the value is the device IDs that have been allocated.

  • podDevices: Records the device allocation for each container in each pod.

    	// ContainerAllocateResponse为容器内某个device对应的分配信息,包括注入的环境变量、挂载信息、Annotations。
    	type ContainerAllocateResponse struct {
    		Envs map[string]string 
    		Mounts []*Mount 
    		Devices []*DeviceSpec 
    		Annotations map[string]string 
    	}
    
    	// deviceAllocateInfo
    	type deviceAllocateInfo struct {
    		deviceIds sets.String
    		allocResp *pluginapi.ContainerAllocateResponse
    	}
    
    	type resourceAllocateInfo map[string]deviceAllocateInfo // Keyed by resourceName.
    	type containerDevices map[string]resourceAllocateInfo   // Keyed by containerName.
    	type podDevices map[string]containerDevices             // Keyed by podUID.
    
  • store: It is the file storage ( /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint) for checkpointData, which specifically stores the Devices information PodDeviceEntries allocated by each Pod, as well as the registered Resource Name and corresponding Devices IDs.

    	type checkpointData struct {
    		PodDeviceEntries  []podDevicesCheckpointEntry
    		RegisteredDevices map[string][]string // key为Resource Name,value为DeviceIDs
    	}
    
    	type podDevicesCheckpointEntry struct {
    		PodUID        string
    		ContainerName string
    		ResourceName  string
    		DeviceIDs     []string
    		AllocResp     []byte
    	}
    

    Enter image description

  • pluginOpts: map object, the key is Resource Name, and the value is DevicePluginOptions. Currently, there is only one content, that is , whether to call the interface of PreStartRequired boolthe device plugin before the container starts . PreStartContinerIn nvidia-k8s-plugin, PreStartContaineran empty implementation.

NewManagerImpl

Let's take a look at the specific creation implementation of Device Manager NewManagerImpl.

pkg/kubelet/cm/devicemanager/manager.go:97

// NewManagerImpl creates a new manager.
func NewManagerImpl() (*ManagerImpl, error) {

	// 通过/var/lib/kubelet/device-plugins/kubelet.sock与device plugin交互
	return newManagerImpl(pluginapi.KubeletSocket)
}

func newManagerImpl(socketPath string) (*ManagerImpl, error) {
	glog.V(2).Infof("Creating Device Plugin manager at %s", socketPath)

	if socketPath == "" || !filepath.IsAbs(socketPath) {
		return nil, fmt.Errorf(errBadSocket+" %v", socketPath)
	}

	dir, file := filepath.Split(socketPath)
	manager := &ManagerImpl{
		endpoints:        make(map[string]endpoint),
		socketname:       file,
		socketdir:        dir,
		healthyDevices:   make(map[string]sets.String),
		unhealthyDevices: make(map[string]sets.String),
		allocatedDevices: make(map[string]sets.String),
		pluginOpts:       make(map[string]*pluginapi.DevicePluginOptions),
		podDevices:       make(podDevices),
	}
	manager.callback = manager.genericDeviceUpdateCallback

	// The following structs are populated with real implementations in manager.Start()
	// Before that, initializes them to perform no-op operations.
	manager.activePods = func() []*v1.Pod { return []*v1.Pod{} }
	manager.sourcesReady = &sourcesReadyStub{}
	var err error
	
	// 在/var/lib/kubelet/device-plugins/目录下创建file store类型的key-value存储文件kubelet_internal_checkpoint,用来作为kubelet的device plugin的checkpoint。
	manager.store, err = utilstore.NewFileStore(dir, utilfs.DefaultFs{})
	if err != nil {
		return nil, fmt.Errorf("failed to initialize device plugin checkpointing store: %+v", err)
	}

	return manager, nil
}
  • The kubelet Device Manager /var/lib/kubelet/device-plugins/kubelet.sockinteracts with the device plugin.
  • The registered callback is genericDeviceUpdateCallbackused to handle the add, delete, and update events of the corresponding devices.
  • /var/lib/kubelet/device-plugins/Create a key-value storage file of file store type in the directory, which is kubelet_internal_checkpointused as a checkpoint for the device plugin of kubelet.
    • When the devices add/delete/update event is monitored, it will be updated to the kubelet_internal_checkpointfile.
    • When the stop time of the device plugin exceeds the grace period time (the code is hard-coded for 5 minutes and cannot be configured), the corresponding devices will be deleted from the checkpoint. During this time frame, Device Manager will continue to cache the endpoint and corresponding devices.
    • After the Container Allocate Devices, the PodDevices will also be updated to the checkpoint.

Let's take a look at the implementation genericDeviceUpdateCallbackof the callback implementation to understand how Device Manager handles add/delete/update messages from devices.

pkg/kubelet/cm/devicemanager/manager.go:134

func (m *ManagerImpl) genericDeviceUpdateCallback(resourceName string, added, updated, deleted []pluginapi.Device) {
	kept := append(updated, added...)
	m.mutex.Lock()
	if _, ok := m.healthyDevices[resourceName]; !ok {
		m.healthyDevices[resourceName] = sets.NewString()
	}
	if _, ok := m.unhealthyDevices[resourceName]; !ok {
		m.unhealthyDevices[resourceName] = sets.NewString()
	}
	for _, dev := range kept {
		if dev.Health == pluginapi.Healthy {
			m.healthyDevices[resourceName].Insert(dev.ID)
			m.unhealthyDevices[resourceName].Delete(dev.ID)
		} else {
			m.unhealthyDevices[resourceName].Insert(dev.ID)
			m.healthyDevices[resourceName].Delete(dev.ID)
		}
	}
	for _, dev := range deleted {
		m.healthyDevices[resourceName].Delete(dev.ID)
		m.unhealthyDevices[resourceName].Delete(dev.ID)
	}
	m.mutex.Unlock()
	m.writeCheckpoint()
}
  • The device status received in the callback is Healthy, then insert the device ID into healthDevices in ManagerImpl, and delete it from unhealthyDevices.
  • The device status received in the callback is Unhealthy, then insert the device ID into unhealthDevices in ManagerImpl, and delete it from healthyDevices.
  • Delete the devices reported by the device plugin that need to be deleted from healthDevices and unhealthDevices.
  • Update the data in ManagerImpl to the checkpoint file.

Startup of Device Manager

The creation process of Device Manager was analyzed earlier, and the analysis of checkpoint and callback was also involved. Next, we continue to analyze the Start process of Device Manager.

Start Device Manager

Device Manager is started at Start of containerManagerImpl.

pkg/kubelet/cm/container_manager_linux.go:527

func (cm *containerManagerImpl) Start(node *v1.Node,
	activePods ActivePodsFunc,
	sourcesReady config.SourcesReady,
	podStatusProvider status.PodStatusProvider,
	runtimeService internalapi.RuntimeService) error {

	...
	
	// Starts device manager.
	if err := cm.deviceManager.Start(devicemanager.ActivePodsFunc(activePods), sourcesReady); err != nil {
		return err
	}

	return nil
}
  • The first parameter of deviceManager.Start is the function to get the active (non-terminated) Pods of this node.
  • SourcesReady is used to track the Pod Sources configured by the kubelet. These Sources include:
    • file: Create static Pods from static file.
    • http: Get Pods information through the http interface.
    • api: Get Pods information from Kubernetes API Server, which is the default internal mechanism of Kubernetes.
    • *: Indicates the Sources type that includes all of the above.

ManagerIml Start

ManagerIml.Start is responsible for starting Device Manager and providing gRPC services to the outside world.

pkg/kubelet/cm/devicemanager/manager.go:204

// Start starts the Device Plugin Manager amd start initialization of
// podDevices and allocatedDevices information from checkpoint-ed state and
// starts device plugin registration service.
func (m *ManagerImpl) Start(activePods ActivePodsFunc, sourcesReady config.SourcesReady) error {

	m.activePods = activePods
	m.sourcesReady = sourcesReady

	// Loads in allocatedDevices information from disk.
	err := m.readCheckpoint()
	...

	socketPath := filepath.Join(m.socketdir, m.socketname)
	os.MkdirAll(m.socketdir, 0755)

	// Removes all stale sockets in m.socketdir. Device plugins can monitor
	// this and use it as a signal to re-register with the new Kubelet.
	if err := m.removeContents(m.socketdir); err != nil {
		glog.Errorf("Fail to clean up stale contents under %s: %+v", m.socketdir, err)
	}

	s, err := net.Listen("unix", socketPath)
	if err != nil {
		glog.Errorf(errListenSocket+" %+v", err)
		return err
	}

	m.server = grpc.NewServer([]grpc.ServerOption{}...)

	pluginapi.RegisterRegistrationServer(m.server, m)
	go m.server.Serve(s)

	glog.V(2).Infof("Serving device plugin registration server on %q", socketPath)

	return nil
}
  • First read the data in the checkpoint file and restore the relevant data of ManagerImpl, including:
    • podDevices;
    • allocatedDevices;
    • healthyDevices;
    • unhealthyDevices;
    • Endpoints, note that the stop time of the endpoint will be set to the current time here, which means that after the kubelet restarts, you need to wait for the device plugin to re-register before these resources are considered available.
  • Then /var/lib/kubelet/device-plugins/clear all the files below, except for the checkpiont file of course, that is, clear all socket files, including your own kubelet.sock, and all other socket files of the previous device plugin. The device plugin will monitor whether the kubelet.sock file is deleted. If it is deleted, it will trigger its own re-registration with the kubelet.
  • Create kubelet.sock and start gRPC Server to provide gRPC services to the outside world. Currently, only the Register service is registered, which is used for Device plugin calls to register plugins.

Register service

Let's take a look at Register, the only gRPC interface provided by kubelet Device Manager.

Register

pkg/kubelet/cm/devicemanager/manager.go:289

// Register registers a device plugin.
func (m *ManagerImpl) Register(ctx context.Context, r *pluginapi.RegisterRequest) (*pluginapi.Empty, error) {
	glog.Infof("Got registration request from device plugin with resource name %q", r.ResourceName)
	metrics.DevicePluginRegistrationCount.WithLabelValues(r.ResourceName).Inc()
	var versionCompatible bool
	for _, v := range pluginapi.SupportedVersions {
		if r.Version == v {
			versionCompatible = true
			break
		}
	}
	if !versionCompatible {
		errorString := fmt.Sprintf(errUnsupportedVersion, r.Version, pluginapi.SupportedVersions)
		glog.Infof("Bad registration request from device plugin with resource name %q: %v", r.ResourceName, errorString)
		return &pluginapi.Empty{}, fmt.Errorf(errorString)
	}

	if !v1helper.IsExtendedResourceName(v1.ResourceName(r.ResourceName)) {
		errorString := fmt.Sprintf(errInvalidResourceName, r.ResourceName)
		glog.Infof("Bad registration request from device plugin: %v", errorString)
		return &pluginapi.Empty{}, fmt.Errorf(errorString)
	}

	// TODO: for now, always accepts newest device plugin. Later may consider to
	// add some policies here, e.g., verify whether an old device plugin with the
	// same resource name is still alive to determine whether we want to accept
	// the new registration.
	go m.addEndpoint(r)

	return &pluginapi.Empty{}, nil
}
  • The registration request is sent by the device plugin to the kubelet, and the registration request RegisterRequest is:

    	type RegisterRequest struct {
    		Version string  // Kubernetes 1.10对应的device plugin api version为v1beta1
    		Endpoint string // device plugin对应的socket name
    		ResourceName string 
    		Options *DevicePluginOptions 
    	}
    
  • Here, it will be checked whether the registered Resource Name conforms to the rules of Extended Resource:

    • Resource Name cannot belong to kubernetes.io, it must have its own domain, such as nvidia.com.
    • Resource Name cannot contain requests.prefixes.
    • The corresponding Resource value can only be an integer value.
  • Call addEndpoint for plugin registration.

addEndpoint for device plugin registration

As can be seen from the Register method above, the logic of the real plugin registration is implemented in addEndpoint.

pkg/kubelet/cm/devicemanager/manager.go:332

func (m *ManagerImpl) addEndpoint(r *pluginapi.RegisterRequest) {
	existingDevs := make(map[string]pluginapi.Device)
	m.mutex.Lock()
	old, ok := m.endpoints[r.ResourceName]
	if ok && old != nil {
		// Pass devices of previous endpoint into re-registered one,
		// to avoid potential orphaned devices upon re-registration
		devices := make(map[string]pluginapi.Device)
		for _, device := range old.getDevices() {
			devices[device.ID] = device
		}
		existingDevs = devices
	}
	m.mutex.Unlock()

	socketPath := filepath.Join(m.socketdir, r.Endpoint)
	e, err := newEndpointImpl(socketPath, r.ResourceName, existingDevs, m.callback)
	if err != nil {
		glog.Errorf("Failed to dial device plugin with request %v: %v", r, err)
		return
	}
	m.mutex.Lock()
	if r.Options != nil {
		m.pluginOpts[r.ResourceName] = r.Options
	}
	// Check for potential re-registration during the initialization of new endpoint,
	// and skip updating if re-registration happens.
	// TODO: simplify the part once we have a better way to handle registered devices
	ext := m.endpoints[r.ResourceName]
	if ext != old {
		glog.Warningf("Some other endpoint %v is added while endpoint %v is initialized", ext, e)
		m.mutex.Unlock()
		e.stop()
		return
	}
	// Associates the newly created endpoint with the corresponding resource name.
	// Stops existing endpoint if there is any.
	m.endpoints[r.ResourceName] = e
	glog.V(2).Infof("Registered endpoint %v", e)
	m.mutex.Unlock()

	if old != nil {
		old.stop()
	}

	go func() {
		e.run()
		e.stop()
		m.mutex.Lock()
		if old, ok := m.endpoints[r.ResourceName]; ok && old == e {
			m.markResourceUnhealthy(r.ResourceName)
		}
		glog.V(2).Infof("Unregistered endpoint %v", e)
		m.mutex.Unlock()
	}()
}
  • First check whether the registered device plugin has been registered, and if so, get the cached devices.

  • Then check whether the socket of the device plugin can be dialed successfully. If the dial fails, it means that the device plugin has not started normally. If the dial is successful, the Endpoint is reinitialized according to the cached devices. The definition of EndpointImpl is as follows:

    	type endpointImpl struct {
    		client     pluginapi.DevicePluginClient
    		clientConn *grpc.ClientConn
    
    		socketPath   string
    		resourceName string
    		stopTime     time.Time
    
    		devices map[string]pluginapi.Device
    		mutex   sync.Mutex
    
    		cb monitorCallback
    	}
    
  • In order to prevent the device plugin from re-registering during the re-initialization of EndpointImpl, after the initialization is completed, obtain the Endpoint of the device plugin in the cache again, and compare it with the Endpoint object before initialization:

    • If it is not the same object, it means that re-register occurred during the initialization process, then invoke the stop interface of the Endpoint, close the gRPC connection, and set the stopTime of the Endpoint to the current time, and the Register process ends with failure.
    • Otherwise, continue with the following process.
  • If the device plugin has been registered before, before calling the Endpoint's run() to start again, call the Endpoint's stop to close the gRPC connection, and set the Endpoint's stopTime to the current time.

  • Then start the golang coroutine to execute Endpoint's run(), in the run method:

    • Call the ListAndWatch gRPC interface of the device plugin to continuously obtain the ListAndWatch gRPC stream through a long connection.
    • The devices obtained from the stream are compared with the devices cached in the Endpoint, and the devices that need add/delete/update are obtained.
    • Then call the callback of Endpoint (that is, the callback method genericDeviceUpdateCallback registered by ManagerImpl) to update the cache of Device Manager and write it to the checkpoint file.
  • Until the errListAndWatch error occurs in the gRPC connection with the device plugin, it jumps out of the endless loop of continuously obtaining the stream, then calls the stop of the Endpoint to close the gRPC connection, and sets the stopTime of the Endpoint to the current time.

  • After invoke stop, mark all devices corresponding to the device plugin as unhealthy, that is, set healthyDevices to be empty, and add all original healthy devices to unhealthyDevices, which means that registration fails.

Call the Allocate interface of Device Plugin

Register UpdatePluginResources as Pod Admit Handler

The kubelet will register a series of Pod Admit Handlers in NewMainKubelet. When a Pod needs to be created, these Pod Admit Handlers will be called first for processing. Among them klet.containerManager.UpdatePluginResources, the kubelet Device Manager allocates devices to the Pod.

pkg/kubelet/kubelet.go:893

func NewMainKubelet( ... ) (*Kubelet, error) {
	...
	
	klet.admitHandlers.AddPodAdmitHandler(lifecycle.NewPredicateAdmitHandler(klet.getNodeAnyWay, criticalPodAdmissionHandler, klet.containerManager.UpdatePluginResources))
	
	...
}
	
pkg/kubelet/cm/container_manager_linux.go:618

func (cm *containerManagerImpl) UpdatePluginResources(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {
	return cm.deviceManager.Allocate(node, attrs)
}

Allocate

Before kubelet creates a Pod, it will invoke the Allocate method of Device Manager to allocate corresponding devices for each Container request in the Pod. The kubelet will forward the request to the Allocate method of the corresponding Endpoint, and then the request will be processed by the corresponding device plugin. .

pkg/kubelet/cm/devicemanager/manager.go:259

func (m *ManagerImpl) Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {
	pod := attrs.Pod
	devicesToReuse := make(map[string]sets.String)
	// TODO: Reuse devices between init containers and regular containers.
	for _, container := range pod.Spec.InitContainers {
		if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
			return err
		}
		m.podDevices.addContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
	}
	for _, container := range pod.Spec.Containers {
		if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
			return err
		}
		m.podDevices.removeContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
	}

	m.mutex.Lock()
	defer m.mutex.Unlock()

	// quick return if no pluginResources requested
	if _, podRequireDevicePluginResource := m.podDevices[string(pod.UID)]; !podRequireDevicePluginResource {
		return nil
	}

	m.sanitizeNodeAllocatable(node)
	return nil
}
  • Call allocateContainerResources to allocate devices for the init container in the Pod, and update the PodDevices cache in ManagerImpl;
  • Call allocateContainerResources to allocate devices to the regular container in the Pod, and update the PodDevices cache in ManagerImpl;
  • Call sanitizeNodeAllocatable to update the Allocatable Resource corresponding to the Resource Name of Node in the scheduler cache;

allocateContainerResources

pkg/kubelet/cm/devicemanager/manager.go:608

func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
	podUID := string(pod.UID)
	contName := container.Name
	allocatedDevicesUpdated := false
	// Extended resources are not allowed to be overcommitted.
	// Since device plugin advertises extended resources,
	// therefore Requests must be equal to Limits and iterating
	// over the Limits should be sufficient.
	for k, v := range container.Resources.Limits {
		resource := string(k)
		needed := int(v.Value())
		glog.V(3).Infof("needs %d %s", needed, resource)
		if !m.isDevicePluginResource(resource) {
			continue
		}
		// Updates allocatedDevices to garbage collect any stranded resources
		// before doing the device plugin allocation.
		if !allocatedDevicesUpdated {
			m.updateAllocatedDevices(m.activePods())
			allocatedDevicesUpdated = true
		}
		allocDevices, err := m.devicesToAllocate(podUID, contName, resource, needed, devicesToReuse[resource])
		if err != nil {
			return err
		}
		if allocDevices == nil || len(allocDevices) <= 0 {
			continue
		}

		startRPCTime := time.Now()
		
		m.mutex.Lock()
		e, ok := m.endpoints[resource]
		m.mutex.Unlock()
		if !ok {
			m.mutex.Lock()
			m.allocatedDevices = m.podDevices.devices()
			m.mutex.Unlock()
			return fmt.Errorf("Unknown Device Plugin %s", resource)
		}

		devs := allocDevices.UnsortedList()
		
		glog.V(3).Infof("Making allocation request for devices %v for device plugin %s", devs, resource)
		resp, err := e.allocate(devs)
		metrics.DevicePluginAllocationLatency.WithLabelValues(resource).Observe(metrics.SinceInMicroseconds(startRPCTime))
		if err != nil {
			m.mutex.Lock()
			m.allocatedDevices = m.podDevices.devices()
			m.mutex.Unlock()
			return err
		}

		// Update internal cached podDevices state.
		m.mutex.Lock()
		m.podDevices.insert(podUID, contName, resource, allocDevices, resp.ContainerResponses[0])
		m.mutex.Unlock()
	}

	// Checkpoints device to container allocation information.
	return m.writeCheckpoint()
}
  • The Resource provided by the device plugin belongs to Kubernetes Extended Resources, so its Resource QoS can only be Guaranted.
  • Every time before allocating devices to a Pod, check the active pods at this time, compare them with the pods in the podDevices cache, and delete the devices of the terminated Pods from the podDevices, that is, perform the GC operation of the devices.
  • Randomly assign the corresponding number of devices from healthyDevices to the Pod, and pay attention to updating allocatedDevices, otherwise one device will be assigned to multiple Pods.
  • After getting the devices, call the Allocate method of Endpoint (and then call the Allocate gRPC Service corresponding to the device plugin), and the device plugin returns ContainerAllocateResponse (including injected environment variables, mount information, and Annotations).
  • Update the cache information of podDevices, and update the cached data in ManagerImpl to the checkpoint file.

Thinking: When the init container ends, will the corresponding allocated devices be released? Currently, the devices will not be released. Before Allocate, only the devices of Terminated Pods will be recycled, and the devices of the init container will not be recycled. It is relatively simple to optimize this, just modify the logic in the updateAllocatedDevices method in the above code, and add the devices recycling logic of the init container.
So the current version is best not to use devices in the init container, although this scenario hardly exists.

Maintain Resource Capacity managed by Device Plugin in NodeStatus

When the kubelet updates the node status, it will call GetCapacity to update the Resource information corresponding to the device plugins.

pkg/kubelet/kubelet_node_status.go:599

func (kl *Kubelet) setNodeStatusMachineInfo(node *v1.Node) {
	...
	devicePluginCapacity, devicePluginAllocatable, removedDevicePlugins = kl.containerManager.GetDevicePluginResourceCapacity()
	...
}	


pkg/kubelet/cm/container_manager_linux.go:881

func (cm *containerManagerImpl) GetDevicePluginResourceCapacity() (v1.ResourceList, v1.ResourceList, []string) {
	return cm.deviceManager.GetCapacity()
}

The following is the specific code implementation of GetCapacity, the logic is very simple:

  • Check whether the device plugin corresponding to healthyDevices has been deleted from the cache or has been stopped for more than 5 minutes. If one of the above conditions is met, delete these devices from the endpoints and healthyDevices cache.
  • Check whether the device plugin corresponding to unhealthyDevices has been deleted from the cache or has been stopped for more than 5 minutes. If one of the above conditions is met, delete these devices from the endpoints and unhealthyDevices cache.
  • If the cache changes, it is updated to the checkpoint file.
pkg/kubelet/cm/devicemanager/manager.go:414

func (m *ManagerImpl) GetCapacity() (v1.ResourceList, v1.ResourceList, []string) {
	needsUpdateCheckpoint := false
	var capacity = v1.ResourceList{}
	var allocatable = v1.ResourceList{}
	deletedResources := sets.NewString()
	m.mutex.Lock()
	for resourceName, devices := range m.healthyDevices {
		e, ok := m.endpoints[resourceName]
		if (ok && e.stopGracePeriodExpired()) || !ok {
		
			if !ok {
				glog.Errorf("unexpected: healthyDevices and endpoints are out of sync")
			}
			delete(m.endpoints, resourceName)
			delete(m.healthyDevices, resourceName)
			deletedResources.Insert(resourceName)
			needsUpdateCheckpoint = true
		} else {
			capacity[v1.ResourceName(resourceName)] = *resource.NewQuantity(int64(devices.Len()), resource.DecimalSI)
			allocatable[v1.ResourceName(resourceName)] = *resource.NewQuantity(int64(devices.Len()), resource.DecimalSI)
		}
	}
	for resourceName, devices := range m.unhealthyDevices {
		e, ok := m.endpoints[resourceName]
		if (ok && e.stopGracePeriodExpired()) || !ok {
			if !ok {
				glog.Errorf("unexpected: unhealthyDevices and endpoints are out of sync")
			}
			delete(m.endpoints, resourceName)
			delete(m.unhealthyDevices, resourceName)
			deletedResources.Insert(resourceName)
			needsUpdateCheckpoint = true
		} else {
			capacityCount := capacity[v1.ResourceName(resourceName)]
			unhealthyCount := *resource.NewQuantity(int64(devices.Len()), resource.DecimalSI)
			capacityCount.Add(unhealthyCount)
			capacity[v1.ResourceName(resourceName)] = capacityCount
		}
	}
	m.mutex.Unlock()
	if needsUpdateCheckpoint {
		m.writeCheckpoint()
	}
	return capacity, allocatable, deletedResources.UnsortedList()
}

GetCapacity updates NodeStatus with the following data:

  • registered device plugin resource Capacity
  • registered device plugin resource Allocatable
  • previously registered resources that are no longer active

Call the PreStartContainer interface of Device Plugin

In kubelet's GetResource, DeviceManager's GetDeviceRunContainerOptions is called, and these options are added to kubecontainer.RunContainerOptionsit. RunContainerOptions includes information such as Envs, Mounts, Devices, PortMappings, and Annotations.

pkg/kubelet/cm/container_manager_linux.go:601

// TODO: move the GetResources logic to PodContainerManager.
func (cm *containerManagerImpl) GetResources(pod *v1.Pod, container *v1.Container) (*kubecontainer.RunContainerOptions, error) {
	opts := &kubecontainer.RunContainerOptions{}
	// Allocate should already be called during predicateAdmitHandler.Admit(),
	// just try to fetch device runtime information from cached state here
	devOpts, err := cm.deviceManager.GetDeviceRunContainerOptions(pod, container)
	if err != nil {
		return nil, err
	} else if devOpts == nil {
		return opts, nil
	}
	opts.Devices = append(opts.Devices, devOpts.Devices...)
	opts.Mounts = append(opts.Mounts, devOpts.Mounts...)
	opts.Envs = append(opts.Envs, devOpts.Envs...)
	opts.Annotations = append(opts.Annotations, devOpts.Annotations...)
	return opts, nil
}
  • Device Manager's GetDeviceRunContainerOptions will decide whether to call the device plugin's PreStartContainer gRPC Service according to whether the PreStartRequired of the pluginOpts is true.

Note: If the PreStartRequired of a device plugin is true, then the timeout period that needs to be registered with the kubelet Device Manager to call the PreStartContainer interface of the device plugin is 30s, that is, the logic of the PreStartContainer must be completed and returned within 30s.

pkg/kubelet/cm/devicemanager/manager.go:688

// GetDeviceRunContainerOptions checks whether we have cached containerDevices
// for the passed-in <pod, container> and returns its DeviceRunContainerOptions
// for the found one. An empty struct is returned in case no cached state is found.
func (m *ManagerImpl) GetDeviceRunContainerOptions(pod *v1.Pod, container *v1.Container) (*DeviceRunContainerOptions, error) {
	podUID := string(pod.UID)
	contName := container.Name
	for k := range container.Resources.Limits {
		resource := string(k)
		if !m.isDevicePluginResource(resource) {
			continue
		}
		err := m.callPreStartContainerIfNeeded(podUID, contName, resource)
		if err != nil {
			return nil, err
		}
	}
	m.mutex.Lock()
	defer m.mutex.Unlock()
	return m.podDevices.deviceRunContainerOptions(string(pod.UID), container.Name), nil
}
  • Then deviceRunContainerOptions is responsible for encapsulating the Container's Envs, Mount points, Device files, and Annotations.

Summarize

This article conducts a walk-through analysis of the core code of Kubelet Device Manager, and has a deep understanding of its entire workflow. /var/lib/kubelet/device-plugins/kubelet_internal_checkpointIn addition, the Register service of kubelet and the Allocate interface of kubelet calling device plugin are analyzed respectively, especially the importance of the checkpoint mechanism ( ) of kubelet device plugins .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325111049&siteId=291194637