How Kubernetes uses NVIDIA GPUs with Device Plugins

Device Plugins

Device Pulgins is a beta feature in Kubernetes 1.10, starting in Kubernetes 1.8, and is used for third-party device manufacturers to connect device resources to Kubernetes through plug-ins, and provide Extended Resources for containers.

Through the Device Plugins method, users do not need to change the Kubernetes code, and third-party device manufacturers develop plug-ins to implement the relevant interfaces of Kubernetes Device Plugins.

At present, the implementation of Device Plugins with high attention is:

When Device plugins are started, they expose several gRPC Services to provide services and /var/lib/kubelet/device-plugins/kubelet.sockregister with kubelet.

Device Plugins Registration

  • In versions before Kubernetes 1.10, DevicePlugins is disabled by default, and users need to enable it in Feature Gate.

  • In Kubernetes 1.10, DevicePlugins is enabled by default, and users can disable it in Feature Gate.

  • When DevicePlugins Feature Gate is enabled, the kubelet exposes a Register gRPC interface. Device Plugins completes Device registration by calling the Register interface.

  • The Register interface is described as follows:

    	pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:440
    	type RegistrationServer interface {
    		Register(context.Context, *RegisterRequest) (*Empty, error)
    	}
    
    
    	pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:87
    	type RegisterRequest struct {
    		// Version of the API the Device Plugin was built against
    		Version string `protobuf:"bytes,1,opt,name=version,proto3" json:"version,omitempty"`
    		// Name of the unix socket the device plugin is listening on
    		// PATH = path.Join(DevicePluginPath, endpoint)
    		Endpoint string `protobuf:"bytes,2,opt,name=endpoint,proto3" json:"endpoint,omitempty"`
    		// Schedulable resource name. As of now it's expected to be a DNS Label
    		ResourceName string `protobuf:"bytes,3,opt,name=resource_name,json=resourceName,proto3" json:"resource_name,omitempty"`
    		// Options to be communicated with Device Manager
    		Options *DevicePluginOptions `protobuf:"bytes,4,opt,name=options" json:"options,omitempty"`
    	}
    
  • The parameters required by RegisterRequest are as follows:

    • Version, there are currently two versions, v1alpha and v1beta1.
    • Endpoint, indicates the name of the socket exposed by the device plugin. When registering, the socket of the plugin generated according to the Endpoint will be placed in the /var/lib/kubelet/device-plugins/directory, such as the Nvidia GPU Device Plugin /var/lib/kubelet/device-plugins/nvidia.sock.
    • ResourceName, must be in the format of the Extended Resource Naming Scheme vendor-domain/resource, such asnvidia.com/gpu
    • DevicePluginOptions, passed as extra parameters when the kubelet communicates with the device plugin.
      • For nvidia gpu, there is only one PreStartRequired option, indicating whether to call the PreStartContainer interface of the Device Plugin (which is one of the Device Plugin Interface interfaces in Kubernetes 1.10) before each Container starts, and the default is false.

        	vendor/k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:71
        	func (m *NvidiaDevicePlugin) GetDevicePluginOptions(context.Context, *pluginapi.Empty) (*pluginapi.DevicePluginOptions, error) {
        		return &pluginapi.DevicePluginOptions{}, nil
        	}
        
        	github.com/NVIDIA/k8s-device-plugin/server.go:80
        	type DevicePluginOptions struct {
        		// Indicates if PreStartContainer call is required before each container start
        		PreStartRequired bool `protobuf:"varint,1,opt,name=pre_start_required,json=preStartRequired,proto3" json:"pre_start_required,omitempty"`
        	}
        
  • As mentioned earlier, the Device Plugin Interface currently has two versions, v1alpha and v1beta1. The interfaces corresponding to each version are as follows:

    • v1alpha :

      • /deviceplugin.Registration/Register

        	pkg/kubelet/apis/deviceplugin/v1alpha/api.pb.go:374
        	var _Registration_serviceDesc = grpc.ServiceDesc{
        		ServiceName: "deviceplugin.Registration",
        		HandlerType: (*RegistrationServer)(nil),
        		Methods: []grpc.MethodDesc{
        			{
        				MethodName: "Register",
        				Handler:    _Registration_Register_Handler,
        			},
        		},
        		Streams:  []grpc.StreamDesc{},
        		Metadata: "api.proto",
        	}
        
      • /deviceplugin.DevicePlugin/Allocate

      • /deviceplugin.DevicePlugin/ListAndWatch

        	pkg/kubelet/apis/deviceplugin/v1alpha/api.pb.go:505
        	var _DevicePlugin_serviceDesc = grpc.ServiceDesc{
        		ServiceName: "deviceplugin.DevicePlugin",
        		HandlerType: (*DevicePluginServer)(nil),
        		Methods: []grpc.MethodDesc{
        			{
        				MethodName: "Allocate",
        				Handler:    _DevicePlugin_Allocate_Handler,
        			},
        		},
        		Streams: []grpc.StreamDesc{
        			{
        				StreamName:    "ListAndWatch",
        				Handler:       _DevicePlugin_ListAndWatch_Handler,
        				ServerStreams: true,
        			},
        		},
        		Metadata: "api.proto",
        	}
        
    • v1beta1 :

      • /v1beta1.Registration/Register

        	/v1beta1.Registration/Register
        
        	pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:466
        	var _Registration_serviceDesc = grpc.ServiceDesc{
        		ServiceName: "v1beta1.Registration",
        		HandlerType: (*RegistrationServer)(nil),
        		Methods: []grpc.MethodDesc{
        			{
        				MethodName: "Register",
        				Handler:    _Registration_Register_Handler,
        			},
        		},
        		Streams:  []grpc.StreamDesc{},
        		Metadata: "api.proto",
        	}
        
      • /v1beta1.DevicePlugin/ListAndWatch

      • /v1beta1.DevicePlugin/Allocate

      • /v1beta1.DevicePlugin/PreStartContainer

      • /v1beta1.DevicePlugin/GetDevicePluginOptions

        	pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:665
        	var _DevicePlugin_serviceDesc = grpc.ServiceDesc{
        		ServiceName: "v1beta1.DevicePlugin",
        		HandlerType: (*DevicePluginServer)(nil),
        		Methods: []grpc.MethodDesc{
        			{
        				MethodName: "GetDevicePluginOptions",
        				Handler:    _DevicePlugin_GetDevicePluginOptions_Handler,
        			},
        			{
        				MethodName: "Allocate",
        				Handler:    _DevicePlugin_Allocate_Handler,
        			},
        			{
        				MethodName: "PreStartContainer",
        				Handler:    _DevicePlugin_PreStartContainer_Handler,
        			},
        		},
        		Streams: []grpc.StreamDesc{
        			{
        				StreamName:    "ListAndWatch",
        				Handler:       _DevicePlugin_ListAndWatch_Handler,
        				ServerStreams: true,
        			},
        		},
        		Metadata: "api.proto",
        	}
        
  • When the Device Plugin is successfully registered, it will send the list of devices it manages to the kubelet through ListAndWatch. After receiving the data, the kubelet will update the status of the corresponding node in etcd through the API Server.

  • Then the user can request the corresponding device in the Container Spec request, paying attention to the following restrictions:

    • Extended Resource only supports requesting an integer number of devices and does not support decimal points.
    • Over-provisioning is not supported, that is, Resource QoS can only be Guaranteed.
    • The same Device cannot be shared by multiple Containers.

Device Plugins Workflow

The workflow of Device Plugins is as follows:

  • Initialization : After the Device Plugin is started, some plugin-specific initialization work is performed to ensure that the corresponding Devices are in the Ready state. For Nvidia GPUs, the NVML Library is loaded.

  • Start the gRPC service : By /var/lib/kubelet/device-plugins/${Endpoint}.sockexposing the gRPC service externally, different API Versions correspond to different service interfaces. As mentioned earlier, the following is the description of each interface.

    • v1alpha
      • ListAndWatch

      • Allocate

        	pkg/kubelet/apis/deviceplugin/v1alpha/api.proto
        	// DevicePlugin is the service advertised by Device Plugins
        	service DevicePlugin {
        		// ListAndWatch returns a stream of List of Devices
        		// Whenever a Device state changes or a Device disappears, ListAndWatch
        		// returns the new list
        		rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}
        
        		// Allocate is called during container creation so that the Device
        		// Plugin can run device specific operations and instruct Kubelet
        		// of the steps to make the Device available in the container
        		rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
        	}
        
        
    • v1beta1
      • ListAndWatch

      • Allocate

      • GetDevicePluginOptions

      • PreStartContainer

        	pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
        	// DevicePlugin is the service advertised by Device Plugins
        	service DevicePlugin {
        		// GetDevicePluginOptions returns options to be communicated with Device
        	        // Manager
        		rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}
        
        		// ListAndWatch returns a stream of List of Devices
        		// Whenever a Device state change or a Device disapears, ListAndWatch
        		// returns the new list
        		rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}
        
        		// Allocate is called during container creation so that the Device
        		// Plugin can run device specific operations and instruct Kubelet
        		// of the steps to make the Device available in the container
        		rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
        
            // PreStartContainer is called, if indicated by Device Plugin during registeration phase,
            // before each container start. Device plugin can run device specific operations
            // such as reseting the device before making devices available to the container
        		rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {}
        	}
        
  • The Device Plugin /var/lib/kubelet/device-plugins/kubelet.sockis registered with the kubelet.

  • After the registration is successful, the Device Plugin officially enters the Serving mode, providing the aforementioned gRPC interface calling service. The following is the specific analysis corresponding to each interface of v1beta1:

    • ListAndWatch : Monitor the state change or Disappear event of the corresponding Devices and return ListAndWatchResponseit to the kubelet. ListAndWatchResponse is the Device list.

      	type ListAndWatchResponse struct {
      		Devices []*Device `protobuf:"bytes,1,rep,name=devices" json:"devices,omitempty"`
      	}
      
      	type Device struct {
      		// A unique ID assigned by the device plugin used
      		// to identify devices during the communication
      		// Max length of this field is 63 characters
      		ID string `protobuf:"bytes,1,opt,name=ID,json=iD,proto3" json:"ID,omitempty"`
      		// Health of the device, can be healthy or unhealthy, see constants.go
      		Health string `protobuf:"bytes,2,opt,name=health,proto3" json:"health,omitempty"`
      	}
      

    Here is struct Devicethe GPU Sample:

    struct Device {
        ID: "GPU-fef8089b-4820-abfc-e83e-94318197576e",
        State: "Healthy",
    }
    
    • Allocate : The Device Plugin performs device-specific operations and returns AllocateResponseit to kubelet, which is then passed to dockerd, which is used by dockerd (calling nvidia-docker) to allocate device when creating a container. The following is a description of the Request and Response of this interface.

      • Allocate is expected to be called during pod creation since allocation failures for any container would result in pod startup failure.

      • Allocate allows kubelet to exposes additional artifacts in a pod's environment as directed by the plugin.

      • Allocate allows Device Plugin to run device specific operations on the Devices requested

        	type AllocateRequest struct {
        		ContainerRequests []*ContainerAllocateRequest `protobuf:"bytes,1,rep,name=container_requests,json=containerRequests" json:"container_requests,omitempty"`
        	}
        
        	type ContainerAllocateRequest struct {
        		DevicesIDs []string `protobuf:"bytes,1,rep,name=devicesIDs" json:"devicesIDs,omitempty"`
        	}
        
        	// AllocateResponse includes the artifacts that needs to be injected into
        	// a container for accessing 'deviceIDs' that were mentioned as part of
        	// 'AllocateRequest'.
        	// Failure Handling:
        	// if Kubelet sends an allocation request for dev1 and dev2.
        	// Allocation on dev1 succeeds but allocation on dev2 fails.
        	// The Device plugin should send a ListAndWatch update and fail the
        	// Allocation request
        	type AllocateResponse struct {
        		ContainerResponses []*ContainerAllocateResponse `protobuf:"bytes,1,rep,name=container_responses,json=containerResponses" json:"container_responses,omitempty"`
        	}
        
        	type ContainerAllocateResponse struct {
        		// List of environment variable to be set in the container to access one of more devices.
        		Envs map[string]string `protobuf:"bytes,1,rep,name=envs" json:"envs,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
        		// Mounts for the container.
        		Mounts []*Mount `protobuf:"bytes,2,rep,name=mounts" json:"mounts,omitempty"`
        		// Devices for the container.
        		Devices []*DeviceSpec `protobuf:"bytes,3,rep,name=devices" json:"devices,omitempty"`
        		// Container annotations to pass to the container runtime
        		Annotations map[string]string `protobuf:"bytes,4,rep,name=annotations" json:"annotations,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
        	}
        
        	// DeviceSpec specifies a host device to mount into a container.
        	type DeviceSpec struct {
        		// Path of the device within the container.
        		ContainerPath string `protobuf:"bytes,1,opt,name=container_path,json=containerPath,proto3" json:"container_path,omitempty"`
        		// Path of the device on the host.
        		HostPath string `protobuf:"bytes,2,opt,name=host_path,json=hostPath,proto3" json:"host_path,omitempty"`
        		// Cgroups permissions of the device, candidates are one or more of
        		// * r - allows container to read from the specified device.
        		// * w - allows container to write to the specified device.
        		// * m - allows container to create device files that do not yet exist.
        		Permissions string `protobuf:"bytes,3,opt,name=permissions,proto3" json:"permissions,omitempty"`
        	}
        
      • AllocateRequest is a list of DeviceIDs.

      • The AllocateResponse includes the Envs that need to be injected into the Container, the mount information of the Devices (including the device's cgroup permissions), and custom Annotations.

    • PreStartContainer _

      • PreStartContainer is expected to be called before each container start if indicated by plugin during registration phase.

      • PreStartContainer allows kubelet to pass reinitialized devices to containers.

      • PreStartContainer allows Device Plugin to run device specific operations on the Devices requested.

        	type PreStartContainerRequest struct {
        		DevicesIDs []string `protobuf:"bytes,1,rep,name=devicesIDs" json:"devicesIDs,omitempty"`
        	}
        
        	// PreStartContainerResponse will be send by plugin in response to PreStartContainerRequest
        	type PreStartContainerResponse struct {
        	}
        
    • GetDevicePluginOptions : Currently there is only PreStartRequiredone field.

      type DevicePluginOptions struct {
      	// Indicates if PreStartContainer call is required before each container start
      	PreStartRequired bool `protobuf:"varint,1,opt,name=pre_start_required,json=preStartRequired,proto3" json:"pre_start_required,omitempty"`
      }
      

exception handling

  • Every time kubelet starts (restarts), all sockets files under /var/lib/kubelet/device-plugins will be deleted.

  • The Device Plugin is responsible for monitoring that its own socket is deleted, and then re-registers and regenerates its own socket.

  • When the plugin socket is deleted by mistake, what should I do with the Device Plugin?

Let's see how the Nvidia Device Plugin handles it. The relevant code is as follows:

github.com/NVIDIA/k8s-device-plugin/main.go:15

func main() {
	...
	
	log.Println("Starting FS watcher.")
	watcher, err := newFSWatcher(pluginapi.DevicePluginPath)
	
    ...

	restart := true
	var devicePlugin *NvidiaDevicePlugin

L:
	for {
		if restart {
			if devicePlugin != nil {
				devicePlugin.Stop()
			}

			devicePlugin = NewNvidiaDevicePlugin()
			if err := devicePlugin.Serve(); err != nil {
				log.Println("Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?")
				log.Printf("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites")
				log.Printf("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start")
			} else {
				restart = false
			}
		}

		select {
		case event := <-watcher.Events:
			if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
				log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket)
				restart = true
			}

		case err := <-watcher.Errors:
			log.Printf("inotify: %s", err)

		case s := <-sigs:
			switch s {
			case syscall.SIGHUP:
				log.Println("Received SIGHUP, restarting.")
				restart = true
			default:
				log.Printf("Received signal \"%v\", shutting down.", s)
				devicePlugin.Stop()
				break L
			}
		}
	}
}	
  • By fsnotify.Watchermonitoring the /var/lib/kubelet/device-plugins/directory.
  • If fsnotify.Watcherthe Events Channel receives the Create kubelet.sockevent (indicating that the kubelet has restarted), it will trigger the restart of the Nvidia Device Plugin.
  • The logic of restarting the Nvidia Device Plugin is: first check whether the devicePlugin object is empty (indicating that the initialization of the Nvidia Device Plugin is completed):
    • If not empty, stop the gRPC Server of Nvidia Device Plugin first.
    • Then call NewNvidiaDevicePlugin() to rebuild a new DevicePlugin instance.
    • Call Serve() to start gRPC Server, and first kubelet to register itself.

Therefore, only kubelet.sockthe Create event is monitored, which can handle the problem of kubelet restarting well, but it does not monitor whether its own socket is deleted. Therefore, if the socket of the Nvidia Device Plugin is deleted by mistake, it will cause the kubelet to be unable to communicate with the Nvidia Device Plugin of the node, which means that the gRPC interface of the Device Plugin cannot be adjusted:

  • Unable to ListAndWatch the Device list and health status of the node, and the Devices information cannot be synchronized.
  • Unable to Allocate Device, resulting in container creation failure.

Therefore, it is recommended to monitor the deletion event of your own device plugin socket. Once the deletion is monitored, restart should be triggered.

select {
    case event := <-watcher.Events:
    	if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
    		log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket)
    		restart = true
    	}
    	
    	// 增加对nvidia.sock的删除事件监控
    	if event.Name == serverSocket && event.Op&fsnotify.Delete == fsnotify.Delete {
    		log.Printf("inotify: %s deleted, restarting.", serverSocket)
    		restart = true
    	}
    	
    	...
}

Extended Resources

  • The Device Plugin exposes the resources on the host through Extended Resources. The built-in Resources of Kubernetes belong to kubernetes.iothe domain, so the Extended Resource does not allow advertise under the kubernetes.iodomain.

  • Node-level Extended Resource

    • Resources managed by Device plugin

    • Other resources

      • Submit a PATCH request to the API Server, and add a new resource name and quantity to the node's status.capacity;
      • The kubelet periodically updates the node status.allocatable to the API Server, which includes the newly added resources by PATCHing the node in advance. After that, the Pod that requests the newly added resource will be selected by the scheduler according to the node status.allocatable FitResources Predicate.
      • Note: kubelet configures the periodic update interval through --node-status-update-frequency, the default is 10s. Therefore, after you submit the PATCH, in the worst case, it may take about 10s for the scheduler to discover and use the resource.
      curl --header "Content-Type: application/json-patch+json" \
      --request PATCH \
      --data '[{"op": "add", "path": "/status/capacity/example.com~1foo", "value": "5"}]' \
      http://k8s-master:8080/api/v1/nodes/k8s-node-1/status
      

    注意:~1 is the encoding for the character / in the patch path。

  • Cluster-level Extended Resources

    • Usually cluster-level Extended Resources are used by the scheduler extender for quota management of Resources.

    • When the resource requested by the Pod contains the extended resource, the default scheduler will send the Pod to the corresponding scheduler extender for secondary scheduling.

    • If the ignoredByScheduler field is set to true, the default scheduler will not perform the PodFitsResources pre-selection check on the resource. Usually, it will be set to true, because the Cluster-level is not related to the node, and it is not suitable for PodFitResources to check the Node resource.

      {
        "kind": "Policy",
        "apiVersion": "v1",
        "extenders": [
          {
            "urlPrefix":"<extender-endpoint>",
            "bindVerb": "bind",
            "ManagedResources": [
              {
                "name": "example.com/foo",
                "ignoredByScheduler": true
              }
            ]
          }
        ]
      }
      
  • API Server restricts Extender Resources to integers, such as 2,2000m, 2Ki, but not 1.5, 1500m.

  • Only Extended Resources configured in Contaienr resources filed must be Guaranteed QoS. That is, either only limits are set (requests are the same as limits by default), or requests and limits are displayed in the same configuration.

Scheduler GPU

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

Here we only discuss how to schedule the use of GPUs in Kubernetes 1.10.

Before Kubernetes 1.8, it was officially recommended to enable alpha gate feature: Accelerators, alpha.kubernetes.io/nvidia-gpuuse gpu by requesting resources, and require the container to mount the nvidia lib and driver on the host into the container. For this part, please refer to my blog post: How to use GPU for AI training in Kubernetes cluster .

  • Starting from Kubernetes 1.8, it is officially recommended to use Device Plugins to use GPUs.
  • It is necessary to pre-install NVIDIA Driver on Node, and it is recommended to deploy NVIDIA Device Plugin through Daemonset. After completion, Kubernetes can discover nvidia.com/gpu.
  • Because the device plugin exposes gpu resources through extended resources, when the container requests gpu resources, it should be noted that the resource QoS is Guaranteed.
  • Containers still do not support sharing the same gpu card. Each Container can request multiple gpu cards, but does not support gpu fraction.

In addition to the above precautions when using the official nvidia driver , you should also pay attention to:

  • Requires pre-install nvidia docker 2.0 on Node and replaces runC with nvidia docker as the default runtime for docker.

  • On CentOS, install nvidia docker 2.0 as follows:

    	# Add the package repositories
    	distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    	curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
    	  sudo tee /etc/yum.repos.d/nvidia-docker.repo
    
    	# Install nvidia-docker2 and reload the Docker daemon configuration
    	sudo yum install -y nvidia-docker2
    	sudo pkill -SIGHUP dockerd
    
    	# Test nvidia-smi with the latest official CUDA image
    	docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
    
  • After the above work is completed, the Container can request gpu resources like buit-in resources:

    	apiVersion: v1
    	kind: Pod
    	metadata:
    	  name: cuda-vector-add
    	spec:
    	  restartPolicy: OnFailure
    	  containers:
    	    - name: cuda-vector-add
    	      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
    	      image: "k8s.gcr.io/cuda-vector-add:v0.1"
    	      resources:
    	        limits:
    	          nvidia.com/gpu: 2 # requesting 2 GPU
    

Use NodeSelector to distinguish different models of GPU servers

If there are different types of GPU servers in your cluster, such as nvidia tesla k80, p100, v100, etc., and different training tasks need to match different GPU models, then first add the corresponding Label to the Node:

# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100

The corresponding GPU model is specified by NodeSelector in the Pod:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.

Thinking: In fact, just using NodeSelector is not a good solution to this problem, which requires all pods to be added with the corresponding NodeSelector. For expensive and rare GPU cards such as V100, it is usually required that other training tasks cannot be used, but only certain algorithms are used for training. At this time, we can add the corresponding Taint to the Node and the corresponding Toleration to the required Pod. It can perfectly meet the needs.

Deploy

  • It is recommended to deploy Device Plugin through Daemonset to facilitate failover.
  • Device Plugin Pod must have privileged to access /var/lib/kubelet/device-plugins
  • The Device Plugin Pod needs to mount the host's hostpath /var/lib/kubelet/device-plugins to the same directory in the container.

Governors 1.8

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
spec:
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      containers:
      - image: nvidia/k8s-device-plugin:1.8
        name: nvidia-device-plugin-ctr
        securityContext:
          privileged: true
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

Governors 1.10

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    metadata:
      # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
      # reserves resources for critical add-on pods so that they can be rescheduled after
      # a failure.  This annotation works in tandem with the toleration below.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      containers:
      - image: nvidia/k8s-device-plugin:1.10
        name: nvidia-device-plugin-ctr
        securityContext:
          privileged: true
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins    

The handling of critical pods by Kubernetes is becoming more and more interesting. Find time to write a separate blog to discuss this in detail.

Device Plugins schematic

Enter image description

Summarize

A few months ago, I analyzed how Kubernetes 1.8 uses GPUs in my blog How to use GPUs for AI training in Kubernetes clusters . In Kubernetes 1.10, Device Plugins has been recommended to use GPUs. This article analyzes the principle and working mechanism of Device Plugin, introduces Extended Resource, exception handling and improvement points of Nvidia Device Plugin, how to use and schedule GPU, etc. In the next blog, I will analyze the source code of NVIDIA/k8s-device-plugin and kubelet device plugin, and learn more about the interaction details between kubelet and nvidia device plugin.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324380559&siteId=291194637