Manage Kubernetes GPU resources with Elastic GPU

about us

For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

④ Reply to [Introduction to the Speed ​​of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.

author

Xu Bei, Tencent Cloud Container Technology Expert, Tencent Cloud Heterogeneous Computing Container Leader, many years of cloud computing front-line architecture design and R&D experience, long-term deep Kubernetes, in the field of offline co-location and GPU containerization, Kubernetes KEP Memory QoS author, Kubernetes active Contributor.

There is currently a problem

With a large number of cores and high-speed memory, GPUs are good at parallel computing and are ideal for training and running machine learning models. As AI technology has become more and more mature in recent years, there are more and more landing scenarios, and the demand for GPU is showing a blowout trend. On the resource management scheduling platform, Kubernetes has become the de facto standard. Therefore, many customers choose to use GPUs to run AI computing tasks in Kubernetes.

Kubernetes provides a device plugin mechanism that allows nodes to discover and report device resources for use by Pods. GPU resources are also provided in this way. Taking nvidia GPU as an example, the user deploys the nvidia-device-plugin on the Kubernetes node, the plug-in scans the node GPU card, and nvidia.com/gpu: 8registers the GPU resources to the node in a similar form through the extended resource mechanism. The user specifies the resource name when creating a Pod. After being scheduled by the scheduler, the Pod is bound to the node, and finally the required GPU device is mounted into the container through a series of tools provided by nvidia docker.

The Kubernetes device plugin provides a convenient way to integrate third-party devices. However, when applied to GPU scenarios, there are still the following shortcomings:

  • Cluster GPU resources lack a global perspective . There is no intuitive way to obtain cluster-level GPU information, such as the binding relationship between Pod/container and GPU card, the number of used GPU cards, etc.

  • Multi-GPU backends are not well supported . Various GPU technologies (nvidia docker, qGPU, vCUDA, gpu share, GPU pooling) need to deploy components independently, and cannot be scheduled and managed uniformly

Problem 1: Lack of global perspective of GPU resources

The existing Kubernetes allocation and scheduling of GPU resources is achieved through extended resources, which is based on the addition or subtraction of the number of cards on the node. If users want to know the allocation of GPU cards in the cluster, they need to traverse the nodes to get and calculate the information. And because this resource is scalar, it is not possible to get the binding relationship between the Pod/container and the card. These problems are not so prominent in the whole card mode, but they are especially serious in the fine-grained sharing mode.

Since GPU cards are relatively expensive, and some AI loads cannot consume the computing power of a single GPU, GPU Sharing technology came into being. In Kubernetes, we will share some AI workloads on the same GPU card to save costs by increasing business deployment density and improving GPU utilization. Taking TKE qGPU as an example, in GPU Sharing mode, the expansion resources change from the number of GPU cards to the percentage of qGPU Core and MB of qGPU Memory. In other words, users can apply for a qGPU virtual device smaller than one card through the qGPU container virtualization technology. These devices are virtualized on a single physical card, and resources are also strongly isolated. In addition to qGPU, technologies such as vCUDA and gpu share support multiple Pods/containers sharing the same GPU card. Based on the existing Kubernetes architecture, it is impossible to know the distribution of slice resources (which I define as the combination of GPU Core and Memory) contained in the GPU card. Cluster resource distribution is a black box for both administrators and users. Administrators cannot know the allocation of GPU slice resources in the entire cluster, and users do not know whether resources are available for newly deployed services.

Problem 2: Unable to support multi-GPU backend

In addition to the method of allocating and mounting the whole card, GPU sharing technologies such as TKE qGPU, vCUDA, gpu share, and GPU pooling are increasingly adopted by users. Each solution has its own independent set of Kubernetes integration implementations. For example, in TKE qGPU, we have self-developed tke-qgpu-scheduler for GPU fine-grained computing power and video memory allocation scheduling, and the supporting tke-qgpu-manager is used for node initialization, registration and reporting of qGPU resources, and qGPU container virtualization. vCUDA and gpu share are also similar architectures, which are also composed of scheduler + device plugin. These programs are independent of each other, and there is no unified standard, and cannot be shared. This makes it difficult for users to use multiple GPU backend technologies simultaneously in a single cluster. For example, some businesses in the user cluster are online reasoning, they are not satisfied with the whole card, and want to apply for TKE qGPU slice resources. Another part of the business is training, which requires the allocation of a single card. Some simulation and model debugging services want to dynamically request resources from the remote GPU pool for cost and flexibility. It is difficult for existing solutions to meet the above requirements at the same time, which makes it more difficult to build a unified AI infrastructure platform based on Kubernetes.

The above problems are the real problems that TKE encounters when helping customers build AI computing platforms based on Kubernetes. With the continuous improvement of AI business, customers are no longer satisfied with "being able to use Kubernetes GPU resources". Attention to GPU cost, overall control of GPU resources, and precise use of different GPU backends have all become prerequisites for customers to make good use of GPU computing power. Since the existing system cannot be satisfied, we need to find another way and rethink the position of GPU in Kubernetes.

A New Kubernetes GPU Solution

Inspiration from PV / PVC

In Kubernetes, resources are generally designed and defined around Pods. To a significant degree, there are two types of resources available to a cluster: core resources and external resources. Core resources refer to the indispensable resources to maintain the normal operation of Pod, including CPU, memory, temporary storage, network card, etc. These are scanned for nodes via the kubelet and reported to the cluster. The other part is external resources, mostly referring to external storage and other devices, such as data disks, GPUs, FPGAs, etc. These devices may be local device mounts or remote device mounts. The existence of such resources can make the Pod run better. For example, the data disk increases the storage capacity of the Pod, and the GPU/FPGA accelerates the computing power of the Pod. From this perspective, storage is similar to GPU.

Kubernetes abstracts a set of resources on storage, such as PV/PVC/StorageClass, and provides a set of APIs and interaction methods to standardize and separate storage supply management and usage.

  • PV : PV is an actual storage resource in the cluster, which can be created manually by the administrator or dynamically created through StorageClass. PV is similar to the CPU, memory, network card and other resources on the node. PV can have various backends, such as the public cloud storage service described above, self-built shared storage or local storage.
  • PVC : PVC is a user's claim for PV storage resources. It is similar to a Pod in a cluster. Pod applies for CPU, memory, network and other resources on the node, and PVC applies for storage resources, that is, PV.
  • StorageClass : StorageClass provides administrators with a way to describe a "class" of storage. For example, the method created by the PV backend, the method of mounting, and so on.

The user creates the actual storage on the specified backend through the PV, and the user applies for the created PV storage resource through the PVC, or specifies the StorageClass to dynamically create the PV from the backend.

Referring to the way Kubernetes storage is designed, we believe that GPUs can also define and implement similar abstractions.

Elastic GPU CRD

We define three new Kubernetes CRDs that represent different abstractions for GPU resources:

  • ElasticGPU : ElasticGPU is an actually usable GPU resource in the cluster, which can be a local GPU physical card, a GPU slice resource (a combination of GPU computing power and video memory), and a remote GPU device.
  • ElasticGPUClaim : ElasticGPUClaim is a user's application for ElasticGPU resources. You can apply for the number of whole cards, apply for the number of GPU cores/video memory, or apply for TFLOPS computing power.
  • EGPUClass : EGPUClass provides a way to produce and mount ElasticGPU, which can use qGPU virtualization, vCUDA, or GPU remote pooling technology.
type ElasticGPU struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
	Spec              ElasticGPUSpec   `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
	Status            ElasticGPUStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
}

type ElasticGPUSpec struct {
	Capacity         v1.ResourceList `json:"capacity,omitempty" protobuf:"bytes,1,rep,name=capacity,casttype=ResourceList,castkey=ResourceName"`
	ElasticGPUSource `json:",inline" protobuf:"bytes,2,opt,name=elasticGPUSource"`
	ClaimRef         v1.ObjectReference `json:"claimRef,omitempty" protobuf:"bytes,3,opt,name=claimRef"`
	NodeAffinity     GPUNodeAffinity    `json:"nodeAffinity,omitempty" protobuf:"bytes,4,opt,name=nodeAffinity"`
	NodeName         string             `json:"nodeName,omitempty" protobuf:"bytes,5,opt,name=nodeName"`
}

type ElasticGPUSource struct {
	QGPU        *QGPUElasticGPUSource        `json:"qGPU,omitempty" protobuf:"bytes,1,opt,name=qGPU"`
	PhysicalGPU *PhysicalGPUElasticGPUSource `json:"physicalGPU,omitempty" protobuf:"bytes,2,opt,name=physicalGPU"`
	GPUShare    *GPUShareElasticGPUSource    `json:"gpuShare,omitempty" protobuf:"bytes,3,opt,name=gpuShare"`
}

type ElasticGPUClaim struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
	Spec              ElasticGPUClaimSpec   `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
	Status            ElasticGPUClaimStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
}

type ElasticGPUClass struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
	Provisioner       string            `json:"provisioner" protobuf:"bytes,2,opt,name=provisioner"`
	Parameters        map[string]string `json:"parameters,omitempty" protobuf:"bytes,3,rep,name=parameters"`
}

The following takes TKE qGPU as an example to describe the entire resource scheduling and allocation process combined with the Elastic GPU solution.

qGPU resource application

The user creates an ElasticGPUClass in the cluster, specifying qGPU as the GPU backend.

apiVersion: elasticgpu.io/v1alpha
kind: ElasticGPUClass
metadata:
  name: qgpu-class
provisioner: elasticgpu.io/qgpu
reclaimPolicy: Retain
eGPUBindingMode: Immediate

Create ElasticGPUClaim to describe the application for qGPU resources, which means applying tke.cloud.tencent.com/qgpu-corefor 10% of the GPU computing power, and means tke.cloud.tencent.com/qgpu-memoryapplying for 4GB of video memory.

apiVersion: elasticgpu.io/v1alpha
kind: ElasticGPUClaim
metadata:
  name: qgpu-egpuc
spec:
  storageClassName: qgpu-class
  resources:
    requests:
      tke.cloud.tencent.com/qgpu-core: 10
      tke.cloud.tencent.com/qgpu-memory: 4GB

The user specifies ElasticGPUClaim to complete the qGPU resource application when creating a Pod.

apiVersion: v1
kind: Pod
metadata:
  name: qgpu-pod
  annotations:
    elasticgpu.io/egpuc-<container-name>: qgpu-egpuc
spec:
  containers:
  - name: test

qGPU resource scheduling

Considering the design of out-tree, qGPU resource discovery, reporting and scheduling still rely on the original device plugin and extended resource mechanism.

We use elastic-gpu-admission-hook to identify annotations when Pod is created, elasticgpu.io/egpuc-<container-name>and correctly set application resources to containers.

apiVersion: v1
kind: Pod
metadata:
  name: qgpu-pod
  annotations:
    elasticgpu.io/egpuc-test: qgpu-egpuc
spec:
  containers:
  - name: test
    resources:
      requests:
        tke.cloud.tencent.com/qgpu-core: 10
        tke.cloud.tencent.com/qgpu-memory: 4GB
      limits:
        tke.cloud.tencent.com/qgpu-core: 10
        tke.cloud.tencent.com/qgpu-memory: 4GB

The qgpu-scheduler extension scheduler is used for qGPU resource scheduling, returning nodes that meet the requirements. After the Pod is bound to the node, qgpu-provisioner will update ElasticGPUthe information such as the node and GPU card index in the CRD to realize the binding of the qGPU device.

qGPU resource creation

qgpu-manager will watch ElastciGPUthe CRD changes, and after the node is successfully bound, it will perform the operation of creating a qGPU device. qgpu-manager will create a qGPU device at the bottom layer according to the application computing power and video memory information contained in the CRD and the index of the GPU card to be dispatched.

qGPU device mount

qgpu-manager is a device plugin, and the kubelet will call the plugin through the standard interface when allocating a device. In interfaces Allocateand PreStartContainer, we will mount the necessary qGPU, nvidia devices, and set environment variables. Finally, we rely on qgpu-container-runtime to bind qGPU devices to containers.

next step

With the large-scale implementation of AI services, more and more users use GPUs for AI computing in Kubernetes. It is difficult for the existing extended resource and device plugin mechanisms to satisfy customers' fine control and allocation of GPU resources, and a new technical framework is imperative. Elastic GPU abstracts a native GPU resource in a Kubernetes cluster. It focuses on three custom CRDs. On the premise of standardizing and defining interaction with other GPU technologies, it also provides a cluster-level global GPU resource perspective, allowing users to better Observe and manage GPU resources.

The first step of Elastic GPU will focus on CRD definition and interaction process standardization, and will first adapt to TKE qGPU. At this stage, we hope to refer to the design concept of PV / PVC / CSI, provide the abstraction of GPU resources in a Kubernetes native way, standardize the processes of resource allocation, scheduling, mounting, etc., and provide flexible interfaces for integration with other GPU technologies . By supporting TKE qGPU first in production, we will continue to polish the framework and release the first alpha version. Next, we will push the community to implement integrated support for mainstream GPU technologies, including nvidia docker, gpu share and vCUDA, the applicable scenarios of the horizontal scaling framework. Through a standard framework, unified interfaces and processes, reduce customer management costs. Improve flexibility, increase utilization, and reduce customer card costs through technologies such as GPU Sharing and Remote GPU. We hope to rely on the Elastic GPU framework to eventually provide customers with the ability to use GPU resources out of the box in Kubernetes.

TKE qGPU:https://cloud.tencent.com/document/product/457/61448

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4534936/blog/5516384