Chapter 20: GPU management mechanism and Device Plugin

GPU management mechanism and Device Plugin

This article will share the following aspects:

  1. Demand Source
  2. Container of the GPU
  3. Kubernetes management of GPU
  4. working principle
  5. After-school thinking and practice

Demand Source

2016, with the sudden emergence of AlphaGo popularity and TensorFlow project, a technological revolution called AI spread rapidly from academic circles to the industry, from the so-called AI revolution had begun.

After three years of development, AI has landed many scenes, including smart customer service, face recognition, machine translation, to map search map and other functions. In fact, machine learning or artificial intelligence, is not a new concept. This time behind the boom, the popularity of cloud computing and considered a huge boost force, the real artificial intelligence will bring a major force behind the industrial sector from the ivory tower.

enter image description here

Correspondingly, from the beginning of 2016, Kubernetes community will continue to receive a lot of demands from different sources. We are hoping to run TensorFlow machine learning framework on Kubernetes cluster. These demands, in addition to previous lessons described, like Job management outside these offline tasks, there is a huge challenge: deep learning depends GPU support heterogeneous devices and NVIDIA.

We can not help but curious: Kubernetes management GPU can bring what good is it? Essentially cost and efficiency considerations. Due to the relative CPU, the high cost of the GPU. On the cloud single CPU usually a few cents an hour, while the GPU is spending 10 per month to 30 per hour from a single GPU, it is necessary to find ways to improve the utilization of the GPU. Why Kubernetes managed to GPU as the representative of heterogeneous resources?

Specifically, three aspects:

  • Accelerate the deployment: vessel conceived by avoiding duplication deploy complex machine learning environment;
  • To enhance the utilization of cluster resources: a unified cluster resource scheduling and allocation;
  • Exclusive support resources: the use of container isolation heterogeneous devices, to avoid affecting each other.

The first is to accelerate the deployment and avoid wasting time preparing the environmental aspects. The entire deployment process for curing and reuse through the container mirroring technology, if the students concerned about the field of machine learning, can be found in many frameworks provide container mirror. We can take to enhance the efficiency of the use of the GPU. By time division multiplexing, to improve efficiency in the use of the GPU. When the number of cards GPU reaches a certain number, you need to use a unified scheduling capabilities Kubernetes so that consumers can do with the resources that is an application that is completely released, thereby revitalize the entire resource pool GPU. By this time also need to bring their own equipment Docker isolation, to avoid the process of running different applications on the same device, affect each other. While a cost-effective, but also to protect the stability of the system.

Container of the GPU

The above has learned the benefits of running GPU applications by Kubernetes, through previous studies also know, Kubernetes is a container management platform, of which the scheduling unit is a container, so before learning how to use Kubernetes, we first look at how to run inside the container environment GPU applications.

Use GPU applications in the container environment

Use GPU application in the container environment, in fact, it is not complicated. It is divided into two steps:

  • Construction of the container support GPU mirror;
  • Docker run using the mirror up, and the map GPU device dependent libraries and to the vessel.

How to prepare GPU container mirror

There are two ways to prepare:

  • Official deep learning directly mirrored container

For example, look for the official GPU directly from docker.hub mirror or mirrored Ali cloud services, including such TensorFlow, Caffe, PyTorch other popular machine learning framework, there is provided a standard mirror. The advantage is simple and convenient, safe and reliable.

  • Construction of CUDA-based image based Nvidia

Of course, if the mirror can not meet official requirements, such as custom modifications you made to TensorFlow framework, we need to be recompiled to build their own TensorFlow mirror. In this case, our best practices are: relying continue to build on Nvidia official mirror, rather than starting from scratch.

TensorFlow example shown in the figure below, the mirror is to the CUDA based GPU begin to build their own image.

enter image description here

GPU container image theory

To learn how to build a GPU container mirroring, you have to know how to install the application on the host GPU.

FIG left as shown, the bottom is to install the hardware driver Nvidia; then the above is a common tool magazine CUDA; top is PyTorch, TensorFlow such machine learning framework. The higher the degree of coupling between two CUDA toolkit and application after application version changes, the corresponding CUDA versions have a high probability update; and the lowest level of Nvidia drivers, under normal circumstances is relatively stable, it will not be like CUDA and the same application, updated frequently.

enter image description here

Nvidia while the kernel driver needs to compile the source code, as shown in the right diagram, the GPU NVIDIA container scheme: Nvidia driver installed on the host, but more CUDA software to do the mirror container. Nvidia driven while the inside of Mount Bind link mapped to the vessel. One benefit of this is: when you install a new Nvidia drivers After that, you can run different versions of CUDA mirrored node on the same machine.

How to use the GPU to run container program

With the front of the base, we are easier to understand the working mechanism of GPU container. Below is an example of a container Docker GPU to use.

enter image description here

We can observe a difference between the container and the GPU common container at run time, only the host that needs to be mapped and Nvidia device driver library into the container. FIG reflects the right container after the start GPU, GPU configuration container. Shows the upper right is the result of the mapping device, the bottom right rear drive library is mapped to Bind to the container, you can see the changes. Usually it will use Nvidia-docker container to run GPU, and Nvidia-docker actual work is done both to automate work. Mount equipment which is relatively simple, but really more complex the application is GPU-dependent driver library. For different scene depth learning, video processing, some drivers are not used in the same library. This in turn relies on Nvidia's domain knowledge, and knowledge in these areas was through to the Nvidia into a container.

Kubernetes management of GPU

How to deploy GPU Kubernetes

First look at how to increase the ability of GPU to a Kubernetes node, we CentOS node, for example.

enter image description here

As shown in FIG:

  • First install Nvidia driver;

Since Nvidia driver needs to compile the kernel, so before installing Nvidia drivers need to install gcc and the kernel source code.

  • Step source via yum, mounting Nvidia Docker2;

Installing Nvidia Docker2 need to reload Docker, you can check Docker's daemon.json inside the default startup engine has been replaced with Nvidia, you can also docker infoview the runC runtime use is not Nvidia's runC command.

  • The third step is to deploy Nvidia Device Plugin.

Nvidia from git repogoing to download Device Plugin deployment manifest file, and through the kubectl createdeployment command. Here Device Plugin is deamonset manner deployment. So we know that if you need to troubleshoot a Kubernetes node not scheduling GPU applications, need to start from the start of these modules, for example, I want to look Device Plugin logs, Nvidia's runC is configured with Docker default runC, as well as Nvidia driver is installed success.

To verify the deployment GPU Kubernetes results

When the GPU node deployment is successful, we can find the relevant information from the state information GPU nodes.

  • Is the name of a GPU, here is nvidia.com/gpu;
  • Another is the number of which corresponds, as shown in FIG. 2, showing containing two GPU in the node.

enter image description here

Use the GPU in Kubernetes the sample yaml

From the user's point of view, using a GPU in Kubernetes the container is very simple. Only you need to specify the fields limit the number of resource configurations Pod nvidia.com/gpu using the GPU, the number of samples below we set to 1; then by kubectl createthe command of the GPU Pod deployed.

enter image description here

View run results

After the deployment is complete can log in to the container execute nvidia-smicommands look at the results, you can see the use of GPU card T4 in the container. Description two GPU cards one of which has been able to use the node in the container, but in addition a card for change of the container is completely transparent to the nodes, it is not accessible, there is a manifestation of GPU of isolation.

enter image description here

working principle

Manage GPU resources by way of extension

Kubernetes itself is to manage GPU resources through the mechanism of plug-in extensions, specifically, there are two separate internal mechanisms.

enter image description here

  • The first is the Extend Resources, allows users to customize the resource name. And measure the level of the resources is an integer, the purpose of doing so is supported by a common pattern different isomeric devices, including RDMA, FPGA, AMD GPU and the like, and not just designed for the Nvidia GPU;
  • Device Plugin Framework allows third-party provider to an external device to device a way to manage the full life cycle, and Device Plugin Framework to build bridges between Kubernetes and Device Plugin module. On the one hand it is responsible for reporting device information to Kubernetes, on the other hand is responsible for scheduling the choice of equipment.

Extended Resource reporting

API Extend Resources belonging to the Node-level, it can be used independently of the Device Plugin. Have reported to Extend Resources, only needs to update the status part of the object via a Node PACTH API, and this operation may be accomplished by pacth a simple curl command. Thus, the scheduler can Kubernetes GPU record type of the node, corresponding to the number of resources it is 1.

enter image description here

Of course, if you are using Device Plugin, you do not need to do this PACTH operation, only need to comply Device Plugin programming model, reported in the device in Device Plugin will work to complete this operation.

Device Plugin mechanism

Device Plugin tell us about the working mechanism of the entire workflow Device Plugin can be divided into two parts:

  • One is the start time of resource reporting;
  • Another user is a schedule and run time.

enter image description here

Device Plugin development is very simple. Including most concerned with the core of the event two methods:

  • Which corresponds reported ListAndWatch resources, while also providing a mechanism for health checks. When the unhealthy device, they can be reported to the device ID Kubernetes unhealthy, so Device Plugin Framework to remove the device from the device can be scheduled;
  • The Allocate Device Plugin will be deployed when the container calls, passing parameters is the core of the device ID of the container will be used, the parameter returns the container to start, needed equipment, data volumes and environmental variables.

Resource monitoring and reporting

For each hardware device requires its corresponding Device Plugin to manage these Device Plugin as a client to connect to kubelet the Device Plugin Manager by GRPC way, and his listening Unis socket api version number and The device name such as GPU, reported to kubelet. We look at the entire process Device Plugin Resources reported. In general, the whole process is divided into four steps, which are the first three steps occurs at the node, and the fourth step api-server interaction of kubelet.

enter image description here

The first step is a registered Device Plugin, you need to know to keep Kubernetes which Device Plugin to interact. This is because there may be multiple devices on a node, Device Plugin required to report the identity of the client's three things to Kubelet.

  • who am I? Device Name Device Plugin is managed by the GPU or RDMA;
  • where am I? File Locations plug-in itself is listening unis socket is located, so that kubelet can call their own;
  • Interactive protocol, namely API version number.

The second step is to start the service, Device Plugin will start a GRPC the Server. After this Device Plugin has been providing service to kubelet to access the identity of the server, while listening address and provide the API version has been completed in the first step.

The third step, after the start GRPC server, kubelet to Device Plugin will create a long connection of ListAndWatch to discover the device ID and the health status of the device. When the Device Plugin detects when a device is unhealthy, it will take the initiative to inform kubelet. And at this time if the device is in an idle state, kubelet will be removed the list can be allocated. But when this device has been used in a Pod, kubelet will not do anything, if at this time to kill the Pod is a very dangerous operation.

The fourth step, kubelet these devices will be exposed to the Node Status node, the number of devices transmitting to the api-server in Kubernetes. Subsequent scheduler may perform scheduling according to the information.

Note that kubelet in time to report to the api-server, will only report the number corresponding to the GPU. And kubelet Device Plugin Manager will own the ID list to save the GPU, and assigned to a specific device. And this for Kubernetes global scheduler, it do not have the ID list of the GPU, we only know the number of GPU. This means that the existing Device Plugin mechanism, Kubernetes the global scheduler can not be more complex scheduling. For example, affinity want to schedule two of the GPU, a node with two GPU may need to communicate, rather than by NVLINK PCIe communications, in order to achieve better data transmission. In this demand, the current Device Plugin scheduling mechanism can not be achieved.

Pod process scheduling and operation

enter image description here

Pod want to use when a GPU, it only needs the same as the previous example, and a corresponding number of GPU resources (such nvidia.com/gpu: 1) limits the Resource field declared in the Pod. Kubernetes will find the number of nodes that satisfy the conditions, the number of nodes of the GPU is then decremented by 1, and to complete the binding of Node Pod.

After binding is successful, it will naturally be used to create the corresponding node kubelet container. When the Pod kubelet discovery resource request is a container when GPU, kubelet will delegate its own internal Device Plugin Manager module, select an available GPU from the own ID list held by GPU allocated to the container.

At this time kubelet will initiate a request to Allocate Plugin DeAvice the machine, the parameters carried in the request, the device ID is assigned to a list of upcoming the container.

Device Plugin AllocateRequest After receipt request, it will pass over kubelet apparatus ID, the path to find the device ID corresponding to the device, the drive directory and environment variables, and returned to form AllocateResponse of kubelet. AllocateResponse path carried in the device driver and directory information, once returned to the kubelet, kubelet will be allocated according to the operation information to implement the GPU container, this container according to create Docker will kubelet instruction, and this will appear in the container GPU device. And it needs to mount a drive to the directory in, so far as Kubernetes Pod assign a GPU process is over.

After-school thinking and practice

Learning Summary

In this lesson, we learned to use GPU together in Docker and Kubernetes. GPU containers of:

  • How to build a GPU image
  • How to run directly on the GPU container Docker

Use Kubernetes manage GPU resources:

  • How GPU support scheduled Kubernetes
  • How to verify GPU configurations in Kubernetes
  • Scheduling process container GPU

Device Plugin working mechanism:

  • Reporting and monitoring resources
  • Pod scheduling and operation

Thinking:

  • The current defect
  • Community Common Device Plugin

Device Plugin mechanism of defect

Finally, we come to think about a problem, the Device Plugin now whether perfect?

It should be noted that the Device Plugin entire working mechanisms and processes, in fact, there are relatively large differences with the real scene academia and industry. The biggest problem here is that the work schedule GPU resources are actually done on the kubelet. As the global scheduler for this participation is very limited, as a traditional Kubernetes scheduler, it can only deal with the number of GPU. Once your device is heterogeneous, can not simply use the number to describe the needs of the time, such as my Pod want to run on the GPU two have nvlink of this Device Plugin can not handle it completely. Not to mention the many scenes, we hope to schedule time scheduler, according to scheduling device is globally the entire cluster, this scenario is the Device Plugin can not be met.

More difficult to, in the design and implementation of the Device Plugin, like Allocate and ListAndWatch the API to increase scalable parameters are of no effect. This is when we use some of the more complex equipment needs when, in fact, can not be achieved through the extension API Device Plugin. Therefore, the current Device Plugin scene design is actually covered by a single, but is a poor use of available state. This could explain why, like Nvidia these companies have achieved based on a code fork upstream Kubernetes his solution is a last resort.

Community heterogeneous resource scheduling scheme

enter image description here

  • The first one is Nvidia contribution scheduling scheme, which is the most commonly used scheduling scheme;
  • The second is shared GPU scheduling scheme contributed by Ali cloud service team, which aims to address the needs of users share scheduling GPU, welcome to work together to improve and use;
  • The following are two RDMA FPGA and scheduling scheme provided by a particular manufacturer.

Guess you like

Origin www.cnblogs.com/passzhang/p/12563688.html