The technical architecture behind the star computing power training platform based on Kubeflow

The Star Computing Platform realizes the management and scheduling of millions of CPU cores, the high and low priority scheduling of large-scale GPU cards and the oversold computing power, and provides large-scale, high-efficiency, low-cost CPU and GPU computing services in a cloud-native manner .

Tencent's business and organizational structure status

Let me briefly introduce you to the current situation of Tencent's internal business and related organizational structure, which will help you understand why we will design a complete solution based on the following structure.

The applications in the picture below are often used by most people, such as WeChat, Tencent Video, games, etc. The technologies behind them are not the same, involving different AI technologies such as NLP, computer vision, reinforcement learning, and voice.

For example, when we play "King of Glory" or play Go, what is behind it is a robot trained with intensive learning. The robot can meet our game needs such as battle and cooperation without teammates.

Different business departments have different external requirements for APP, and they will do some customization of the AI ​​platform for their own business scenarios. What we do is low-level computing power. When we provide services to business departments, we also need to make customized services for each business conveniently, taking into account the overall resource utilization rate. This is the current situation of a multi-tenant within Tencent.

Business characteristics and scale

Next, I will introduce some characteristics and approximate scale of Tencent's internal business.

The current environment is based on the open source project TKEStack. TKEStack is the open source version of Tencent's public cloud TKE. It is an open source container cloud platform solution. The Kubernetes V1.14 version is used. The operating system is Tencent's self-developed Linux operating system, which is already GPU Or Tencent’s internal business has done a layer of performance tuning and bugfixes.

GPU nodes are NVIDIA's V100 and P100, and there will be some M40 machines in some cases. Network Unicom uses 100G RoCE, which can not only provide Ethernet support, but also RDMA network protocol support, which will have a multiplier effect for users to optimize multi-machine communication, and also ensure the overall efficiency of use from the hardware level.

Complete process design ideas

Next, let’s introduce how we perfected, developed and designed this whole process.

What is Kubeflow

Let me first introduce Kubeflow and some of the main components in Kubeflow to help you understand some of the specific businesses or designs.

Since its release at the end of 2017, Kubeflow has gradually become the mainstream tool for training or reasoning tasks such as machine learning and deep learning on Kubernetes.

Kubeflow contains a lot of components, more like Operator, or tools like automatic tuning.

Operator

First introduce Operator, which is a concept in Kubernetes, mainly used to package, deploy, and manage user tasks. But in Kubeflow, Operator is mainly used to manage tasks in machine learning or deep learning.

So what can it do for users?

For example, in the task of multi-machine, how to manage and maintain multiple nodes of a task, and how to communicate before, how to help users manage the entire Pod and the life cycle of the task, such as a Pod failed, the entire task Whether to terminate or there are some other fault tolerance methods, this is all done by Operator.

There are currently several mainstream Operators, which correspond to each framework. For example, the most used TF-Operator mainly corresponds to tensorflow, and MPI-Operator mainly corresponds to Horovod, Pytorch and Caffe2 operators. They have some customized scenarios for their respective frameworks.

Here we focus on introducing the two Operators and some of the key optimizations we have done to them. It is also the two Operators and framework tasks that we use more often.

MPI-Operator

MPI-Operator is to provide a multi-machine management for MPI tasks or Horovod tasks. It has V1 α1, V1 α2 and V1 versions. V1 is under large-scale development, so we recommend V1 α2. It can be recommended after V1 release in the future.

Under this version, it has two concepts: one Launcher and the other Worker. Launcher is equivalent to the role of a launcher. It will wait for the workers to be in place before starting MPI tasks. But here, the Launcher itself will take up some CPU resources. In fact, we can merge them together, and we have indeed made some improvements and submitted them to the community, removing some useless CPU Pods, and letting a certain GPU node start To this role.

In addition, in order to increase the speed of Pod creation and increase the speed of job startup, we have also made some optimizations.

For example, when the Launcher waits for other Workers to be in place, it changes from the version of other Shell scripts to a multi-threaded tool, which can greatly improve the overall waiting time.

In addition, for example, adding an additional init container to download the user's docker image, which is similar to parallel loading.

We recommend that you directly refer to Kubeflow’s MPI Operator’s GitHub to see more details. The general process is also shown on the right side of the figure above. The main thing is to replace the MPI Job with some resources that can be identified by the corresponding Kubernetes, such as using CRD, Use Configmap and RBAC to control permissions and so on.

TF-Operator

In addition to MPI-Operator, there is another TF-Operator that is more used. TF-Operator is mainly to help users start a multi-machine task in the case of a PS-Worker architecture, because this cluster All are GPU-related clusters, so we recommend that users start the PS and Worker in a Pod to reduce some communication costs.

In addition, we also made some additional optimizations. For example, solutions such as nearby deployment and off-database scheduling. In order to speed up the image loading time, ifNotPresent is also used to speed up the whole speed.

At present, TF-Operator is relatively complete, and other companies will also have more investment.

Use Kubeflow to build a training platform in a multi-tenant scenario

After introducing some of Kubeflow's current Operators, let's get back to business. Today's topic is also a multi-tenant scenario. Let's use Kubeflow to build a training platform.

Conventional construction plan

Let's take a look at how the current Kubeflow multi-tenant platform is constructed.

Currently, there are two ways on this platform. One is to use native Kubernetes RBAC, which is Role based access control to control permissions. The other is to use Istio, which is more inclined to reasoning scenarios. In this way, a user's access is controlled.

You can simply look at this picture. When we provide users with a cluster, there are generally two ways. One is to use the command line, and the other is to provide the Web terminal. The Web terminal accesses the entire cluster through Istio's gateway, and then Istio's RBAC is used to control permissions, distribute traffic, and maintain the permissions of the entire cluster. .

In addition, if it is a client, in addition to the integrated client, it also introduces Kubernetes' RBAC to allow or prohibit it to help users solve the problem of permission control.

But it has a small problem, that is, all users will share all these resources, such as Operator and Controller. Users can only use defined resources. For example, if we design a Job type or Operator type, then the user must Use it like this.

But for multiple business groups, everyone will have some customized requirements or customized Operators. In this scenario, it will be a little stretched. For this reason, we have also introduced some other ways to improve this demand.

Optimize the construction plan

User hierarchy

The multi-tenant Kubeflow training platform we are doing now first aggregates GPU resources into one or more clusters at the resource layer, and provides multiple user clusters on top of this GPU cluster, which is also a K8s cluster. Users can access the underlying computing power cluster through Virtual Kubelet, which is divided into two layers:

1. In the user's cluster, the administrators of these tenants manage or create them, define some Operators or Controllers themselves, and the users access the K8s native KPIs of these tenants;

2. At the bottom of the computing power cluster, we unified and centralized scheduling to improve resource utilization.

When a task is issued, it is forwarded to the computing power cluster through Virtual Kubelet or Virtual Node implemented with Kubernetes. At the bottom computing power level, different tenants and tasks of different tenants can be isolated through Namespace to meet specific requirements. The nodes are divided to form some small resource pools, and more are isolated through Namespace.

But for upper-level users, they have some custom permissions, and they can also develop some of their own components, which is equivalent to separating permissions.

The Virtual node we implemented here is equivalent to the implementation of Virtual Kubelet. Virtual Kubelet is equivalent to the implementation of an open source Kubernetes kubelet on Kubernetes. It was originally developed and maintained by a Microsoft team, and then donated to CNCF to become a sandbox project. Behind the Virtual Kubelet are scalable components or services that can be connected to ACI, AWS forgate, IOT, etc. For example, OpenStack can also be accessed.

And we are actually equivalent to accessing a new function here. It is a relatively simple one. It is mainly defined as a user who has a Pod to send a request or a request for other resources. We directly forward it to a K8s at the bottom. Virtual Kubelet also monitors the state of the resources that it is concerned about at the bottom layer and reports the state to the upper layer. It acts as a bridge, connecting the links and synchronizing the overall state.

This picture is a relatively complete user cluster or an architecture diagram of users. There are user clusters on the left, which correspond to different computing power clusters connected to the bottom through Virtual Kubelet. These computing power clusters are all computing power clusters with GPU resources and are also K8s clusters.

If the user is relatively simple, he can directly access the recommended components to form an overall simple control strategy. For example, if a user wants to run some tasks and has his own controller to define the entire rule, he can actually access operator resources such as MPI-Operator and TF-Operator.

One more thing here, when a user issues an MPI Job, MPI-Operator will convert resources like Configmap, Secret, RollBinding, and multiple Pods.

When these conversions are completed, the Pod passes through the scheduler to a specific Virtual Kubelet. After the Virtual Kubelet finds the resource to its own node, it forwards it to a specific Kubernetes cluster and pays attention to the status of these Pods or other resources. , Forming the whole effect of forwarding and transmitting.

For the administrators of the upper-level clusters, they no longer need to pay attention to the situation of the lower-level clusters and the authority control, etc., just pay attention to the upper level. The administrators at the bottom need to pay attention to the use of more overall resources to form a separation between the upper and lower layers.

Improve resource utilization

After introducing the separation of upper and lower layers and how to do multi-tenant scenarios, turn to the clustering of all the resources we mentioned before. The main goal is to improve the utilization of resources. When the entire resource is opened up, how to improve resource utilization? And through the multi-tenant mechanism introduced above, users do not need to perceive it.

In deep learning or machine learning scenarios, most tasks require batch scheduling, which means that multiple Pods need to be scheduled at the same time. Its main algorithm is the all or nothing algorithm to ensure that the entire resource can be scheduled or not scheduled. If it cannot be scheduled, it will queue up to ensure that the entire resource will not be starved to death. This is a relatively common requirement.

This mainly uses volcano to do gang-scheduling.

Introduce task priority

In addition, we also introduced a priority task. As the name implies, we open high-priority tasks to each user or each tenant, and they can be scheduled within a fixed time. When the utilization rate of the entire cluster is not high or when there is still some space allocated, some low-quality tasks can be developed for users, and users can submit the entire flexible task or called low-quality tasks.

After this kind of task is issued, these idle resources can be occupied by low-quality tasks. When high-quality tasks are issued, these low-quality resources can be preempted to ensure that the entire resource pool is at its fullest.

Strategy optimization

The last point is some optimization strategies, such as topology scheduling based on network topology architecture or GPU topology architecture or using binpack to reduce the fragmentation of the underlying cluster and ensure that more resources can be scheduled as soon as possible.

Other optimizations, such as increasing the startup speed of MPIJob, can deliver tasks as soon as possible, and reduce the idle computing resources at the bottom layer.

In addition, as we all know, we generally use the nvidia-docker2 version to run GPU tasks. We analyzed it, in fact, for the version corresponding to nvidia-docker2, its Pod startup speed is slower than that of general runC. Many reasons are analyzed, and the main focus is on each pod or container that is started. It will query the information corresponding to the version of the CUDA or Nvidia driver, and then use the CLI Prehook of the Nvidia Container to operate.

But in our scenario, it is actually not very useful, because we are a private cloud platform, and the private cloud platform does not involve many scenarios where the hardware is frequently changed or the driver version changes frequently. So we can simplify.

How to simplify it? The easiest way is to cache all the driver information such as CUDA and INVIDA, and then save it to a fixed file. When the machine is restarted or the driver changes, the CUDA version will be triggered again. In this action, the updated information is obtained and cached, which can greatly reduce the number of times that this information is obtained every time a Pod is created.

The time of this message is about a few hundred milliseconds as shown in the figure.

After such an optimization, we  can increase the startup time of the entire Pod by about 30% to 40%  . It is very friendly to the user experience.

Of course, this can only be said to be optimized for a few hundred milliseconds, like deep learning scenarios, the CUDA version, the Nvidia version, the Nvidia driver itself is relatively large, so how can we optimize the loading of this docker image, or reduce its mirror pull Take, do some pre-distribution, pre-deployment, this is also a point we are very concerned about. In addition to some things that can be done at the scheduling level, in fact, there is also a more popular way in the industry to do a lazy loading.

Docker image is divided into multiple layers, and it is divided into metadata. The survey found that basically the content in most mirrors is generally not used, and only 10 to 20% can be used.

We do some lazy loading and load it when it is in use. Of course, this is also a more cutting-edge or time-based feature, and we are also heavily involved. There are some developments in the future. You can continue to follow us and share with you.

to sum up

Based on kubeflow's current architecture or some existing components that support multi-tenancy and some later optimization strategies, in order to improve the overall user experience, we still have a lot of work to do. For example, the more popular flexible training tasks, such as those based on kubeflow and horovod itself, can be dynamically scaled to occupy more resources and reduce the training time of users, which are very critical.

In addition, we are also focusing on participating in the V1.0 version of MPI-Operator itself, and hope it can be released as soon as possible.

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/108373157