qGPU container products are fully launched, and the offline mixing function of GPU is released.

author

Xu Bei, Tencent Cloud Container Technology Expert, Tencent Cloud Heterogeneous Computing Container Leader, many years of cloud computing front-line architecture design and R&D experience, long-term deep Kubernetes, in the field of offline co-location and GPU containerization, Kubernetes KEP Memory QoS author, Kubernetes active Contributor.

Summary

qGPU is a GPU sharing technology launched by Tencent Cloud, which supports the sharing of GPU card resources among multiple containers, provides percentage computing power and MB-level video memory fine-grained allocation and strong isolation capabilities, and is equipped with the industry's unique GPU offline mixing technology. Under the premise of fully guaranteeing business security and stability, the GPU utilization rate is improved to the extreme.

qGPU has served a large number of internal and external customers, helping many AI companies save a lot of GPU costs. The qGPU container virtualization product is now fully launched on Tencent Cloud TKE.

Tencent Cloud is the first in the industry (except NVIDIA's original factory) to support fine-grained computing power and strong isolation. qGPU computing power can achieve a fine-grained limit of 1%, and it is guaranteed to allocate and limit computing power resources in strict accordance with the ratio. Even when GPU resources are very tight, the computing power resources allocated by each business are guaranteed to remain unaffected. Relying on this capability, enterprise users can increase the density of business deployment as much as possible and make full use of GPU resources without worrying about negative impacts on business.

qGPU relies on TKE's self-developed scheduler and device manager, and supports GPU card-level percentage computing power and MB-level video memory allocation and scheduling on TKE Kubernetes clusters. On the premise of ensuring optimal cluster resource allocation and load, enterprise AI tasks Smaller granularity of GPU resources can be used.

qGPU implements QoS capabilities at the GPU hardware level (rather than CUDA API level interception and control), and controls GPU memory resource allocation and fine-grained strong computing power isolation at the MB level, avoiding the business performance caused by shared GPUs to the greatest extent. loss. Through this innovative technology, qGPU solves the problem of isolation in all dimensions of fault, memory and computing power.

In addition, Tencent Cloud qGPU innovatively combines offline hybrid deployment technology with GPUs, and for the first time in the industry has proposed the concept of offline hybrid deployment of GPUs, advancing the GPU container sharing technology to the next era.

Online business usually refers to reasoning business. Offline business may be reasoning or training. Therefore, the main forms of offline co-location include reasoning + reasoning and reasoning + training. If there is no effective technical means, in order to ensure the QoS of online services, it is necessary to make a GPU card exclusively, which will lead to low utilization rate. With the ability of qGPU to co-locate offline, users can safely deploy online services and other services on the same GPU card. While sharing multiplexing resources, they can fully guarantee the healthy and stable operation of online services.

It can be said that Tencent Cloud qGPU is an innovative breakthrough technology to improve GPU utilization in offline co-location. Using the leading fine-grained computing power isolation technology and the original computing power optimal scheduling technology, under the premise of ensuring the QoS of online task computing power, the GPU utilization rate can be effectively increased to 100%, which greatly reduces computing power waste. GPU resources are squeezed to the extreme.

Summarize

Computing power isomerization has become an industry consensus today. Among them, GPU occupies a dominant position in AI heterogeneous computing with its powerful computing power and perfect ecology. Faced with expensive AI computing resources, enterprises are eager to have technical means to help reduce costs and increase efficiency.

Tencent Cloud qGPU is based in the field of AI, relying on technical products such as fine-grained scheduling of GPU resources, strong isolation of GPU resources, and offline co-location of GPUs. huge commercial value.

qGPU container virtualization: https://cloud.tencent.com/document/product/560/66232

about Us

For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

④ Reply to [Introduction to the Speed ​​of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.

⑤ If you reply to the [Selected Collection] in the background of the official account, you can get a wonderful speech from 24 Tencent Cloud experts from Tencent - the 40,000-word "Tencent Cloud Technology Practice Collection 2021".

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324079965&siteId=291194637