An article explaining the infrastructure behind the AI model

Author: Wang Jin

The infrastructure behind cloud computing may be familiar to everyone. After the AI ​​​​large model entered the public eye, the infrastructure behind the artificial intelligence algorithm has also become a topic of frequent discussion. So what kind of infrastructure supports the training and operation of the model? Let’s start by measuring the computing power of the infrastructure.

Starting with the computing power of the blockchain

Before the emergence of Bitcoin, everyone may have no special concept of computing power. They just vaguely knew when buying a computer that the higher the CPU frequency, the more instructions the CPU can execute per second, and the better the performance of the computer. It wasn't until the emergence of Bitcoin that Proof of Work computing (Proof of Work) was directly linked to money that the concept of computing power began to explicitly appear in various projects and software, and everyone also used computing power to measure various hardware. The calculation in Bitcoin mainly uses the hash algorithm. The computing power to measure the proof-of-work calculation is the hash rate (hash rate), that is, how many hash values ​​can be calculated per second.

So what is the relationship between the computing power of this hash calculation and the computing power of the AI ​​large model? And listen to me slowly. Let me first use the common cooking in life to let everyone understand the process of this hash calculation:

To make a beef fried rice, first you need to process the beef and rice separately, cut the beef into pieces and fry them in a pan. During this process, add some condiments, such as salt, pepper, soy sauce, etc., to adjust the fried rice. The texture and taste of the beef. Finally, add the cooked rice to the pot and stir-fry with the beef to get a delicious beef fried rice.

This process of mixing various ingredients and condiments is the process of hash calculation, which is a one-way irreversible calculation process, and beef fried rice cannot be restored to beef. And what does Bitcoin have to do with this calculation process? Continuing with the example of fried rice, the entire Bitcoin network will release a task every once in a while. Who can combine the fresh ingredients in the last 10 minutes (collect the transaction ledger), and the first fried rice will be ecstasy. Taste, you can get a bitcoin. What should I do then? Fry a few more servings, the ingredients of each serving are slightly different, and then taste the taste of ecstasy after cooking. The first person to fry the ecstasy fried rice proves his workload The corresponding rewards were obtained.

From this example, we can find that the greater the number of fried rice per unit time, the higher the possibility of frying fried rice.

At this time, some people will think of a way. If I set up more stoves to fry at the same time, can it be faster? Yes, this is the difference between CPU and GPU computing power. The CPU is like a common gas stove at home. There are not many stoves (few cores), but the fire is very strong, and various complicated dishes can be fried, but only a few times per unit time. The GPU is like a small stove with a small fire, but it can cook fried rice just enough. Since the number of stoves is very large (the number of cores is large), the number of rice that can be fried per unit time is very large. Remember the large number of cores of the GPU, we will mention it later.

At the same time, we can also understand why the Proof of Work calculation (Proof of Work) in the blockchain is very wasteful from the perspective of fried rice. Because in the end, only the rice with the taste of ecstasy is useful, and all other rice will be thrown away after tasting without the taste of ecstasy.

GPU Computing Basis

The deep learning model of the AI ​​large model mainly runs on the neural network, and the calculation process of the neural network network can be regarded as a combination of a large number of matrix calculations. Matrix computing is different from ordinary logic computing. Multi-core parallel computing can greatly improve computing speed. Yes, GPU multi-core comes in handy here: 10x10 matrix computing is like 100 stoves with pots on them. You can put 100 ingredients into corresponding 100 pots for cooking. The biggest difference between these calculations and proof-of-work calculations is that the dishes cooked by these 100 stoves are all different, and the dishes cooked are all useful, unlike the ones fried in the proof-of-work scenario in the blockchain. 99.999% of the rice is thrown away.

However, the GPU was not so good at the beginning. Before the emergence of NVIDIA's CUDA, it can be said that the GPU has no computing power advantage over the CPU. NVIDIA's CUDA (Compute Unified Device Architecture), launched in June 2007, is a parallel computing platform and programming model based on NVIDIA GPU. CUDA is developed to support NVIDIA's high-performance computing GPU. It allows developers to write parallel computing programs in programming languages ​​such as C/C++ and Python, so as to make full use of the parallel computing capabilities of the GPU, which is the GPU mentioned above. Multi-core capability.

In 2017, in order to further improve the computational efficiency of the deep learning model, NVIDIA provided a new computing unit called Tensor Cores on the Volta architecture, which has a special circuit design for performing matrix multiplication and convolution in deep learning. Equal tensor operations to improve computational efficiency. Therefore, Tensor Cores can be regarded as hardware accelerators dedicated to deep learning computing.

The visual demonstration of its acceleration effect is shown in the following animation, which is very amazing:

From this point of view, GPU, as the basis of computing power, supports the training of general AI large models. Let's disassemble the composition of the GPU and see which ones have a crucial impact on computing power.

  • CUDA Cores : General-purpose computing units, mainly used to perform standard floating-point operations, such as addition, multiplication, and vector operations. They are designed for highly concurrent data parallel computing and can perform a large number of floating-point operations at the same time. The number of CUDA cores determines the parallel processing capability of the GPU.
  • Tensor Cores : Tensor Core is a dedicated execution unit designed to perform tensor or matrix operations. It uses low-precision floating-point operations (FP16) to perform matrix multiplication and accumulation operations, which can greatly improve the training speed and efficiency of deep learning models. Introduced on the Volta architecture, this compute unit is designed for deep learning and has evolved on the recently introduced Turing and Ampere architectures.
  • Video memory : Temporarily store the data to be processed by the GPU and the processed data. The larger the capacity of the video memory, the GPU can handle larger-scale data and programs, thereby increasing the computing power.
  • Video memory bandwidth : Video memory bandwidth refers to the data transfer rate between the chip and the video memory, which is in bytes/second. Video memory bandwidth is one of the most important factors in determining graphics card performance and speed.
  • NVLink : A high-speed connection technology introduced by NVIDIA for data transmission between multiple GPUs. The transmission speed of NVLink is faster than that of PCIe. The bandwidth of PCIe 4.0 can reach up to 16GB/s, while the bandwidth of NVLink 2.0 is as high as 300GB/s, which means that the transmission speed of NVLink is dozens of times that of PCIe. NVLink enables GPUs to directly access video memory on other GPUs, greatly improving the performance and efficiency of multi-GPU systems.

Schematic diagram of the structure of a common GPU server with NVIDIA graphics card (picture source: understanding and basic use of GPU memory (video memory))

As can be seen from this picture, with the continuous development of GPU parallel computing, the PCIe bus connection method has become somewhat outdated in the field of high-performance computing. Under the traditional CPU architecture, each computing unit communicates with the CPU, memory, hard disk and other devices on the motherboard through the PCIe bus. Due to the limited bandwidth of the PCIe bus, this method often cannot meet the needs of large-scale parallel computing. So NVIDIA opened another high-speed interconnection channel called NVLink at the physical level to directly connect multiple GPUs to form a high-performance computing cluster. It can also be seen from NVLink that the improvement of communication exchange bandwidth is the key point to increase the computing power in the infrastructure, which we will talk about later.

Computing infrastructure

After dismantling the basic composition of the computing power of the GPU, let's take a look at the computing power infrastructure. Can a few graphics cards constitute the computing power infrastructure? No, infrastructure is a complex systematic project. For example, to turn natural water into tap water that can come out just by turning on the tap, a set of water supply projects including water source, water delivery, water purification, and water distribution must be implemented.

Well, the same is true for GPUs. Computing clusters composed of a single card provide huge computing power, which is also a huge project combining software and hardware. Among them, the general-purpose computing cluster part will not be expanded in this article due to the length of the article. We will mainly talk about the GPU-related part: Like all infrastructures, the most critical issue of computing power infrastructure is the distribution of computing power. The power is getting stronger and more expensive. If the computing power is roughly allocated according to the granularity of a single card, it will lead to a large waste of resources. It is very necessary to let multiple training tasks run on one graphics card. .

vGPU is a virtual GPU software launched by NVIDIA that supports VM applications. It supports the creation of many small virtual GPUs on one physical GPU, and makes these virtual GPUs available to different users. The internal principle is shown in the following figure:

For the virtualization of physical hardware, it is mainly divided into two types: storage virtualization and computing virtualization

  • Storage: By creating a dedicated buf (dedicated framebuffer), the storage space of the virtual GPU is created in advance
  • Computing: Control the usage time of tasks on the GPU physical device engine through the time slice manager (scheduling)

The vGPU based on time slice (time-sliced) essentially allows the task to share the physical device (engine), and controls the resource usage of the task on the entire physical device by changing the time slice of the virtual GPU. Although this method can meet some application scenarios, since the resources on the physical card are shared, all tasks have to be used in turn, making it difficult to achieve good QoS in terms of computing power/bandwidth/task switching after the whole card is split.

The domestic GPU virtualization method basically refers to the vGPU method. GPU virtualization is basically designed around three issues: whether data is safe, whether isolation is sufficient, and whether QoS can guarantee these three issues. It is undeniable that these virtualizations have achieved very good results, and have greatly helped to improve the utilization of cluster resources. But after all, the underlying hardware cannot be modified and constrained, and there are always limitations in terms of security and resource allocation balance. In addition, software segmentation and isolation will lead to waste of physical card resources. The more segmentation, the more serious the waste.

However, NVIDIA's Ampere-based chips launched in 2020 have solved the above problems very well. Take the most widely mentioned A100 graphics card as an example: the Ampere architecture enables the GPU to create sub-GPUs through hardware design (this virtualized GPU is also often called GPU Instance/GI). Through the segmentation and recombination of GPU physical resources such as system channel, control bus, computing power unit (TPC), global video memory, L2 cache, and data bus), each GI can achieve data protection, independent fault isolation, and service (Qos) Stable.

  • GI : Creating a GI can be thought of as splitting a large GPU into multiple smaller GPUs, each with dedicated compute and memory resources. Each GI behaves like a smaller, fully functionally independent GPU that includes a predefined number of GPCs (Graphics Processing Clusters), SMs, L2 cache slices, memory controllers, and framebuffer memory.
  • CI : Compute Instance (CI) is a group that can configure different levels of computing capabilities in GI, and encapsulates all computing resources that can perform work in GI (GPC, replication engine, number of NVDEC units, etc.), each The number of CIs in a GI can vary. But combined with our existing scheduling granularity, we tend to create only one CI in a GI, and then implement concurrency within CI through existing methods.

The left picture below is a schematic diagram of A100 used as a whole, and the right picture is a schematic diagram of A100 with MIG turned on:

Due to the hardware limitation of video memory and SM segmentation, GI's video memory/computing power can only be a limited number of combinations, and GI's segmentation methods can be enumerated as the following 18 types.

Taking A100 (40GB) as an example, 7 in the table represents 7g.40gb, 4 represents 4g.20gb, 3 represents 3g.20gb, 2 represents 2g.10gb, 1 represents 1g.5gb, according to the rules of MIG Device Name, 1g It means that GI includes 1 computing slice, and 5gb means the memory size of GI; while A100 (80GB) only needs to change the memory size accordingly.

Through the articles "Scaling Kubernetes to 7,500 nodes" recently disclosed by OpenAI and the article "Scaling Kubernetes to 2,500 nodes" disclosed in 2018, we can see that OpenAI's training cluster is also a Kubernetes cluster, and the orchestration computing power of GPU virtualization finally landed on it The task plan is k8s. The k8s cluster and cloud native will not be expanded in this article because of the limited space. We will talk about a few points in the article on OpenAI:

  1. Model production process : The article mentions that this cluster does not rely heavily on load balancing, so the k8s cluster of 7,500 machines is a training cluster, not a cluster that ultimately provides services. The last section about the production process of the model will be expanded.
  2. k8s API Server : After the scale of the cluster becomes larger, the API Server can't handle it at first, just like the most common problems you encounter, so switch the storage from the network SSD disk to the local SSD disk to solve the problem.
  3. Large-scale model mirroring : The mirroring of large-scale models is very large, and the mirroring network cannot handle it. Using the mirroring warm-up solution reduces a part of the bandwidth. The remaining problems are left unresolved and hope to continue to optimize. It is recommended to try Dragonfly, Alibaba's open source distributed image distribution solution.
  4. Container network : In 2018, OpenAI was still considering how to switch the Flannel network (~2Gbit/s) to the hostNetwork network (10-15Gbit/s). The recently disclosed article mentioned that it has already switched to Azure's VMSSes CNI solution. According to the data, if a single Node in Azure Kubernetes Service (AKS) uses CNI, the bandwidth can reach 40 Gbit/s.

high performance computing

Through a step-by-step description, we gradually understand the composition of the computing power infrastructure from the principle to the structure. At the same time, due to the outbreak of large AI models, the overall training data set is getting bigger and bigger, and the training cycle is getting longer and longer. Almost all training is multi-machine and multi-card. At this time, the communication bandwidth mentioned above becomes the bottleneck that restricts the increase of computing power.

In order to solve this problem, we call it in the professional field: High-Performance Computing High-performance computing, which aims to improve the communication efficiency between computing units. Since OpenAI did not disclose these details, let's start with Alibaba Cloud's high-performance network EFlops practice.

The performance optimization of the EFlops network is divided into two layers: communication optimization within the server and network communication optimization between servers.

The internal communication optimization of the server is mainly to solve the communication congestion problem in various scenarios of PCIe and improve the communication efficiency.

In terms of communication optimization between servers, EFlops uses the RDMA (Remote Direct Memory Access) network as the communication network, and at the same time, the self-developed communication library ACCL (Alibaba Collective Communication Library) provides a general distributed multi-machine multi-card collective communication function. ACCL implements an innovative non-congestion algorithm and high-speed network transmission, and its performance is close to or even surpassed by other industry-leading communication libraries. At the same time, ACCL is also compatible with Nvidia's NCCL communication library to improve its universality.

Like other ensemble communication libraries, ACCL supports a variety of commonly used deep learning frameworks and platforms. In terms of underlying data transmission, because RDMA has the characteristics of low latency and high throughput, data transmission between nodes is mainly through RDMA / GPU Direct (GDR). The nodes also support various interconnection methods such as PCIe/NVLink.

AI full-stack platform PAI

As an infrastructure, we have basically made it clear, but in fact, our computing power infrastructure cannot be simply compared to those infrastructures of water, electricity and coal. Computing power is a very abstract thing. This kind of resource is not available in nature. existing.

If you have to compare it to a water plant, it can be said that the input is massive data and the thinking and ideas of data scientists, and then the intermediate process is based on the engineering carefully designed by the scientists, using the computing power of a large number of graphics cards to carry out large-scale model training , the final output is a model interface that can provide question answering services. Therefore, model training is only the most important intermediate link for the entire artificial intelligence algorithm production process. The complete model production process includes model development, model training, and model deployment. The question-and-answer service that everyone uses on the Internet is the product of the last link, model deployment.

Therefore, if a very powerful AI model is to be produced, a complete production process is essential, so Alibaba Cloud provides the AI ​​full-stack platform PAI.

Alibaba Cloud's machine learning product PAI (Platform of Artificial Intelligence) provides enterprise customers and developers with an easy-to-use, cost-effective, high-performance, easy-to-expand machine learning/deep learning engineering platform with plug-ins for various industry scenarios. Built-in 140+ optimization algorithms, providing the whole process including data labeling (PAI-iTAG), model building (PAI-Designer, PAI-DSW), model training (PAI-DLC), compilation optimization, reasoning and deployment (PAI-EAS) AI engineering capabilities.

In the model development stage, the modeling can be completed through two development tools, PAI-Designer and PAI-DSW.

  • PAI-Designer  is a code-free development tool that provides classic machine learning algorithm components, such as regression, classification, clustering, text analysis, etc.
  • PAI-DSW  machine learning interactive programming development tool is suitable for developers of different levels.

In the model training phase, PAI-DLC provides a one-stop cloud-native deep learning training platform.

  • PAI-DLC  supports a variety of algorithm frameworks, ultra-large-scale distributed deep learning task operation and custom algorithm frameworks, and has the characteristics of flexibility, stability, ease of use and high performance.

In the model deployment phase, PAI-EAS provides online prediction services, and PAI-Blade provides inference optimization services.

  • PAI-EAS  supports users to deploy machine learning models (PMML, PAI-OfflineModel, Tensorflow, Caffe, etc.) into services with one click, and also supports users to develop customized online services according to the interface specifications formulated by EAS.
  • All optimization technologies of PAI-Blade  Blade are designed for general purpose and can be applied to different business scenarios. Through joint optimization of model systems, the model can achieve optimal inference performance.

In the entire PAI platform, we can divide the architecture into three layers:

  • The top layer is the Model As A Service that we often mention recently , which provides directly available model services for end users.
  • The Paas layer is the platform layer, each product platform built with the algorithm model life cycle, from model development, model training to final model deployment.
  • The Iaas layer infrastructure layer is what this article spends the most time talking about. In PAI, it is mainly Lingjun smart computing and elastic ECS clusters, which provide computing power, network, storage, etc.

Finally, as AI large models gradually penetrate into all walks of life, many fields are undergoing earth-shaking changes. With the computing power of AI large models, qualitative changes are taking place in many industries, and efficiency improvements and digital intelligence upgrades are constantly being realized. Although there is a certain gap between the domestic computing power infrastructure and the international forefront, the gap is not that big. We have been advancing in multiple points, gradually breaking through performance bottlenecks, and constantly promoting the upgrade of AI computing power. During this process, everyone is welcome to conceive of various digital intelligence demand scenarios and put forward higher requirements for the computing power support of our platform. We are also very happy to deal with these challenges, and let us welcome the arrival of the artificial intelligence era together.

reference:

[1] Development and characteristic analysis of GPU hardware---Summary of Tesla series

https://zhuanlan.zhihu.com/p/515584277

[2] Understanding and basic use of GPU memory (video memory)

https://zhuanlan.zhihu.com/p/462191421

[3] Introduction to Eflops Hardware Cluster_Product Introduction_Machine Learning_Agile Edition General Version

https://help.aliyun.com/apsara/agile/v_3_6_0_20210705/learn/ase-product-introduction/hardware-cluster.html

[4]EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform

https://ieeexplore.ieee.org/document/9065603

[5]ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

https://ieeexplore.ieee.org/document/9462480

[6]Scaling Kubernetes to 7,500 nodes

https://openai.com/research/scaling-kubernetes-to-7500-nodes

[7]Scaling Kubernetes to 2,500 nodes

https://openai.com/research/scaling-kubernetes-to-2500-nodes

[8]Visualized Modeling in Machine Learning Designer - Machine Learning Platform for AI - Alibaba Cloud Documentation Center

https://www.alibabacloud.com/help/en/machine-learning-platform-for-ai/latest/visualized-modeling-in-machine-learning-studio

[9]DSW Notebook Service - Machine Learning Platform for AI - Alibaba Cloud Documentation Center

https://www.alibabacloud.com/help/en/machine-learning-platform-for-ai/latest/dsw-notebook-service

[10]EAS Model Serving - Machine Learning Platform for AI - Alibaba Cloud Documentation Center

https://www.alibabacloud.com/help/en/machine-learning-platform-for-ai/latest/eas-model-serving

[11]Inference Acceleration (Blade) - Machine Learning Platform for AI - Alibaba Cloud Documentation Center

https://www.alibabacloud.com/help/en/machine-learning-platform-for-ai/latest/pai-blade-and-inference-optimization-agile-edition

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/8704637