Multi-GPU training of large models: resource allocation and optimization techniques | NVIDIA will launch improved chips HGX H20, L20 PCIe, and L2 PCIe for China

★Large models, artificial intelligence; data parallelism; model parallelism; pipeline parallelism; mixed precision training, gradient accumulation; model offloading CPU; recalculation; model compression; memory optimized version of the optimizer; Nvidia; A100; H100; A800; H800; L40s ;Hybrid Expert;910B;HGX H20;L20 PCIe;L2 PCIe

In the field of artificial intelligence, large models have attracted much attention due to their powerful prediction capabilities and generalization performance. However, as the scale of the model continues to expand, computing resources and training time have become major challenges restricting its development. Especially after the NVIDIA ban, China's AI computing industry faces unprecedented difficulties. To solve this problem, Nvidia will launch new AI chips for the Chinese market to deal with US export restrictions. This article will explore how to train large models on multiple GPUs and analyze the impact of the NVIDIA ban on China’s AI computing industry.

How to train large models on multiple GPUs?

The training of neural network is an iterative process. In each iteration, the data is first propagated forward through the layers of the model, and the output is calculated for each training sample. The gradients are then propagated backward to calculate how much each parameter affects the final output. The average gradient and optimization state of these parameters are passed to the optimization algorithm, such as Adam, for calculating the parameters and new optimization state for the next iteration. As training proceeds, the model gradually evolves to produce more accurate output.

However, with the advent of large models, it is difficult to complete training on a single machine. Parallel technology emerged as the times require, dividing the training process into different dimensions based on strategies such as data parallelism, pipeline parallelism, tensor parallelism, and hybrid experts. In addition, due to limitations of machine and memory resources, strategies such as mixed precision training, gradient accumulation, model offloading to the CPU, recalculation, model compression, and memory-optimized version of the optimizer have emerged.

To further speed up the training process, parallel processing can be performed from both the data and model perspectives. A common way is to split the data and copy the same model to multiple devices to process different data shards. This method is also called data parallelism. Another method is model parallelism, which divides the operators in the model into multiple devices to complete respectively (including pipeline parallelism and tensor parallelism). When training very large-scale language models, the data and model need to be split simultaneously to achieve a higher level of parallelism. This approach is often called hybrid parallelism. Through these parallel strategies, the training speed and efficiency of neural networks can be significantly improved.

1. Data parallelism

In a data-parallel system, each computing device has a complete copy of the neural network model. When iterating, each device is only responsible for processing a batch of data subsets and performing forward calculations based on this subset. Assume that the number of training samples in a batch is N and M devices are used for parallel calculation. Each device will process N/M samples. After completing the forward calculation, each device will calculate the error gradient Gi (i is the accelerator card number) based on local samples and broadcast it. All devices need to aggregate the gradient values ​​provided by other accelerator cards, and then use the average gradient (ΣN i=1Gi)/N to update the model to complete the batch training.

Data-parallel training systems can significantly increase overall training throughput and global batches per second by adding computing devices. Compared with single computing device training, the main difference is that gradients in reverse calculation need to be synchronized across all computing devices to ensure that the average gradient of all processes is finally obtained on each computing device.

2. Model parallelism

Model parallelism can be divided into two methods: pipeline parallelism and tensor parallelism from the perspective of computational graphs.

1. Pipeline parallelism

Pipeline Parallelism (PP) is a computing strategy that divides each layer of the model into multiple stages and processes them on different computing devices to achieve continuous work in the previous and later stages. PP is widely used in parallel systems of large-scale models to solve the problem of insufficient memory on a single device. The figure below shows a PP system consisting of four computing devices, including forward computing and backward computing. Among them, F1, F2, F3, and F4 represent four forward paths, located on different devices; B4, B3, B2, and B1 represent reverse backward paths, located on four different devices. However, the downstream device needs to wait for the upstream device to complete the calculation before starting the computing task, resulting in a reduction in the average device usage and the formation of model parallel bubbles or pipeline bubbles.

The naive pipeline strategy will lead to parallel bubbles, making the system unable to fully utilize computing resources and reducing overall computing efficiency. To reduce parallel bubbles, mini-batches can be further divided into smaller micro-batches and a pipeline parallel scheme can be used to process each micro-batch data. After the calculation of the current stage is completed and the results are obtained, the results of the micro-batch are sent to the downstream device, and the data of the next micro-batch is started to be processed at the same time, which reduces parallel bubbles to a certain extent. As shown in the figure below, the forward F1 calculation is broken down into F11, F12, F13, and F14. After the F11 calculation is completed in computing device 1, F21 calculation will start in computing device 2, and F12 will start in parallel in computing device 1. calculation. Compared with the original pipeline parallel method, parallel bubbles are effectively reduced.

 

2. Tensor parallelism

Tensor parallelism needs to handle how parameters are split on different devices based on the model structure and operator type, and ensure mathematical consistency after splitting. The large language model is based on the Transformer structure and contains three operators: embedding representation, matrix multiplication and cross-entropy loss calculation. These three operators have large differences, so corresponding tensor parallel strategies need to be designed to allocate parameters to different devices. For the embedding representation layer parameters, they can be divided according to the word dimension. Each computing device only stores part of the word vector, and then the complete word vector is obtained by aggregating the partial word vectors on each device.

 

Tensor parallelism of matrix multiplication can utilize the principle of matrix block multiplication to optimize calculations. Take matrix multiplication Y = X × A as an example, where X is an M × N-dimensional input matrix, A is an N × K-dimensional parameter matrix, and Y is an M × K-dimensional result matrix. When the parameter matrix A is too large to exceed the video memory capacity of a single card, A can be divided into multiple cards, and the results can be gathered through collective communication to ensure that the mathematical calculation of the final result is equivalent to the calculation result of a single computing device. There are two ways to segment the parameter matrix A:

1) Split by columns

Cut matrix A into columns A1 and A2 and place them on two computing devices respectively. Two computing devices calculate Y1 = X × A1 and Y2 = X × A2 respectively. After the calculation is completed, multiple computing devices communicate and are spliced ​​to obtain the final result matrix Y, whose mathematical calculation is equivalent to the result of a single computing device.

2) Split by row

Cut the matrix A into B1, B2,...,Bn by rows, and each Bi has N*(K/n) or (K/n)N dimensions. By placing these n segmented matrices on n GPUs, matrix multiplications Y=XB1, Y=X*(B1+B2),..., Y=X*(B1+B2+. ..+Bn). After each step of parallel calculation is completed, the GPUs communicate with each other and splice to obtain the final result matrix Y.

 

The FFN structure in Transformer contains two fully connected (FC) layers, each layer involving two matrix multiplications. These two matrix multiplications respectively adopt the two segmentation methods mentioned above. For the parameter matrix of the first FC layer, the column-based blocking method is used, while for the second FC layer parameter matrix, the row-based blocking method is used. This segmentation method enables the output of the first FC layer to directly meet the input requirements of the second FC layer (segmentation by columns), thereby eliminating the need for summary communication operations after the first FC layer.

 

Multi-head self-attention mechanism tensor parallelism is similar to FFN. Since it has multiple independent heads, it is easier to achieve parallelism than FFN. The matrix segmentation method is shown in the figure.

 

In the last layer of the classification network, the Softmax and Cross_entropy operators are usually used to calculate the cross-entropy loss. However, when the number of categories is very large, a single computing device memory may not be able to store and calculate the logit matrix. In response to this situation, this type of operator can be divided into category dimensions, and the final global cross-entropy loss can be obtained through intermediate result communication. The first thing to calculate is the softmax value, whose formula is as follows:

When calculating cross-entropy loss, tensor parallelism can be used to segment the softmax value and target label according to the category dimension, and each device calculates part of the loss. Finally, another communication is performed to obtain the losses of all categories. During the entire process, only three small amounts of communication are required to complete the calculation of cross-entropy loss.

3. Pipeline parallelism

Pipeline parallelism splits the model "vertically" into layers. At the same time, certain operations within the layer can also be split "horizontally", which is called tensor parallel training. For the computational bottleneck of modern models (like the Transformer), which is multiplying the activation batch matrix with a large weight matrix, it is possible to compute independent dot products or parts of each dot product on different GPUs and sum the results. Regardless of the strategy, one can split the weight matrix into uniformly sized shards, host them on different GPUs, and use the shards to compute the relevant parts of the entire matrix product and combine the results via communication. Megatron-LM is an example that implements parallelization of matrix multiplication in the Transformer self-attention layer and MLP layer. PTD-P combines tensor, data, and pipeline parallelism by allocating multiple non-contiguous layers to each device to reduce bubble overhead, but increases network communication cost. Sometimes inputs can be parallelized across dimensions and computed over finer-grained examples to reduce peak memory consumption. Sequence parallelism is the idea of ​​temporally splitting an input sequence into multiple sub-examples, thereby proportionally reducing memory consumption.

4. Mixing Expert (MoE)

Mix of Experts (MoE) methods are attracting widespread attention as researchers attempt to overcome model size limitations. The core idea is ensemble learning, that is, the combination of multiple weak learners can produce a powerful learner. When using the MoE approach, only a small portion of the network is needed to calculate the output for any input. One example approach is to have multiple sets of weights, and the network can choose which set to use at inference time through a gating mechanism. This enables more parameters without increasing computational cost. Each set of weights is called an "expert," and the hope is that the network will learn to assign specialized computations and skills to each expert. Different experts can be hosted on different GPUs, providing a clear way to scale the number of GPUs used by a model. Exactly one layer of MoE consists of n "experts" as expert feedforward networks {E_i}^n_{i=1} and a trainable gated network G that learns probability distributions in order to route traffic to a few selected "experts". When there are too many “experts”, consider using a two-level hierarchical MoE.

 

GShard (a distributed training framework developed by the Google Brain team

) scales the MoE transformer model to 600 billion parameters through sharding. MoE transformers replace all other feedforward layers with MoE layers. Sharded MoE transformers only have the MoE layer sharded across multiple machines, the other layers are simply replicated. Switch Transformer (a trillion-level model of the Transformer class

) by replacing dense feedforward layers (where each input is routed to only one expert network) with sparse switching FFN layers, extending the model size to trillions of parameters with higher sparsity.

5. Other memory-saving designs

1. Mixed Precision Training

Mixed Precision Training refers to using both 16-bit and 32-bit floating point types when training the model to speed up operations and reduce memory usage. On NVIDIA GPUs, using float16 for calculations is more than twice as fast as using float32, which greatly increases the upper limit of computing power. However, converting the model's operations to FP16 does not completely solve the problem, because the numerical range of FP16 is much smaller than FP32 and TF32, which limits the model's computing capabilities. To ensure that the model can converge to the same results as FP32, additional techniques are required.


1) Weight Backup

One technique to avoid losing critical information with half-precision is weight backup. During training, weights, activation values, and gradients are all calculated using FP16, but the weight values ​​of TF32 are additionally saved. When performing gradient updates, the weights of TF32 are updated. In the next step of training, the weight values ​​of TF32 are converted to FP16, and then forward and backward calculations are performed.

2) Loss Scaling


When training the model, since the gradient magnitude is often very small, using the FP16 format may cause some tiny gradients to be directly reset to zero. Most non-zero gradients are not actually within the FP16 representation. Since the right part of the FP16 format is not fully utilized, we can make the entire gradient distribution shift to the right and fall completely within the FP16 representation range by multiplying the gradient by a larger coefficient. A simple approach is to amplify all gradients by multiplying the loss by a larger value before calculating the gradients. When doing a gradient update, shrink it back to its original size and update it using TF32.

3) Precision Accumulation


In the FP16 model, some arithmetic operations such as matrix multiplication require TF32 to accumulate the product results and then convert to FP16. For example, Tensor Cores in Nvidia GPU devices support leveraging FP16 mixed-precision acceleration while maintaining accuracy. Tensor Core is mainly used to implement matrix multiplication of FP16, and uses TF32 in the accumulation phase to greatly reduce the accuracy loss of mixed precision training.

 

2. Gradient Accumulation

Gradient accumulation is a neural network training technique that works by splitting data samples into several small batches and calculating them sequentially. In each mini-batch, gradients are calculated and accumulated, and averaged after the last batch to update the model parameters. A neural network consists of many interconnected neural network units, the sample data passes through all layers and the predicted value is calculated, and then the loss value (error) of each sample is calculated through the loss function. The neural network calculates the gradient of the loss value relative to the model parameters through the backpropagation algorithm, and uses this gradient information to update the network parameters. Gradient accumulation obtains a batch of data each time, calculates the gradient (forward), continuously accumulates gradients, and updates the network parameters based on the accumulated gradients after a certain number of accumulations, and then clears all gradient information for the next cycle.

 

3. CPU Offloading

CPU Offloading refers to temporarily offloading unused data to the CPU or different devices and re-reading it back when needed. Since CPU storage has larger space and lower price than GPU storage, implementing dual-layer storage can greatly expand the storage space during training. However, simple implementations may result in slower training, while complex implementations require prefetching data to ensure the device does not have to wait. ZeRO is a way of implementing this idea, distributing parameters, gradients and optimizer states across all available hardware, and concretizing them as needed.

4. Activation Recomputation

Recompute is a method that releases tensors in forward calculation and needs to be recalculated during backpropagation. It is suitable for tensors that occupy large memory but have a small amount of recalculation. There are three ways to recalculate:

Speed ​​Centric will retain the calculated tensor for subsequent use;

Memory Centric will release the tensor after the calculation is completed and recalculate if needed;

Cost Aware will determine whether to retain the tensor after the calculation is completed, and release it if it may cause a memory peak.

Swap and recompute can be used together to use different methods for specific ops. You can also iterate several times in advance to collect memory and running time information to determine which tensors should be swapped and which should be recomputed.

 

5. Model compression (Compression)

Model compression is to process large models through methods such as cropping and weight sharing to reduce the amount of parameters. However, this method easily reduces model accuracy, so it is rarely used. Common model compression methods include pruning, weight sharing, low-rank decomposition, binarized weights, and knowledge distillation.

Pruning can be done by cutting connections, kernels, and channels; weight sharing is to reduce the amount of parameters by sharing model parameters; low-rank decomposition decomposes the matrix into a low-rank form, thereby reducing the amount of parameters; binarization weight is to change the weight from 32 bits are reduced to 8 or 16 bits to achieve mixed precision training; knowledge distillation uses a trained teacher model to guide student model training.

 

6. Memory Efficient Optimizer

The memory consumption of the optimizer in model training is an important issue. Taking the Adam optimizer as an example, it needs to store momentum and variance, which is the same size as the gradient and model parameters, and the memory requirements increase. To reduce memory usage, several optimizers have been proposed, such as Adafactor and SM3, which adopt different methods to estimate the second moment or significantly reduce memory usage.

The ZeRO optimizer is a memory optimization method for large model training. By observing the model status and activating the consumption of temporary buffers and unavailable fragmented memory, two methods are used: ZeRO-DP and ZeRO-R. ZeRO-DP reduces redundancy in model state through dynamic communication scheduling, while ZeRO-R uses partition-activated recalculation, constant buffer size, and dynamic memory defragmentation to optimize memory consumption of residual state.

After the NVIDIA ban, where will China’s AI computing go?

On October 17, the United States strengthened its ban on AI chips in the Chinese market, using performance and density as export control standards, prohibiting the export of chips with a single chip exceeding 300 teraflops of computing power and a performance density exceeding 370 gigaflops per square millimeter. The ban is also known as the "NVIDIA ban" due to restrictions on high-end AI chips from AMD, Intel and other companies, especially NVIDIA's mainstream AI training GPUs A100 and H100.

In response to the new chip ban, the AI ​​industry has been discussing a lot, focusing mainly on the implementation time, buffer zone, involved GPU models and ban period. Despite the controversy, the ban on high-end AI chips in China is still firmly implemented.

Now, the AI ​​industry must form a consensus to deal with the challenges. Rather than focusing too much on banned GPUs, we should think more deeply about the future development path of China's AI computing in the era of the Iron Curtain of Chips. Below we will discuss the current industry situation and discuss the way forward for AI computing.

1. Current Situation

Compared with the previous situation, the reaction of public opinion and the AI ​​industry seems to be calmer after the NVIDIA ban was introduced. Only the issue of whether the consumer-grade graphics card RTX 4090 is banned has sparked debate among gamers and merchants. Although the industry does not want to see high-end AI chips being banned, it has already anticipated this situation. The U.S. blockade of Chinese chips has lasted for many years. Some of NVIDIA's high-end GPUs have been banned from sale, and the industry's reaction has also changed from surprise to calm response. Coupled with the popularity of ChatGPT, which has led to a rise in the global high-end GPU market, the United States has repeatedly stated that it will promote a comprehensive ban on the sale of high-end AI chips to China.

In response to the ban and driven by the development of large models, from the end of last year to the first half of this year, many Chinese technology, financial, automotive and other companies concentrated on purchasing Nvidia's high-end GPUs, resulting in a shortage of GPUs in the market. For many small and medium-sized Chinese technology companies and AI startups, it is already difficult to buy high-end GPUs, and the ban has not brought much change. In fact, the domestic AI chip industry began to accelerate development in the early days of trade friction. Although Nvidia's high-end GPUs are difficult to replace in terms of AI training needs, they are not irreplaceable.

In addition, AI chips are different from mobile phone chips and are not relevant to mass consumers. Huawei has made a breakthrough in the field of mobile phone chips. Therefore, both the public and the industry take a calm attitude towards the ban, and even take it for granted. However, it must be admitted that the ban has still caused a certain degree of harm to China's AI industry: replacing Nvidia GPUs in the short term will face difficulties such as chip production capacity and ecological compatibility; the ban will also directly harm manufacturers in fields such as AI servers that use Nvidia products .

A long-term ban may decouple China's AI computing from global high-end chips, which may have complex negative impacts, including: the development of China's AI computing power may lag behind the iteration of NVIDIA's high-end GPUs; under the divergence in the development of underlying computing power, China's AI industry may Falling behind in software technology; technological blockade may extend from AI chips to basic digital capabilities such as general computing power, storage, and basic software. Therefore, it is necessary to formulate three simultaneous "breakout plans": accelerating the independent research and development and ecological construction of domestic AI chips; increasing investment in software technologies such as large models to reduce dependence on companies such as Nvidia; strengthening cooperation with international science and technology, Promote the global development of AI computing in China.

2. Solution 1: Make good use of buyer identity

As the largest buyer in the global chip market, Chinese companies should make good use of this status and get rid of misunderstandings in Sino-US technology trade. We tend to think that the rules of the game are set by the U.S. government and companies and can only be passively accepted, but in fact, as buyers, we should have more say. The ban on AI chips in the Chinese market will most directly harm American technology giants represented by NVIDIA, because the Chinese market has the greatest demand for their AI chips. Nvidia CEO Jensen Huang once said that if they were deprived of the Chinese market, they would have no emergency measures and there would be no other China in the world. Therefore, we should recognize the power we have as buyers and use this status to protect our own interests.

 

We can see the contradiction between American technology companies and the government. Technology companies pursue commercial interests, while governments pursue political interests. U.S. technology companies have been trying to oppose and circumvent the ban, such as Nvidia launching a special version of its GPU for the Chinese market.

3. Solution 2: Use the cloud instead of cards to centralize computing power

In the foreseeable long term, the U.S. ban on Chinese AI chips will only strengthen, which will bring challenges to the development of large AI models. Many people in the industry believe that although large models are developing rapidly, they have not shown the rapid development of previous technological trends. The main reasons are the lack of money for investment and the lack of computing cards.

In order to solve the computing power gap problem in China's AI industry under the ban, companies need to increase the allocation and investment of cloud AI computing power and promote the use of cloud instead of cards. In fact, under the general trend that high-end AI chips may be banned, China's major public cloud manufacturers have begun to increase their stockpiling of Nvidia's high-end GPUs. This is not only because it needs to increase investment in large models and open up the MaaS market, but also has direct demand for AI computing power. In addition, GPUs can be reused for a long time after being converted into cloud resource pools, giving cloud vendors the advantage of being able to advance and attack, and retreat and defend. Therefore, in the first half of this year, there was a situation where high-end AI chips went to cloud manufacturers and small and medium-sized enterprises found it difficult to obtain chips.

Objectively speaking, this move of high-end AI chips to the cloud is conducive to the Chinese market's overall response to the AI ​​chip ban, and is also in line with the strategic thinking of digitalization in the east and calculation in the west. Another trend is that as the parameters of large models and the amount of data used continue to increase, localized card pool training has become increasingly tense. Training of 1,000 cards or 10,000 cards in the cloud has become the main development direction in the future, so enterprise users will be more active. Go to the clouds.

 

At the same time, cloud AI computing power is not limited to hoarding NVIDIA GPUs. As policy promotion and the procurement of independent AI chips increase, AI computing power that combines cloudization and autonomy will become a development trend. According to IDC data, China's AI servers have used 500,000 self-developed AI accelerator chips in the first half of 2023. Huawei has launched Shengteng AI cloud service to provide independent AI computing power services. In the context of digital computing in the East and computing in the West, a number of AI computing centers using independent AI computing power have been established in various places to ensure a stable and reliable supply of cloud AI computing power.

However, many companies still prefer to purchase local AI computing power. On the one hand, this is because Nvidia GPUs are in short supply in the market and have high value retention, and can even be used as core assets of enterprises. On the other hand, there are problems such as queuing, downtime, and lack of software services in cloud AI computing power, which affects the developer experience. In order to further improve developers' experience in using cloud AI computing power, public cloud vendors need to make further efforts.

4. Plan 3: Let domestic AI computing power grow explosively

Facing a new round of AI chip bans, China's AI industry does not rely on Nvidia's high-end GPUs. Instead, after years of development, the AI ​​chip industry has achieved tremendous development. Although NVIDIA still dominates the market share and domestic AI computing power already has a certain market share, it still needs to continue to improve in terms of core performance, software ecosystem and shipping capabilities. Objectively speaking, the ban will accelerate the growth and maturity cycle of domestic AI computing power.

In order to achieve this goal, several things are very important:

1. Form industry consensus and avoid conceptual confusion

Although the AI ​​chip market presents many brands and types of players, the existing problems cannot be ignored. Cutting-edge technologies such as brain-inspired chips are still in the imagination stage, and some AI chip manufacturers can only use them for their own use and cannot ship to the market. At the same time, a large number of manufacturers are in the early construction stage and will have limited contribution to the autonomy of AI computing in the short term.

In order to deal with Nvidia's high-end GPU ban, it is necessary to focus on feasible and effective GPU alternatives and avoid too many associations and divergences. Only by forming an industry consensus can problems be better solved.

2. Move towards large-scale commercial use and avoid PPT core making

At present, domestic AI chip manufacturers that can ship are mainly concentrated in a few companies such as Huawei, Baidu, Suiyuan Technology and Haiguang Information. A large number of semiconductor manufacturers and AI companies are still stuck in their plans and visions to build chips, which has led to the stagnation of the development of domestic AI chips that the policy support and investment market expects. Some companies may even only enjoy the dividends of the financial market at this stage without making substantial progress.

In order to promote industrial development, the future industry orientation should focus on shifting AI chips from planning to shipment, helping manufacturers obtain direct business feedback, allowing products and production capacity to accept market testing, and gradually creating positive cash flow.

3. Strengthen the software ecosystem and strengthen migration capabilities

The importance of NVIDIA GPUs lies not only in hardware performance, but also in the powerful capabilities of its software ecosystem such as CUDA and PyTorch. Therefore, the development of domestic AI chips cannot ignore the improvement of software capabilities. While strengthening the construction of independent software ecosystem, it is also necessary to pay attention to the AI ​​model migration capabilities and migration costs based on NVIDIA ecosystem.

Many manufacturers have already explored this aspect. For example, Haiguang Information's DCU is highly similar to CUDA in terms of ecological and programming environments, allowing CUDA users to quickly migrate to Haiguang's ROCm platform at a lower cost. Previously, PyTorch version 2.1 announced support for Huawei Ascend, showing that domestic AI chips already have a certain scale of influence and can be more integrated into the global software ecosystem. To realize the explosion of domestic AI computing in the future, it is inseparable from the vigorous development of the domestic AI basic software ecosystem.

 

4. Increase support for “main brands” to create scale effects

In China, in order to accelerate the maturity of AI computing and realize independent substitution, a market structure of one superpower and multiple superpowers should be formed as soon as possible to avoid ecological fragmentation and waste of IT investment. In this process, the market mechanism will play a decisive role. However, in the context of the current chip ban, the rise of domestic AI computing is urgent, and the formation of a "main brand" should be accelerated to quickly replace imported chips such as Nvidia.

At present, it seems that Huawei's Ascend series is one of the most likely to become the main brand of domestic AI computing power. Liu Qingfeng, chairman of iFlytek, once said that Huawei GPUs have kept pace with Nvidia A100. Data shows that the integer precision computing power of the Ascend 310 reaches 16TOPS, while the integer precision computing power of the Ascend 910 is as high as 640TOPS, which means that the performance of the Ascend 910 is close to that of the NVIDIA A100.

At the same time, Ascend is currently the only domestic AI computing power brand that occupies a certain share of the market, and in terms of software, it has cultivated heterogeneous computing architecture CANN and AI computing framework MindSpore similar to NVIDIA CUDA. From the three perspectives of core performance, software ecosystem and market share, Ascend has the potential to accelerate growth and achieve large-scale domestic substitution of AI computing power.

The main ways to promote the rapid growth of domestic AI computing power in the short term include standardizing industry standards, strengthening software construction, and improving support for independent brands. The NVIDIA ban is an issue that the Chinese AI industry is unwilling to face and is trying its best to avoid, but yet keeps it secret.

NVIDIA will launch new AI chips for the Chinese market to cope with U.S. export restrictions

According to people familiar with the matter, NVIDIA has developed a new and improved AI chip series tailored for the Chinese market, including HGX H20, L20 PCle and L2 PCle. Against the backdrop of the U.S. government's tightening of export restrictions on China's high-tech industry, NVIDIA's move is seen by the industry as a direct response to relevant policy adjustments. The move may suggest the company is looking for strategies to comply with regulations while remaining competitive in the market.

 

According to industry insiders, NVIDIA has developed a new generation of improved AI chip series for the Chinese market, including HGX H20, L20 PCIe and L2 PCIe. These chips are based on Nvidia's H100 series of chips and use different architectures.

HGX H20 uses NVIDIA Hopper architecture and is equipped with up to 96 GB of HBM3 memory, providing 4TB/s bandwidth. It is suitable for extremely demanding computing scenarios and demonstrates excellent performance.

L20 PCIe and L2 PCIe use NVIDIA Ada Lovelace architecture and provide diverse options for different computing needs. L20 PCIe comes with 48 GB GDDR6 w/ ECC memory, while L2 PCIe has 24 GB GDDR6 w/ ECC memory. Of particular note is that the H20 model does not have RT Core, while the L20 and L2 PCIe have added this feature, indicating that they have enhanced their ray tracing capabilities.

These new series of chips may adjust performance parameters to meet the special requirements of the Chinese market and circumvent certain export bans on sensitive technologies. Although such product customization may bring technological innovation, it may also bring the risk of technological fragmentation, triggering industry concerns about the differentiation of technical standards.

Analysts believe that NVIDIA's move is an important part of its global supply chain strategy and reflects the company's flexible adaptation to the global economic situation. This move will help NVIDIA maintain its business activities and customer relationships in the Chinese market, and may also promote local Chinese manufacturers to accelerate the pace of technological self-reliance.

Although U.S. export restrictions have brought challenges to technology products in the Chinese market, according to people familiar with the matter, Nvidia has adopted targeted technical adjustments to comply with export rules and ensure that its products can smoothly enter the Chinese market. It is reported that Nvidia is expected to announce this new series of products after November 16, and more details will be announced by then. Although NVIDIA has not yet made an official response to this news, the market is already full of expectations for these possible new products.

Blue Ocean Brain Large Model Training Platform

The Blue Ocean Brain large model training platform provides powerful computing power support, including an AI accelerator based on high-speed interconnection of open acceleration modules. It is configured with high-speed memory and supports fully interconnected topology to meet the communication requirements of tensor parallelism in large model training. It supports high-performance I/O expansion and can be extended to Wanka AI cluster to meet the communication needs of large model pipelines and data parallelism. Powerful liquid cooling system hot-swappable and intelligent power management technology, when the BMC receives a PSU failure or error warning (such as power outage, surge, overheating), it automatically forces the system's CPU to enter ULFM (ultra-low frequency mode) to achieve the lowest power. consumption). Committed to providing customers with environmentally friendly and green high-performance computing solutions through "low carbon and energy saving". Mainly used in deep learning, academic education, biomedicine, earth exploration, meteorology and oceanography, supercomputing centers, AI and big data and other fields.

 

1. Why do we need large models?

1. The model effect is better

The effect of large models in various scenes is better than that of ordinary models

2. Stronger creative ability

Large models can perform content generation (AIGC) to facilitate large-scale content production

3. Flexible customization of scenarios

By giving examples, we can customize a large number of application scenarios for large models.

4. Less labeled data

By learning a small amount of industry data, large models can cope with the needs of specific business scenarios.

2. Platform features

1. Heterogeneous computing resource scheduling

A comprehensive solution based on general-purpose servers and dedicated hardware for scheduling and managing multiple heterogeneous computing resources, including CPUs, GPUs, etc. Through powerful virtualization management functions, underlying computing resources can be easily deployed and various models can be run efficiently. At the same time, the hardware acceleration capabilities of different heterogeneous resources are fully utilized to speed up the running and generation speed of the model.

2. Stable and reliable data storage

Supports multiple storage type protocols, including block, file and object storage services. Pool storage resources to achieve free circulation of models and generated data, improving data utilization. At the same time, data protection mechanisms such as multiple copies, multi-level fault domains, and fault self-recovery are adopted to ensure the safe and stable operation of models and data.

3. High-performance distributed network

Provides network and storage of computing resources, forwards them through distributed network mechanisms, transparently transmits physical network performance, and significantly improves the efficiency and performance of model computing power.

4. Comprehensive security guarantee

In terms of model hosting, a strict permission management mechanism is adopted to ensure the security of the model warehouse. In terms of data storage, measures such as privatized deployment and data disk encryption are provided to ensure the security and controllability of data. At the same time, during the model distribution and operation process, comprehensive account authentication and log audit functions are provided to fully ensure the security of the model and data.

3. Common configurations

1. Processor CPU:

  • Intel Xeon Gold 8358P 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

  • Intel Xeon Platinum 8350C 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

  • Intel Xeon Platinum 8458P 28C/56T 2.7GHz 38.5MB,DDR4 2933,Turbo,HT 205W

  • Intel Xeon Platinum 8468 Processor 48C/64T 2.1GHz 105M Cache 350W

  • AMD EPYC™ 7742 64C/128T,2.25GHz to 3.4GHz,256MB,DDR4 3200MT/s,225W

  • AMD EPYC™ 9654 96C/192T,2.4GHz to 3.55GHz to 3.7GHz,384MB,DDR5 4800MT/s,360W

2. Graphics card GPU:

  • NVIDIA L40S GPU 48GB

  • NVIDIA NVLink-A100-SXM640GB

  • NVIDIA HGX A800 80GB

  • NVIDIA Tesla H800 80GB HBM2

  • NVIDIA A800-80GB-400Wx8-NvlinkSW

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/134438746