Deep learning model deployment and optimization: strategies and practices; comparative analysis of L40S, A100, and H100

★Deep learning, machine learning, generative AI, deep neural network, abstract learning, Seq2Seq, VAE, GAN, GPT, BERT, pre-trained language model, Transformer, ChatGPT, GenAI, multi-modal large model, visual large model, TensorFlow , PyTorch, Batchnorm, Scale, Crop operator, L40S, A100, H100, A800, H800

With the rapid development of generative AI applications, we are in an unprecedented era of explosion. In this era, the deployment of deep learning models has become an urgent problem to be solved. Despite the key role GPUs play in training and inference, misconceptions about its role in the field of generative AI persist. Nvidia's L40S GPU architecture has become a hot topic recently. So what are the advantages of L40S compared with A100 and H100?

Generative AI applications enter the era of explosion

1. Driving factors: Resonance between large models, computing power and ecology

Generative AI originated from Rule-based expert systems in the 1950s. Although it has the ability to handle simple tasks such as character matching and word frequency statistics, it has limited vocabulary, poor context understanding, and very weak ability to generate creative content.

The rise of machine learning in the 1980s injected new impetus into AI. The emergence of neural networks in the 1990s allowed it to begin to imitate the human brain and learn from data to generate more realistic content. The core technology of contemporary generative AI originated from the continuous deepening of deep neural network structures after 2012. The model learns complex feature representations of tasks through layers of abstraction, thereby improving accuracy and authenticity.

In the following years, the maturity of a series of algorithms such as Seq2Seq, VAE, and GAN, as well as the increase in computing power and data size, made large model training possible, making a qualitative leap in the development of generative AI. In particular, the birth of pre-trained language models such as GPT and BERT marks a new stage in the field of text generation.

DM_20231006111220_001.png

The development wave of artificial intelligence industry

1. Models, computing power, and ecology drive AI applications into an era of explosion

1) Rapid progress in algorithms and models

GenAI has made significant progress in the field of text. With the release of the Transformer model in 2017 and the launch of ChatGPT in 2022, many of its capabilities have surpassed human benchmarks. In the future, more powerful language large models such as GPT-5 and technological breakthroughs in multi-modal large models and visual large models will continue to promote the continuous evolution of AI applications.

2) Computing infrastructure will be faster and cheaper

Although the surge in demand for large model training in the short term has led to a continued rise in computing power costs, with the continuous updates and iterations of NVIDIA computing power chips, Microsoft, Amazon, Google, etc. continue to increase capital expenditures on AI cloud services, and the development of AI applications will receive stronger support.

3) The AI ​​ecosystem gradually matures

The improvement of the AI ​​component layer (AI Stack) and the refinement of industrial division of labor provide full life cycle support for AI applications in model training, data integration, application development, and application deployment.

DM_20231006111220_002.png

Models, computing power, and ecosystem drive AI applications into an era of explosion

Currently, GPT-4 has surpassed human levels in multiple dimensions such as reasoning, generation, and dialogue. Its power lies not only in text, but also in the fact that the large model framework itself can be applied to various tasks, becoming a unified technical basis for multi-modal generation including images, codes, audio, etc.

DM_20231006111220_003.png

GPT-4 is currently the most powerful large model

2. Industry Status: The Evolution of AI Applications from a First- and Second-Level Perspective

The development of generative AI applications has gone through three stages:

1. GPT-3 birth period

GPT-3's powerful language generation capabilities enabled the first batch of generative AI applications like Jasper AI to be born in 2021, and rapidly grew users and revenue.

2. Explosion period in 2022

AI painting applications have emerged one after another this year, with representative works such as MidJourney and Stable Diffusion. At the same time, the release of ChatGPT marks a new level of text generation. The generation capabilities of other modalities such as video and 3D are also improving.

3. Commercialization period in 2023

GPT-4 further improves language model capabilities, while the open source model Llama provides a low-cost option. AI applications are springing up in various industries. Giants such as Microsoft and Salesforce have announced the commercial pricing of AI products, indicating that AI applications have officially entered commercialization.

DM_20231006111220_004.png

Development stages of generative AI applications

3. Application framework: four major application tracks and industrial logic

Generative AI applications can be divided into tool categories, general software categories, industry software categories and intelligent hardware categories. Tool applications such as chatbots mainly serve the C-side, are highly homogenized, and are highly dependent on the underlying model. For general software applications such as office software, benchmark products have emerged and will enter the commercialization stage. Industry software is oriented to specific vertical fields, such as finance and medical care. The industries vary greatly and data mining is the core competitiveness. For intelligent hardware applications such as autonomous driving, perception and decision-making are bottlenecks and require breakthroughs in underlying technology. From tools to smart hardware, applications are evolving from general to professional, from C-side to B-side, and the core of competition has also shifted from reliance on models to data applications and underlying technology innovation. This indicates that generative AI applications are developing towards maturity and implementation.

DM_20231006111220_004.png

Generative AI application industry map

The three underlying meta-capabilities of generative AI are perception, analysis and generation. Currently, it focuses on text understanding, and will be extended to image visual perception in the future. Analytical capabilities will move from information integration to complex reasoning. The generation capabilities will also be expanded from text to multi-modal content such as images and videos. Through the collaborative improvement of these three capabilities, generative AI will achieve unified perception, understanding, and creation across modalities, evolving toward a comprehensive understanding of the real world and creative responses to achieve a more powerful level of intelligence.

There are four major directions for the future development of generative AI: content generation, knowledge insights, intelligent assistants and digital agents. Content generation is a core value, and automated generation of a wider range of content is achieved by improving large models and multi-modal technology. Knowledge insights use large models to analyze data to provide insights and serve decision-making. Intelligent assistants embed AI capabilities into scenarios to proactively collaborate. Digital agents make autonomous decisions and actions based on their environment to achieve goals. From passive services to proactive actions, generative AI is evolving towards higher intelligence with content creation, world insight, intelligent collaboration, and autonomous actions.

DM_20231006111220_006.png

Benchmark products and development paths for AI segmentation applications

How to deploy deep learning models

The trained model needs to be optimized and adjusted to be converted into a standard format inference model for use in deployment. Optimization includes operations such as operator fusion and constant folding to improve reasoning performance. Since deployment target scenarios vary, the appropriate model format needs to be selected based on hardware limitations. To accommodate hardware limitations, methods such as model compression or precision reduction may be required. After the model is deployed, inference latency and occupied resources are key indicators, which can be improved through customized chips and software and hardware collaborative optimization. In software optimization, factors such as data layout and computing parallelism need to be taken into consideration and designed for the CPU architecture. Models are important assets to an enterprise, so they must be secure after deployment. In short, the model needs to go through multiple processes from training to deployment, including conversion optimization, scale adjustment, and software and hardware collaboration, etc., to ensure that it adapts to the resource constraints, performance indicators, and security requirements of different scenarios and fully utilizes the value of the model.

To address the above challenges in model deployment, the industry has some common methods:

  • Operator fusion

It is a technology that combines multiple operators into one operator through expression simplification, attribute fusion, etc. Fusion can reduce the computational complexity of the model and the size of the model.

  • constant folding

The forward calculation of qualified operators is completed in advance in the offline stage, thereby reducing the computational complexity and volume of the model.

  • Model compression

Technologies that reduce model volume and computational complexity through quantization, pruning and other means can be divided into compression technologies that require retraining and compression technologies that do not require retraining.

  • Data layout

Based on the support level of the back-end operator library and hardware limitations, the optimal data layout format of each layer in the network is searched, and data rearrangement or data rearrangement operators are inserted to reduce the inference delay during deployment.

  • Model confusion

Confusing the trained model mainly includes adding network nodes and branches, and replacing operator names. Even if the attacker steals the obfuscated model, he cannot understand the structure of the original model. In addition, the obfuscated model can be executed directly in the deployment environment in an obfuscated state, ensuring the security of the model during operation.

1. Conversion and optimization of training model to inference model 

1. Model conversion

Different model training frameworks such as TensorFlow and PyTorch have their own data structure definitions, which brings certain inconvenience to model deployment. To solve this problem, the industry developed the open neural network exchange format ONNX. It has powerful expressive capabilities, supports various operators, and provides converters from different frameworks to ONNX.

The essence of model conversion is to transfer structured information between different data structures. Therefore, it is necessary to analyze the similarities and differences between the two structures, directly map the similar structural parts, and find a reasonable conversion method based on semantics for the parts that are significantly different. If there are incompatibilities, conversion cannot occur.

The advantage of ONNX lies in its powerful expression ability. Models of most frameworks can be converted to ONNX, thus avoiding incompatibility problems. As an intermediate representation format, ONNX becomes a bridge between different frameworks and deployment environments, enabling seamless migration of models.

The model can be abstracted as a graph, so the data structure of the model can be deconstructed into the following two main points:

1) Model topology connection

From the perspective of graph theory, it is the edge of the graph, corresponding to the data flow and control flow in the model. Determine subgraph expression, input and output, and control flow structures. Different frameworks have different representations of this and require equivalent conversions. For example, TensorFlow's loop control flow is converted to ONNX's While and If operators to avoid introducing loops.

2) Definition of operator prototype

From the perspective of graph theory, it is the vertex of the graph, corresponding to the computing unit of the model. Including operator type, input and output, attributes, etc. Although operators in different frameworks have the same name, their semantics are inconsistent. Therefore, it is necessary to parse semantic equivalence and perform appropriate mapping. For example, Caffe's Slice operator needs to be converted to ONNX's Split. No completely equivalent operator also requires a combinatorial expression.

After the model conversion is completed, various optimizations will be performed to complete some constant calculations in advance, merge related operators, replace multiple simple operators with more powerful operators, rearrange operators according to dependencies, etc. These optimization methods of operator fusion, folding, replacement, and rearrangement are very similar to the optimization performed by the compiler. The purpose is to calculate in advance, reduce the amount of calculations, and enhance parallelism. The reason why thorough optimization can only be carried out during the deployment phase is that only during this period can the hardware backend and operating environment be determined. Optimization in the model deployment phase can effectively compress the model size, reduce the amount of calculation, and improve execution efficiency. In line with the compilation and optimization ideas, they all follow the strategy of calculating in advance, merging operations, and adjusting layout to better utilize the hardware performance.

2. Operator fusion 

Operator fusion is a model compression technology, and its core idea is to merge multiple related operators into one operator. In deep learning models, a large number of repeated calculations and frequent memory accesses can lead to inference latency and energy consumption issues. Operator fusion analyzes the model topology to find operator nodes that can be merged, and "fuses" them into new operators according to certain rules. This is equivalent to a "contract graph" in graph theory, which can effectively reduce the number of nodes and edges. Reducing the number of nodes means fewer computational steps, and reducing the number of edges means fewer memory accesses. Therefore, operator fusion technology can reduce the computational complexity and storage occupation of the model, thereby reducing the delay and power consumption during model inference.

DM_20231006111220_007.png

Computer tiered storage architecture

The performance improvement of operator fusion mainly comes from two points:

1) Make full use of cache and reduce memory access

CPU registers and multi-level caches are much faster than memory. After fusion, the results of a calculation can be temporarily stored in the cache for direct reading by the next calculation, eliminating the cost of memory reading and writing.

2) Calculate in advance to eliminate redundancy

Multiple identical calculations can be completed in advance to avoid repeated calculations during forward reasoning, especially repeated calculations in loops. There is a huge speed gap between CPU and memory. Cache closer to the CPU is smaller and faster, memory is slower and larger. Operator fusion makes full use of cache to accelerate repeated calculations through merging operations, eliminates redundancy, and reduces the number of memory accesses. This is the fundamental reason why it can significantly reduce latency and power consumption.

3. Operator replacement

Operator replacement aims to replace operators in the model with operators that have the same calculation logic but are more suitable for online deployment. This technology simplifies the calculation formula of an operator through mathematical methods such as merging similar terms and extracting common factors, and maps the simplified formula to a class of more efficient operators. Through operator replacement, the calculation amount and model size can be reduced.

DM_20231006111220_008.png

Batchnorm operator replacement

By replacing the Batchnorm operator with the Scale operator, power consumption and performance can be optimized during the model deployment phase. However, it should be noted that this optimization strategy only applies to the deployment phase, because during the training phase, the parameters in the Batchnorm operator are treated as variables rather than constants. In addition, this optimization strategy will change the structure of the model, reduce the expressive ability of the model, and may affect the convergence and accuracy of the model during the training process.

4. Operator rearrangement

Operator rearrangement reduces the computational complexity of model inference by rearranging the topological order of operators in the model while maintaining the accuracy of inference. Common operator rearrangement techniques include moving cutting operators (such as Slice, StrideSlice, Crop, etc.) forward, rearranging Reshape and Transpose operators, and rearranging BinaryOp operators. The goal of these technologies is to improve the inference efficiency of the model by reorganizing the calculation order of operators and reducing unnecessary calculations and data transmission.

DM_20231006111220_009.png

Crop operator rearrangement

By moving the cropping process of the Crop operator forward, that is, cropping the feature map in advance in the model, the calculation amount of subsequent operators can be reduced, thereby improving the inference performance of the model in the deployment stage. The performance improvement of this optimization strategy depends on the parameter settings of the Crop operator. However, it should be noted that only element-wise operators can perform forward operations. Based on previous experimental data, it can be seen that optimizing before model deployment can significantly improve the latency, power consumption, and memory usage of inference.

2. Model compression

Due to the differences in requirements of different hardware environments, for example, when deploying models on resource-constrained devices such as mobile phones, there are usually strict requirements on model size, which are generally within the range of several megabytes. Therefore, for larger models, some model compression techniques are usually required to meet the needs of different computing hardware. These techniques can be achieved by reducing model parameters, reducing model complexity, or using quantification methods. Through model compression, the storage space and computing overhead of the model can be reduced while maintaining model performance, thereby better adapting to different hardware platforms.

1. Quantification

Model quantization works by representing continuously valued floating-point weights at a lower loss of inference accuracy. Model quantization can reduce the size of the model by using fewer bits to represent floating point data, thereby reducing memory consumption during inference. In addition, model quantization can also improve inference speed on some processors that support low-precision operations.

DM_20231006111220_010.png

Quantification principle

In computers, different data types occupy different numbers of bits and different representation ranges. By quantizing the parameters of the model into data types with different digits, the storage size of the model can be reduced according to actual needs. Generally speaking, parameters in deep neural networks are represented by single-precision floating point numbers, but if signed integers can be used to approximate the parameters, the storage size of the quantized weight parameters can be reduced to a quarter of the original. The fewer quantization bits, the higher the model compression rate.

In addition, quantization methods can be further divided into linear quantization and nonlinear quantization according to whether the quantization method uses linear or nonlinear quantization. In actual deep neural networks, weights and activation values ​​are usually uneven, so in theory, nonlinear quantization can represent these values ​​more accurately, thereby reducing accuracy loss. However, in actual inference, the computational complexity of nonlinear quantization is high, so linear quantization is usually used for model quantization.

1) Quantitative perception training

Quantization-aware training is a method that simulates quantization during the training process by inserting pseudo-quantization operators into the model to simulate quantization operations. In each training iteration, quantization-aware training calculates the range of weights and activation values ​​of the quantized network layers and introduces quantization losses into the process of forward inference and backpropagation. In this way, the optimizer can minimize quantization errors during training, resulting in higher model accuracy.

Specifically, the process of quantified perception training is as follows:

  • Initialization: Set the range of weights and activation values, and initialize the weights and activation values ​​of the network. Construct a simulated quantization network: insert pseudo-quantization operators after the weights and activation values ​​that need to be quantized to simulate actual quantization operations.

  • Quantitative training: Repeat the following steps until the network converges. a. Calculate the weights and activation value range of the quantized network layer. b. Bring quantized loss into the process of forward inference and backpropagation according to the range to update the parameters of the network.

  • Export the quantized network: obtain the weights and activation value ranges of the quantized network layer, and calculate the quantized parameters. Substitute the quantization parameters into the quantization formula to convert the weights in the network into quantized integer values. Finally, the pseudo quantization operator is deleted, and the quantization and inverse quantization operators are inserted before and after the quantization network layer to obtain the final quantization network.

Through quantization-aware training, you can reduce the size of the model while maintaining the accuracy of the model as much as possible, and obtain faster inference speed on processors with faster low-precision operations.

2) Quantification after training

In post-training quantization, there are two common methods: weight quantization and full quantization.

Weight quantization only quantizes the weight of the model to reduce the size of the model. During inference, the quantized weights are dequantized into original float32 data, and ordinary float32 operators are used for inference. The advantage of weight quantization is that there is no need to calibrate the data set, no need to implement quantization operators, and the accuracy error of the model is smaller. However, since the actual inference still uses the float32 operator, the inference performance will not improve.

Full quantization not only quantifies model weights, but also quantifies model activation values, and accelerates the inference process by executing quantization operators. In order to quantify activation values, a certain number of calibration data sets need to be provided to count the distribution of activation values ​​in each layer and calibrate the quantization operator. The calibration data set can come from a training data set or input data from a real scenario, usually in small quantities.

When quantifying activation values, the distribution of the original float32 data is first used to count histograms. Then, appropriate quantization parameters are selected in the given search space to quantize the activation values ​​into quantized data. Next, use the histogram to count the quantized data distribution, and calculate the quantization parameters to measure the distribution difference before and after quantification.

Furthermore, since there are inherent errors in quantization, quantization errors need to be corrected. For example, in matrix multiplication, the quantized mean and variance need to be corrected so that they are consistent with the float32 operator. By correcting the quantized data, the distribution after quantification can be kept consistent with that before quantification.

As a general model compression method, quantization method can significantly improve the efficiency of neural network storage and compression, and has been widely used.

2. The model is sparse

Sparse models are a way to reduce storage and computational costs by reducing the components in a neural network. It is a strong inductive bias introduced to reduce the computational complexity of the model, similar to methods such as model weight quantization, weight sharing, and pooling.

1) Motivation for model sparseness

The rationality of the sparse model can be explained from two aspects.

  • Existing neural network models often have too many parameters, which makes the model overly complex and redundant.

  • For many visual tasks, the useful information in the activation value feature map only accounts for a small part of the entire image.

Based on these observations, sparse models can reduce redundancy by removing weak connections in weights or activation values. Specifically, sparse models prune some of the weaker connections to zero based on the strength of the connection (usually the weight or the absolute size of the activation). This improves the sparsity of the model and reduces computational and storage requirements.

However, it should be noted that the higher the sparsity of a sparse model, the greater the accuracy drop of the model may be. Therefore, the goal of sparse models is to minimize the loss of accuracy while increasing sparsity.

2) Structured and unstructured sparsity

Weight sparse can be divided into structured and unstructured sparse. Structured sparse pruning the model at the channel or convolution kernel level to obtain a regular and smaller weight matrix. This sparse method is suitable for accelerated calculations on CPUs and GPUs, but the accuracy drops significantly. Unstructured sparseness can clip the weight at any position in the weight matrix with a small decrease in accuracy. However, the irregularity of unstructured sparsity makes it difficult to utilize hardware acceleration, and brings about problems such as reduced parallelism at the instruction level and thread level, and reduced memory access efficiency. To overcome these problems, recent research has proposed methods that combine structured and unstructured sparsity to have the advantages of both and address their shortcomings.

3) Sparse strategy

Specific strategies for sparse models include pre-training, pruning and fine-tuning. Pre-training refers to first training a dense model, and then sparse and fine-tuning based on this. Pruning refers to removing redundant weights in the model according to certain rules or standards to make the model sparser. Pruning can be done all at once or alternately with training to gradually discover redundant parts. Fine-tuning refers to further training the model after pruning to restore the accuracy of the sparse model.

Taking Deep Compression as an example, the sparse model after pruning can be further quantized, that is, using lower bit data to represent the weight. In addition, combined with Huffman coding, the storage space of the model can be further reduced. The combined application of these strategies can significantly reduce the storage requirements of the model while maintaining high accuracy.

DM_20231006111220_011.png

Deep Compression

In addition to directly removing redundant neurons, dictionary learning-based methods can also be used to remove useless weights in deep convolutional neural networks. This method transforms the original convolution kernel into the coefficient domain by learning the basis of a series of convolution kernels, and makes it sparse. For example, Bagherinezhad et al. proposed a method to decompose the original convolution kernel into a weighted linear combination of the basis of the convolution kernel and the sparse coefficients. In this way, unnecessary weights in the network can be effectively removed, thereby reducing the storage requirements and computational complexity of the model.

3. Knowledge distillation

Teacher-student neural network learning algorithm, also known as knowledge distillation. In practice, large deep networks often achieve excellent performance because over-parameterization helps improve the model's generalization ability. In knowledge distillation, a larger neural network is usually used as the teacher network, which has been pre-trained and has high accuracy. Then, a new, deeper and narrower neural network is used as the student network, and the knowledge of the teacher network is taught to the student network through supervised learning. The key to knowledge distillation is how to effectively transfer the knowledge of the teacher network to the student network, thereby improving the performance and generalization ability of the student network.

DM_20231006111220_012.png

An attention-based teacher neural network-student neural network learning algorithm

Knowledge distillation is an effective method that can help optimize small networks and can be used in conjunction with other compression methods such as pruning and quantization. Efficient models with high accuracy and small computational load can be trained through knowledge distillation. This comprehensive application can significantly reduce model complexity while maintaining high performance.

4. Model reasoning

When deploying a trained model to computing hardware for inference, there are several key steps to go through:

  • Preprocessing: Preprocess the original data to make it suitable for input into the network for inference.

  • Inference execution: Deploy the trained and converted model to the device, perform calculations on the input data, and obtain the output data.

  • Post-processing: further process the model output results, such as applying filtering thresholds or other post-processing operations, to obtain the final results.

These steps are key to applying the trained model to actual scenarios, ensuring that the model can correctly process input data and generate accurate output.

1. Pre-processing and post-processing

1) Pre-processing

In machine learning, data preprocessing is a key step. Its purpose is to convert raw data into a form suitable for model input and eliminate noise and irrelevant information to improve the performance and reliability of the model.

Data preprocessing usually includes the following aspects:

  • Feature encoding: Converting raw data into a numerical form that can be processed by machine learning models. This involves converting different types of data (such as text, images, audio, etc.) into numerical representations, such as using one-hot encoding, ordinal encoding, etc.

  • Data normalization: Standardize the data so that it has the same scale and range, eliminating dimensional differences between different features. Common normalization methods include min-max scaling and Z-score normalization.

  • Handle outliers: Outliers are outliers that are significantly different compared to other data points and can negatively impact the performance of your model. Therefore, the accuracy and robustness of the model can be improved by detecting and handling outliers.

Through data preprocessing, we can extract and highlight valuable features while removing irrelevant information and noise, providing better input to the model, thereby improving the model's performance and generalization ability.

2) Post-processing

After model inference is completed, the output data usually needs to be post-processed to obtain more reliable and useful results. Common data post-processing methods include:

  • Discretize continuous data: If the output of the model is a continuous value, but the actual application requires discrete values, you can use methods such as rounding and thresholding to convert the continuous data into discrete data to obtain actually usable results.

  • Data visualization: By presenting data in the form of graphics or tables, you can more intuitively understand the relationships and trends between data, thereby deciding the next analysis strategy.

  • Manually adjust the prediction range: In some cases, a regression model may not predict extreme values ​​accurately and the results are concentrated in the middle range. In this case, you can manually adjust the forecast range by multiplying it by a factor to zoom in or out to get a more accurate forecast.

These post-processing methods can further optimize the output results of the model to make it more consistent with actual application needs and provide more useful information for users to make decisions and analysis.

2. Parallel computing

To improve inference performance, inference frameworks often use multi-threading mechanisms to take advantage of the capabilities of multi-core processors. The main idea of ​​this mechanism is to split the input data into multiple small pieces and use multiple threads to perform operator calculations in parallel. Each thread is responsible for processing different data blocks, thereby realizing parallel calculation of operators and greatly improving computing performance. In this way, the computing power of multi-core processors can be fully utilized, the reasoning process can be accelerated, and the response speed of the system can be improved.

DM_20231006111220_013.png

Matrix multiplication data segmentation

In order to implement multi-threaded parallel calculation of matrix multiplication, you can divide it according to the rows of the left matrix, and use the thread pool mechanism to conveniently perform parallel calculation of operators. In the industry, there are two common approaches:

1) Use the OpenMP programming interface, which provides a set of cross-platform shared memory multi-threaded concurrent programming APIs. By using OpenMP's "parallel for" interface, you can implement multi-threaded parallel execution of the code in the for loop body.

2) The inference framework implements its own thread pool mechanism for parallel computing of operators. Compared with using the OpenMP interface, the thread pool of the inference framework can implement parallel calculation of operators in a more targeted manner, thereby providing higher performance and lighter implementation.

3. Operator optimization

For deep learning networks, framework scheduling often takes up very little time, and performance bottlenecks usually occur during the execution of operators. To improve performance, operators can be optimized from both the hardware instruction and algorithm perspectives.

From a hardware instruction perspective, specific hardware instruction sets, such as SIMD (Single Instruction Multiple Data) instruction sets, can be used to process multiple data in parallel. This can reduce the number of instruction executions and improve calculation efficiency. In addition, hardware accelerators, such as GPU (Graphics Processing Unit) or TPU (Tensor Processing Unit), can be used to accelerate the execution of operators.

From an algorithmic perspective, performance can be improved by optimizing the implementation of the algorithm. For example, a more efficient matrix multiplication algorithm can be used to reduce the number of multiplications and additions. Approximate computing methods such as quantization (converting floating point numbers to integers) or pruning (removing unimportant weights) can also be used to reduce computational effort and storage requirements.

1) Hardware instruction optimization

Most devices have a CPU, so the time the operator spends on the CPU is particularly important.

  • Assembly language

High-level programming languages ​​use a compiler to convert code into a sequence of machine instruction codes, but its functionality is limited by the capabilities of the compiler. In contrast, assembly language is closer to machine language and can directly write specific instruction code sequences. Programs written in assembly language take up less storage space, execute faster, and are more efficient.

In practical applications, most codes are usually written in high-level programming languages, while parts with higher performance requirements can be written in assembly language, thus achieving complementary advantages. In deep learning, operators such as convolution and matrix multiplication involve a large amount of calculations. Writing these operators in assembly language can significantly improve the performance of model training and inference, often achieving tens to hundreds of times performance improvement.

  • Registers and NEON instructions

On the ARMv8 series CPU, there are 32 NEON registers, each register can store 128-bit data. This means that each register can store 4 float32 data, 8 float16 data, or 16 int8 data. These registers can be used to process multiple data in parallel, thereby increasing computational efficiency.

DM_20231006111220_014.png

The structure of NEON register v0 in ARMv8 processor

The NEON instruction set is a special instruction set for ARMv8 series processors that allows multiple data to be operated at the same time, thereby improving the speed of data access and calculation. Compared with traditional single data operation instructions, NEON instructions can process multiple data at one time.

For example, NEON's fmla instruction can perform multiplication and accumulation operations on floating-point numbers in multiple registers at the same time. The usage of the instruction is similar to "fmla v0.4s, v1.4s, v2.4s", where v0, v1 and v2 represent NEON registers respectively, and ".4s" means there are 4 float values ​​in each register. The function of this instruction is to multiply the float values ​​at the corresponding positions in the v1 and v2 registers, and accumulate the result to the corresponding position in the v0 register.

By using the NEON instruction set, developers can make full use of the processor's parallel computing capabilities to process large amounts of data in a more efficient manner. The advantages of this parallel computing can be reflected in many applications, especially in areas involving large-scale matrix operations, image processing, and signal processing.

DM_20231006111220_015.png

fmla instruction calculation function

  • Assembly language optimization

When writing assembly language programs, some optimization strategies can be adopted to improve the performance of the program, especially by increasing the cache hit rate to speed up data access.

A common optimization strategy is loop unrolling, which improves performance by using more registers to reduce memory accesses. In addition, instruction rearrangement is also an effective optimization method. By rearranging the execution order of instructions, the pipeline can be more fully utilized and delays can be reduced.

When using NEON registers, reasonably dividing the registers into blocks can reduce the idle time of the register and increase the reuse rate of the register. In addition, by rearranging the storage order of calculation data, the cache hit rate can be improved and the memory accessed by read and write instructions can be guaranteed to be continuous.

Finally, using prefetch instructions, data about to be used can be loaded from main memory into cache in advance to reduce access latency. The comprehensive application of these optimization strategies can significantly improve the performance of the program, reaching tens to hundreds of times.

5. Security protection of models

The security protection of the model can be divided into two aspects: static protection and dynamic protection. Static protection mainly focuses on the security of the model during transmission and storage. Currently, the commonly used method is to encrypt the model file, transmit and store it in the form of ciphertext, and decrypt it in the memory for inference. However, this approach runs the risk of the plaintext model in memory being stolen.

Dynamic protection protects the model while it is running. There are currently three common technical routes. One is a protection scheme based on TEE (Trusted Execution Environment), which isolates a safe area through trusted hardware and decrypts and runs the model in it. This solution has less impact on inference latency, but requires specific hardware support and has certain limitations on the protection of large-scale deep models. The other is a protection scheme based on dense state computing, which uses cryptography methods to maintain the ciphertext state of the model during transmission, storage, and operation.

This solution does not rely on specific hardware, but it will incur large computational and communication overhead, and cannot protect the structural information of the model. The third is a protection scheme based on confusion. By scrambling the calculation logic of the model, even if the hostile party obtains the model, it cannot understand its internal structure. Compared with the first two solutions, the obfuscation solution has smaller performance overhead and lower accuracy loss. At the same time, it does not rely on specific hardware and can support the protection of large models.

Model obfuscation technology is a method that can automatically obfuscate the calculation logic of plaintext AI models, making it impossible for attackers to understand its internal logic even if they obtain the model during transmission and storage. At the same time, this technology can also ensure the confidentiality of the model during runtime, and will not affect the original inference results of the model, only causing a small inference performance overhead.

DM_20231006111220_016.png

Model obfuscation implementation step chart

Combined with the above figure, the execution steps of model confusion are explained in detail:

For a trained model, you first need to parse its model file and obtain the graphical expression (computation graph) of the model's calculation logic for subsequent operations. This calculation graph includes information such as node identifiers, node operator types, node parameter weights, and network structure.

After obtaining the calculation graph, you can use techniques such as graph compression and graph expansion to obfuscate the dependencies between nodes in the calculation graph to hide the true calculation logic of the model. Graph compression mainly works by checking the entire graph to match the key subgraph structure in the original network, and compressing and replacing these subgraphs with single new computing nodes. On the compressed computational graph, graph expansion further hides the true dependencies between nodes by adding new input/output edges to the network structure. These newly added input/output edges can originate from/point to existing nodes, or they can originate from/point to the confusion nodes added in this step.

Then, in order to further achieve the effect of model confusion, the calculation graph can also be scrambled. Commonly used scrambling methods include adding redundant nodes and edges, fusing partial subgraphs, etc. These operations are all for the purpose of model confusion.

Then, after completing the above steps, it is necessary to traverse the processed calculation graph and filter out the nodes that need to be protected. For these nodes, replace the node identifier, node operator type, and other attributes that can describe the node's calculation logic with symbols without semantic information. For the anonymization of node identifiers, it is necessary to ensure that the anonymized node identifiers are still unique in order to distinguish different nodes. For the anonymization processing of operator types, in order to avoid the explosion of operator types caused by anonymization of large-scale calculation graphs, nodes of the same operator type in the calculation graph can be divided into several disjoint sets. The node's operator type is replaced with the same anonymous symbol. This ensures that the model can still be identified and executed after the node is anonymized.

Afterwards, for each weight that needs to be protected, they are scrambled by a random noise and mapping function. Each weight can use different random noise and mapping functions when scrambling, and these scrambling operations need to ensure that they do not affect the correctness of the model execution results.

After completing the above steps, save the processed calculation graph as a model file for subsequent use. At the same time, for each operator type that needs to be protected, it is necessary to perform morphological transformation and generate several candidate confusion operators. There is a one-to-many correspondence between the original operator and the confusion operator, and the number of candidate confusion operators is equal to the number of node sets divided in the previous step. On this basis, based on the anonymization operator type, operator input/output relationship and other information obtained in the previous steps (2) (3) (4), the interface of the corresponding operator can be transformed. The transformation methods of operator interfaces include but are not limited to input and output transformation and interface name transformation. Among them, the input and output transformation is mainly realized by modifying the input and output data of the original operator, while the interface name transformation replaces the original operator name with the anonymized operator name generated in the previous step. This ensures that the model can still be identified and executed after the node is anonymized, and the name of the operator will not reveal its calculation logic.

In addition, the code implementation of the operator also needs to be transformed. This code transformation method includes but is not limited to string encryption, redundant code and other software code obfuscation technologies. These confusion technologies can ensure that the confusion operator and the original operator implement semantically identical computing logic, but they also make the confusion operator difficult to read and understand. It should be noted that different operators can use different combinations of code obfuscation techniques for code transformation. In addition, the confusion operator of the operator whose parameters are scrambled mentioned in step (4) also needs to implement the inverse mapping function of weight scrambling, so as to dynamically eliminate noise disturbance during the execution of the operator and ensure that the confusion is The calculation results of the model are consistent with the original model.

Afterwards, save the generated confusion operator as a library file for subsequent use. These library files contain all required obfuscation operators and their code implementations.

After completing all the above steps, deploy the obfuscation model file and the corresponding obfuscation operator library file to the target device. This will prepare you to perform model inference tasks.

Before executing the inference task, you first need to parse the confused state model file according to the model structure to obtain the graphical expression of the model calculation logic (i.e., the confused calculation graph).

Then this obfuscation calculation graph needs to be initialized to generate an execution task sequence. According to the requirements of the security configuration options, if you need to protect the security of the model at runtime, you can directly initialize the obfuscation calculation graph; if you only need to protect the security of the model during transmission and storage, you can first obfuscate the model in the memory. The calculation graph is restored to the original calculation graph, and then the original calculation graph is initialized to generate an execution task sequence. Each computing unit in this task sequence corresponds to the execution of a confusion operator or primitive operator. Doing so can further reduce the performance overhead during inference.

Finally, based on the inference data input by the AI ​​application, each computing unit in the execution task sequence is traversed and the inference results are obtained. If the operator corresponding to the current calculation unit is a confusion operator, then the confusion operator library is called; otherwise, the original operator library is called. This completes the entire reasoning task.

GPU in generative

Misunderstandings in the field of AI

Generative AI has fired our imaginations, showing unprecedented potential in everything from predictive maintenance to patient diagnostics to customer support. This potential relies on the accelerated computing capabilities provided by GPUs, enabling them to efficiently train complex language models. However, oversimplification or blind faith in GPUs can lead to unexpected problems, delays or even failure of data science projects. Here are five misconceptions about GPUs to avoid when building AI projects.

1. GPUs are delivering the fastest results

Before GPUs, approximately 70% of each time step was spent copying data to complete various stages of the data flow. GPUs can significantly reduce infrastructure costs and deliver superior performance for end-to-end data science workflows by enabling massively parallel computing. 12 NVIDIA GPUs can deliver the same deep learning performance as 2,000 modern CPUs. Adding 8 additional GPUs to the same server provides up to 55,000 additional cores. Although GPUs speed up the computing process, research points out that they can spend half the time waiting for data, meaning you end up waiting for results. In order to fully utilize the computing power of GPU, more powerful network and storage support are needed.

2. Bandwidth is king, pay tribute to bandwidth

While bandwidth is a key metric for optimizing GPU usage, it does not accurately reflect all characteristics of AI workloads. Optimizing data flow requires thinking about more than just transferring large amounts of data to the GPU. IOPs and metadata are equally important. Different data process steps have different IO requirements, which may not be met by traditional storage.

In addition to bandwidth, performance characteristics such as IOPS, latency, and metadata operations also need to be considered. Some steps require low latency and random small IO, some require massive streaming bandwidth, and some require a concurrent mix of both. Multiple data processes run simultaneously, increasing the need to process different IO profiles simultaneously.

3. GPU-driven AI workloads always face challenges when processing small files.

Training large language models for most generative AI applications involves large amounts of small files, such as millions of small images and logs from each IoT device. The ETL workflow normalizes the data and trains the model using stochastic gradient descent, which leads to large-scale metadata and random read issues, especially in the first part of the AI ​​deep learning process where there are many small IO requests that many storage platforms cannot handle efficiently. .

4. Storage? The focus of GPU is on computing power

AI workloads have special requirements for performance, availability, and flexibility, which are difficult to meet with traditional storage platforms. Choosing the right storage solution for AI workloads can have a significant impact on meeting business needs. Successful AI projects tend to grow rapidly in terms of compute and storage needs, so the impact of storage choices needs to be carefully considered. However, most of the attention and investment in AI infrastructure is focused on GPUs and networks, leaving little budget for storage devices. For AI storage, performance is equally important at scale, not just meeting traditional requirements.

5. Local storage is the fastest storage method for GPU

As AI data sets continue to grow, data loading time has become a workload performance bottleneck. Retrieving data from local NVMe storage avoids transfer bottlenecks and delays, but server hosts can no longer keep up with the growth in GPU speeds. GPU suffers from slow IO constraints.

NVIDIA L40S GPU architecture and comparison between A100 and H100

At SIGGRAPH 2023, NVIDIA launched the new NVIDIA L40S GPU and NVIDIA OVX server equipped with L40S. These products are mainly aimed at the training and inference of generative artificial intelligence models, and are expected to further improve the computing efficiency of training and inference scenarios of generative artificial intelligence models.

The L40S is based on the Ada Lovelace architecture and is equipped with 48GB of GDDR6 video memory and 846GB/s bandwidth. With the support of the fourth-generation Tensor core and FP8 Transformer engine, it can provide tensor processing power of more than 1.45 PFLOPS. According to data given by NVIDIA, under test cases of fine-tuning and inference scenarios, the computing efficiency of L40S has improved compared to A100.

DM_20231006111220_018.png

Compared with the A100 GPU, the L40S differs in the following aspects:

1. Video memory type

The L40S uses the more mature GDDR6 video memory technology. Compared with the HBM video memory used by the A100 and H100, although the memory bandwidth is lower, the technology is highly mature and the market supply is sufficient.

2. Computing power performance

The L40S has improved compared to the A100 in terms of FP16 computing power (intelligent computing power), and has a more obvious improvement than the A100 in terms of FP32 computing power (general computing power), making it more suitable for the needs of scientific computing and other scenarios.

3. Energy consumption performance

Compared with the A100, the L40S has reduced power, which is beneficial to reducing the relevant energy consumption of the data center and improving energy efficiency.

4. Cost-effectiveness comparison

According to data from Super Micro, the L40S has an advantage over the A100 in terms of cost performance, providing more choices for users who want to deploy efficient and competitive generative artificial intelligence solutions.

DM_20231006111220_019.png

L40S has differentiated design from A100 and H100

Similar to the A100, the L40S communicates with the CPU via a 16-lane PCIe Gen 4 interface with a maximum bidirectional transfer rate of 64 GB/s. However, unlike L40S, NVIDIA Grace Hopper uses NVLink-C2C technology to connect the Hopper architecture GPU to the Grace architecture CPU, achieving a total bandwidth of up to 900 GB/s from CPU to GPU and GPU to GPU, which is faster than PCIe Gen 5 7 times.

DM_20231006111220_020.png

PCIe protocol limits the communication bandwidth of L40S

The L40S GPU based on the Ada Lovelace architecture is equipped with GDDR6 video memory and 846GB/s bandwidth, and provides tensor processing capabilities of more than 1.45 PetaFLOPS through the fourth-generation Tensor core and FP8 Transformer engine. For computing-intensive tasks, L40S's 18,176 CUDA cores can provide nearly 5 times higher single-precision floating point performance than A100, thereby accelerating complex calculations and data-intensive analysis.

In addition, to support professional visual processing work, such as real-time rendering, product design and 3D content creation, L40S is also equipped with 142 third-generation RT cores, which can provide 212TFLOP of ray tracing performance. Power consumption reaches 350 watts. For generative AI workloads, L40S can achieve up to 1.2 times improvement in inference performance and up to 1.7 times improvement in training performance compared to A100. With the support of L40S GPU, NVIDIA also launched an OVX server that can carry up to 8 L40S. NVIDIA announced that for the GPT3-40B model with 860 million tokens, the OVX server can complete fine-tuning in just 7 hours; for the Stable Diffusion XL model, it can generate 80 images per minute.

Blue Ocean Brain Large Model Training Platform

The Blue Ocean Brain large model training platform provides powerful computing power support, including an AI accelerator based on high-speed interconnection of open acceleration modules. It is configured with high-speed memory and supports fully interconnected topology to meet the communication requirements of tensor parallelism in large model training. It supports high-performance I/O expansion and can be extended to Wanka AI cluster to meet the communication needs of large model pipelines and data parallelism. Powerful liquid cooling system hot-swappable and intelligent power management technology, when the BMC receives a PSU failure or error warning (such as power outage, surge, overheating), it automatically forces the system's CPU to enter ULFM (ultra-low frequency mode) to achieve the lowest power. consumption). Committed to providing customers with environmentally friendly and green high-performance computing solutions through "low carbon and energy saving". Mainly used in deep learning, academic education, biomedicine, earth exploration, meteorology and oceanography, supercomputing centers, AI and big data and other fields.

DM_20231006111220_021.jpg

1. Why do we need large models?

1. The model effect is better

The effect of large models in various scenes is better than that of ordinary models

2. Stronger creative ability

Large models can perform content generation (AIGC) to facilitate large-scale content production

3. Flexible customization of scenarios

By giving examples, we can customize a large number of application scenarios for large models.

4. Less labeled data

By learning a small amount of industry data, large models can cope with the needs of specific business scenarios.

2. Platform features

1. Heterogeneous computing resource scheduling

A comprehensive solution based on general-purpose servers and dedicated hardware for scheduling and managing multiple heterogeneous computing resources, including CPUs, GPUs, etc. Through powerful virtualization management functions, underlying computing resources can be easily deployed and various models can be run efficiently. At the same time, the hardware acceleration capabilities of different heterogeneous resources are fully utilized to speed up the running and generation speed of the model.

2. Stable and reliable data storage

Supports multiple storage type protocols, including block, file and object storage services. Pool storage resources to achieve free circulation of models and generated data, improving data utilization. At the same time, data protection mechanisms such as multiple copies, multi-level fault domains, and fault self-recovery are adopted to ensure the safe and stable operation of models and data.

3. High-performance distributed network

Provides network and storage of computing resources, forwards them through distributed network mechanisms, transparently transmits physical network performance, and significantly improves the efficiency and performance of model computing power.

4. Comprehensive security guarantee

In terms of model hosting, a strict permission management mechanism is adopted to ensure the security of the model warehouse. In terms of data storage, measures such as privatized deployment and data disk encryption are provided to ensure the security and controllability of data. At the same time, during the model distribution and operation process, comprehensive account authentication and log audit functions are provided to fully ensure the security of the model and data.

3. Common configurations

1. Processor CPU:

Intel Xeon Gold 8358P 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

Intel Xeon Platinum 8350C 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

Intel Xeon Platinum 8458P 28C/56T 2.7GHz 38.5MB,DDR4 2933,Turbo,HT 205W

Intel Xeon Platinum 8468 Processor 48C/64T 2.1GHz 105M Cache 350W

AMD EPYC™ 7742 64C/128T,2.25GHz to 3.4GHz,256MB,DDR4 3200MT/s,225W

AMD EPYC™ 9654 96C/192T,2.4GHz to 3.55GHz to 3.7GHz,384MB,DDR5 4800MT/s,360W

2. Graphics card GPU:

NVIDIA L40S GPU 48GB×8

NVIDIA NVLink-A100-SXM640GB

NVIDIA HGX A800 80GB×8

NVIDIA Tesla H800 80GB HBM2

NVIDIA A800-80GB-400Wx8-NvlinkSW×8

DM_20231006111220_022.png

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/133608330