[Translation] DeepSpeed: A very large-scale model training tool that everyone can use

We released DeepSpeed ​​in February this year . This is an open source deep learning training optimization library that contains a new memory optimization technology - ZeRO (Zero Redundancy Optimizer), which greatly promotes large model training by expanding scale, increasing speed, controlling costs, and improving availability. ability. DeepSpeed ​​has helped researchers develop the Turing Natural Language Generation model (  Turing-NLG ), which at the time of publication was the world's largest language model (with 17 billion parameters) and had the best accuracy. In May we launched ZeRO-2 – which supports model training with 200 billion parameters and is up to 10 times faster than the latest technology – and a range of compute, IO and convergence optimization features to power the fastest BERT training. Since then, we have continued to innovate at a rapid pace, pushing the boundaries of speed and scale in training deep learning models.

Today, we are very happy to share with you some new developments that will not only push deep learning training to the extreme, but also make this technology more widely used - from data scientists training on supercomputers to to train on low-end clusters or even just a single GPU. Specifically, DeepSpeed ​​incorporates 4 new systemic technologies to further expand our  AI at Scale  initiative. They also drive innovation in Microsoft's AI products and platforms. These techniques provide extremely efficient utilization of computing, memory, and communication, and help us train models with billions to trillions of parameters. These technologies also support very long input sequences and can be used on high-end clusters with single-card GPUs, kilo-card GPUs, or low-end clusters with slow Ethernet.

  • Achieving trillion-parameter model training with 3D parallelization:  DeepSpeed ​​implements a flexible combination of three parallel methods: data parallelism, pipeline parallelism and tensor slice model parallelism supported by ZeRO. 3D parallelism adapts to the needs of different workloads to support very large models with trillions of parameters while achieving near-perfect memory scalability and throughput scaling efficiency. Furthermore, its improved communication efficiency enables users to train models with billions of parameters 2-7 times faster on conventional clusters with limited network bandwidth.
  • ZeRO-Offload Enables Training 10x Larger Models on a Single GPU Card:  To leverage both CPU and GPU memory for training large models, we extend ZeRO-2. Our users, using machines with a single NVIDIA V100 GPU , can run models with up to 13 billion parameters  without running out of video memory , scaling models up to 10x the size of existing methods, and remain competitive. force throughput. This feature democratizes model training with billions of parameters and opens a window for many deep learning practitioners to explore bigger and better models.
  • Execute 10x longer sequences 6x faster with DeepSpeed ​​Sparse Attention:  DeepSpeed ​​provides a sparse attention kernel - a tool technology that supports long sequences of model inputs, including text input, image input, and speech input. Compared with the classic dense Transformer, it supports an order of magnitude longer input sequence and achieves up to 6 times faster execution while maintaining comparable accuracy. It is also 1.5–3x faster than the latest sparse implementations. In addition, our sparse kernel flexibly supports sparse formats, enabling users to innovate by customizing sparse structures.
  • 1-bit Adam reduces communication by 5 times:  Adam is an effective (perhaps the most widely used) optimizer in large-scale deep learning model training scenarios. However, it is often incompatible with communication efficiency optimization algorithms. Therefore, communication overhead can become a bottleneck when scaling distributed across devices. We introduce a new 1-bit Adam algorithm, as well as its efficient implementation. This algorithm reduces communication volume by up to 5 times while achieving similar convergence rates to Adam. In communication-limited scenarios, we observed a 3.5x improvement in distributed training speed, which allows the algorithm to be extended to different types of GPU clusters and network environments.

a screenshot of a cell phone

This blog post will delve deeper into these 4 technologies. We have announced these exciting optimization techniques in the open source project DeepSpeed .

3D Parallelism: Scaling to Trillion Parameter Models

With the rapid growth of computing on modern GPU clusters, training powerful trillion-parameter models with astonishing capabilities is no longer out of reach and may be achievable in the near future. DeepSpeed ​​combines three powerful technologies to train trillion-scale models and scale to thousands of GPUs: data-parallel training, model-parallel training, and pipeline-parallel training. The symbiosis of these three allows deep learning training to scale far beyond what is possible using each strategy alone. 3D parallelism simultaneously solves two fundamental challenges in training trillion-parameter models: memory efficiency and computational efficiency . As a result, DeepSpeed ​​can scale to fit the largest models in video memory without sacrificing speed.

Understand the memory and computational efficiency challenges of training huge models

Video memory efficiency : The video memory required to train a trillion-parameter model far exceeds the video memory size of a single GPU. When using the Adam optimizer for mixed-precision training, approximately 16TB of video memory is required to store the model state (parameters, gradients, and optimizer state). For comparison, the most advanced Nvidia A100 GPU has only 40 GB of video memory. Just to store the model state, 400 such GPUs are needed.

The additional memory consumed by the activation function increases with the batch size. When batch is set to 1, training a trillion-parameter model will generate more than 1 TB of memory for the activation function (hereinafter referred to as activation memory). Activating the video memory with  checkpoint  processing and exchanging the video memory with computation can reduce this video memory to about 20 GB, but it is still too high for training.

Model state and active video memory must be efficiently divided across multiple GPU devices to allow such large models to be trained without running out of video memory.

Computational efficiency : It is estimated that end-to-end training of a trillion-parameter model requires approximately 5,000 Zflops (that is, 5 followed by  24 zeros ; this estimate is based on OpenAI's research  law of scaling ). This means that training such a model requires 4000 A100s running at 50% computational efficiency for about 100 days.

Although large-scale supercomputing GPU clusters can have more than 4000 GPUs, it is still a challenge to achieve high computing efficiency at this scale due to batch size limitations. Computational efficiency increases with the ratio of computation time to communication time. This ratio is proportional to the batch size. However, there is an upper limit to the batch size for training a model - beyond which convergence can deteriorate significantly.

In fact one of the largest models, GPT-3  , has a training batch size of about 1500. If about 4000 GPUs are used, even though we can freely set the batch size to 4000, the batch size on each card is only 1, which will affect scalability.

Understand the trade-offs between data parallelism, model parallelism, and pipeline parallelism

Data parallelism is a commonly used technique in deep learning. In this technique, each batch of input training data is divided equally among data-parallel workers. After backpropagation, gradients need to be communicated and reduced to ensure that the optimizer makes the same updates on each worker. Data parallelism has several clear advantages, including high computational efficiency and low effort to implement. However, the batch size of data parallelism increases with the number of workers, and we often cannot increase the batch size all the time without affecting convergence.

  • Memory efficiency : Data parallelism copies models and optimizers among all workers, so memory efficiency is not high. DeepSpeed ​​developed  ZeRO  , a series of optimizers designed to improve memory efficiency for data parallelism. This work relies on ZeRO's Phase 1, which divides the optimizer state among workers to reduce redundancy.
  • Computational efficiency : As we increase the degree of parallelism, the amount of computation performed by each worker is constant. Data parallelism enables near-linear scaling at small scales. However, the communication overhead of reducing gradients between workers is positively related to the model size, so when the model is large or the communication bandwidth is low, the computational efficiency will be limited. . Gradient accumulation is a common strategy for amortizing communication costs. It will further increase the batch size, use micro-batch locally   to perform forward and backward propagation multiple times to accumulate gradients, and then perform gradient reduction and optimizer updates.

Model parallelism is a broad class of techniques. It divides the various layers of the model among multiple workers. By its very nature, the computation and communication of model parallelism vary depending on the model structure, so there is a significant effort in implementation. DeepSpeed ​​borrows NVIDIA's  Megatron-LM  to provide large-scale model parallelism for Transformer-based language models. Model parallelism will reduce memory usage proportionally to the number of workers, and is the most memory efficient among the three degrees of parallelism. But the price is the lowest computational efficiency.

  • Video memory efficiency : Model parallelism reduces video memory usage proportionally to the number of workers. Crucially, this is the only way to reduce the active memory of a single network layer. DeepSpeed ​​further improves memory efficiency by dividing active memory among model parallel workers.
  • Computational efficiency : Due to the need for additional communication activation values ​​in each forward and backward propagation, the computational efficiency of model parallelism is very low. Model parallelism requires high communication bandwidth and does not scale well to nodes with limited communication bandwidth. Additionally, each model parallel worker reduces the amount of computation performed between each communication stage, thus affecting computational efficiency. Model parallelism is often used in conjunction with data parallelism to create a trade-off between memory and computational efficiency.

The pipeline parallel training engine is also included in this release of DeepSpeed! Pipeline parallelism divides each layer of the model into stages that can be processed in parallel . When a stage completes the forward pass of a micro-batch, the activated memory is communicated to the next stage of the pipeline. Similarly, when the next stage completes backpropagation, the gradients are communicated backward through the pipeline. Multiple micro-batches must be computed simultaneously to ensure that each stage of the pipeline can be computed in parallel. Several methods have been developed to trade off memory and computational efficiency and convergence behavior, such as  PipeDream . The method adopted by DeepSpeed ​​is to achieve parallelism through gradient accumulation and maintain the same convergence conditions as traditional data parallel and model parallel training under the same total batch size.

  • Video memory efficiency : The video memory reduced by pipeline parallelism is proportional to the number of pipeline stages, so that the size of the model can be expanded linearly with the number of workers. However, pipeline parallelism does not reduce the memory footprint of each layer’s activation function. Additionally, each worker must store activation values ​​for each micro-batch that runs simultaneously. This results in the activation memory of the first stage of the pipeline being roughly the same as the total activation memory of a single mirco batch. A trillion-parameter model would require an activation memory of about 19 GB of video memory for a micro batch, which is almost half of the total video memory of the new Nvidia A100 GPU.
  • Computational efficiency : Pipeline parallelism has the lowest communication volume because its communication volume is only proportional to the activation value of each layer at the boundary of each stage. However, it cannot scale infinitely. Like model parallelism, increasing pipeline size reduces the amount of computation required at each pipeline stage, which reduces the computation-to-communication ratio. If good computing efficiency is to be achieved, pipeline parallelism also requires that the computing load of each stage is perfectly balanced.

Additionally, pipeline parallelism creates bubble overhead due to the need to refill or drain the pipeline at the beginning and end of each batch. Training with 4x or 8x the number of gradient accumulation steps (and batch size) compared to just one pipeline stage achieved 81% and 90% scalability respectively.

Simultaneously achieve high memory efficiency and high computational efficiency through 3D parallelism

Data, model, and pipeline parallelism all play specific roles in improving memory and computational efficiency. Figure 1 illustrates our 3D strategy.

Video memory efficiency : First divide each layer of the model into different pipeline stages, and further divide the layers of each stage through the model in parallel. This 2D combination simultaneously reduces the memory consumed by the model, optimizer and activation function. However, we cannot partition the model infinitely without introducing communication overhead, which limits computational efficiency.

Computational Efficiency : In order to scale the number of workers beyond what the model and pipeline parallelism can support without sacrificing computational efficiency, we use ZeRO-supported data parallelism (ZeRO-DP). ZeRO-DP can not only further improve the memory utilization efficiency by partitioning the optimizer state volume, but also can be extended to any number of GPUs with minimal communication overhead by utilizing the mapping relationship based on the communication topology.

3D mapping based on communication topology (Figure 2) : By leveraging two key architectural properties, we carefully map each dimension in 3D parallelism onto workers to achieve maximum computational efficiency.

  1. Optimizing intra-node and inter-node communication bandwidth : Model parallelism has the highest communication overhead among the three strategies, so we prioritize placing model parallel worker groups within nodes to take advantage of greater intra-node bandwidth. Here we perform tensor segmentation model parallelism based on NVIDIA Megatron-LM. When the model parallelism group does not occupy all workers within the node, we choose to place the data parallelism group within the node. Otherwise, data parallelism is performed across nodes. Pipeline parallelism has minimal communication, so we can schedule pipeline stages across nodes without being limited by communication bandwidth.
  2. Increase bandwidth through parallel communication : The amount of gradients that each data parallel group needs to communicate decreases linearly with the scale of the pipeline and model parallelism, so the total communication volume is less than using data parallelism alone. In addition, each data parallel group will communicate independently within a small number of local workers, and inter-group communication can be parallel to each other. The result is that the effective bandwidth of data parallel communication is increased by reducing communication volume and increasing locality and parallelism.

The figure shows an example of 3D parallelism with 32 workers. The layers of the neural network are divided into four pipeline stages. Layers in each pipeline stage are further divided among four model parallel workers. Finally, there are two data-parallel instances per pipeline stage, and ZeRO divides the optimizer state volume between these 2 replicas.

Figure 1: An example of 3D parallelism with 32 workers. The layers of the neural network are divided into four pipeline stages. The layers in each pipeline stage are further divided between four model parallel workers. Finally, there are two data parallel instances per pipeline stage, and ZeRO divides the optimizer state between the two replicas.

The colored blocks show the mapping of workers in Figure 1 to GPUs on a system of eight nodes with four GPUs per node. GPUs of the same color are on the same node.

Figure 2: Mapping of workers from Figure 1 to GPUs on a system of eight nodes with four GPUs per node. GPUs of the same color are on the same node.

Learn more about 3D parallel training of trillion-parameter models

Using 8-way model parallelism, 64-way pipeline parallelism, and 8-way data parallelism, a trillion-parameter model can be trained scalably on 4096 NVIDIA A100 GPUs.

By combining model parallelism and pipeline parallelism, 3D parallelism enables excellent memory efficiency and high computational efficiency across multiple nodes. Model parallelism improves the storage efficiency of activation memory and model state quantities within nodes, while pipeline parallelism, compared to only using model parallelism, can efficiently store model state across nodes without sacrificing computational efficiency. In the example of a trillion parameters with a micro-batch size of 1, after using activation value checkpoint and the above 3D parallelism, the model state quantity will consume 30 GB of video memory, and the divided activation value will consume 2.5 GB of memory. This results in a total memory footprint of 32.5 GB, which allows the use of an NVIDIA A100 GPU with 40 GB of memory to accommodate and train such a model.

Combining model parallelism and pipeline parallelism, pipeline parallelism can achieve high computing efficiency with minimal bubble overhead in very small batches. Under 8-way model parallelism, using a micro-batch of 1 per model will result in an effective micro-batch size of 1/8 per GPU. Therefore, using 8 times the gradient accumulation step of the pipeline parallelism will only bring the total cumulative batch size to 1 on each GPU, and pipeline parallel processing can achieve 90% computational efficiency. When combined with data parallelism, this results in a total effective batch size of 4096 on 4096 GPUs and still achieves 90% pipeline efficiency.

But how does data parallelism affect computational efficiency? Doesn't data parallelism require each GPU to have a large batch in order to remain efficient?

Model parallelism can reduce the effective batch size on each GPU to less than 1. This allows pipeline parallelism to hide pipeline bubble overhead even with small batches. Note that by using pipeline parallelism across nodes, we can have data parallel nodes for each stage of the pipeline communicate independently between each other and in parallel with other pipeline stages. Indeed, in fully connected network topologies common in high-end GPU clusters, this has important implications for the effective communication bandwidth available for data-parallel training. Since each node in a pipeline stage can communicate in parallel with its corresponding data parallel node, the effective communication bandwidth is proportional to the number of pipeline stages. By setting up 64 parallel pipeline stages, the effective bandwidth becomes 64 times the bandwidth to and from a single node. Pipeline parallelism brings such a large effective bandwidth to data parallelism, which enables data parallelism to achieve efficient expansion even in small batch situations where the ratio of calculation to communication is very low.

Train Trillion Parameter Models with Linear Scalability

DeepSpeed ​​can train a language model with one trillion parameters using only 800 NVIDIA V100 GPUs (Figure 3). We demonstrate model size and training throughput, and observe that both memory and computational efficiency increase linearly with model size. Across various configurations, we can train approximately 1.4 billion parameters per GPU, which is the maximum model size a single GPU can support without running out of memory, demonstrating perfect memory scalability. We also achieved near-perfect linear computational efficiency scaling, with a throughput of 47 Tflops per V100 GPU. For the hardware mentioned, this is impressive scalability and throughput.

Figure 3: Graph of model size (in billions of parameters) and training throughput (in Pflops) as a function of number of GPUs. DeepSpeed ​​can train a model with 1 trillion parameters using 800 NVIDIA V100 Tensor Core GPUs with 32 GB of memory. Each configuration uses   16-way model parallelism provided by NVIDIA Megatron-LM , with the remaining GPUs responsible for pipeline parallelism. The trillion parameter model has 298 layers of Transformer, its hidden layer size is 17408, the training sequence length is 2048, and the batch size is 2048. For smaller models, we reduced the number of Transformer layers and batch size proportionally to the number of GPUs.

An in-depth look at how 3D parallelism can speed up training models at GPT-3 scale

Figure 4: System performance using 2D and 3D in parallel using 800 GPUs to train a GPT-3 scale model with 180 billion parameters. The model has 100 Transformer layers, a hidden layer size of 12288 and 96 attention heads. The batch size used for training is 2048, and the sequence length is 2048. ZeRO-1 can also be used in conjunction with data parallelism. P, M and D represent pipeline, model and data parallel dimensions respectively.

In Figure 4, we use the latest  GPT-3  model architecture with over 175 billion parameters as a baseline for 3D parallelism:

  • We first evaluated  the 2D configurations (C1-C3). Configurations C1 and C2 only use pipeline and model parallelism - they can train the model, but result in lower throughput and lower GPU utilization due to over-decomposition of the model. C3 attempts to use only pipelining and data parallelism, but cannot solve the problem of insufficient video memory without reducing the amount of activations through Megatron's model parallelism.
  • The 3D configurations (C4-C10) increase the pipeline parallelism in sequence; the intermediate configuration that balances parallelism can achieve the best performance and achieve the three highest graphics memory, computing and communication efficiencies.
  • The best 3D approach achieves 49 Tflops per GPU, exceeding the hardware’s theoretical peak by 40%.

See how hybrid parallelism speeds up training GPT-2 by 7x on a low-bandwidth cluster

We trained a 1.5 billion parameter GPT-2 model and demonstrated the communication advantages of hybrid parallelism in Figure 5. To highlight the communication phase of training, training is performed on a four-node cluster with low inter-node bandwidth:

  • Model parallelism has no advantage in this case because the models are smaller and the intra-node bandwidth is lower.
  • The communication volume of pipeline parallelism is an order of magnitude smaller than that of configuration data and model parallelism. When the batch size is smaller, the training speed is 7 times faster.
  • Data parallelism uses gradient accumulation to increase the batch size to amortize the communication overhead, but at a larger batch size, the performance of the configuration of pipeline parallelism is still twice that of data parallelism.
  • The hybrid pipeline and data-parallel configuration avoids gradient communication bottlenecks by confining data-parallel groups to GPUs within a node, so gradient communication benefits from faster intra-node bandwidth.

Figure 5: Throughput vs. batch size when training GPT-2 with a sequence length of 1024 (1.5B parameters). Four nodes each equipped with four V100 GPUs with 16 GB of memory were used for training. The GPUs are connected with 50 Gbps of intra-node bandwidth and 4 Gbps of inter-node bandwidth per second. DP stands for Data Parallelism Enabled with ZeRO-1. All methods scale the batch size by increasing the number of steps for gradient accumulation.

ZeRO-Offload: Train a 10x larger model on a single GPU

ZeRO-Offload increases the maximum model size that can be efficiently trained with less GPU resources by simultaneously utilizing the computing and storage resources of the GPU and host CPU. It allows us to train models with up to 130 billion parameters on a single V100, 10 times the current highest level, while maintaining a high training throughput of 30Tflop per GPU.

By enabling a single GPU to train models with billions of parameters, ZeRO-Offload makes large model training accessible and allows deep learning practitioners with limited hardware resources to participate.

Histogram of the largest model size that can be trained using default PyTorch and ZeRO-Offload on a single GPU.

Figure 6: Maximum model size that can be trained on a single GPU using default PyTorch and ZeRO-Offload.

The core technology behind ZeRO-Offload is to   offload optimizer state and gradients to CPU memory based on ZeRO-2 . This method allows ZeRO-Offload to minimize the loss of computing efficiency caused by copying to the CPU, while achieving the same efficiency as ZeRO-2, or sometimes even exceeding it. The following figure shows the architecture of Zero-OffLoad:

Figure 7: ZeRO-Offload overview.

Learn how ZeRO-Offload trains billion-parameter models on a single GPU

Training models with billions of parameters like GPT and T5 requires multiple GPUs to store the model and state. Large model training mostly solves the memory limitation problem through model parallelism across GPUs. Recently, we released ZeRO, a memory-efficient optimizer that distributes model state (optimizer state, gradients, and model parameters) across multiple parallel GPUs, allowing models with billions of parameters to be processed without using Models are trained in parallel. However, ZeRO still requires a large number of data-parallel GPUs to save the divided model state, so only a few people have the conditions to train this model.

ZeRO-Offload allows large model training to be performed on a single GPU, making such training civilian. In order to train models with billions of parameters without using multiple GPUs, ZeRO-Offload inherits ZeRO-2's method of dividing optimizer state quantities and gradients. The difference from ZeRO-2 is that ZeRO-Offload does not save a part of the optimizer state and gradient on each GPU, but moves both to the local memory. Optimizer state is kept in memory throughout the entire training process. The gradient is calculated on the GPU during the reverse calculation process and averaged through reduce-scatter. After that, each data parallel process offloads its own share of the averaged gradient to the CPU (g offload in Figure 7) and Discard the parts that are not your responsibility.

Once the gradient reaches the CPU, the divided optimized state quantities will be updated on the CPU in parallel (  p update in Figure 7 ). After the update is complete, the divided parameters are moved back to the GPU and updated using the all gather operation (  g swap in Figure 7 ). Zero-Offload also improves training efficiency by using different CUDA streams to overlap communication (such as  g offload  and  g swap ) and calculation (such as backpropagation and  p update ).

The advantages of ZeRO-Offload in terms of model size, training speed and scalability

10 times model expansion : On a single 32GB V100 GPU, Figure 6 shows that PyTorch can train a model with up to 1.3 billion parameters, while ZeRO-Offload can train a model with 13 billion parameters, which is 10 times that of PyTorch. This is because ZeRO-Offload keeps the optimizer state, which consumes most of the GPU memory, in native memory throughout the training process, while also moving the computed gradients to the CPU during backpropagation. Therefore, the saved GPU memory can be used to train larger models.

Efficient training throughput : As shown in Figure 8, when training a 10 billion parameter model, even if only a single GPU is used for training, using ZeRO-Offload can still allow each GPU to have a throughput of more than 30 Tflops, and its throughput It grows almost perfectly linearly with the number of GPUs.

ZeRO-Offload is a perfect complement to ZeRO-2, enabling efficient training of large models on a small number of GPUs. ZeRO-Offload makes it feasible to train large models on 1 to 16 GPUs by utilizing CPU memory to reduce the GPU memory required by the model. On 32 GPUs, the performance of ZeRO-Offload is slightly higher than that of ZeRO-2; the performance improvement comes from the GPU memory saved by ZeRO-Offload, which allows us to train the model in a larger batch, so despite the existence of copying to the CPU Overhead, GPU computing efficiency can still be improved. With more GPUs (e.g. 64 and 128), ZeRO-2 outperforms ZeRO-Offload because both can now run similar sized batches and ZeRO-2 does not have the overhead of moving data to the CPU , and optimizer updates are much faster on the GPU than on the CPU. All in all, ZeRO-Offload is a complement to ZeRO-2 and expands the optimization scope of the ZeRO family, from a single device to thousands of devices, with optimization solutions for large model training.

Histogram of throughput for training a 10 billion parameter GPT-2 model on 128 GPUs using ZeRO-Offload and ZeRO-2.

Figure 8: Comparison of training throughput between ZeRO-Offload and ZeRO-2 using 128 GPUs to train the 10 billion parameter GPT-2 model.

DeepSpeed ​​Sparse Attention Mechanism: Execute 10x Longer Sequences 6x Faster

Attention-based deep learning models (e.g., Transformers) are very effective in capturing relationships between tokens in an input sequence, even when the distance between them is long. As such, they are often used with text, image, and speech-related inputs. The sequence length of these inputs can be thousands of tokens. However, although the attention module effectively captures dependencies within long sequences, in practical applications, support for long sequence inputs is limited by the amount of computation and video memory. The amount of calculation and memory requirements increase quadratically with respect to the sequence length \(n\).

To address this limitation, DeepSpeed ​​provides a sparse attention kernel - an instrumental technology that reduces the computational and memory requirements of attention computation by orders of magnitude through block sparse computation. This set of tools not only alleviates the memory bottleneck of attention computation, but its sparse computation is also very efficient. Its API can be easily integrated into any Transformer-based model. In addition to providing various sparse structures, it can flexibly handle any user-defined block sparse structure.

More specifically, Sparse Attention (SA) can be designed to calculate local attention between close tokens, or calculate summary tokens by using local attention to obtain global attention. In addition, SA supports both random attention and any combination of local, global and random attention, such as the blue, orange and green blocks in Figure 10. This enables SA to reduce the memory footprint to \(O(wn)\), where 1\(<w≤n\) is a parameter whose value depends on the attention structure.

Small colored squares show variable sparsity structure

Figure 10: Variable sparse structure

Efficient implementation on GPU : Although the basic implementation of sparse attention will save video memory, it may be computationally worse than dense computation. This is mainly due to the fragmentation of memory accesses caused by sparse data. Developing efficient sparse kernels is often challenging, especially on GPUs. DeepSpeed ​​provides   an efficient sparse attention kernel developed in Triton . These kernels are structured in a block-wise sparse paradigm, which enables aligned memory access, reduces GPU thread forking, and balances the workload on the processor.

System performance : As shown in Figure 11, SA supports  10 times longer sequences and up to 6.3 times faster calculation . The left plot shows the longest sequence lengths that can run in BERT-Base and BERT-Large. Our experiments have the following three settings: dense mode, dense mode with activation checkpoints and sparse (SA) mode with activation checkpoints. Compared with the dense models of BERT-Base and BERT-Large, the sequences of SA are 10 times and 16 times longer, respectively. Furthermore, compared to dense modes, SA reduces the total computational effort and improves training speed: the improved efficiency increases with sequence length, up to 6.3 times faster for BERT-Base, and up to 6.3 times faster for BERT-Base. Large, up to 5.3 times.

Figure 11: Maximum supported sequence length of the BERT model (left); time to train BERT-Base (middle) and BERT-Large (right) with different sequence lengths on a single NVIDIA V100 GPU.

Learn how SA can achieve accuracy comparable to or better than full dense attention

Related works involving sparse attention ( Sparse Transformer , Longformer , BigBird ) all show higher accuracy than full attention, consistent with our experience. In addition to lower memory overhead and faster computation, we also observe higher accuracy and faster convergence of SA in production models. The figure below illustrates the accuracy of training a BERT-based production model for long text understanding (sequence length 2048). The experiments were conducted in three settings: dense training from scratch, SA training from scratch, and continuing SA training from a dense checkpoint using sequence length 512. We have observed that for pre-training from scratch, SA converges faster and with better accuracy than dense settings. Furthermore, in terms of time and accuracy, the effect of continuing to train the pre-trained checkpoints with SA is even better.

Figure 12: Accuracy of long text understanding application

See how SA compares to the latest LongFormer

We compared SA with Longformer, a recent sparse structure and its implementation. In our experiments, SA uses “ Fixed ” sparsity. The accuracy of both implementations is comparable. In terms of system performance, SA outperforms Longformer in both training and inference:

  • 1.5x faster running pretrained MLM on Wikitext103
  • BERT-Base’s inference speed is increased by 3 times (batch size 1, sequence length 2,048)

Flexibility to handle any block-like sparse structure:  DeepSpeed ​​Sparse Attention Suite does not target any specific sparse structure, so it can effectively support model researchers to explore any block-like sparse structure. Currently, we add popular sparse structures such as  Fixed (from OpenAI Sparse Transformer), [BigBird]( https://arxiv.org/pdf/2007.14062.pdf  ) (from Google) and BSLongformer (  block-sparse implementation of AI2 Longformer ). We also define a template with a "variable" structure, shown in Figure 10, which can be used to easily customize the block-wise sparse structure for any random, local, or global attention pattern.

1-bit Adam: 5x less traffic and 3.4x faster training

Scaling training of large models such as BERT and GPT-3 requires careful optimization based on model design, architecture, and system functionality. From a system perspective, communication efficiency has become a major bottleneck, especially on commercial systems that use standard TCP and have limited network bandwidth.

Communication compression is an important technique for reducing training time on such systems. One of the most effective methods of compressing communications is error-compensated compression, which provides stable convergence speed even at 1-bit compression. However, state-of-the-art error compensation techniques are only applicable to some simple optimizers that are linearly dependent on the gradient, such as Stochastic Gradient Descent (SGD) and Momentum SGD. These techniques cannot be integrated with nonlinear optimizers like Adam, which lead to the best convergence and accuracy in many tasks, including training BERT-like models.

For a powerful optimizer like Adam, it is quite challenging to develop error compensation based compression techniques for it due to its reliance on the nonlinear characteristics of the gradient (on the variance term), thus limiting The practical value of advanced communication compression technology.

Understand the background of classic compression techniques

One method of communication compression is 1-bit compression, which can be expressed as:

In this compression, we represent each number with 1 bit, reducing memory requirements by a factor of 32. The problem is that this straightforward approach slows down the convergence considerably and is of little practical value. Recent studies have shown that by using error-compensated compression, we can expect to guarantee almost the same convergence rate under communication compression.

The idea of ​​error compensation can be summarized as: 1) perform compression, 2) memorize the compression error, and then 3) add the compression error back in the next iteration. For SGD, error compression is equivalent to:

Where \(C(⋅)\) is a 1-bit compression operator. The advantage of this kind of error compression is that the historical values ​​of the compression error \(e_t\) and \(e_t-1\) will eventually cancel each other out, which makes:

This strategy has been proven to work for all optimization algorithms that linearly depend on gradients, such as SGD and Momentum SGD.

Understand the challenges of applying error compensation to Adam

We provide an overview of the Adam algorithm below. The update rules are as follows:

As shown in the formula above, the variance term \(v_t\) and the gradient \(g_t\) are non-threaded. If we apply ordinary error compensation to Adam, we will find (see Figure 13) that Adam will fail to converge.

Figure 13: Error compensation compression does not work with Adam due to non-linear dependence on gradients

Compress communication with 1-bit Adam

To compress the communication when using the Adam optimizer, we develop  1-bit Adam , which addresses nonlinear dependencies in gradients through preprocessing. We observe that the magnitude of change in the variance of the nonlinear term (\(v_t\)) decreases significantly after a few training epochs, after which setting \(v_t\) to a constant does not change the convergence speed. So the proposed 1-bit Adam optimizer consists of two parts (as shown in Figure 14): The warm-up stage is essentially the original Adam algorithm. The compression phase, which keeps the variance term constant and compresses the remaining linear term (ie momentum) into a 1-bit representation.

The compression stage of the algorithm is controlled by the threshold parameter (shown in Figure 14). When we detect that the change in "variance" has dropped below a certain threshold, we switch to the compression stage. Our research shows that the warm-up phase only requires 15-20% of the total training steps.

Learn more about the underlying mechanism of 1-bit Adam

The weight of 1-bit Adam is updated according to the following formula. For the  i  -th worker, during the compression phase:

a screenshot of text

a screenshot of a cell phone

Figure 14: Comparison of distributed training processes using the classic Adam algorithm and the 1-bit compressed Adam algorithm

Addressing the System Challenge of 1-bit Adam

In addition to algorithmic challenges, there are two system challenges in applying 1-bit Adam in training systems. First, we need an efficient kernel capable of converting momentum into a 1-bit representation. Second, we need efficient communication schemes to transfer the compressed momentum between different GPUs. The purpose of compression is to reduce overall training time so that bandwidth-constrained commodity systems can be used to train large models. We address these challenging problems in DeepSpeed ​​and fully optimize the 1-bit Adam implementation for the scenario of training on systems with limited communication efficiency.

Advantages of 1-bit Adam on communication-constrained systems

1-bit Adam provides the same convergence capability as Adam, and can reduce  the communication volume by up to 5 times . When used for BERT-Large pre-training tasks, it can achieve up to 3.5 times the throughput when used for SQuAD fine-tuning tasks. , up to  2.7 times higher throughput . The end-to-end throughput improvement comes from the 6.6x (Figure 15 left) and 6.2x (Figure 15 right) speed improvements observed during the compression phase. It is worth mentioning that our 1-bit Adam optimizer scales very well on 40 Gb Ethernet systems, and its performance is comparable to Adam's scalability on 40 Gb InfiniBand QDR systems. We note that based on the iPerf benchmark, the effective bandwidth on 40 Gb Ethernet is 4.1 Gbps, while based on the InfiniBand perftest micro benchmark, InfiniBand provides a near-peak bandwidth of 32 Gbps.

Figure 15: 1-bit Adam scalability for BERT-Large pretraining (left) and SQuAD fine-tuning (right) on NVIDIA V100 GPUs. The batch size of BERT pre-training is 16/GPU, and the batch size of SQuAD fine-tuning is 3/GPU.

Dive deeper into 1 Bit Adam’s review results

Convergence same as Adam : A major problem with using 1-bit Adam is the speed of convergence. We found that 1-bit Adam can achieve the same convergence rate and comparable performance when using the same number of training samples, see Figure 16.

Figure 16: Using the same number of training samples, 1-bit Adam can converge as well as Adam.

Table 1 shows the detailed results of BERT-Base and BERT-Large. We see that 1-bit Adam performs on par with the original model for both the uncompressed and compressed cases, and some outperforms the original model.

Table 1: Verification of the correctness of 1-bit Adam on various test tasks

Up to 5x less communication:  1-bit Adam provides the same convergence capabilities as Adam and reduces communication by 16x during the compression phase (for 16-bit (FP16) training). For the BERT pre-trained model, the overall communication is reduced by 5 times as we observe that the warm-up phase is only 15% of the end-to-end training time.

The formula for the ratio of communication volume between original Adam and 1-bit Adam is as follows:

1 / (warmup + (1 – warmup)/16)

1-bit Adam makes training BERT-Large 3.5x faster:  We provide results of training BERT-Large on two systems with limited bandwidth constraints: 1) 40 Gbps Ethernet (Figure 17 left) and 2) 40 Gbps InfiniBand QDR (Figure 17 right). During the compression phase, we saw a 6.6x improvement in system throughput using Ethernet and a 2x improvement in system throughput using InfiniBand, with end-to-end speeds (including warm-up and compression phases) increasing by 3.5x and 2.7x respectively. 1-bit Adam mainly benefits from the reduction in communication volume (due to the compression implementation of momentum communication) and our custom  allreduce  operation, which is implemented through an efficient 1-bit non-blocking gather and an  allgather  operation.

It is worth noting that it is also possible to use LAMB instead of the Adam optimizer for BERT pre-training to reduce communication volume by increasing the total batch size. However, 1-bit Adam avoids this demanding hyperparameter tuning. According to our experience, it is usually more difficult to adjust parameters in large batches. In addition, 1-bit Adam is also very suitable for work with small critical batch sizes (which cannot converge well with large batches, such as many fine-tuning tasks).

Figure 17: Performance of BERT-Large training with 1-bit Adam on 40 Gbps Ethernet (left) and InfiniBand (right) during the compression phase

1-bit Adam accelerates SQuAD fine-tuning tasks by 2.7 times:  1-bit Adam not only provides scalability on large-scale training tasks, but is also effective on tasks such as SQuAD fine-tuning. As shown in Figure 18, 1-bit Adam scales well on both Ethernet-based and InfiniBand-based systems and provides up to 6.2x higher throughput (in the compression phase) on Ethernet-based systems, resulting in 2.7x end-to-end speedup (25% in the warm-up phase and 75% in the compression phase). For SQuAD fine-tuning, we observe that the F1 score is the highest when the total batch size is 96. Batch sizes larger than this reduce the convergence rate and require additional hyperparameter tuning. Therefore, to scale to 32 GPUs, we run small batches of 3-4 on each GPU. This makes fine-tuning tasks communication intensive and difficult to scale. 1-bit Adam solves the scalability problem very well, reducing the communication volume by 3.4 times without increasing the batch size, thereby achieving an end-to-end acceleration of 2.7 times.

Figure 18: Performance of the compression phase using 1-bit Adam in SQuAD fine-tuning tasks on 40 Gbps Ethernet (left) and InfiniBand (right).


Check out  the DeepSpeed ​​website and  Github repository for code, tutorials, and documentation for these new technologies! We also integrated some technologies into  ONNX Runtime .

About our amazing collaborators:

  • We would like to thank our academic collaborator, Philippe Tillet from Harvard University. He   developed the sparse attention algorithm kernel with us through the Triton compiler.
  • ZeRO-Offload was developed with Jie Ren, an intern from UC Merced. We also thank Dong Li from UC Merced, and Bharadwaj Pudipeddi and Maral Mesmakhouroshahi from Microsoft for  L2L work , for their discussions on this topic.
  • 1-bit Adam was co-developed by Hanlin Tang, an intern from the University of Rochester.
  • We also appreciate the strong cooperation from NVIDIA, especially the Megatron-LM team.

About the DeepSpeed ​​team:

We are a group of researchers and engineers passionate about large-scale system performance optimization - Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Reza Yazdani Aminabadi, Elton Zheng, Arash Ashari, Jing Zhao, Minjia Zhang, Niranjan Uma Naresh, Shaden Smith, Ammar Ahmad Awan, Conglong Li, Yuxiong He (team lead). Recently, we have focused on deep learning systems, optimizing the training speed, convergence speed and development speed of deep learning systems!

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/131304237