Parallel optimization of distributed training data: ZeRO

Parallel optimization of distributed training data: ZeRO

preface

With the explosion of ChatGPT, large models have become a research hotspot in artificial intelligence in recent years. The ability of the large model is amazing, but the cost of training is not small. Large model, as the name suggests, the biggest feature is "big". The "big" here usually refers to the large number of parameters of the model. Therefore, in distributed training, how to use limited video memory to train larger models is the key point. Common paradigms for distributed training include data parallelism and model parallelism, and model parallelism includes tensor parallelism and pipeline parallelism. Tensor parallelism implemented in frameworks such as Megatron-ML is already a standard configuration for training large models, but data parallelism, as the most concise, easiest-to-understand, and easiest-to-implement distributed training paradigm, has been improved in recent years. optimization. This article mainly introduces the extreme optimization of distributed training data parallelism: ZeRO.

In data parallelism, an obvious problem is that each card needs to save a complete model and its optimization parameters (including model gradients, Adam parameters, etc.), which has great redundancy. Can each card only save part of all model parameters. ZeRO (Zero Redundancy Optimizer, another redundancy optimization) is an efficient data parallel solution proposed by Microsoft in 2019. ZeRO is able to eliminate redundancy in distributed training data parallelism while maintaining low communication and high computational granularity. This allows us to train larger models with limited video memory. In recent years, well-known distributed training frameworks, such as Microsoft's DeepSpeed ​​and Pytorch's FSDP, are all based on ZeRO's data parallelism.

Data Parallel Space Complexity Analysis

Let's take the commonly used ADAM optimizer and mixed precision training as an example to analyze the memory usage during training.

ADAM maintains the first-order momentum (momentum) and second-order momentum (variance) of the gradient, has a dynamic learning rate, and is a commonly used optimizer today. From the perspective of memory usage, the ADAM optimizer needs to maintain momentum and variance in addition to maintaining model parameters and their gradients.

Mixed-precision training is now the standard configuration for training large-scale training. It can reduce memory usage and speed up training with little performance loss. In the process of mixed precision training, there are generally two types of precision values, fp16 and fp32. The fp16 type contains the model parameters and their gradients, and the fp32 type includes the fp32 backup of the model parameters, as well as parameters that the optimizer needs to maintain, such as momentum and variance in ADAM.

The above is the memory usage of the model state in the case of ADAM optimizer + mixed precision training. In addition, there are activation values, temporary buffers, and video memory fragments during training.

To sum up, the video memory usage during training can be divided into two parts:

  1. Model state: record the parameter amount of the model itself as Φ \PhiΦ , in the case of Adam + mixed precision training, the model state includes model parameters2 of fp16 Φ 2\Phi and parameter gradient2 Φ 2\Phi and fp32 model parameter backup4 Φ 4\Phi,momentum4 Φ 4\Phi and variance4Φ 4\Phi,即总共2 Φ + 2 Φ + 4 Φ + 4 Φ + 4 Φ = 16 Φ 2\Phi+2\Phi+4\Phi+4\Phi+4\Phi=16\Phi2F+2F+4F+4F+4F=16Φ . (Note that fp16 occupies two bytes and fp32 occupies four bytes)
  2. Remaining state: the activation value in training, temporary buffer and video memory fragmentation, etc.

Taking GPT-2 as an example, the GPT-2 model contains 1.5B parameters. If the fp16 format is used, the model itself only occupies 3GB of video memory, but the model state in the actual training process needs to consume 24GB! can be seen. The model state is twice the size of the model itself, and is the bulk of memory consumption. Moreover, for the activation value in the remaining state, etc., there are already optimization methods such as activation checkpointing that exchange time for space, which can effectively reduce this part of video memory consumption. Therefore, optimizing the memory usage of the model state is the focus.

ZeRO consists of ZeRO-DP and ZeRO-R, which are memory optimizations for model state and remaining state, respectively.

ZeRO-DP

Model state is the focus of ZeRO memory optimization. As mentioned in the introduction, in the data parallel distributed training method, each GPU must save an independent and complete model state parameter, that is, 12 Φ 12\Phi12Φ memory usage. Obviously, there is a lot of redundancy in this. It stands to reason that we only need to save a copy of the model state parameters. This is exactly the idea of ​​​​ZeRO optimization: sharding (partition), inthe NNAmong N GPUs, each GPU saves1 N \frac{1}{N}N1When the model state parameters of other parts are needed for calculation, the parameters saved by other GPUs can be passed over. This is a way of trading bandwidth for video memory.

The following figure is from the original ZeRO paper, which shows the idea of ​​ZeRO memory optimization more intuitively.

insert image description here

The video memory optimization of ZeRO-DP has three optimization levels, generally called ZeRO-1, ZeRO-2, ZeRO-3, corresponding to P os P_{os} in the figurePosP os + g P_{os+g}Paxis + gP os + g + p P_{os+g+p}Pos + g + p. Without optimization, the memory usage is ( 2 + 2 + K ) ∗ Φ (2+2+K)*\Phi(2+2+K)Phi

  • ZeRO-1: First, according to the previous analysis, the Adam optimizer state (Optimizer States, os) occupies the most video memory, corresponding to the green part in the figure. Shard the optimizer state and maintain it on different GPUs. Therefore, the memory usage of ZeRO-1 is ( 2 + 2 + KN ) ∗ Φ (2+2+\frac{K}{N})*\Phi(2+2+NK)Φ ,当K → ∞ K\rightarrow\inftyK is about4 Φ 4\Phi4F .
  • ZeRO-2: The second optimization is the gradient (Gradients, g), which corresponds to the orange part in the figure. The same slice is saved on different GPUs, and the memory usage is ( 2 + 2 + KN ) ∗ Φ (2+\frac{ 2 +K}{N})*\Phi(2+N2+K)Φ ,当K → ∞ K\rightarrow\inftyK is about2 Φ 2\Phi2F .
  • ZeRO-3: The last thing to optimize is the model parameter (Parameter, p), which corresponds to the green part in the figure. At this time, the video memory usage is 2 + 2 + KN ∗ Φ \frac{2+2+K}{N}*\PhiN2+2+KΦ ,当K → ∞ K\rightarrow\inftyK Yes, the video memory occupied by the model state is close to zero.

It can be seen that the model state is saved in slices using the ZeRO strategy. As the number of GPUs increases, the slices become more and more, and the memory usage of this part becomes smaller and smaller, and even theoretically tends to zero.

But in practice, we need to consider the overhead of communication between GPUs. Don't forget that our existing savings are "exchanged" by using bandwidth and communication. The conclusion is that ZeRO-1 and ZeRO-2 have the same amount of traffic as conventional data parallelism without ZeRO strategy, while ZeRO-3 requires additional traffic. The specific analysis will be discussed separately later. To balance the memory usage and communication overhead, we generally choose ZeRO-1 or ZeRO-2 in practice. ZeRO-1/2/3 can be set in DeepSpeed, and Pytorch's FSDP, that is, Fully Sharded Data Parallel, since it is Fully, it is completely sliced, which is equivalent to ZeRO-3.

ZeRO-R

ZeRO-DP optimizes the memory usage of the model state, while ZeRO-R optimizes the remaining state, that is, activation, buffer, and fragmentation.

  • The activation value also uses the fragmentation method, and cooperates with activation-checkpointing to further reduce the memory usage;
  • During the model training process, some temporary buffers of different sizes are often created, such as AllReduce for the gradient, etc. The solution is to create a fixed buffer in advance, which is no longer dynamically created during the training process. If the data to be transmitted is small, Then multiple sets of data buckets are transmitted at once to improve efficiency
  • One of the main reasons for the fragmentation of video memory is that after gradient checkpointing, the unsaved activation values ​​are continuously created and destroyed. The solution is to pre-allocate a continuous video memory, store the resident model state and checkpointed activation in it, and leave the remaining video memory Used to dynamically create and destroy discarded activations

The ZeRO-R part is some of the more commonly used cache methods in computer systems.

Shard traffic analysis

Collective Communication Primitives Review

Before analyzing the communication volume of ZeRO's sharding strategy, let's review the common collective communication primitives, including AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter. Refer to the official documentation of NVIDIA NCCL here .

AllReduce

The AllReduce operation performs reduction operations (such as sum, min, max, etc.) on the data on all nodes, and saves the result in the buffer of each node.

by kkTake k nodes to perform the sun operation as an example, each node providesaVector V i V_i of N elementsVi, get V i V_i on all nodesViThe result after summing is also a NNvector SS of N elementsS。即有: S [ i ] = V 0 [ i ] + V 1 [ i ] + ⋯ + V k − 1 [ i ] S[i]=V_0[i]+V_1[i]+\dots+V_{k-1}[i] S[i]=V0[i]+V1[i]++Vk1[i]

AllReduce is the basis of data parallel communication. At present, Ring AllReduce is commonly used in distributed training. If you are interested, you can read Mr. Yuan Jinhui's hand-in-hand derivation of the mathematical properties of Ring All-reduce .

insert image description here

Broadcast

Broadcast replicates a vector on one node to all other nodes.

insert image description here

Reduce

The calculation process of the Reduce operation is consistent with that of AllReduce, except that the result is only written to one node.

Note: Reduce + Broadcast is equivalent to AllReduce.

insert image description here

AllGather

AllGather operation collects kkThe respective NNson k nodesN values, get ak ∗ N k*NkN matrix and distribute it to all nodes.

Note: Execute ReduceScatter + AllGather, which is equivalent to AllReduce.

insert image description here

ReduceScatter

The calculation process of the ReduceScatter operation is the same as that of the Reduce operation, except that the results are divided equally and distributed to different nodes according to the node serial numbers.

insert image description here

traffic analysis

We mentioned before: ZeRO-1 and ZeRO-2 have the same amount of communication as the traditional data parallel method without ZeRO strategy, while ZeRO-3 requires additional communication amount.

After calculating the gradient at each step (step/iteration) in traditional data parallelism, an AllReduce operation is required to calculate the mean value of the gradient. The common Ring AllReduce is divided into two steps: ReduceScatter and AllGather. The amount of communication data (send + receive) of each card is approximately 2 Φ 2\Phi2F .

We directly analyze P os + g P_{os+g}Paxis + g, each card only stores 1 N \frac{1}{N}N1The optimizer state and gradient, for gpu0, in order to calculate it this 1 N \frac{1}{N}N1The mean value of the gradient requires a Reduce operation, and the amount of communication data is 1 N Φ ∗ N = Φ \frac{1}{N}\Phi*N=\PhiN1PhiN=Φ , and then other graphics cards do not need to save this part of the gradient value. The bucket strategy is used in the implementation to ensure that1 N \frac{1}{N}N1The gradient of is sent only once per card.

One more thing to note here, if the gradients of the last two layers of the model fall on gpu0, in order to save video memory, other cards delete the gradients of these two layers, how to calculate the gradient of the third last layer? Still because of the use of buckets, other cards can send gradients and calculate the third-to-last gradient at the same time. When both are completed, you can safely delete the last two gradients.

After gpu0 calculates the average gradient, it can update the local optimizer state (including 1 N Φ \frac{1}{N}\PhiN1Φ parameters), when the backpropagation process ends, a Gather operation is performed to update( 1 − 1 N ) Φ (1-\frac{1}{N})\Phi(1N1) model parameter of Φ , the amount of communication data is1 N Φ ∗ N = Φ \frac{1}{N}\Phi*N=\PhiN1PhiN=F .

From a global point of view, it is equivalent to using Reduce-Scatter and AllGather in two steps, which is consistent with traditional data.

And for ZeRO-3, P os + g + p P_{os+g+p}Pos + g + pMake each card save only 1 N \frac{1}{N}N1The parameters of the model itself, whether in forward calculation or backpropagation, involve a Broadcast operation.

ZeRO-Offload

GPU memory is a key factor restricting the size of the model that can be trained. Memory is much cheaper than GPU memory. The idea of ​​ZeRO-Offload is to put temporarily unused tensors into memory to expand the scale of the trainable model. It's a bit like the idea that the memory uses the disk as a swap.

ZeRO-Infinity

Also for offload, ZeRO-Offload focuses more on single-card scenarios, while ZeRO-Infinity is a typical industrial style, trying to break the memory wall of large-scale training and rushing to extremely large-scale training.

Ref

Guess you like

Origin blog.csdn.net/weixin_44966641/article/details/131951696