Data Parallel - DP/DDP/ZeRO

Data Parallel DP

The core idea of ​​data parallelism is to copy a complete model on each GPU, each consume a copy of the data, calculate a gradient, and finally accumulate the gradients to update the overall model. The concept is not complicated, but when it comes to large model scenarios, the huge amount of communication between storage and GPU is the focus of system design. In this article, we will progressively introduce three mainstream data parallel implementation methods:

  1. DP (Data Parallelism): The earliest data parallel mode, generally using the programming framework of Parameters Server. In practice, it is mostly used for single machine and multiple cards.
  2. DDP (Distributed Data Parallelism): Distributed data parallelism, using Ring AllReduce communication method, mostly used in multi-machine scenarios in practice
  3. ZeRO: Zero Redundancy Optimizer. Launched by Microsoft and used in its DeepSpeed ​​framework. Strictly speaking, ZeRO uses data parallelism + tensor parallelism to reduce storage.

Insert image description here
1) Several blocks of computing GPUs, such as GPU0~GPU2 in the figure; 1 block of gradient collection GPU, such as the GPU where the AllReduce operation is located.
2) Copy a complete model parameter on each computing GPU.
3) Evenly distribute a piece of data X (such as a batch) to different computing GPUs.
4) After each computing GPU performs a round of FWD and BWD, it calculates a gradient G.
5) Each computing GPU pushes its own gradient to the gradient collection GPU for aggregation operations. The aggregation operation here generally refers to gradient accumulation. Of course, user customization is also supported.
6) After the gradient collection and GPU aggregation are completed, the complete gradient results pulled by the GPU from it are calculated and used to update the model parameters W. After the update is completed, the model parameters on the computing GPU remain consistent.
7) The operation of aggregating and then issuing gradients is called AllReduce.

  • To summarize: Scatter – Collect – Reverse Distribution (updated)

A classic programming framework for implementing DP is called "parameter server". In this framework, the computing GPU is called Worker, and the gradient aggregation GPU is called Server. In practical applications, in order to minimize the communication volume, one Worker can generally be selected as the Server at the same time. For example, all gradients can be sent to GPU0 for aggregation. A few additional points need to be explained:

  • There can be more than one GPU under one Worker or Server.
  • Server can only do gradient aggregation, or it can do gradient aggregation + full parameter update together.
  • Under the language system of parameter server, the DP process can be described as follows:

Insert image description here
So the problem is:

  • Storage overhead is large. A complete model is stored on each GPU, resulting in redundancy.
  • Communication overhead is high. Server needs to perform gradient transmission with each Worker. When the Server and Worker are not on the same machine, the bandwidth of the Server will become the computing efficiency bottleneck of the entire system.

To summarize: each node completes its own work and submits it, waiting for the feedback update from the server. This waiting process is a waste of time, and the pressure on the server is very high.


Therefore, the idea of ​​gradient asynchronous update came out.

In the scenario of asynchronous gradient update, the calculation order of a certain Worker is:

  • In the 10th round of calculation, the Worker calculates the gradient normally and sends push&pull gradient requests to the Server.
  • However, the Worker will not actually wait until it gets the aggregated gradient back and updates the parameter W before doing the calculation. Instead, just take the old W, eat the new data, and continue the calculation of the 11th round . This ensures that during the communication time, the Worker is also doing calculations non-stop, improving the calculation-to-communication ratio.
  • Of course, asynchronousness cannot be taken too far. If only the gradient is calculated without updating the weights, the model will not converge. What is depicted in the figure is an asynchronous update with a delay of 1 , that is, when starting the calculation of the 12th round, it must be ensured that W has completed two updates using the gradients of the 10th and 11th rounds.

This means that the parameters of the work are updated in stages, and the interval between updates is determined by the delay time step. There are
three update methods:
(a) No delay
(b) Delay without specifying the number of delay steps . That is to say, in iteration 2, the old weights may be used, or the new weights may be used, it depends on fate.
(c) Delay and specify the number of delay steps as 1 . For example, when doing iteration 3, you do not need to retrieve the gradient of iteration 2, but you must ensure that the gradients of iteration 0 and 1 have been retrieved and used for parameter update.

To sum up, asynchronous is very popular, but for a Worker, it just means that W remains unchanged and the number of batches increases. Under SGD, it will slow down the overall convergence speed of the model. The overall idea of ​​asynchronous is that instead of letting the Worker idle, it is better to let it eat more data. Although the feedback is delayed, as long as it is working and learning.

Distributed Data Parallel (DDP)

Affected by uneven communication load, DP is generally used in single-machine multi-card scenarios.
Therefore, DDP emerged as a more versatile solution that can be used on multiple machines as well as on a single machine.

  • The first thing DDP needs to solve is the communication problem: to balance the communication pressure on the server to each worker. After achieving this, you can further go to Server and stay as Worker.

This round of operations of aggregating gradients + issuing gradients is called AllReduce.

Next, we introduce the most common AllReduce method at present: Ring-AllReduce. It was first proposed by Baidu and very effectively solved the problem of uneven communication load in data parallelism, making DDP possible.

  • That’s great, just look at the picture

Insert image description here
Insert image description here

Insert image description here

Insert image description here

  • By grouping and accumulating this, each GPU has a block-complete gradient, which is 1/4 of the complete gradient. That is to say, Reduce-Scatter

  • The following is to synchronize each separate 1/4 complete gradient to other GPUs, which is a replacement operation. That is to say, ALLTOGETHER

Insert image description here
Insert image description here

  • After two more rounds of iteration, it will be OK.

Insert image description here

Insert image description here
To summarize:

Naive Data Parallel (DP) and Distributed Data Parallel (DDP). Although the total communication volume of the two is the same, DP has uneven load. Most of the communication pressure is concentrated on the server, and the communication volume of the server has a linear relationship with the number of GPUs, making DP generally suitable for single-machine multi-card scenarios. DDP uses the Ring-AllReduce NCCL operation to evenly distribute the communication volume to each GPU, and the communication volume is a fixed constant and is not affected by the number of GPUs, so it can achieve cross-machine training.

  • DDP has optimized the uneven communication load, but it still leaves a problem of video memory overhead: in data parallelism, a complete model is copied on each GPU. When the model becomes larger, it is easy to exhaust the GPU's video memory.

ZeRO

ZeRO (Zero Redundancy Optimization) developed by Microsoft is the core of DeepSpeed, a distributed training framework, and is used to solve the problem of memory overhead in large model training.

  • The idea of ​​ZeRO is to exchange communication for video memory.

Insert image description here
Storage is mainly divided into two parts: Model States and Residual States.
Model States refer to the content that is closely related to the model itself and must be stored, specifically including:

optimizer states: momentum and variance gradients in Adam optimization algorithm
: model gradient
parameters: model parameter W

Residual States refer to content that is not required for the model but will be additionally generated during the training process, including:

activation: activation value. We have introduced it in detail in Pipeline Parallelism. It is used when calculating the gradient using the chain rule in the backward process. It will be faster to calculate the gradient with it, but it does not have to be stored because it can be calculated by re-doing the Forward.
temporary buffers: temporary storage. For example, the storage generated when gradients are sent to a certain GPU for summation and aggregation.
unusable fragment memory: fragmented storage space. Although the total storage space is sufficient, if continuous storage space cannot be obtained, related requests will also fail. This type of space waste can be solved through memory organization.


Mixed precision operation
Precision mixed training, for the model, we definitely hope that its parameters are as accurate as possible, that is, we use fp32 (single precision floating point number, storage occupies 4 bytes) to represent the parameter W. However, during the forward and backward processes, the computational overhead of fp32 is also huge. So can fp16 or bf16 (half-precision floating point number, storage occupies 2 bytes) be introduced in the calculation process to reduce the calculation pressure?

As a result, mixed precision training was born

Insert image description here

  • Store a copy of fp32 parameters, momentum and variance (collectively called model states)
  • Before starting the forward, open an additional storage space and halve the fp32 parameter to the fp16 parameter.
  • Normally do forward and backward, and the activations and gradients generated in between are all stored in fp16.
  • Use fp16 gradients to update model states under fp32.
  • When the model converges, the parameter of fp32 is the final parameter output.

That is, fp32 is used when storing model parameters, and fp16 is used when calculating model fw and bw.

That is, let the model parameter w be ϕ \phiϕ
Insert image description here

  • Because Adam optimization is used, momentum and variance appear. Adam seems to consume a lot of memory.

Activation is not included in the statistical scope for the time being because:

  1. Activation is not only related to model parameters, but also to batch size.
  2. Storage of activation is not required. The activation is stored just to calculate the gradient faster during the backward process using the chain rule. But you can always get the activation of each layer by only retaining the initial input X and re-doing the forward (although it will not be so extreme in practice).

Because of the flexibility of activation, it is inconvenient to measure the real changes in system performance as the model increases after including it. Therefore, we will not consider it here. We will explain the optimization of activation in a separate section later.


Now that we know what things take up storage and how much storage they take up, we can talk about how to optimize storage.
Note that throughout the training, there are many states that are not used every moment, for example;

  1. The optimizer states under Adam optimization are only used when doing the final update.
  2. In data parallelism, gradients are only used when doing AllReduce and updates at the end.
  3. The parameter W is only used when doing forward and backward.

Therefore, ZeRO thought of a simple and crude method: If the data is discarded after calculation, and when needed, I will find a way to get it back from somewhere, wouldn't that save a lot of storage space?


Optimize state segmentation

Insert image description here

  • The optimization parameters are in W of the model. Optimizing state segmentation means cutting W apart. Each GPU updates its own, and then synchronizes it.

Insert image description here
Insert image description here

  • The drop in video memory is very obvious. On the basis of increasing the communication overhead of a single card by 1.5 times, the storage of a single card is reduced by 4 times. Looks like a pretty good trade-off

Then cut, in the same way,
Insert image description here
each GPU uses its own corresponding O and G to update the corresponding W. After the update is completed, each GPU maintains an updated W. In the same way, do an All-Gather on W, synchronize the W calculated by other GPUs to yours, and
Insert image description here
cut them all! ! !

Each GPU maintains corresponding optimizer states, gradients and parameters.
Insert image description here

The final data parallel process is as follows :
(1) Only part of the parameters W are saved on each GPU. Divide a batch of data into 3 parts, and each GPU will eat one part.
(2) When doing forward, do an All-Gather on W, retrieve W distributed on other GPUs, and get a complete W, single-card communication volume. After forwarding is completed, immediately discard the W that is not maintained by yourself.
(3) When doing backward, perform an All-Gather on W to retrieve the complete W and single-card communication volume. After completing backward, immediately discard the W that is not maintained by yourself.
(4) After completing backward, calculate a complete gradient G, perform a Reduce-Scatter on G, and aggregate the part of the gradient and single-card communication volume maintained by yourself from other GPUs. After the aggregation operation is completed, immediately discard the G that is not maintained by yourself.
(5) Update W with O and G maintained by yourself. Since only part of W is maintained, there is no need to perform any AllReduce operations on W.

Insert image description here


ZeRO - R

Now let’s look at the optimization of residual states

Fixed size memory buffer:

  • Improve bandwidth utilization . When the number of GPUs increases, the number of communications between GPUs also increases, and the amount of communication each time may decrease (but the total communication amount will not change). If the data slice is small, the bandwidth cannot be fully utilized. Therefore, this buffer plays the role of accumulating data: wait until the data accumulates to a certain size, and then communicate.
  • Makes the storage size controllable . Before each communication, the accumulated storage size is constant and is known and controllable. It is more convenient for users to estimate storage consumption and communication time during training.

Set up a mechanism to reintegrate fragmented storage space and create continuous storage space .

Prevent storage request failure caused by insufficient total storage but insufficient continuous storage.

ZeRO-Offload

Finally, a brief introduction to ZeRO-Offload. Its core idea is: if the video memory is not enough, use the memory to make up for it. If I offload the bulk of the data to be stored on the CPU and put the calculation part on the GPU, will this not only reduce the graphics memory but also reduce some communication pressure compared to cross-machine processing?

  • Forward and backward are computationally intensive, so the parts related to them, such as parameter W (fp16) and activation, are all put into the GPU.
  • The calculation part of the update is low, so all the parts related to it are put into the CPU. For example, W(fp32), optimizer states(fp32) and gradients(fp16), etc.

The core idea is: If the video memory is not enough, use the memory to make up for it.

Insert image description here

Guess you like

Origin blog.csdn.net/RandyHan/article/details/132612156