Pipeline parallelism, tensor parallelism and 3D parallelism

Detailed interpretation of three distributed training methods: pipeline linear parallelism, tensor parallelism, and 3D parallelism, from principles to specific method cases. 

Pipeline linear parallelism and tensor parallelism both divide the model itself, with the purpose of using limited single-card graphics memory to train larger models . Simply put, pipeline parallelism divides the model horizontally , that is, divides the model according to layers ; tensor parallelism divides the model vertically . 3D parallelism applies popular line parallelism, tensor parallelism and data parallelism to model training at the same time.

Pipeline parallelism

The goal of pipeline parallelism is to train larger models. This section first introduces the intuitive naive layer parallel method and analyzes its limitations. Then, the pipeline parallel algorithms GPipe and PipeDream are introduced.

1. Naive layer parallelism

When a model is too large to be trained on a single GPU, the most straightforward idea is to divide the model layers and then place the divided parts on different GPUs. The following takes a 4-layer sequence model as an example to introduce naive layer parallelism: ​Animation  ​Based on the above introduction, you can find the shortcomings of naive layer parallelism:

  • Low GPU utilization . At any time, there is only one GPU working, and the other GPUs are idle.

  • Computation and communication have no overlap . The GPU is also idle when sending intermediate results of forward propagation (FWD) or backward propagation (BWD).

  • High memory usage . GPU1 needs to save all activations of the entire minibatch until the final parameter update is completed. If the batch size is large, this will pose a huge challenge to video memory.

GPipe

The principle of GPipe

GPipe improves efficiency by dividing minibatches into smaller and equally sized microbatches. Specifically, let each microbatch independently calculate forward and backward propagation, and then add the gradients of each microbatch to get the gradient of the entire batch. Since each layer is only on one GPU, summing the gradients of the mircobatch only needs to be done locally and no communication is required.

Assume there are 4 GPUs and the model is divided into 4 parts by layer. The process of naive layer parallelism is

Timestep 0 1 2 3 4 5 6 7
GPU3 FWD BWD
GPU2 FWD BWD
GPU1 FWD BWD
GPU0 FWD BWD

As you can see, only 1 GPU is working at a certain time. And each timesep takes a relatively long time because the GPU needs to run the entire minibatch forward propagation.

GPipe divides the minibatch into 4 microbatches, and then sends them to GPU0 in sequence. After forward propagation on GPU0, the results are sent to GPU1, and so on. The whole process is as follows

Timestep 0 1 2 3 4 5 6 7 8 9 10 11 12 13
GPU3 F1 F2 F3 F4 B4 B3 B2 B1
GPU2 F1 F2 F3 F4 B4 B3 B2 B1
GPU1 F1 F2 F3 F4 B4 B3 B2 B1
GPU0 F1 F2 F3 F4 B4 B3 B2 B1

F1 means forward propagation of microbatch1 using the layer currently on the GPU . In GPipe's scheduling, the time spent on each timestep is shorter than in naive layer parallelism because each GPU only needs to process microbatch.

GPipe Bubbles problem

Bubbles refer to points in the pipeline where no valid work is being done. This is due to dependencies between operations. For example, GPU4 can only wait until GPU3 finishes executing F1. The bubbles in the entire pipeline process are shown in the figure below. Therefore , increasing the number of microbatch m can reduce the proportion of bubbles.

Graphics memory requirements for GPipe

Increasing the batch size will linearly increase the memory requirements that need to be activated by the cache. In GPipe, the GPU needs to cache activations from forward propagation to back propagation. Taking GPU0 as an example, the activation of microbatch1 needs to be saved from timestep 0 to timestep 13.

In order to solve the problem of video memory, GPipe uses gradient checkpointing. This technique does not require caching all activations, but instead recomputes activations during backpropagation. This reduces the need for video memory, but increases the computational cost.

All layers are assumed to be approximately equal. The video memory required for each GPU cache activation is PipeDream

GPipe needs to wait until all microbatch forward propagation is completed before starting back propagation. PipeDream immediately enters the backward propagation stage after the forward propagation of a microbatch is completed.  Theoretically, the activation of the corresponding microbatch cache can be discarded after backpropagation is completed. Since PipeDream's backpropagation is completed earlier than GPipe, it will also reduce the need for video memory.

The picture below is the scheduling diagram of PipeDream, 4 GPUs and 8 microbatches. The blue square represents forward propagation, the green represents back propagation, and the number is the ID of the microbatch. ​PipeDream is no different from GPipe in terms of bubbles, but because PipeDream releases video memory earlier, it will reduce the demand for video memory.

Merging data parallelism and pipeline parallelism

Data parallelism and pipeline parallelism are orthogonal and can be used simultaneously.

  • For pipeline parallelism. Each GPU needs to communicate with the next pipeline stage (forward propagation) or the previous pipeline stage (back propagation).

  • for data parallelism. Each GPU needs to communicate with the GPU assigned the same layer. Copies of all layers require AllReduce to average gradients.

This will form subgroups on all GPUs and use collective communication within the subgroups. Any given GPU will have two parts of communication, one with the GPU containing all the same layers (data parallelism), and another with the GPU with different layers (pipeline parallelism). The following figure is an example of a pipeline parallelism of 2 and a data parallelism of 2. The horizontal direction is a complete model, and the vertical direction is different copies of the same layer

tensor parallelism

The main components in Transformer are fully connected layers and attention mechanisms, the core of which are matrix multiplication. The core of tensor parallelism is to split matrix multiplication, thereby reducing the model's memory requirements on a single card . ​​( 1) Matrix multiplication angle

  (2) Activation function and communication

Obviously, just observing the above data formula, whether it is row parallel or column parallel, it only needs to be performed once after each part is calculated. It's just that column parallelism splices the communication results, while row parallelism adds the communication results .

Now, we add nonlinear activation GeLU and simulate two layers of fully-linked layers. Let X be the input, and A and B are the weights of the two fully connected layers.  Therefore, when multiple full-link layers are stacked, only one communication is required during the final output (as shown in the figure above)  .  (3) Multiple heads of attention in parallel

Multi-head attention parallelism is not 1D tensor parallelism , but since it was proposed in Megatron-LM at the same time as 1D tensor parallelism, it will be briefly introduced here.  Since each head of multi-head attention is essentially independent, each head can operate in parallel.

NOTE: Tensor Parallelism (TP) requires a very fast network, so tensor parallelism across multiple nodes is not recommended. In practice, if a node has 4 GPUs, the highest tensor parallelism is 4.

2D, 2.5D tensor parallelism

After 1D tensor parallelism, 2D, 2.5D and 3D tensor parallelism were gradually proposed. Here is a brief introduction to 2D and 2.5D tensor parallelism:

2D tensor parallelism

1D tensor parallelism does not divide activations (that is, the output results of the middle layers of the model), resulting in the consumption of a large amount of video memory.  Broadcasting is performed, and each processor calculates separately. Finally, the result of the second matrix is ​​obtained.

Add the results of two matrices.

(2) 2.5D tensor parallelism

  
3D parallel

In general, 3D parallelism is composed of data parallelism (DP), tensor parallelism (TP) and pipeline parallelism (PP). TP and PP have been introduced respectively before. ZeRO-DP is a memory-efficient data parallel strategy.

The following describes how to combine three parallel technologies to form 3D parallel technology.

A 3D parallel example    

Suppose there are two nodes Node1 and Node2 , each node has 8 GPUs, a total of 16 GPUs. The numbers of the 16 GPUs are Rank0, Rank1, ..., Rank15. In addition, assume that the user sets the pipeline parallelism to 4 and the tensor parallelism to 2 .

Pipeline parallelism . Pipeline parallelism will divide the entire model into 4 parts, here called sub_model_1 to sub_model_4. Each 4 consecutive GPUs are responsible for one sub_model. That is, in the upper right corner of the above figure, GPUs of the same color are responsible for the same sub_model .

Tensor parallelism . Tensor parallelism will split tensors for sub_model in pipeline parallelism. That is, Rank0 and Rank1 are responsible for one copy of sub_model_1, Rank2 and Rank3 are responsible for another copy of sub_model_1; Rank4 and Rank5 are responsible for sub_model_2, Rank6 and Rank7 are responsible for another copy of sub_model_2; and so on. In the lower right corner of the above figure, the green line represents a single tensor parallel group, and each tensor parallel group is jointly responsible for a specific sub_model.

Data parallelism . The purpose of data parallelism is to ensure that the same model parameters in parallel read the same data . After pipeline parallelism and tensor parallelism, Rank0 and Rank2 are responsible for the same model parameters, so Rank0 and Rank2 are the same data parallel group. The red line in the upper left corner of the figure above represents the data parallel group. whaosoft  aiot  http://143ai.com

3D parallel analysis

Why does 3D parallelism need to divide the GPU in the above way?  First, model parallelism has the largest communication overhead among the three strategies, so it is preferable to place the model parallel group in one node to utilize the larger intra-node bandwidth. Second, pipelines have minimal parallel communication, so pipelines are scheduled between different nodes, which will not be limited by communication bandwidth. Finally, if tensor parallelism does not cross nodes, data parallelism does not need to cross nodes; otherwise, the data parallel group also needs to cross nodes.

Pipeline parallelism and tensor parallelism reduce the memory consumption of a single graphics card and improve memory efficiency. However, dividing the model too much will increase communication overhead and thus reduce computational efficiency. ZeRO-DP not only improves memory efficiency by dividing the optimizer state, but also does not significantly increase communication overhead.

References

https://siboehm.com/articles/22/pipeline-parallel-training

https://arxiv.org/pdf/1811.06965.pdf

https://huggingface.co/docs/transformers/v4.18.0/en/parallelism

https://www.colossalai.org/zh-Hans/docs/features/1D_tensor_parallel

https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/?utm_source=wechat_session&utm_medium=social&utm_oi=738477582021832704

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132747544