Large model distributed training parallel technology (1) - overview

From: Eat jelly without spitting out jelly skin

Enter the NLP group —> join the NLP exchange group

In recent years, with the introduction of the Transformer and MOE architectures, the deep learning model can easily break through trillions of parameters. The traditional single-machine single-card model can no longer meet the requirements for super-large model training. Therefore, we need to train distributed large models based on single-machine multi-card, or even multi-machine multi-card.

The primary goal of distributed machine learning is to use AI clusters to enable deep learning algorithms to efficiently train large models with excellent performance from large amounts of data. In order to achieve this goal, it is generally necessary to consider the division of computing tasks, training data, and models according to the matching of hardware resources and data/model scale, so as to perform distributed storage and distributed training. Therefore, distributed training-related technologies are worthy of our in-depth analysis of the mechanism behind them.

The following mainly explains the parallel technology for distributed training of large models. This series is roughly divided into nine articles.

  • Large model distributed training parallel technology (1) - overview

  • Large Model Distributed Training Parallel Technology (2) - Data Parallel

  • Large model distributed training parallel technology (3) - pipeline parallelism

  • Large model distributed training parallel technology (4) - Tensor parallelism

  • Large Model Distributed Training Parallel Technology (5) - Sequence Parallel

  • Large Model Distributed Training Parallel Technology (6) - Multidimensional Hybrid Parallel

  • Large model distributed training parallel technology (7) - automatic parallelism

  • Large Model Distributed Training Parallel Technology (8) - MOE Parallel

  • Large model distributed training parallel technology (9) - summary

This article is the first article on parallel technology for distributed training, and briefly introduces common parallel technologies for distributed training of large models.

data parallelism

Data parallelism is the most common form of parallelism because of its simplicity. In data-parallel training, the dataset is split into several shards, and each shard is assigned to a device. This is equivalent to parallelizing the training process along the batch dimension. Each device will hold a full copy of the model, trained on the allocated dataset shards. After backpropagation, the gradients of the model are all reduced so that the model parameters on different devices can be kept in sync. Typical data parallel implementation: PyTorch DDP.

559dbf9a99e3916df42d565458c10b15.png
image.png

model parallelism

A notable feature in data-parallel training is that each GPU holds a copy of the entire model weights. This creates a redundancy problem. Another mode of parallelism is model parallelism, where the model is partitioned and distributed across an array of devices.

There are generally two types of model parallelism: tensor parallelism and pipeline parallelism.

  • Tensor parallelism is parallel computation in one operation, such as matrix-matrix multiplication.

  • Pipeline parallelism is parallel computation between layers.

Therefore, from another perspective, tensor parallelism can be seen as intra-layer parallelism, and pipeline parallelism can be seen as inter-layer parallelism.

Tensor Parallel

Tensor parallel training is to divide a tensor into N blocks along a specific dimension, and each device only holds 1/N of the entire tensor without affecting the correctness of the calculation graph. This requires additional communication to ensure the correctness of the results.

Taking general matrix multiplication as an example, suppose we have C = AB. We can split B into [B0 B1 B2 ... Bn] along the columns, with each device holding a column. Then we multiply A with each column in B on each device and we'll get [AB0 AB1 AB2 ... ABn] . At this moment, each device still holds a part of the result, eg, device (rank=0) holds AB0. In order to ensure the correctness of the results, we need to collect all the results and concatenate tensors along the dimension. In this way, we are able to distribute tensors across devices while ensuring that the computation pipeline remains correct.

4d6a2bd8ab911511979c7a51b08563b1.png
image.png

Typical tensor parallel implementations: Megatron-LM (1D), Colossal-AI (2D, 2.5D, 3D).

Pipeline Parallel

The core idea of ​​pipeline parallelism is that the model is divided into several blocks by layers, and each block is handed over to a device.

  • During forward propagation, each device passes intermediate activations to the next stage.

  • During backpropagation, each device passes the gradient of the input tensor back to the previous pipeline stage.

This allows devices to perform computations simultaneously, increasing training throughput.

ee43a80bfafbe18470d38fad3d15b515.png
image.png

An obvious disadvantage of pipeline parallel training is that the training equipment is prone to idle state (because the latter stage needs to wait for the previous stage to complete), resulting in waste of computing resources, and the acceleration efficiency is not as high as data parallelism.

600818431659f22ee57a6d62eaca87f6.png
image.png

Typical pipeline parallel implementations: GPipe, PipeDream, PipeDream-2BW, PipeDream Flush (1F1B).

optimizer-dependent parallelism

At present, as the model becomes larger and larger, the video memory of a single GPU usually cannot hold such a large model. Then we must find a way to optimize the place where the video memory is occupied.

Generally speaking, during the model training process, the parameters that need to be stored on the GPU include the parameters of the model itself, the state of the optimizer, the output value of the activation function, the gradient, and some zero-time buffers. The proportion of various data is shown in the figure below:

acd16912720b96c59ee8a2d0b4574eb2.png
image.png

It can be seen that the model parameters only account for a part of all the data in the model training process. When performing mixed precision operations, the model state parameters (optimizer state + gradient + model parameters) account for more than half. Therefore, we need to find a way to remove redundant data during model training.

The parallelism related to the optimizer is a parallel scheme for removing redundant data. At present, the most popular method of this kind of parallelism is ZeRO (Zero Redundancy Optimizer). For the storage optimization (removal of redundancy) of the model state, ZeRO uses sharding, that is, each card only stores 1/N of the model state, so that only one copy of the model state is maintained in the system. ZeRO has three different levels, which fragment the model state to different degrees:

  • ZeRO-1 : Optimizer States Sharding

  • ZeRO-2: Optimizer States & Gradients Sharding

  • ZeRO-3: Optimizer States & Gradients & Parameters Sharding for Optimizer States, Gradients & Parameters Sharding

b28a9da1c3f66a181b75fad3719016e8.png
image.png

Heterogeneous System Parallel

In the above methods, a large number of GPUs are usually required to train a large model. What is often overlooked, however, is that CPUs have significantly more memory than GPUs. On a typical server, the CPU can easily have hundreds of gigabytes or even terabytes of memory, while each GPU card usually only has 48 or 80 GB of memory. This prompts people to wonder why CPU memory is not being used for distributed training.

More recent advances rely on CPUs or even NVMe disks to train large models. The main idea is to offload tensors back to CPU memory or NVMe disk when not in use.

By using a heterogeneous system architecture, it is possible to accommodate a huge model on a single machine.

62181d04b989b91f1235e4283506b89a.png
image.png

Multidimensional Mixed Parallelism

Multidimensional hybrid parallelism refers to the combination of multiple parallel technologies such as data parallelism, model parallelism, and pipeline parallelism for distributed training.

1d4768b5c355e751d3bd105b5ba7ef92.png
image.png

Usually, multi-dimensional hybrid parallelism is required for pre-training and full-parameter fine-tuning of very large-scale models.

2d7a83eb704700320d0afc070bfe7005.png
image.png

In order to make full use of the bandwidth, in general, the amount of communication required for tensor parallelism is the largest, while the amount of communication required for data parallelism and pipeline parallelism is relatively small. Therefore, tensor parallelism is used within the same server, while data parallelism and pipeline parallelism are used between servers.

d67bae9a41a9622721d8f68797c72dab.png
image.png

automatic parallelization

The multi-dimensional hybrid parallelism mentioned above, such as data parallelism, tensor parallelism, and pipeline parallelism, needs to divide the model into multiple AI accelerator cards. If users are required to implement it manually, it will be very difficult for developers, and performance and memory need to be considered. , communication, training effects and other issues, if the model can be automatically divided into different accelerator cards by operator or layer, it can greatly reduce the difficulty of developers. Therefore, automatic parallelism came into being.

40b73107b1a6cf56b63d0252b4952924.png
image.png

MOE Parallel/Expert Parallel

Generally speaking, the expansion of model scale will lead to a significant increase in training costs, and the limitation of computing resources has become the bottleneck of large-scale intensive model training. In order to solve this problem, a deep learning model architecture based on a sparse MoE layer is proposed, that is, the large model is split into multiple small models (experts, ), and each round of iteration decides to activate a part of the experts for calculation according to the sample, which expertsaves The effect of computing resources; and introduce a gate ( ) mechanism that can be trained and ensures sparsity  gate to ensure the optimization of computing power.

Using the MoE structure, super-large-scale model training can be achieved while the computational cost increases sub-linearly, bringing huge gains to a constant computational resource budget. The MOE parallel is essentially a model parallel method. The figure below shows a model with six expert networks being trained in parallel by two experts. Among them, experts 1-3 are placed on the first computing unit, and experts 4-6 are placed on the second computing unit.

41c84f08e35f91fcfecb277a326b5c57.png
image.png

epilogue

This article briefly introduces common parallel technologies for distributed training of large models. Subsequent chapters will explain in detail the different solutions of common parallel technologies.


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132486425