Data Parallel/Model Parallel (Inter-Layer Intra-Layer)/Pipeline Parallel -> Zero > Lora

  • Data parallelism (data parallelism, DP): Assuming there are cards, each card saves a model, and each iteration (iteration/step) divides the batch data into micro-batches of equal size, and each card is obtained according to The micro-batch data independently calculates the gradient, and then calls AllReduce to calculate the gradient mean, and each card updates the parameters independently.
  • Model parallelism (model parallelism/tensor parallelism, MP/TP): Some tensors/layers are too large to fit on one card, so the tensor is divided into multiple pieces, and one card is stored in one piece.
  • Pipeline parallelism (PP): Divide the network into multiple groups by layer, and store one group on one card.

model parallelism

Model parallelism: inter-layer model parallelism (inter-layer) and intra-layer model parallelism (intra-layer).
NUS articles are basically aimed at intra-layer model-parallelism, or it can also be called tensor parallelism.
Intra-layer model parallelism can segment the input matrix or parameter matrix in one dimension, and put the slices on different devices. A typical example is the 1D Megatron.
Inter-layer model parallelism is to segment the model layer. There are also many framework companies in the industry who call it Pipeline parallelism, but my point of view is that inter-layer model parallelism can only be called Pipeline parallelism when it is really streamlined.
insert image description here
1D 2D 2.5D 3D model parallel detailed introduction:
Parallel embedding self-att MLP of different layers of GPT

Zero

zero 1 2 3

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
was published in SC 20, and the DeepSpeed ​​project was originally the official implementation of the ZeRO method in the paper.
ZeRO-Offload: Democratizing Billion-Scale Model Training was published on ATC 21.
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning was published on SC 21. It also performs offload. ZeRO-Offload focuses more on single-card scenarios, while ZeRO- Infinity is a typical industrial style, going for extremely large-scale training.
Reference: https://zhuanlan.zhihu.com/p/428117575
deepspeed mainly uses this method

Lora

low rank adaption

Tensor Parallelism

Tensor Model Parallelism (Tensor Model Parallelism), Megatron uses this method
XA = Y

Row Parallelism (Row Parallelism)
is to divide A into two parts according to the row
insert image description here

Column Parallel
Y=[Y1, Y2]

MLP
For the first fully connected layer: use column splits
For the second fully connected layer: use row splits

Guess you like

Origin blog.csdn.net/weixin_36378508/article/details/129838193