Distributed Training and Quantization of LLMs Open Source Models

The previous blog post sorted out:

Introduction to LLMs open source models and datasets

This blog post mainly sorts out the current popular training methods and quantization.

insert image description here
（图自Towards a Unified View of Parameter-Efficient Transfer Learning）

Tuning Strategies

The most common way to adapt general LLMs to downstream tasks is to fine-tune all model parameters or fine-tune tail layer parameters (Freeze). However, this will lead to a separate fine-tuning model parameter for each task, and the training cost is high.

Adapter. Freeze the original parameters and add an adapter layer for fine-tuning. The adapter layer generally projects downward first, then the nonlinear activation function, then uses upward projection, and finally connects the residual. In Transformer, one is placed after MHA and one after FFN. The amount of newly added parameters roughly accounts for 0.5%-8% of the original model.
$h=h+f(hW_{down})W_{up}$
P-Tuning. The soft-prompt method. P-tuning v1 inserts learnable prompts into the input embedding, resulting in trainable parameters being limited by the length of the sentence, making the task and scale uncommon. In P-tuning v2, prompts are added to the front of the sequence, and each layer All added.
LoRA. The original parameters remain unchanged, and only the additional low-rank parameters are trained to approximate the weight update (ie W0 is frozen, A and B are trainable, where B is initialized to 0). When the "rank value" is much smaller than the original parameter dimension, the amount of newly added low-rank matrix parameters will be small.
$xh=W_0x+ \Delta W x=W_0x+BA x$
when deployed, explicitly computes and stores low-rank matrices, and performs inference normally $.$ When it is necessary to transfer to another downstream task, W0 can be recovered by subtracting the low-rank matrix, and then adding other low-rank matrices.
BitFit only fine-tunes the bias vector in the pre-trained model, diff-pruning learns a sparse parameter update vector, and so on.

Optimization

How to train large models with lower memory and time?

Storage: It is difficult to train large model parameters and data in the memory of a single GPT card. –>Distributed parallel and mixed precision training
Computation: All computation operations for large model parameters can result in long training times and slow inference times. –> Model Quantization

Parallelism

Data Parallelism. In data parallelism, multiple copies of the model are copied, each copy is placed on a different GPU, and the input data is sharded. Then the model completes forward and backward in parallel, and synchronizes all model copies at the end of each training step, that is, the AllReduce process of aggregating gradients + sending gradients.
Tensor Parallelism. Tensor parallelism divides matrix multiplication into blocks, thereby splitting large matrices into smaller matrices, so that different matrices can be placed on different GPUs. This approach does not put the entire activation tensor or gradient tensor on a single GPU, but instead puts fragments of this tensor on a single GPU.
Pipeline Parallelism. Pipeline parallelism, which splits a single large model across multiple cards by placing different layers of the model on different GPUs, is also known as vertical parallelism. In order to improve the utilization rate, the concept of chunks is introduced, which defines the sequential input data blocks in the same pipe level. For example, GPU0 executes the same forward path on chunks 0, 1, 2, and 3 (F0,0, F0,1, F0,2, F0,3), then waits until the other GPUs have finished their work, then GPU0 again To get to work, execute the backward path for blocks 3, 2, 1, and 0 (B0,3, B0,2, B0,1, B0,0). This is similar to DeepSpeed's gradient accumulation steps (GAS).

When PP+DP is combined, if the batch size is 1024, 4 cards, then it will be split into 4 x 256, if the number of blocks or GAS is 32, the final micro batch size is 8, that is, each tube level is processed once A micro batch. Also called micro-batches (MBS).
When DP+PP+TP is combined into 3D Parallelism:
insert image description here

In practice, Megatron-DeepSpeed (GPT-3 training (Under Construction) based on Megatron-Deepspeed) is widely used for training, and the core is ZeRO to optimize DP storage (similar to FSDP).

ZeRO

DP needs to copy multiple models, and then AllReduce. It mainly stores three types of data, which are model parameters P, model gradient G, and optimizer parameters O, that is, the OPG state. These backups need to be stored on each GPU, and there is redundancy. When the model becomes larger, it is easy to burst the video memory.

Parameters are only used when doing forward and backward
Gradients are only used when doing AllReduce and updates at the end
Adam's optimizer states are only used in the final update

But in fact, only part of the necessary data can be stored on each GPU, and when other data is needed, it can be retrieved on other GPUs, that is, the communication time is exchanged for the video memory space.
[picture]
( $\Phi$ is the model parameter W, K refers to parameter, momentum and variance occupy 4+4+4 bytes, p and g store 2+2 in mixed precision)

forward. Do an All-Gather on W, retrieve W distributed on other GPUs, and get a complete W. After the forward is done, immediately discard the W that is not maintained by yourself.
backward. Do an All-Gather on W to get back the complete W. After the backward is done, immediately discard the W that is not maintained by yourself.
gradients. For the complete gradient G, G needs to do a Reduce-Scatter to aggregate a certain gradient on other GPTs. After aggregation, immediately discard G that is not maintained by yourself.
Adam. Use the O and G maintained by yourself to update some parameters W. But since only part of W is maintained, no other AllReduce operations are required.

Therefore, ZeRO has three different levels, corresponding to different degrees of segmentation:

ZeRO-1: Split Optimizer States.
ZeRO-2: Split Optimizer States and Gradients.
ZeRO-3: Split Optimizer States, Gradients and Parameters.

Mixed Precision Training

Although it is hoped that the more accurate the parameters, the better, but fp32 represents a large calculation overhead for each parameter, so mixed precision training tries to introduce fp16. However, the range of fp16 table values is much narrower than that of fp32, and has the following problems:

Full fp16 will lose 80% accuracy.
Many accumulation operations such as gradient accumulation and softmax are prone to overflow. –> Use fp32 for accumulation, and fp16 for the rest of the operations.
In the later stage of training, the gradient becomes smaller and overflows. –>Save a copy of the weights (only use fp16 in forward and reverse), Loss Scaling.

But in the report of many large models, it shows that fp16 is unstable in training. So they all chose bf16 by coincidence! It is the same size as fp16, but the table value range is the same as fp32. Although the accuracy is worse, in the case of combining weight copies, it can be regarded as a little loss of stochastic gradient descent, and then wait for the next iteration.
insert image description here

other optimizations

Gradient Accumulation. Gradient accumulation solves the problem of insufficient memory and cannot train a large batch size. The method is to accumulate N mini-bacths and update them with the accumulated gradients to achieve the same effect as N*mini-batch.
Activation Checkpointing. Only keep the input and output of each layer, discard the intermediate results, and recalculate when passing the process backwards.
Fused Kernels. Minimize data transfer. The GPU mainly writes or reads video memory and performs calculations. When the GPU is busy reading and writing data, the computing unit of the GPU will be idle, so core fusion combines multiple operations into one GPU operation, and the intermediate results are left in the register instead of being put back into the video memory.
c = torch.add (a, b) #Read a, b from video memory, calculate kernel function, write back to video memory
e = torch.max ([c,d]) #read c, d from video memory, calculate kernel function, write Echo memory
After the two kernel functions are fused, only the fused kernel is executed, so c will not be written to the video memory, which reduces GPU idleness. Megatron-LM provides several custom fused CUDA kernels, such as various combinations of LayerNorm, scaling, masking, and softmax operations, and more.
Hardware failure problem. For long-term training, there are 1-2 GPU failures almost every week, use backup nodes + save checkpoints every 3 hours.

Quantization

Quantization is a common model compression technique. The core idea is to convert model parameters from high-precision to low-precision, reducing the demand for inference memory, so that large models can be run on consumer-grade graphics cards. INT8 quantization is the most popular post-training quantization method, as shown in the figure below.

Symmetric quantization symmetry and asymmetric quantization asymmetric.
Outlier problem.
LLM.int8. Adaptive mixed-precision quantization method. Set the quantization resolution by region and eliminate the negative impact of outliers on model quantification.
INT4 quantization. GLM-130B only needs a minimum of 6GB of video memory at the INT4 quantization level, and a single card can be used for reasoning without a great loss of accuracy.