[Deep Learning] Mixed Precision Training and Video Memory Analysis

Mixed precision training and video memory analysis

​ For an introduction to parameter accuracy, see the article https://zhuanlan.zhihu.com/p/604338403

Related Blog
[Megatron-DeepSpeed] Tensor Parallel Tool Code Mpu Detailed Explanation (3): Implementation and Testing of Tensor Parallel Layer
[Megatron-DeepSpeed] Tensor Parallel Tool Code Mpu Detailed Explanation (1): Parallel Environment Initialization
[Megatron-DeepSpeed] Tensor Parallel Tool Code mpu Detailed Explanation (2): Encapsulation mappings of Collective Communication Operations
[Deep Learning] [Distributed Training] DeepSpeed: AllReduce and ZeRO-DP
[Deep Learning] Mixed Precision Training and Video Memory Analysis
[Deep Learning] [Distributed Training] Training] Collective communication operation and Pytorch example
[Natural Language Processing] [Large Model] Large language model BLOOM reasoning tool test
[Natural Language Processing] [Large Model] GLM-130B: an open source bilingual pre-trained language model
[Natural Language Processing] [ Large Model] Introduction to 8-bit Matrix Multiplication for Large Transformers
[Natural Language Processing] [Large Model] BLOOM: A multilingual model with 176B parameters and open access

1. How is the model trained?

​ Here is a brief introduction to the forward propagation, back propagation and optimization process, which facilitates the understanding of subsequent mixed precision training and video memory analysis.

1. Forward propagation

Neural networks can be viewed as large fitting functions. Let's assume that the neural network is f ( x ; θ ) = g ( z ) , z = h ( x ) f(x;\theta)=g(z),z=h(x)f(x;i )=g(z),z=h ( x ) . Then the forward propagation process of the neural network: the samplexxx is fed into the functionhhh , get the outputz = h ( x ) z=h(x)z=h ( x ) ; then will outputzzz is sent to the functionggg gets the final outputg ( z ) g(z)g ( z ) . The whole process is simplified asf ( x ; θ ) f(x;\theta)f(x;θ) θ \theta θ is the parameter to be learned of the model.

2. Backpropagation

​ Backpropagation here still follows the previous assumptions: neural network f ( x ; θ ) f(x;\theta)f(x;i ) ,xxx is the input,θ \thetaθ is a parameter. Furthermore, suppose there isa NNN labeled samples{ ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } \{(x_1,y_1),(x_2,y_2),\dots, (x_N,y_N)\}{(x1,y1),(x2,y2),,(xN,yN)} , among whichxi x_ixiright iiThe value of i samples,yi y_iyiis the corresponding label. Now, from NNSelect mm from N samplesm samples to form a batch{ ( x 1 ′ , y 1 ′ ) , ( x 2 ′ , y 2 ′ ) , … , ( xm ′ , ym ′ ) } \{(x_1',y_1'),(x_2 ',y_2'),\dots,(x_m',y_m')\}{(x1,y1),(x2,y2),,(xm,ym)} . Then the model is heremmThe gradient on m samples is g ^ = 1 m ∇ θ ∑ i = 1 m L ( f ( xi ′ ; θ ) , yi ′ ) \hat{g}=\frac{1}{m}\nabla_{\ theta}\sum_{i=1}^m L(f(x_i';\theta),y_i')g^=m1ii=1mL(f(xi;i ) ,yi)

  • SGD

    l r lr l r is the learning rate of the model, then the optimization process of the model is:
    θ ← θ − lr × g ^ \theta\leftarrow\theta-lr\times\hat{g}iilr×g^

  • Adam

    Compared with the optimization process of SGD, Adam solves the problems of gradient oscillation and dynamic learning rate by introducing two variables. Specifically, initialize two variables v = 0 v=0v=0 andr = 0 r=0r=0 , and specify two hyperparametersβ 1 \beta_1b1and β 2 \beta_2b2. Suppose it is now t + 1 t+1t+1- step update, and the batch gradientg ^ \hat{g}g^Definitions:
    v = β 1 ⋅ v + ( 1 − β 1 ) ⋅ g^r = β 2 ⋅ r + ( 1 − β 2 ) ⋅ g^⊙ g^v^ = v 1 − β 1 tr ^ = r 1 − β 2 t Δ θ = v ^ r ^ + δ v=\beta_1\cdot v + (1-\beta_1)\cdot\hat{g}\\ r=\beta_2\cdot r+(1- \beta_2)\cdot \hat{g}\odot\hat{g} \\ \hat{v}=\frac{v}{1-\beta_1^t} \\\hat{r}=\frac{r }{1-\beta_2^t}\\\Delta\theta=\frac{\hat{v}}{\sqrt{\hat{r}}+\delta}v=b1v+(1b1)g^r=b2r+(1b2)g^g^v^=1b1tvr^=1b2trD i=r^ +dv^
    Among them, δ \deltaδ is a small constant, usually set to1 0 − 8 10^{-8}10−8 . _ The update process of model parameters is:
    θ = θ − lr × Δ θ \theta = \theta - lr\times \Delta\thetai=ilr×D i

2. Mixed precision training

1. Accuracy

Usually the model will be trained with float32 precision, but as the model becomes larger, the hardware cost and time cost of training increase sharply. So is it possible to use float16 for training? The answer is no .

​ The representation range of float16 is [ − 65504 ∼ 66504 ] [-65504\sim 66504][6550466504 ] , indicating that the precision is2 − 24 2^{-24}224

  • Advantages of float16
    • Reduce video memory usage ; float16 is half smaller than float32, and all video memory usage can be reduced by half;
    • Reduce network communication overhead ;
    • The hardware is optimized for float16, and the speed is faster ;
  • Disadvantages of float16
    • underflow . For deep learning, the biggest problem with float16 is "underflow". The update of the model is usually gradient × lr \text{gradient}\times\text{lr}gradient×lr , as the model is trained, this value tends to be small and may exceed the precision represented by float16. The result is:most of the model weights are no longer updated, and the model is difficult to converge.
    • Rounding error . The difference between the model weight and the gradient is too large. When the weight is updated by the gradient and rounded, the weight before and after the update may not change.

2. Principle

In order to take advantage of the advantages of float16 and avoid the disadvantages, mixed precision training is proposed. In general, float16 is used for model weights and gradients in mixed precision training, and float32 is used for optimizer parameters. In addition, the optimizer also needs to save a float32 version of the weights.

insert image description here

The specific process of mixed precision is as follows:

  • Forward propagation using float16 weights;
  • Backpropagation gets the gradient of float16;
  • Calculate the weight update amount of float32 precision through the optimizer;
  • Update float32 weights;
  • Convert float32 weights to float16;

3. Actual combat

  • apex

    Apex is a mixed-precision training tool developed by NVIDIA, which enables users to quickly implement mixed-precision training. The following shows how to call apex to achieve mixed precision training:

from apex import amp

###########
# 其他代码 #
###########

# 利用amp.initialize重新封装model和optimizer
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

# 其他训练代码

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward() #梯度自动缩放
optimizer.step() #优化器更新梯度
optimizer.zero_grad()

###########
# 其他代码 #
###########

​,amp.initialize(model, optimizer, opt_level="O1") the registration of mixed precision will be specified here, and there are 4 levels in total:

​ O0: Pure float32 precision training, which can be used as a reference baseline;
​ O1: Automatically determine whether to use float16 or float32 according to the black and white list (recommended);

​ O2: Most of them use float16, except batch norm;

​ O3: pure float16, unstable training;

  • pytorch-native

    Pytorch supports mixed precision training after version 1.6. Below is the sample code

    from torch.cuda.amp import autocast as autocast, GradScaler
     
    ###########
    # 其他代码 #
    ###########
     
    scaler = GradScaler()
     
    ###########
    # 其他代码 #
    ###########
            
    # 前向传播过程中开启
    with autocast():
        output = model(input)
        loss = loss_fn(output, target)
     
    # float16精度范围有限,需要放大
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
     
     
    ###########
    # 其他代码 #
    ###########
    

3. Where is the video memory?

​ At present, the training of large models basically uses mixed-precision training. Based on the previous introduction to mixed-precision training, we will further analyze the whereabouts of the video memory.

1. Main memory consumption

​ Suppose there is a parameter quantity Ψ \PsiΨ and use Aadm as the optimizer. First of all, since the parameters and gradients of the model use float16, the memory consumption is2 Ψ 2\Psi2 Ψ 2\Psi . Aadm will maintain a float32 model copy, which will consume4 Ψ 4\Psi . In addition, according to the Aadm optimizer introduced above, Adam needs to maintain two state variablesvvv andrrr . due tovvv andrrr is float32, so the video memory usage is4 Ψ + 4 Ψ 4\Psi+4\Psi4Ps+ . Overall, the model consumes 2 Ψ + 2 Ψ = 4 Ψ 2\Psi+2\Psi=4\Psi2Ps+2Ps= video memory, Aadm optimizer consumes4 Ψ + 4 Ψ + 4 Ψ = 12 Ψ 4\Psi+4\Psi+4\Psi=12\Psi4Ps+4Ps+4Ps=12Ψ of video memory. Finally, the total video memory consumption is4 Ψ + 12 Ψ = 16 Ψ 4\Psi+12\Psi=16\Psi4Ps+12 Ps=16Ψ . For a model with 1.5B parameters like GPT-2, the memory consumption is at least24 GB 24GB24GB

2. The remaining video memory consumption

​Activations . Activation is z = h ( x ) z=h(x) introduced in the previous "forward propagation" processz=h ( x ) , after completingg ( z ) g(z)The graphics card needs to save zzbefore g ( z )z . Obviously, activations also consume a lot of video memory during training. A specific example, the model is GPT-2 of 1.5B, the sequence length is 1K, and the batch size is 32, then the memory consumption is 60GB. Activation checkpointing (or activation recomputation) is a common method to reduce the memory usage of activation. This method reduces the memory usage of the activations to an even split of the total activations at the cost of 33% recomputation. That is, the active video memory usage is reduced from 60GB to 8GB.

​ Although the memory footprint of activations has been significantly reduced, they can also be very large for larger models. For example, for a GPT model with 100B parameters and a batch size of 32, even if it is used for activation checkpointing, the video memory usage still needs 60GB.

​Temporary buffers . For large models, the temporary buffer used to store intermediate results also consumes a lot of video memory. For example, in all-reduce, a flat buffer is required to fuse all gradients, thereby improving throughput. For example, all-reduce operations across devices increase with message size. Although the gradient in this article is a tensor of fp16, the buffer that may need to be fused in some operations is fp32. When the model size is large, the temporary buffer is not small. For example, for a model with 1.5B parameters, an fp32 buffer requires 6GB of video memory.

​Video memory fragmentation . Even if there is enough video memory, it may lead to Out of Memory, which is caused by video memory fragmentation. When a process makes a video memory request, if there is no contiguous video memory to satisfy the request, even if the total video memory is still sufficient, the request will fail. Significant memory fragmentation can be observed when training very large models. In extreme cases, it may result in 30% video memory fragmentation.

References

https://arxiv.org/pdf/1910.02054.pdf

https://zhuanlan.zhihu.com/p/103685761

https://zhuanlan.zhihu.com/p/604338403

https://blog.csdn.net/flyingluohaipeng/article/details/128095936

https://zhuanlan.zhihu.com/p/406319979

おすすめ

転載: blog.csdn.net/bqw18744018044/article/details/131030255
おすすめ