[Pytorch] Understanding automatic mixed precision training

[Pytorch] Understanding automatic mixed precision training

  Larger deep learning models require more computing power and memory resources. Some new technologies have been proposed to train deep neural networks faster. We can use FP16 (half-precision floating point format) instead of FP32 (full-precision floating point format), and researchers found that using them in series is a better choice. Some GPUs, such as the Ampere GPU from Paperspace, can even take advantage of lower levels of precision, such as INT8.

  Mixed precision allows half-precision training while still retaining most of the single-precision network accuracy. The term "mixed-precision technique" refers to the method using both single-precision and half-precision representations.

  In this overview of automated mixed precision (Amp) training with PyTorch, we demonstrate how the technology works, walk through the process of using Amp, and discuss applications of Amp technology through code.

Mixed Precision Overview

  In the world of deep learning, using FP16 for calculations can not only significantly improve performance, but also save memory. However, this approach also introduces two major problems: precision overflow and rounding errors. These two issues are key challenges for FP16 computing in deep learning.

   Precision Overflow:

  In FP16 format, due to the smaller bit width, the representable numerical range is much smaller than FP32 or FP64. This can easily lead to values ​​that are too large or too small to be accurately represented within the representation range of FP16. In deep learning, this can cause vanishing or exploding gradients, as some small gradient values ​​may become zero (underflow), while some large gradient values ​​may become infinite (overflow). This overflow problem will seriously affect the stability and final performance of model training.

  Rounding Error:

  FP16 has more pronounced rounding errors than FP32 or FP64 due to its 16-bit representation limitation. In deep learning, rounding errors accumulate with each calculation, especially in multiple layers and complex operations. This can cause model output to differ significantly from when calculated using higher precision. For applications where accuracy is critical, such as finance or medical, this error may have unacceptable consequences.

  To alleviate these issues, mixed-precision training methods use FP32 in critical parts (such as weight updates) to maintain accuracy, and FP16 in other operations (such as forward propagation) to improve efficiency. In mixed precision training, I particularly noticed the importance of three technologies: Weight Backup, Loss Scaling, and Precision Accumulated.

  Weight Backup:

  In mixed precision training, in order to ensure numerical stability, the weights of the model are usually maintained in both FP16 and FP32 formats. Weight backup means keeping a copy of the weights in FP32 format, so that even if numerical instability occurs during most calculations using FP16 format, we can still rely on the FP32 weight copy to remain stable and accurate. This is critical for accuracy when updating model parameters.

  Loss Scaling:

  In mixed precision training, due to the representation range limitation of FP16, gradient values ​​may be too small to be accurately represented in FP16, causing the effective gradient to become zero. Loss amplification works by multiplying the value of the loss function by a larger constant (amplification factor) before calculating the gradient, thereby amplifying the gradient value so that it is representable and non-zero in the FP16 range. After backpropagation, the amplified gradient is divided by the same amplification factor to restore the original proportion, which can effectively reduce the gradient underflow problem.

  Precision Accumulation:

  Precision accumulation means that during the weight update process, even if the gradient calculation is completed under FP16, the weight update is performed under FP32 precision. This helps reduce rounding and accumulation errors, especially when a large number of accumulation operations are involved during training. Since FP32 provides higher numerical precision and a larger representation range, small gradient values ​​can be accumulated more accurately, avoiding numerical instability when updating weights.

  To sum up, by combining these technologies, mixed-precision training can effectively take advantage of the performance advantages brought by FP16 while minimizing accuracy loss and computational instability.

Experimental comparison

  To further verify the effectiveness of these techniques, I designed two experiments to compare training using mixed precision and traditional FP32. Here are the code snippets I used in both experiments:

FP16 and FP32 mixed training code:

import torch
from tensorboardX import SummaryWriter
from torch import optim, nn
import time

from torch.cuda.amp import GradScaler, autocast


class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.linears = nn.Sequential(
            nn.Linear(2, 20000),

            nn.Linear(20000, 20000),
            # nn.Dropout(0.1),

            nn.Linear(20000, 200),
            # nn.LayerNorm(20),

            nn.Linear(200, 20),
            # nn.LayerNorm(20),

            nn.Linear(20, 1),
        )

    def forward(self, x):
        _ = self.linears(x)
        return _

lr = 0.0001
iteration = 1000


x1 = torch.arange(-1000, 1000).float().to('cuda')
x2 = torch.arange(0, 2000).float().to('cuda')
x = torch.cat((x1.unsqueeze(1), x2.unsqueeze(1)), dim=1)
y = (2*x1 - x2 + 1).to('cuda')

model = Model().to('cuda')
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=0.01)
loss_function = torch.nn.MSELoss()

scaler = GradScaler()

start_time = time.time()
writer = SummaryWriter(comment='_FP16')

for iter in range(iteration):
    with autocast():
        y_pred = model(x)
        loss = loss_function(y, y_pred.squeeze())
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    writer.add_scalar('loss', loss, iter)
    optimizer.zero_grad()

    if iter % 100 == 0:
        # 获取 GPU 的内存使用情况
        print("GPU Memory Allocated:", torch.cuda.memory_allocated())
        print("GPU Memory Cached:   ", torch.cuda.memory_reserved())

print("Time: ", time.time() - start_time)
torch.save(model.state_dict(), 'model_state_dict_fp16.pth')


FP32 training code:

import torch
from tensorboardX import SummaryWriter
from torch import optim, nn
import time

from torch.cuda.amp import GradScaler, autocast


class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.linears = nn.Sequential(
            nn.Linear(2, 20000),

            nn.Linear(20000, 20000),
            # nn.Dropout(0.1),

            nn.Linear(20000, 200),
            # nn.LayerNorm(20),

            nn.Linear(200, 20),
            # nn.LayerNorm(20),

            nn.Linear(20, 1),
        )

    def forward(self, x):
        _ = self.linears(x)
        return _

lr = 0.0001
iteration = 1000


x1 = torch.arange(-1000, 1000).float().to('cuda')
x2 = torch.arange(0, 2000).float().to('cuda')
x = torch.cat((x1.unsqueeze(1), x2.unsqueeze(1)), dim=1)
y = (2*x1 - x2 + 1).to('cuda')

model = Model().to('cuda')
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=0.01)
loss_function = torch.nn.MSELoss()

scaler = GradScaler()

start_time = time.time()
writer = SummaryWriter(comment='_FP32')

for iter in range(iteration):

    y_pred = model(x)
    loss = loss_function(y, y_pred.squeeze())
    loss.backward()
    optimizer.step()


    writer.add_scalar('loss', loss, iter)
    optimizer.zero_grad()

    if iter % 100 == 0:
        # 获取 GPU 的内存使用情况
        print("GPU Memory Allocated:", torch.cuda.memory_allocated())
        print("GPU Memory Cached:   ", torch.cuda.memory_reserved())

print("Time: ", time.time() - start_time)
torch.save(model.state_dict(), 'model_state_dict_fp32.pth')


  The final effect of the two is as follows:

Experiment name GPU occupied Time consuming
FP16 and FP32 mixed training 4867271680 bytes 78.73 s
FP32 training 4867274752 bytes 140.18 s

  Experimental analysis is as follows:

  1. In terms of memory usage, the two training methods are almost the same. This may be because the model structure and dataset size are the same, so there is no significant difference in memory footprint. However, in general, FP16 training should take up less memory because it uses half-precision floating point.

  2. In terms of training time, mixed precision training is significantly faster than pure FP32 training. This is because FP16 training speeds up calculations and reduces memory requirements, allowing models to run faster. Mixed precision training combines the high efficiency of FP16 with the numerical stability of FP32 to provide a balanced solution.

Insert image description here

  In the provided loss plot, we can see that the blue curve represents the loss of the model trained using full precision (FP32), while the orange curve represents the loss of the model trained using mixed precision (FP16 and FP32). The following is a further analysis of the loss curves of the two training methods:

  After falling to a certain extent, both curves reach a stationary state, which indicates that the model has basically converged. In this stationary stage, the loss does not change much, indicating that the model's performance on the training set has stabilized.

  There is no clear sign of overfitting, as the loss curve has no tendency to rise again.

  · Mixed precision training seems to be slightly better than full precision training in terms of time efficiency. Although the difference in final loss values ​​is not big, when pursuing fast iteration and efficient training, choosing mixed precision training will be more advantageous; and in terms of convergence speed , mixed precision training completed convergence at 140 epoch, while full precision training completed convergence as early as 100 epoch, indicating that full precision training converges faster than mixed precision training at the step level, but it is slower at the time level.

Guess you like

Origin blog.csdn.net/qq_43592352/article/details/134835677