Deep Learning Model Accuracy and PyTorch Model Quantization

Table of contents

1. What is model quantization

2. Pytorch model quantization

2.1 Quantization of Tensor

2.2 Post Training Dynamic Quantization Post Training Dynamic Quantization

2.3 Post Training Static Quantization Post Training Static Quantization

2.4 Quantization Aware Training during training

3. Mixed precision training Automatically Mixed Precision

3.1 Solutions

3.2 How Pytorch and Tensorflow apply mixed precision training


1. What is model quantization

PyTorch model quantization official documentation: Quantization — PyTorch 2.0 documentation

Most current deep learning frameworks use fp32 to store weight parameters. For example, the type of Python float is double-precision floating-point number fp64, and the default type of PyTorch Tensor is single-precision floating-point number fp32.

There are mainly problems with using fp32 :

  1. The size of the model is large, and the memory requirements of the graphics card are high during training.
  2. Model training is slow
  3. Model inference is slow

PyTorch supports int8 quantization, resulting in a 4x reduction in model size and 4x reduction in memory bandwidth requirements . Hardware support for int8 calculations is typically 2 to 4 times faster than FP32 calculations. Quantization is primarily a technique to speed up inference , and only supports the forward pass of quantization operators. To put it simply, in deep learning, quantization refers to using fewer bits to store tensors that were originally stored in floating point numbers, and using fewer bits to complete calculations that were originally done in floating point numbers.

type of data Ranges
float16 -65504 ~ 65504
float32 -2^31 ~ 2^31-1
you8 -2^7 ~ 2^7-1 (-128 ~ 127)
uint8

0 ~ 2^8-1 (0~255)

Benefits of model quantization:

  • Less storage overhead and bandwidth requirements
  • Faster calculation speed (calculation speed is 2~4 times faster due to less memory access and faster int8 calculation)
  • Lower energy consumption and footprint
  • Acceptable loss of accuracy (that is, quantization is equivalent to introducing noise to the model weights. Fortunately, CNN itself is not sensitive to noise. During the model training process, the addition of noise to the weights introduced by analog quantization is also conducive to preventing overfitting. The quantized model under the number of bits does not bring serious loss of accuracy)

For a quantized model, some or all of its tensor operations will be calculated using the int type instead of the float type before quantization. Of course, quantization also requires underlying hardware support. Mainstream hardware such as x86 CPU (supporting AVX2), ARM CPU, Google TPU, Nvidia Volta/Turing/Ampere, and Qualcomm DSP all provide support for quantization.

Note: The mainstream methods of model compression for deep learning include quantization-based methods, model pruning and knowledge distillation , and model quantization, which is the most widely used form of model compression.

2. Pytorch model quantization

PyTorch currently supports quantization in the following three ways:

  • Post Training Dynamic Quantization, dynamic quantization after model training
  • Post Training Static Quantization, static quantization after model training
  • QAT (Quantization Aware Training), open quantization in model training

2.1 Quantization of Tensor

>>> x = torch.rand(2,3, dtype=torch.float32) 
>>> x
tensor([[0.6839, 0.4741, 0.7451],
        [0.9301, 0.1742, 0.6835]])

>>> xq = torch.quantize_per_tensor(x, scale = 0.5, zero_point = 8, dtype=torch.quint8)
tensor([[0.5000, 0.5000, 0.5000],
        [1.0000, 0.0000, 0.5000]], size=(2, 3), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.5, zero_point=8)

>>> xq.int_repr()
tensor([[ 9,  9,  9],
        [10,  8,  9]], dtype=torch.uint8)

>>> xdq = xq.dequantize()
>>> xdq
tensor([[0.5000, 0.5000, 0.5000],
        [1.0000, 0.0000, 0.5000]])
  • quantize_per_tensor function : Use the given scale and zp to convert a float tensor into a quantized tensor
  • dequantize function : the antonym of quantize_per_tensor, convert a quantized tensor to float tensor

The fact that the values ​​of xdq and x have deviated tells us two truths:

  • There will be a loss of precision in quantization ;
  • Choosing an appropriate scale and zp can effectively reduce the loss of precision (for example, scale = 0.0036, zero_point = 0)

In PyTorch, the work of selecting the appropriate scale and zp is done by various observers . Tensor quantization supports two modes: per tensor and per channel . Per tensor means that all values ​​in a tensor are scaled and offset in the same way; per channel means that values ​​on a certain dimension of the tensor (usually the dimension of the channel) are scaled and offset in a certain way, that is, a There are many different ways of scale and offset in tensor (to form a vector), so that fewer errors will be introduced when quantizing than per tensor. PyTorch currently supports per channel quantization of conv2d(), conv3d(), linear().

2.2 Post Training Dynamic Quantization Post Training Dynamic Quantization

This is the simplest form of quantization, where weights are quantized ahead of time and activations are quantized dynamically during inference. This approach is used when model execution time is dominated by loading weights from memory rather than computing matrix multiplications. Applying dynamic quantization to the entire model requires only one call to the torch.quantization.quantize_dynamic() function.

torch.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)

The quantize_dynamic API converts a float model into a dynamic quantized model, that is, a model in which only weights are quantized. The dtype parameter can take the value float16 or qint8. When converting the entire model, only the following ops are converted by default:

  • Linear
  • LSTM
  • LSTMCell
  • RNNCell
  • GRUCell

why? Because dynamic quantization only quantifies the weight parameters, and these layers generally have a large number of parameters, and the proportion of parameters in the entire model is extremely high, so the marginal benefit is high. Performing dynamic quantization on other layers has little practical significance.

For a quantize_dynamic call with default behavior, an example is as follows:

#原始网络
Net(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (fc): Linear(in_features=3, out_features=2, bias=False)
  (relu): ReLU()
)

#quantize_dynamic 后
Net(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (fc): DynamicQuantizedLinear(in_features=3, out_features=2, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
  (relu): ReLU()
)

It can be seen that, except for Linear, other ops have not changed. Linear is converted to DynamicQuantizedLinear, and DynamicQuantizedLinear is the torch.nn.quantized.dynamic.modules.linear.Linear class. That's right, the essence of quantize_dynamic API is to retrieve the type of op in the model. If the type of an op belongs to the key of the dictionary DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS, then this op will be replaced with the value corresponding to the key:

# Default map for swapping dynamic modules
DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS = {
    nn.GRUCell: nnqd.GRUCell,
    nn.Linear: nnqd.Linear,
    nn.LSTM: nnqd.LSTM,
    nn.LSTMCell: nnqd.LSTMCell,
    nn.RNNCell: nnqd.RNNCell,
}

Summary : Post Training Dynamic Quantization, referred to as Dynamic Quantization, or Weight-only quantization , is to quantify some op parameters in the model to INT8 in advance, and then dynamically quantize the input to INT8 during operation. Then requantize the result back to the float32 type when the current op is output. Dynamic quantization is only applicable to Linear and RNN variants by default .


2.3 Post Training Static Quantization Post Training Static Quantization

This is the most commonly used form of quantization, where weights are quantized ahead of time, and activation tensor scale factors and biases are precomputed based on observing the behavior of the model during calibration. CNNs are a typical use case, where post-training quantization is often done when both memory bandwidth and computational savings are important.

Similarities and differences with dynamic quantization:

  • The same point is that the weight parameters of the network are converted from float32 to int8
  • The difference is that the training set or data similar to the distribution of the training set needs to be fed to the model (note that there is no backpropagation), and then the quantization parameters (scale and zp) of activation are calculated according to the distribution characteristics of each op input - called It is Calibrate (calibration).

Static quantization includes activation, that is, post process, which is the post-processing after op forward. Why does static quantification require activation? Because the forward reasoning process of static quantization is INT calculation from (start +1) to (end -1), activation needs to ensure that the input of one op conforms to the input of the next op.

The general procedure for performing post-training quantization is as follows:

  • Step 1 - Prepare the model: specify where to explicitly quantize and dequantize activations by adding QuantStub and DeQuantStub modules; ensure modules are not reused; convert any operations that require requantization to the module's schema;
  • Step 2 - Fuse combined operations like conv+relu or conv+batchnorm+relu to improve model accuracy and performance;
  • Step 3 - Specify the configuration of the quantization method, such as choosing symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques;
  • Step 4 - Plug in the torch.quantization.prepare() module to observe activation tensors during calibration;
  • Step 5 - Perform the calibration operation on the model using the calibration dataset;
  • Step 6 - Use the torch.quantization.convert() module to convert the model, including calculating and storing the scale and bias values ​​to be used for each activation tensor, and replacing the quantization implementation of key operators, etc.

Specific reference: Quantification of PyTorch- Know about

The biggest difference between dynamic quantization and static quantization:

  • The statically quantized float input must be changed to int by QuantStub, and it will be int until the output
  • The dynamically quantized float input is dynamically calculated scale and zp quantized to int, and the op output is converted back to float

2.4 Quantization Aware Training during training

In rare cases where post-training quantization does not provide sufficient accuracy, the torch.quantization.FakeQuantize() module can be plugged in to perform training-time quantization. Calculations will be done in FP32, but values ​​are rounded and rounded to simulate the quantization effect of INT8. The specific quantification steps are as follows:

  • Step 1 - Prepare the model: specify where to explicitly quantize and dequantize activations by adding QuantStub and DeQuantStub modules; ensure modules are not reused; convert any operations that require requantization to the module's schema;
  • Step 2 - Fuse combined operations like conv+relu or conv+batchnorm+relu to improve model accuracy and performance;
  • Step 3 - Specify the configuration of the pseudo-quantization method, such as choosing symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques;
  • Step 4 - Insert the torch.quantization.prepare_qat() module, which is used for analog quantization during training;
  • Step 5 - train or fine tune the model;
  • Step 6 - Use the torch.quantization.convert() module to convert the model, including calculating and storing the scale and bias values ​​to be used for each activation tensor, and replacing the quantization implementation of key operators, etc.

Specific reference: Quantification of PyTorch- Know about

QAT summary:

# 原始的模型,所有的tensor和计算都是浮点
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# 训练过程中,fake_quants发挥作用
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# 量化后的模型进行推理,权重和输入都是int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

3. Mixed precision training Automatically Mixed Precision

Use low-precision calculations to optimize the model. During the inference process , the more mature schemes for model optimization are fp16 quantization and int8 quantization; the training scheme is mixed-precision training , and its basic idea is very simple: the precision is halved (fp32→ fp16) , the training time is halved. Compared with the single-precision floating-point number float32 (32bit, 4 bytes), the half-precision floating-point number float16 is only 16bit, consisting of 2 bytes. 

Replacing FP32 with FP16 during training has two benefits :

  • Reduce video memory usage : The video memory usage of FP16 is only half of that of FP32, and a larger batch size can be used
  • Accelerated training : With FP16, the training speed of the model can be almost doubled

During the training process, directly using half-precision calculations will cause two problems :

  • Rounding Error : Any operation performed on a sufficiently small floating-point number will round the value to zero. Many or even most gradient update values ​​​​are very small in backpropagation. Rounding error in backpropagation Accumulation can turn these numbers into 0 or nan, which will cause inaccurate gradient updates and affect the convergence of the network
  • Overflow error (Grad Overflow / Underflow) : Decreased precision (16 digits after the decimal point is much more accurate than 8 digits after the decimal point) will cause the obtained value to be greater than or smaller than the effective dynamic range of fp16, that is, overflow or underflow

Mixed precision (Automatically Mixed Precision, AMP) can halve the memory usage and double the training speed with just a few lines of code. The AMP technology was proposed by the Baidu and NIVDIA teams in 2017 (Mixed Precision Training), and the results were published on ICLR. Before PyTorch 1.6, everyone used NVIDIA's apex library to implement AMP training. After version 1.6, PyTorch comes with AMP out of the box.

Mixed-precision training is to use half-precision floating-point numbers to accelerate training while reducing precision loss as much as possible. It uses FP16, that is, half-precision floating-point numbers to store weights and gradients, which accelerates training while reducing memory usage . The biggest problem of using FP16 to replace the original FP32 neural network calculation is the loss of precision .

3.1 Solutions

1) Keep a copy of FP32 for each weight

 In order to achieve FP16 training, we need to convert the model weights and input data into FP16, and the gradient of FP16 will be obtained during backpropagation. If you update directly at this time, because the value of gradient * learning rate is often small, the gap with the model weight will be large, and rounding errors may occur. 

So the solution is: store the model weight, activation value, gradient and other data in  FP16  , and maintain a  copy of  the FP32 model weight for updating. After the gradient of FP16 is obtained by backpropagation, it is converted into FP32 and unscaled , and finally the model weight of FP32 is updated. Since the whole updating process is carried out in the FP32 environment, there will be no rounding errors. 

FP32 weight backup solves the rounding error problem of backpropagation.

2)Loss-scaling

In order to solve the problem of underflow, the calculated loss value is scaled. Due to the existence of the chain rule, the scaling of the loss will act on each gradient. The scaled gradient is then translated to the effective range of FP16. In this way, FP16 can be used to store gradients without overflow. In addition, before updating, the scaled gradient needs to be converted to FP32, and then the gradient should be unscaled (unscaled) back. 

The scaling factor (loss_scale) is generally determined automatically by the framework. As long as no inf or nan occurs, the larger the loss_scale, the better. Because as the training progresses, the gradient of the network will become smaller and smaller, and a larger loss_scale can make full use of the representation range of FP16.

3) Improved arithmetic method: FP16 * FP16 + FP32

For those modules that run unstable in FP16 environment, force it to run in FP32 precision. For example, the BN layer that needs to calculate the batch mean should run under FP32, otherwise rounding errors will occur. Take the BN layer as an example, convert its weight to FP32, and convert the input from FP16 to FP32, so that the entire module can be guaranteed to run under FP32.

 

3.2 How Pytorch and Tensorflow apply mixed precision training

Pytorch can use NVIDIA's open source framework APEX to support mixed progress and distributed training:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward() 

Tensorflow is even simpler, there is already official support, just add a sentence before training:

export TF_ENABLE_AUTO_MIXED_PRECISION=1
# 或者
import os
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'

Reference:

Guess you like

Origin blog.csdn.net/daydayup858/article/details/128373747