DL model quantification

The original text is the AIWalker public account, compiled

Contents of this article

1 Deep Neural Network Quantification White Paper
(from Google)
1.1 The practical significance of model quantization
1.2 Uniform Affine quantization
1.3 Uniform symmetric quantization
1.4 Random quantization
1.5 Pseudo quantization
1.6 Determining quantization parameters
1.7 Quantization granularity
1.8 Post-training quantization
1.9 Quantization-aware training

Too long not to read version

This article introduces the basic concepts and methods of deep neural network quantification. Deep neural networks are increasingly used in edge computing, and edge computing devices usually have low computing power and are limited by memory and power consumption. Furthermore, it is necessary to reduce the communication bandwidth and network connection resources required to download models from the cloud to mobile devices. Therefore, some techniques are needed to optimize the model size to achieve faster inference and lower power consumption. The first few sections introduce linear quantization operations, and the next two sections introduce two mainstream quantization techniques, namely post-training quantization and quantization-aware training.

1 Deep Neural Network Quantification White Paper

Paper title: Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper

Paper address:

https//arxiv.org/pdf/1806.08342.pdf

1.1 The practical significance of model quantification

Deep neural networks are increasingly used in edge computing, and edge computing devices usually have low computing power and are limited by memory and power consumption. Furthermore, it is necessary to reduce the communication bandwidth and network connection resources required to download models from the cloud to mobile devices. Therefore, some techniques are needed to optimize the model size to achieve faster inference and lower power consumption.

The practical significance of model quantification is:

  1. Applicable to a series of models and usage scenarios: No complex model design is required. Any floating-point model can be quickly quantified into a fixed-point model with basically no loss of accuracy. And no retraining is required. Many hardware platforms and libraries are supported.

  2. The model size is smaller (weight quantization): When 8-bit quantization is used, the model is reduced by 4 times, and the download time is faster.

  3. Reduce the working memory and cache (activation quantization) required to store inference activation values: Intermediate calculation results during inference are usually stored in the cache to facilitate reuse in subsequent layers, reducing The precision of these data can reduce the required working memory and cache.

  4. Faster calculations: Most processors allow faster processing of 8-bit data.

  5. Lower power consumption: The power consumption required to move 8-bit data is much less than that of moving 32-bit data. In many architectures, data handling determines power consumption. Therefore, quantization has an impact on power consumption.

1.2 Uniform affine quantization

Given a floating point number within the range, it needs to be quantized to the range of , where if it is 8-bit quantization, then there is .

Define 2 parameters below: Scale and Zero-Point.

Scale specifies the quantized value, the value by which the floating point number increases each time it advances.

Zero-Point is a quantized integer that indicates where the zeros of the floating point number are quantized.

For the case where the data is distributed on one side, will be expanded to include the value 0. For example, if the floating point number is distributed on , it will be expanded to , and then quantized. Note that in the case of extreme one-sided distributions this may result in a loss of accuracy.

Once Scale and Zero-Point are determined, the quantification process is as follows:

Among them, is the floating point number before quantization, is the integer after quantization, and there are:

We can also find that when , , that is: when the floating point number is 0, it is quantized to Zero-Point.

The operation of de-quantization is:

While uniform affine quantization allows weight and activation to be stored with 8-bit precision, there is an additional cost due to Zero-Point. Consider the 2D convolution between weight and activation:

1.3 Uniform symmetry quantization

Uniform symmetric quantization means that Zero-Point is forced to 0, that is: when the floating point number is 0, the quantization result is 0. The process of symmetric affine quantization is as follows:

For faster SIMD implementations, the range of weights can be further restricted. In this case, clamp can be modified to:

Further reference can be made here to [1].

The dequantization operation is

1.4 Random quantization

When quantizing randomly, an additive noise is added to the quantization process and then rounded. The process of random quantization is as follows:

The dequantization operation is shown in Equation 3. Random quantization is not considered for inference as most inference hardware does not support it.

1.5 Pseudo quantization

Pseudo quantization is a common technique inQuantization-aware training. In quantization-aware training, simulated quantization (pseudo-quantization) operations are generally used, which include a quantizer and a dequantizer:

Because there is a rounding operation in the quantization process, the derivative calculated when backpropagating through it is almost zero everywhere, so the quantizer needs to be modeled in backpropagation, that is, by using the following operation, as shown in Figure 1 .

Backpropagation is modeled via the Straight Through Estimator (STE)[2] method:

In the formula, is the derivative of the loss function with respect to the output.

Figure 1: (top) round operation with a derivative equal to 0 everywhere. (Bottom) Clamp operation that can be differentiated

1.6 Determine quantization parameters

Quantizer parameters can be determined using several criteria. For example, TensorRT [3] minimizes the KL divergence between the original distribution and the quantized distribution to determine the step size. This article takes a simpler approach:

For weights: Use the actual value and the maximum value to determine the quantizer Scale parameter.

For activations: Use a moving average of the minimum and maximum values ​​in the mini-batch to determine the quantizer Scale parameter.

1.7 Quantification granularity

Per-Layer Quantization: Set a Scale parameter and a Zero-Point parameter for the entire weight tensor.

Per-Channel Quantization: Set a Scale parameter and a Zero-Point parameter for each convolution kernel.

The weight tensor is 4-dimensional and contains convolution kernels. Each convolution kernel is responsible for generating an output feature map. Per-Channel Quantization has different Scale and Zero-Point parameters for each convolution kernel. Additionally, Per-Channel Quantization of activation is not considered as this would complicate the inner product calculation of the core.

1.8 Post-training quantization

Post Training Quantization enables faster inference and reduced model size by compressing weights and activations without retraining the model. Post-training quantization allows quantization with limited data.

Only quantify the weight

A simple approach is to just quantize the floating point weights to 8-bit. This method only reduces the weights of the FP32 model to 8-bits without quantifying the activations. Since only the weights are quantified, this can be done without any validation data. This setup is useful if you just want to quantify the size of model transfers and storage without considering the cost of performing inference on FP32.

Quantify weights and activations

Floating point models can be quantized to 8-bit precision by computing the quantizer parameters for all quantities to be quantized. Since activations need to be quantified, a calibration set is required, and the dynamic range of activation values ​​needs to be calculated. Typically, approximately 100 mini-batches are sufficient to estimate the range of activation values ​​to converge. Through experiments, this paper found that the main reason for the large accuracy decrease caused by Per-layer Weight Quantization is Batch Normalization, which leads to extreme changes in the dynamic range of the single-layer convolution kernel. Per-channel weight quantization can avoid this problem, which makes the accuracy of Per-channel quantization independent of the scaling factor of BN.

Activation Quantization still uses the per-layer symmetric quantization strategy.

1.9 Quantitative perception training

Quantization aware training quantizes during training and can provide higher accuracy than the Post Training Quantization scheme. This article uses simulated quantization operations to model the impact of weights and activations. For backpropagation, this article uses a Straight Through Estimator to simulate quantization. This paper uses simulated quantization for forward and backpropagation and maintains a set of FP32 weights and uses gradient updates to optimize them. This ensures that smaller gradient updates act directly on weight. The updated weights will then be used in the following forward propagation and back propagation.

In the formula, is the output of the simulated quantization process, and is the input of the simulated quantization process.

Figure 2 below is the calculation diagram of Quantization aware training.

Figure 2: Quantization aware training calculation diagram

The author also introduces the quantitative method of Batch Normalization. BN is defined by the following equation:

train:

reasoning:

where and are the mean and standard deviation of the mini-batch, and are the mean and standard deviation of the mini-batch and are calculated as the moving average of the batch statistics during training.

For inference, the author changes Batch Normalization to the weight defined by the following equation 2. Therefore, there is no explicit Batch Normalization at inference time.


Figure 3: Schematic diagram of training and inference of Batch Normalization

The training and inference diagram of Batch Normalization is shown in Figure 3 above.

During training: is the INT8 output of the previous layer, and the result of the INT8 convolution is calculated by the convolution after SimQuant pseudo-quantization. Among them, the parameters of the convolution have been multiplied by . The calculated result is added to obtain the equivalent output, and finally the calculation result of this layer is obtained through SimQuant pseudo-quantization. where is calculated using floating point weights.

When inference: is the INT8 output of the previous layer, and the result of the INT8 convolution is calculated by the convolution after SimQuant pseudo-quantization. Among them, the convolution parameters have been multiplied by . The calculated result is added to obtain the equivalent output, and finally the calculation result of this layer is obtained through SimQuant pseudo-quantization.

Scan the QR code below, or add WeChat: AICVerX2, add "Xiaoer" WeChat, and get the first time to get papers related to underlying vision and basic AI, please noteResearch direction+ School/company+nickname.

▲Scan the QR code or add WeChat ID: AICVerX2 to get the latest underlying visual papers

Guess you like

Origin blog.csdn.net/chumingqian/article/details/134694390