[1] Google 2021 model quantification white paper "A White Paper on Neural Network Quantization"

I recently started to learn the knowledge of model quantization, and found an introductory paper A White Paper on Neural Network Quantization .

In the process of learning, make some records to deepen understanding, prevent forgetting (poor memory makes people headache, worry about their own dementia), and add some of their own understanding. It would be my pleasure if I can help you.

Summary

Current neural networks have made progress in many applications, but it also always brings high computational consumption. Reducing the power and latency of neural network inference is key if we want to deploy neural networks on edge devices with stringent power and computation requirements. Model quantization is one of the most effective ways to meet these requirements, but the noise introduced in the process of model quantization will also bring about a decline in accuracy.

In this paper, the authors introduce the current best model quantization algorithms, which try to eliminate the impact of quantization noise on model performance while maintaining low weights and activation values. The article first introduces model quantization from the hardware background, and then mainly discusses two mainstream quantization algorithms: post-training quantization (PTQ) and quantization-aware training (QAT). PTQ does not require pre-trained and labeled data, so it is a lightweight button-based quantization method. In most cases, PTQ can achieve 8-bit quantization, while the accuracy is close to floating point. QAT requires fine-tuning and labeling data, but it can complete lower-level quantification and achieve more competitive results. For these two schemes, the author provides well-tested pipelines based on existing literature and extended experiments, which achieve SOTA performance for common deep learning models and tasks.

Personal understanding : In order to pursue performance, the size of neural network models continues to increase, which makes it difficult to deploy models on edge devices. Therefore, some techniques are needed to make the model lightweight and speed up the inference speed and power consumption of the model. Model quantization is a very effective model lightweight algorithm. It mainly speeds up model inference by converting the weight and activation value parameters in the model (such as floating-point to low-bit integer, or mixed precision, etc.). However, this rough quantization method will introduce noise and reduce the accuracy of the model. The mainstream model quantization algorithms are mainly divided into two types: PTQ and QAT. PTQ is to perform precision conversion after the model parameters are trained. This algorithm can solve most problems. QAT is quantized during the training process, the steps are more cumbersome, fine-tuning and labeling data are required, but its accuracy is better than PTQ.

introduction

As deep learning is increasingly used as a general-purpose solution to endow electronic devices with intelligent properties, the neural network solution with small size, low latency and good performance has become a development trend. Today, neural networks can be found in many electronic products and services, such as smartphones, smart glasses, smart homes, robots, autonomous driving, and more. These devices typically require neural networks to obey strict time constraints in execution and require reduced power consumption for inference for long-run performance.

Model quantization is one of the best ways to reduce computation time and energy consumption. During model training, weight and activation tensors are usually stored in 16- or 32-bit precision data types, while model quantization stores weight and activation tensors in low-precision tensors. When the weight and activation tensors are converted from 32 bits to 8 bits, the memory space for storing tensors is reduced by 4 times, and the computational consumption of matrix multiplication is reduced by 14 times. The neural network has been confirmed that after the model is quantized to a relatively low bandwidth, the impact on accuracy is within an acceptable range. In addition, model quantization is often used together with other model optimization methods, such as network architecture search, model compression, model pruning, etc. Therefore, model quantification is a core step in the practical application of deep learning. But it also has disadvantages: low-bit-width quantization introduces noise into the model, leading to loss of accuracy. While some networks are robust to noise, others require additional work to maximize the benefits of quantization.

In this paper, the author introduces the model quantization algorithm of SOTA. They first introduce quantization and discuss hardware background and practical application conditions. Then discuss two mainstream quantization algorithms: PTQ and QAT respectively. PTQ deals with the already trained network, uses a small amount of data or does not need to use data for quantization, it requires fewer hyperparameters to adjust, and does not require end-to-end training. This makes PTQ not require too much engineering and calculation consumption, and can be regarded as a push-button method (button start method, I understand that it does not need to be considered during model training, and if PTQ is needed after the model is trained, set it Add PTQ, if you don’t need it, you don’t need to add it, similar to just click when you want to use it). QAT relies on retraining neural networks using simulated quantization in the training pipeline. While this requires more effort in training and potentially hyperparameter tuning, it often closes the gap to full-precision accuracy further than PTQ with low-bit quantization. For both schemes, the authors introduce standard pipelines based on existing literature and extensive experiments, leading to state-of-the-art performance for common computer vision and natural language processing models. We also propose a debugging workflow to identify and resolve common problems when quantizing new models.

Theoretical Basis of Model Quantization

In this section, the authors introduce the fundamentals of neural network quantization and the fixed-point accelerators for running quantized networks. This section starts with a hardware background and then introduces standard quantization schemes and their properties. Practical considerations related to layers common in modern neural networks and their impact on fixed-point accelerators are discussed later.

hardware background

Before diving into technical details, first explore the hardware background of quantization and how it enables efficient inference on-device. Figure 1 shows how a matrix-vector can be used in a neural network accelerator (hardware module) in y = ω x + by = {\omega}x + by=ωx+Mechanism diagram of b . This is the basic building block for multiplication and convolution operations between large matrices. Such hardware modules are designed to increase the inference speed of neural networks through parallel computing. Two basic elements of a neural network accelerator in Figure 1: processing elementsC n , m C_{n,m}Cn,mand accumulator A n A_nAn

where C n , m C_{n,m}Cn,mWhat is performed is a multiplication operation ( C n , m C_{n,m}Cn,mInside, ω n , m and xm are multiplied {\omega_{n,m}} and x_m are multipliedohn,mand xmmultiplied ),

A n = bn + ∑ i = 1 4 C n , i , bn is bias bias A_n=b_n + {\sum}_{i=1}^{4}C_{n,i}, b_n is bias biasAn=bn+i=14Cn,i, bnis the bias b i a s .

After repeating such calculation steps, the multiplication between matrices is completed. Once all input elements have been computed, the value in the accumulator A n A_nAnwill be moved back to memory for the input of the next neural network layer.

The general neural network training uses 32-bit floating-point numbers to represent weights and activation values. If we want to perform inference using 32-bit floating point numbers, the processing elements C n , m C_{n,m}Cn,mand accumulator A n A_nAnFloating point logic must be supported, and we need to transfer 32-bit data from memory to processing unit C n , m C_{n,m}Cn,m. The computational flow and data transmission in Figure 1 consume most of the energy spent in the neural network inference process. Therefore, it can be achieved by using a fixed-point or quantized representation of the lower bits. A low-bit fixed-point representation (such as INT 8) not only reduces the large amount of data transfers, but also reduces the memory and energy consumption of the operations in Figure 1.Because the cost of number arithmetic is generally quadratic in the number of bits used, and fixed-point addition is more efficient than floating-point addition.
insert image description here
           Figure 1. Mechanism diagram of matrix-vector multiplication operation in neural network accelerator hardware

In order to switch from floating-point operations to more efficient fixed-point operations, we need a mechanism to convert floating-point vectors to integer vectors: a floating-point vector XXX can be roughly represented as a constant or a vector of integers.
               insert image description here
Based on the conversion mechanism expressed in the above formula, the accumulatorA n A_nAnThe calculation process of can be approximated as:
          insert image description here
One thing worth noting here is that different constants S w S_w are used for weight w and activation value x approximationSwand S x S_xSx. This approach is more flexible and can reduce quantitative loss. When quantifying, the reason why we can put S w S_wSwand S x S_xSxget ∑ \sumΣ , because the same approximate constant is used for all elements in each tensor.We intentionally ignore bias quantization for now, since biases are usually stored at a higher bit width (32 bits) with a scaling factor that depends on the scaling factors of weights and activations.

Figure 2 shows what happens to neural network accelerators when quantization is introduced. Take the INT 8 type as an example here. For accumulators, it is important to keep a higher bit width, usually 32 bits wide. Otherwise, we may suffer from the risk of overflow as more elements are added during computation.

Activations stored in 32-bit accumulators need to be written to memory before they can be used as input to the next layer. In order to reduce the data transmission amount and the calculation amount of the next layer when reading and writing memory, these activation values ​​need to be requantized back to INT8 type (this requires a requantization step, see Figure 2).
insert image description here
         Figure 2. Schematic diagram of matrix multiplication in a neural network accelerator for quantized inference.

uniform affine quantization

In this section, the authors define the quantization scheme they will use in the paper. This scheme is called uniform quantization, and it is the most commonly used quantization scheme because it allows efficient implementation of fixed-point arithmetic.

Uniform affine quantization, also known as asymmetric quantization, is defined by three quantization parameters: scale factor s, zero point z, and bit width b

  • Scale factors and zeros are used to map floating point values ​​to a grid of integers, the size of which depends on the bit width.
    • The scale factor s is usually expressed as a floating point number and specifies the step size of the quantizer;
    • The zero z is an integer that ensures error-free quantization of real zeros. This is important to ensure that common operations such as zero padding or ReLU do not introduce quantization errors.

Seeing the above description of the three quantization parameters may be a little unclear, let's look down slowly.

Once the three quantization parameters are defined, the quantization operation can be performed.

First, map an original weight or activation value vector x to an unsigned integer grid according to the following formula {0, … , 2 b − 1 2^b-12b1 }, where b is the bit width.
          insert image description here
insert image description here
Indicates the rounding operation, the clamp(x;a,c) function is defined as follows:
          insert image description here
On the contrary, in order to recover the real data x from the quantized data, we need the dequantization operation:
         insert image description here
combine the quantization step and the dequantization step together, Get the general quantization functionq ( . ) q(.)q ( ​​. ) :
  insert image description here
With the dequantization step, we can also define the quantization range (qmin , qmax q_{min}, q_{max}qmin,qmax), where qmin = − sz q_{min}=-szqmin=sz q m a x = s ( 2 b − 1 − z ) q_{max}=s(2^b-1-z) qmax=s(2b1z ) . Any x value outside this range will be clipped to its limit, resulting in a clipping error (error in precision). If you want to reduce the clipping error, you can expand the quantization range by increasing the scale factor s. However, increasing the scale factor will also increase the rounding error, because the rounding error is in the range [ − 1 2 s , 1 2 s ] [-\frac{1}{2}s, \frac{1}{2 }s][21s,21s ] . In the following, I will introduce how to choose the quantization parameter, and make a trade-off between the clipping error and the rounding error.

Symmetric Uniform Quantization

Symmetric quantization is a simplified version of the asymmetric case. Symmetric quantizers limit the zeros to 0. This reduces the computational overhead of handling zero offsets during the accumulate operation in asymmetric quantization. But the lack of offsets limits the mapping between integer and floating-point domains. Therefore, the choice of signed or unsigned integer grid is important:
insert image description here
Unsigned symmetric quantization is very suitable for data with one-tailed distribution, such as the activation value of RELU. Signed symmetric quantization is suitable for data that is approximately symmetric about zero.

Three forms of uniform quantization (asymmetric uniform quantization, signed symmetric quantization, and unsigned symmetric quantization) are shown in Figure 3.
insert image description here
Figure 3. Visual explanation of different uniform quantization grids with a bit width of 8. s is the scaling factor and z is the zero point. Floating point grids are black, integer quantized grids are blue.

Power of 2 quantization

Quantization to the power of 2 is a special case of symmetric quantization, and the scale factor s is limited to the power of 2, that is, s = 2 − ks=2^{-k}s=2−k . _ This method can be very efficient on hardware, because scaling is equivalent to a simple displacement operation. However, the limited expressiveness of such scale factors complicates the trade-off between rounding and clipping errors.

Quantification granularity

So far we have defined a set of quantization parameters (quantizers) for each tensor, one for weights and one for activations. This is called per-tensor quantization. We can also define separate quantizers for each dimension of a tensor (e.g., the output channels of a weight tensor), thus increasing the quantization granularity. In neural network quantization, per-tensor quantization is the most common granularity choice because of its simpler hardware implementation: all accumulators in Figure 2 use the same scaling factors S w , S x S_w, S_xSw,Sx. However, we can use finer granularity to further improve performance. For example, with weight tensors, we can specify a different quantizer for each output channel. This is called per-channel quantization.

There are other works that go beyond specifying a different quantizer for each output channel, but apply a separate quantizer for each set of weights or activations. However, increasing the granularity of groups usually improves accuracy at the cost of some additional overhead. The overhead is associated with handling accumulators that sum values ​​with different scaling factors. Most existing fixed-point accelerators do not currently support such logic, therefore, we will not consider them in this work. However, as research in this field grows, more hardware support for these methods is expected in the future.

Quantitative simulation

To test how well a neural network performs on a quantization device, we often simulate quantization behavior on the same general-purpose hardware used to train the neural network. This is called quantization simulation (quantization simulation). we aimhardware using floating pointestimatefixed-point arithmetic. Compared to running experiments on real quantized hardware or using quantized kernels, it is better to simulate quantization. This allows users to efficiently test different quantization strategies and also enables GPU acceleration for QAT. In this section, the authors first explain the rationale for the quantized simulation step, and then discuss techniques that can help reduce the variance between simulated quantized kernels run on real devices.

Earlier, we saw how matrix-vector multiplication is performed on fixed-point devices. In Figure 4a, the authors generalize this operation to the convolutional layer, and also include an activation function to make it more realistic. During on-device inference, all inputs to the hardware (biases, weights, and input activations) are in fixed-point format. However, when quantized using common deep learning frameworks and general-purpose hardware simulations, these quantities are floating point numbers. This is why we introduce quantization modules in the computation graph to induce quantization effects.

Figure 4b shows how the same convolutional layer can be modeled in a deep learning framework. Add quantization blocks between weights and convolutions to simulate weight quantization, and after activation functions to simulate activation quantization. Bias is usually not quantized since it is stored with higher precision. In the previous subsections, the authors discuss in more detail when it is appropriate to place quantization modules after nonlinearities. Quantization modules implement the quantization functions listed above, and each quantization module is defined by a set of quantization parameters (scale factor, zero point, bit width).The input and output of the quantization module are both in floating point format, but the output is on a quantization grid.
insert image description here
Figure 4. Schematic diagram of the quantization forward pass of a convolutional layer: a) Computing a graph of the quantization inference on a real device. b) Quantized inference that simulates general-purpose floating-point hardware.

Batch normalization folding

Batch normalization is a standard component of modern convolutional networks. Batch normalization normalizes the output of the linear layer before scaling and adding an offset.For on-device inference, these operations are folded into the previous or next linear layer in a step called batch normalization folding. This completely removes the batch normalization operation from the network, as the computation is absorbed into adjacent linear layers. In addition to reducing the computational overhead of additional scaling and offsetting, this prevents additional data movement and quantization of layer outputs. More formally, during inference, batch normalization is defined as an affine graph of the output x:
             
             B atch N orm ( x ) = γ ( x − μ σ 2 + ϵ ) + β BatchNorm(x)={ \gamma}({\frac{x-{\mu}}{\sqrt{ {\sigma}^2}+{\epsilon}}})+{\beta}BatchNorm(x)=c (p2 + ϵx m)+β
inμ {\mu}µσ {\sigma}σ is the mean and variance, which is the exponential moving average of the results calculated with batch as the statistical unit during the training process. γ {\gamma}cb {\beta}β is the affine hyperparameter learned from each channel. If batch normalization is applied immediately after the linear layer y = BatchNorm(Wx) as stated above, we can rewrite these formulas so that the batch normalization operation is merged with the linear layer itself. Suppose a weight matrixW ϵ R n × m W{\epsilon}~R^{n×m}W ϵ R n × m , for outputYYEach channelyk y_k of Yyk(k={1, 2, …, n}) apply batch normalization.
          insert image description here
in,
         insert image description here
Through the above transformation, the batch normalization layer and the linear layer are fused together.

Activation function fusing

In the simple quantization accelerator presented in Figure 2, we see that dequantization of activations occurs after computing matrix multiplication or convolution output values.However, in practice we often have nonlinearities directly after linear operations (case in Figure 4). Because writing the activations of a linear layer into memory and then reloading them to a compute core to apply a nonlinear operation would be a waste of resources. For this reason, many hardware solutions come with a hardware unit that applies nonlinearities before the requantization step. If this is the case, we only need to simulate the requantization that occurs after the nonlinearity. For example ReLU non-linearity is easily modeled by a requantization block, since you can set its minimum representable value activation quantization to 0.

Other more complex activation functions, such as sigmoid or Swish, require more specialized support. If this support is not available, we need to add a quantization step before and after the nonlinearity in the graph. This can have a big impact on the accuracy of quantized models. Although new activations like the Swish function can improve floating-point precision, these activations may disappear after quantization, or may be less efficient to deploy on fixed-point hardware.

Other network layers and their quantization

There are many other commonly used layers in neural networks. How they are modeled depends largely on the specific hardware implementation. Sometimes the mismatch between analog quantization and target performance comes down to incorrectly quantized layers. Here we provide some guidance on how to simulate quantization for several commonly used layers, helping these layers to be quantized correctly and enabling simulated quantization to match target performance.

  • Max Pooling: There is no need to quantize the activation value, because the input and output are on the same quantization grid;
  • Average Pooling: The average of several integers is not necessarily an integer. For this reason, a quantization step is required after Average Pooling. However, since the quantization range does not change significantly, the same quantization module can be used for both input and output;
  • Element-wise addition: Although this step is simple, it is difficult to simulate correctly. During addition, the quantized ranges of the two input addends must match exactly. If it doesn't match, extra care is needed to make the rest work as expected.What I understand is that the addition of two numbers may exceed the quantization range.There is no single accepted solution to this, but adding a requantization step roughly simulates the added noise. Another approach is to optimize the network by binding the quantization grid of the input. This will prevent the requantization step, but may require fine-tuning.
  • Concatenation: The two branches being concatenated usually do not share the same quantization parameters. This means that their quantization grids may not overlap, thus requiring a requantization step. As with the Element-wise addition, it is possible to optimize your network so that quantization parameters are shared for connected branches.

Practical considerations

When quantizing multilayer neural networks, we are faced with a large space of quantization choices, including quantization schemes, granularities, and bit widths. In this section, the authors explore some practical considerations that help reduce the search space.

Note that in this white paper, the authors only considerHomogeneous bit width. This means that the bitwidth chosen for weights or activations remains the same across all layers. Because hardware more generally supports isomorphic bitwidths, but some recent work has also exploredHeterogeneous bit width or mixed precisionrealization.

Symmetric vs. asymmetric quantization

For each weight and activation quantization, we must choose a quantization scheme. On the one hand, asymmetric quantization is more expressive because there is an additional offset parameter, but on the other hand there may be computational overhead. To understand why this is the case, consider what happens when the weights are asymmetrical.
The lower ones are the quantized weights of the current layer and the quantized activation values ​​of the previous layer, respectively. As shown below, they need to be multiplied to get the calculation result of the current layer. 
insert image description here
insert image description here
insert image description here
If both operations were of symmetric form, the first term is what we would have (Since there is no zero and bias). The third and fourth terms depend only on pre-known scale, offset and weight values. Therefore, these two terms can be precomputed and added to the layer's bias term at almost no cost. However, the second term depends on the input data x. This means that for each batch of data, we need to compute an additional term during inference. This can incur a significant overhead in latency and power, as it amounts to adding an extra lane. For this reason, using asymmetric activation quantization and symmetric weight quantization is a common approach to avoid extra data-dependent terms.

Per-tensor and per-channel quantization

In the section on quantifying fine-grainedness, the author discusses different levels of quantized granularity. Per-tensor quantization of weights and activations has been standard for some time, as it is supported by all fixed-point accelerators. However, per-channel quantization of weights can improve accuracy, especially when the distribution of weights varies across channels. Per-channel weight quantization can be achieved in the accelerator without rescaling by applying a separate per-channel weight scaling factor. Per-channel quantization of activations is more difficult to implement because we cannot factor out the scaling factor from the summation, thus requiring rescaling of the accumulator for each input channel. Although per-channel quantization of weights is becoming more common, not all commercial hardware supports it. Therefore, it is important to check whether it will work in your intended target device.

Guess you like

Origin blog.csdn.net/Just_do_myself/article/details/124350921