Basics of Deep Learning Model Quantification

reference article

Introduction to Neural Network Quantization-Basic Principles-Knowledge

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

A Survey of Quantization Methods for Efficient Neural Network Inference

Summary of Model Quantization and Landing Deployment

Int8 Quantization - Introduction (1) bzdww

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Introduction to the latest progress in transformer model quantification:

Large Transformer Model Inference Optimization | Lil'Log

LLM large language model quantification method (1)

Basic concepts of quantification

Post Training Quantization Post Training Quantization (PTQ)

The quantization process only quantifies the weights and activation values ​​by inferring some sample data offline, without training fine-tuning.

PTQ can be divided into Uniform quantization and Non-uniform quantization (the difference is whether the fp32 and int values ​​before and after quantization are linear, and Uniform quantization is more commonly used):

Uniform quantization enables the use of integer or fixed-point math pipelines, allowing computation to be performed in the quantized domain.

Non-uniform quantization requires dequantization, e.g. a codebook lookup, before doing computation in higher precision, limiting its benefits to model compression and bandwidth reduction. 

Uniform quantization is further divided into affine quantization and scale quantization, which can also be called asymmetric quantization and symmetric quantization. The mapping relationship between fp32 and int before and after quantization is only a multiple relationship, and there is no bias, so the calculation is simpler.

PTQ can also be divided into Dynamic quantization and Static quantization:

Dynamic quantization: The scale and zero point are only calculated during inference. Compared with static quantization, the calculation amount is larger, and the accuracy may be higher. The NLP model is considered to be used.

Static quantization: Use the calibration data set to calculate the quantization parameters in advance, and the calculation amount is significantly smaller than that of Dynamic quantization.

Quantization Aware Training (QAT)

In the process of quantization, the network is trained so that the network parameters can better adapt to the information loss caused by quantization. This approach is more flexible, so accuracy is generally higher than post-training quantization. The disadvantage is that it is not convenient to operate. In most cases, the quantization accuracy is higher than that after training, and some scenes are not necessarily much better than partial/mixed precision quantization.

Quantitative calculation formula

(The picture is taken from Introduction to Neural Network Quantization-Basic Principles-Knowledge )

The above formula is the Affine Quantization form. Scale Quantization is the case where Affine Quantization Z is 0.

The offset coefficient z must ensure that 0 of the real number corresponds to a value after quantization, so z needs to be an integer.

The above formula is not absolutely correct, and you should still be careful when using it : if a tensor has all the same values ​​or the content min max differs very little, you will find that the above formula cannot be used. Because when the entire tensor has the same element or only one element itself, the scale becomes 0. At this time, other solutions are required to deal with it. For example, you can choose s=r/N, N is a positive integer, this is theoretically no quantization error, and then the problem becomes to choose the appropriate N

For example scale = r/int(max(sqrt(abs(r)), 1)), where N takes int(max(sqrt(abs(r)), 1)). The size of scale and Z can be moderate. Here, r=0 needs to be specially treated so that scale is not 0.

Matrix multiplication and convolution quantization

 matrix multiplication

Here B is weight, which is quantized by scale. If B also uses affine quantization, the calculation process will be very complicated and generally not used.

Convolution can be derived similarly

quantitative techniques

Calibration

How to select the value range [b,a] before quantization is one of the most important tasks in model quantization, which has a great impact on the accuracy. Generally, there are max, entropy (KL divergence used in TensorRT), percentile (such as the maximum value of 99.9%, 99.99% and other proportional elements).

The numerical range and precision after quantization directly depend on the range that the int type can express, while the weight and activation values ​​before quantization may have characteristics similar to Gaussian distribution. If max is used directly, although all floating-point ranges have int type representation before quantization, the precision of int type representation is very rough due to the limited number of int type samples. However, if [b, a] is selected too small, a large amount of useful data will be truncated, which will also lead to insufficient precision. Therefore, there should be an optimal range of values ​​before quantization to optimize the precision under specific distribution and int type quantization. Percentile directly selects 99.9% or 99.99% proportional elements after sorting the entire data elements as the threshold. The KL divergence used by TensorRT shows good accuracy and versatility under most CV models. In addition, different range selection methods can be used for weight and activation quantization. For example, weight can usually use max or Percentile with higher precision, while activation quantization generally cannot use max.

Reference: https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf

INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE:PRINCIPLES AND EMPIRICAL EVALUATION

TensorRT's ppt gives the process and pseudocode based on KL divergence. The basic process idea is:

It’s always a tradeoff between range and precision of the INT8 representation.

We want to minimize loss of information.
Loss of information is measured by Kullback-Leibler divergence (AKA relative entropy or information divergence).
P, Q - two discrete probability distributions.
KL_divergence(P,Q):= SUM(P[i] * log(P[i] / Q[i] ), i)
P is the histogram before quantization, and Q is the histogram after quantization.

A simple method is to obtain P by counting the histogram of the original floating-point number 2048 bins according to the calculation results of the calibration data. Then use different thresholds to obtain the histogram Q of quantized floating-point numbers through quantization and inverse quantization. Computes the KL_divergence of two distributions. Of course, this approach is slightly different from that of TRT PPT.

Per-tensor quantization : The entire tensor shares quantization parameters.

Per-channel (Conv) or per-column (MatMul) quantization : Each channel element shares parameters, and different channels use different parameters. Usually, the quantization accuracy is higher than that of per-tensor, and has less impact on BN fusion. It can avoid the significant drop in accuracy caused by the use of per-tensor quantization in depth-wise-conv2d in mobileNet. This is because the weight value range of each channel varies greatly. , using the same set of quantization parameters may lead to a significant lack of precision for channels with a small value range.

The first and last layers of the model are usually not quantized or quantized with higher precision. (Can it be judged automatically: Can it be judged automatically based on the differences in the results of different data?)

weight quantization

Generally, weight uses scale quantization, and the accuracy is generally sufficient. The use of affine quantization will lead to a significant increase in computational complexity. The activation quantization can use affine or scale quantization, and the computational complexity of affine quantization is slightly larger. (Conclusion based on the quantification formula calculated by MatMul and Conv2D)

Per-channel is generally more accurate than per-tensor.

As BN parameters are learned per channel, their folding can result in significantly different weight value distributions across channels. Fortunately, as Table 3 shows, per-channel quantization granularity maintains model accuracy even with BN folding. nnel granularity, max
calibration The loss of accuracy is very small (most of them are less than 0.5 percentage points), and the model size becomes 1/4 of the original.

bias vectors, are quantized as 32-bit integers
 

Quantified simulation python code reference

It can be found that it is better to use a set of quantization coefficients in the same column for the weight of matrix multiplication.

import numpy as np
import copy

weight = np.load("tensor1.npy")


def round_near(data):
    """Round data to nearest int
    For example, 0.1 to 0, 0.5 to 1
    """
    if data >= 0:
        data += 0.5
    else:
        data -= 0.5
    return int(data)


def get_u8_quant_coef(np_tensor):
    max_val = np.max(np_tensor)
    min_val = np.min(np_tensor)

    dst_max = 255
    dst_min = 0

    scale = (max_val-min_val)/(dst_max-dst_min)
    zero_point = dst_max - max_val / scale
    zero_point_i8 = np.rint(zero_point)
    return scale, zero_point_i8


def quant_u8(np_tensor, scale, zero_point):
    quanted_tensor = (np_tensor / scale + zero_point)
    quanted_tensor_1d = quanted_tensor.reshape([-1])
    for i, elem in enumerate(quanted_tensor_1d):
        quanted_tensor_1d[i] = np.rint(elem)
    quanted_tensor = quanted_tensor_1d.reshape(quanted_tensor.shape)
    return quanted_tensor


def dequant(np_tensor, scale, zero_point):
    dequant_tensor = np_tensor.astype("float32")
    dequant_tensor = (dequant_tensor-zero_point)*scale
    return dequant_tensor


def get_error(tensor1, tensor2):
    return np.sum(np.abs(tensor1 - tensor2))


def get_dequant(np_tensor):
    scale, zero_point = get_u8_quant_coef(np_tensor)
    quanted_tensor = quant_u8(np_tensor, scale, zero_point)
    dequant_tensor = dequant(quanted_tensor, scale, zero_point)
    return dequant_tensor, scale, zero_point


dequant_tensor, scale, zero_point = get_dequant(weight)

error = get_error(weight, dequant_tensor)

weight1 = copy.deepcopy(weight)
weight2 = copy.deepcopy(weight)


col = weight1.shape[1]
row = weight1.shape[0]


for i in range(col):
    line_data = weight[:, i]
    dequant_tensor_i, scale_i, zero_point_i = get_dequant(line_data)
    weight1[:, i] = dequant_tensor_i


for i in range(row):
    line_data = weight[i, :]
    dequant_tensor_i, scale_i, zero_point_i = get_dequant(line_data)
    weight2[i, :] = dequant_tensor_i

error1 = get_error(weight, weight1)
error2 = get_error(weight, weight2)

activation quantification

Max calibration is generally not available. The CV class model Entropy and 99% (99.9% 99.99% 99.999% 99.9999%) calibration accuracy are not very different, and Entropy is better in most cases. The accuracy of Entropy calibration of some NLP models has dropped significantly from 99%.

99.9% percentile calibration clips the large magnitude values too aggressive and leads to significant accuracy drops on most networks.

The best post training quantization results are achieved with entropy, 99.99%, or 99.999% percentile calibrations, though no single calibration is best for all networks.

其他算子量化
Notable integer-only quantization works include [154], which fuses Batch Normalization into the previous convolution layer, and [113], which proposes an integeronly computation method for residual networks with batch normalization. However, both methods are limited to ReLU activation. The recent work of [132] addresses this limitation by approximating GELU [94], Softmax, and Layer Normalization [6] with integer arithmetic and further extends integer-only quantization to Transformer [243] architectures.

Quantization of some operators may have a great impact on accuracy, such as softmax, etc., without quantization.

A simple model subgraph quantization:

Quantization coefficient optimization

Dyadic quantization is another class of integer-only quantization, where all the scaling is performed with dyadic numbers, which are rational numbers with integer values ​​in their numerator and a power of 2 in the denominator. Substitute, and the divisor is a power of 2, so the scale calculation does not need to use division)

The impact of different rounding models

Techniques to Recover Accuracy

Partial Quantization: Do not quantize layers that have a large impact on accuracy.

Mixed precision quantization: Different layers use different precision quantization, such as fp16, int8, etc.

Excluding a small amount of Layer can reduce the accuracy loss to within 0.05% to 0.1%.

Addition can be considered: each time the cluster with the least impact on accuracy is found for quantization. You can also consider doing subtraction, quantizing everything, and then removing the layer that has the greatest impact on accuracy. The latter may be more efficient, since the proportion of removal is a small proportion of the total quantization.

Data-free quantization through weight equalization and bias correction

CV, transformer available, bert model is unique

For BERT, sensitivity analysis does not reveal any particular layer that contributes more to the accuracy drop. As a result we cannot identify a small subset of layers to leave in floating-point.
To address this we need to consider different approaches. Section 5.2, incorporates quantization with training to recover accuracy. Additionally, Appendix D examines the GELU activation function in BERT and presents a simple augmentation to significantly improve post training quantization accuracy.

There are articles researching that other round methods besides rounding round may have better precision.

Mixed Precision Quantization Parameter Search Method

Mixed-precision quantization parameters can be searched for using reinforcement learning, genetic algorithms, NAS, and more.
Different than these exploration and regularization based approaches, HAWQ [51] introduces an automatic way to find the mixed-precision settings based on second order sensitivity of the model. 

In HAWQv2, this method was extended to mixedprecision activation quantization [50], and was shown to be more than 100x faster than RL based mixed-precision methods [246].

Recently, in HAWQv3, an integer-only, hardware-aware quantization was introduced [267] that proposed a fast Integer Linear Programming method to find the optimal bit precision for a given applicationspecific constraint (e.g., model size or latency).

Recommended Workflow

we recommend the following for int8 quantization:
Weights:
    – Use scale quantization with per-column/per-channel granularity
    – Use a symmetric integer range for quantization [-127, 127]) and max calibration
Activations:
    – Use scale quantization with with per-tensor granularity

Activation may be better with affine, and the paper does not have a detailed evaluation. Because the upper layer may be connected to relu, the value range is not symmetrical.
We recommend the following procedure to quantize a pre-trained neural network.
• PTQ: Quantize all the computationally intensive layers (convolution, linear, matrix multiplication, etc.) and run activation calibration including max, entropy and 99.99%, 99.99 9% percentile. If none of the calibrations yield the desired accuracy continue to partial quantization or QAT.
• Partial Quantization: Perform sensitivity analysis to identify the most sensitive layers and leave them in floating-point. If the impact on computational performance is not acceptable or an acceptable accuracy cannot be reached, continue to QAT.
• QAT: Start from the best calibrated quantized model. Use QAT to fine-tune for around 10% of the original training schedule with an annealing learning rate schedule starting at 1% of the initial training learning rate. Refer to Appendix A.2 for specific hyperparameter choices.

Three important operators

Quantize: Quantize fp32 to int8.

Dequantize: Dequantize int8 to fp32.

Requantize: Both inputs of operators such as Conv2d/MatMul are int8, and the output type is int32 to prevent overflow. At this time, you need to use Requantize: Conv2d to quantize the int32 output by operators such as Conv2d/MatMul into int8 as the input of the next quantization operator. That is to convert the int type represented by a set of input quantization parameters to the int type represented by another set of quantization parameters, and the floating-point values ​​before and after conversion are equivalent. s1(q1-z1)=s2(q2-z2), the process of finding q2 from other known parameters.

quantitative tools

TensorRT Quantization

fp16 quantization: config configures fp16 without additional data

config.set_flag(trt.BuilderFlag.FP16)
 

Int8 quantization can use Polygraphy as an auxiliary and requires additional data, but does not require label as the basis for accuracy evaluation

https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy

https://developer.download.nvidia.cn/compute/redist/polygraphy/polygraphy-0.33.2-py2.py3-none-any.whl

TensorRT supports the use of 8-bit integers to represent quantized floating point values. The
quantization scheme is symmetric uniform quantization - quantized values are represented
in signed INT8, and the transformation from quantized to unquantized values is simply a
multiplication. In the reverse direction, quantization uses the reciprocal scale, followed by
rounding and clamping.
To enable the use of any quantized operations, the INT8 flag must be set in the builder
configuration.
 

TensorRT supports only per-tensor quantization for activation tensors, but supports perchannel weight quantization for convolution, deconvolution, fully connected layers, and MatMul
where the second input is constant and both input matrices are 2-dimensional


In explicitly quantized networks, the scaling operations to transform between the quantized
and unquantized values are represented explicitly by IQuantizeLayer (C++, Python) and
IDequantizeLayer (C++, Python) nodes in the graph - these will henceforth be referred
to as Q/DQ nodes.

Post-Training Quantization using Calibration

In post-training quantization, TensorRT computes a scale value for each tensor in the network.
This process, called calibration, requires you to supply representative input data on which
TensorRT runs the network to collect statistics for each activation tensor.
The amount of input data required is application-dependent, but experiments indicate that
about 500 images are sufficient for calibrating ImageNet classification networks.

TensorRT provides multiple different calibrators which calculate the scale in different ways.

IInt8EntropyCalibrator2
Entropy calibration chooses the tensor’s scale factor to optimize the quantized tensor’s
information-theoretic content, and usually suppresses outliers in the distribution. This
is the current and recommended entropy calibrator and is required for DLA. Calibration
happens before Layer fusion by default. It is recommended for CNN-based networks.
IInt8MinMaxCalibrator
This calibrator uses the entire range of the activation distribution to determine the scale
factor. It seems to work better for NLP tasks. Calibration happens before Layer fusion by
default. This is recommended for networks such as NVIDIA BERT (an optimized version of
Google's official implementation)
ONNX Support
When a model trained in PyTorch or TensorFlow using Quantization Aware Training (QAT) is
exported to ONNX, each fake-quantization operation in the framework’s graph is exported as a
pair of QuantizeLinearand DequantizeLinearONNX operators.
When TensorRT imports ONNX models, the ONNX QuantizeLinear operator is imported as
an IQuantizeLayer instance, and the ONNX DequantizeLinear operator is imported as an
IDequantizeLayer instance. ONNX using opset 10 introduced support for QuantizeLinear/
DequantizeLinear, and a quantization-axis attribute was added in opset 13 (required for perchannel quantization).
 

Layer-level Control of Precision
The builder-flags provide permissive, coarse-grained control. However, sometimes part
of a network requires higher dynamic range or is sensitive to numerical precision. You can
constrain the input and output types per layer:
C++
layer->setPrecision(DataType::kFP16)
Python
layer.precision = trt.fp16

This setting layer precision does not seem to be easy to use. The actual model fp16 has insufficient quantization precision. Setting the layer precision in this way does not solve the problem.
Sparsity
NVIDIA Ampere Architecture GPUs support Structured Sparsity. To make use of this feature
to achieve higher inference performance, the convolution kernel weights and/or the fully connected weights must meet the following requirements:

Several schemes using TensorRT quantization

Solution 1. Use TensorRT's built-in quantization directly.

Solution 2. TensorRT 8 supports the QDQ fake int8 quantization model. This method can be used for model quantization, and the quantization can be transferred to TensorRT. However, manually quantized models composed of operators such as QLinearConv cannot be transferred to TensorRT. After QAT quantization, it can also be saved as a model of this quantization method. TensorRT automatically fuses quantization operators for int8 quantization.

Generated ONNX graph with QuantizeLinear and DequantizeLinear ops is parsed using ONNX parser available in TensorRT.https://developer.download.nvidia.cn/video/gputechconf/gtc/2020/presentations/s21664-toward-int8-inference-deploying-quantization-aware-trained-networks-using-tensorrt.pdf

Intel® AI Quantization Tools

GitHub - IntelAI/tools

https://github.com/intel/neural-compressor

https://github.com/IntelAI/tools/blob/master/tensorflow_quantization/quantization/quantize_graph.py

ONNX model quantization

https://github.com/onnx/onnx/blob/master/docs/Operators.md

https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization provides model quantization tools

Quantize ONNX Models - onnxruntime for usage details and https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization for examples.

ONNX quantization representation format

There are 2 ways to represent quantized ONNX models:

  • Operator Oriented. All the quantized operators have their own ONNX definitions, like QLinearConv, MatMulInteger and etc.
  • Tensor Oriented, aka Quantize and DeQuantize (QDQ). This format uses DQ(Q(tensor)) to simulate the quantize and dequantize process, and QuantizeLinear and DeQuantizeLinear operators also carry the quantization parameters. Models generated like below are in QDQ format:
    • Models quantized by quantize_static API below with quant_format=QuantFormat.QDQ.
    • Quantization-Aware training (QAT) models converted from Tensorflow or exported from PyTorch.
    • Quantized models converted from tflite and other framework.

Quantizing an ONNX model

Quantization API

Quantization has 3 main APIs, which corresponds to the 3 quantization methods:

  • quantize_dynamic: dynamic quantization
  • quantize_static: static quantization
  • quantize_qat: quantize-aware training quantization

Please refer to quantize.py for quantization options for each method.

Please refer to E2E_example_model for an example of static quantization.

In general, it is recommended to use dynamic quantization for RNN and transformer-based models, and static quantization for CNN models.

Data type selection

Quantization represents value with 8 bit, which can be either int8 and uint8. Combining with activation and weight, the data format can be (activation:uint8, weight:uint8), (activation:uint8, weight:int8), etc.

Transformer-based models

There are specific optimization for transformer-based models, like QAttention for quantization of attention layer. In order to leverage those specific optimization, you need to optimize your models with Transformer Model Optimization Tool before quantizing the model.

This notebook demonstrates the E2E process.

Quantization on GPU

Hardware support is required to achieve better performance with quantization on GPUs. You need a device that support Tensor Core int8 computation, like T4, A100. Older hardware won’t get benefit.

ORT leverage TRT EP for quantization on GPU now. Different with CPU EP, TRT takes in full precision model and calibration result for inputs. It decides how to quantize with their own logic. The overall procedure to leverage TRT EP quantization is:

  • Implement a CalibrationDataReader.
  • Compute quantization parameter with calibration data set. Our quantization tool supports 2 calibration methods: MinMax and Entropy. Note: In order to include all tensors from the model for better calibration, please run symbolic_shape_infer.py first. Please refer tohere for detail.
  • Save quantization parameter into a flatbuffer file
  • Load model and quantization parameter file and run with TRT EP.

We have 2 E2E examples Yolo V3 and resnet50 for your reference.

onnx resnet50 cpu quantization case (bn folded)

But the QLinearConv of onnxruntime seems to be only implemented by CPU, and the usability is not great.

Operator Definition Reference

https://github.com/onnx/onnx/blob/main/docs/Operators.md

https://github.com/microsoft/onnxruntime/blob/master/docs/ContribOperators.md

AIMETTools

https://github.com/quic/aimet

AIMET tool installation introduction (1)_weixin_38498942's blog-CSDN blog

Inadequacies of Existing Quantitative Tools

There is no precision feedback to guide quantization: users may have different precision tolerances, how to guide quantization according to precision, satisfy the precision and obtain the best performance at the same time.

Versatility and ease of use: The quantized model can be compatible with the original framework reasoning or converted to TensorRT, etc.

Other Reference Articles

https://github.com/Ewenwan/MVision/tree/master/CNN/Deep_Compression/quantization

Introduction to Neural Network Quantization--Post-training Quantization-Knowledge

Introduction to Neural Network Quantization - Quantization Perception Training

Introduction to Neural Network Quantization-Basic Principles-Knowledge

Post training quantization - Data free quantization bzdww

Detailed explanation of NCNN Conv quantification (1) bzdww

Detailed explanation of NCNN quantification (2) bzdww

https://github.com/google/gemmlowp

GitHub - quic/aimet: AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

BERT Quantization

I-BERT: Integer-only BERT Quantization

PPQ Small Classroom | Principles of Quantitative Calculation (1)

Guess you like

Origin blog.csdn.net/u013701860/article/details/121627946