参考文章

神经网络量化入门--基本原理 - 知乎

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

A Survey of Quantization Methods for Efficient Neural Network Inference

模型量化与落地部署总结 - 知乎

Int8量化-介绍（一） - 知乎

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

transformer模型量化的最新进展介绍：

Large Transformer Model Inference Optimization | Lil'Log

LLM大语言模型量化方法（一）

量化基本概念

后训练量化Post Training Quantization (PTQ)

量化过程仅仅通过离线推理一些sample数据对权重和激活值进行量化，无需要进行训练微调。

PTQ可以分为Uniform quantization和Non-uniform quantization（区别在于量化前后的fp32和int值是否是线性关系，Uniform quantization更常用）：

Uniform quantization enables the use of integer or fixed-point math pipelines, allowing computation to be performed in the quantized domain.

Non-uniform quantization requires dequantization, e.g. a codebook lookup, before doing computation in higher precision, limiting its benefits to model compression and bandwidth reduction.

Uniform quantization又分为affine quantization和scale quantization，也可称为asymmetric quantization和symmetric quantization，后者量化前后量化前后的fp32和int映射关系只有倍数关系，没有偏置，因此计算更加简单。

PTQ也可以分为Dynamic quantization和Static quantization：

Dynamic quantization: 推理时才计算scale和zero point，比static quantization计算量更大一些，精度可能更高一些，NLP模型考虑使用。

Static quantization：使用校准数据集提前计算量化参数，计算量比Dynamic quantization明显更小。

量化感知训练Quantization Aware Training (QAT)

在量化的过程中，对网络进行训练，从而让网络参数能更好地适应量化带来的信息损失。这种方式更加灵活，因此准确性普遍比后训练量化要高。缺点是操作起来不太方便。大多数情况下比训练后量化精度更高，部分场景不一定比部分/混合精度量化好很多。

量化计算公式

（图截取自神经网络量化入门--基本原理 - 知乎）

上面公式即为Affine Quantization形式。Scale Quantization为Affine Quantization Z为0的情形。

偏置系数z必须保证实数的0精确对应到量化后的一个值，因此z需要是整数。

上面的公式并不是绝对正确的，使用时仍然要小心：假如一个tensor全部是相同的数值或者内容min max差别很小时，你会发现上面的公式其实不能用。因为当整个tensor都是一样的元素或本身只有一个元素时，scale变成了0。这时需要其他解决方案来处理。例如可以选择s=r/N，N为正整数，这是理论上不会产生量化误差，然后问题变为选择合适的N

例如 scale = r/int(max(sqrt(abs(r)), 1))，这里N取int(max(sqrt(abs(r)), 1))。可以使得scale和Z的大小都适中，这里还要对r=0进行特殊处理下使得scale不为0。

矩阵乘法和卷积量化

矩阵乘

这里B为weight，采用scale量化。如果B也采用affine量化，则计算过程会非常复杂，一般不采用。

卷积可进行类似推导

量化技巧

Calibration

如何选择量化前的数值范围[b,a]是模型量化最重要的工作之一，对精度影响很大。一般有max，entropy（TensorRT中使用的KL divergence），percentile（例如99.9%, 99.99%等比例元素的最大值）。

量化后的数值范围和精度直接取决于int类型能够表达的范围，而量化前的权重和激活值可能具有类似高斯分布的特征。如果直接采用max，虽然量化前所有浮点数范围都有int型表示，但是由于int型采样数量有限，这导致int型表示的精度很粗糙。而[b,a]选的太小又导致大量有用数据被截断，同样会导致精度不足。因此应该存在一种最优的量化前数值范围使得在特定分布和int类型量化下精度最优。Percentile直接选择整个数据元素排序后的99.9%或99.99%等比例元素作为阈值。TensorRT采用的KL divergence在大部分CV模型下表现出不错的精度和通用性。此外，weight和activation量化可以使用不同的范围选择方法，例如weight通常可以使用max或Percentile精度可能更高，而activation量化一般无法使用max。

参考：https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf

INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE:PRINCIPLES AND EMPIRICAL EVALUATION

TensorRT的ppt给出了基于KL divergence的流程和伪代码，基本流程思路是：

It’s always a tradeoff between range and precision of the INT8 representation.

We want to minimize loss of information.
Loss of information is measured by Kullback-Leibler divergence (AKA relative entropy or information divergence).
P, Q - two discrete probability distributions.
KL_divergence(P,Q):= SUM(P[i] * log(P[i] / Q[i] ), i)
P是量化前的直方图，Q是量化后的直方图。

一种简单的做法是根据校准数据计算的结果统计原始浮点数2048 bin的直方图得到P。然后采用不同阈值经过量化和反量化得到量化后浮点数的直方图Q。计算两个分布的KL_divergence。当然这个做法跟TRT PPT的做法稍有不同。

per-tensor量化：整个tensor共享量化参数。

per-channel (Conv) or per-column (MatMul)量化：每个channel元素共享参数，不同channel使用不同参数。通常比per-tensor量化精度更高，对BN融合影响更小，能避免mobileNet里面depth-wise-conv2d使用per-tensor量化导致的精度显著下降，这是因为每个channel的weight数值范围差异很大，使用同一组量化参数可能导致数值范围小的channel精度显著不足。

模型第一层和最后一层通常不量化或使用更高精度量化。（能否自动判断：是否可以根据不同数据数据结果的差异性来自动判断？）

weight量化

一般weight使用scale quantization，精度一般是足够的。而使用affine量化会导致计算复杂度明显上升。而activation量化可以使用affine或者scale quantization，affine量化计算复杂度稍微大一点。（基于MatMul和Conv2D计算的量化公式得到的结论）

per-channel比per-tensor一般精度更高。

As BN parameters are learned per channel, their folding can result in significantly different weight value distributions across channels. Fortunately, as Table 3 shows, per-channel quantization granularity maintains model accuracy even with BN folding.
权重的per-channel granularity, max calibration带来的精度损失很小（绝大多数小于0.5个百分点），模型大小变为原来1/4.

bias vectors, are quantized as 32-bit integers

量化仿真python代码参考

可以发现通常矩阵乘的weight按照同一列使用一组量化系数较好。

import numpy as np
import copy

weight = np.load("tensor1.npy")


def round_near(data):
    """Round data to nearest int
    For example, 0.1 to 0, 0.5 to 1
    """
    if data >= 0:
        data += 0.5
    else:
        data -= 0.5
    return int(data)


def get_u8_quant_coef(np_tensor):
    max_val = np.max(np_tensor)
    min_val = np.min(np_tensor)

    dst_max = 255
    dst_min = 0

    scale = (max_val-min_val)/(dst_max-dst_min)
    zero_point = dst_max - max_val / scale
    zero_point_i8 = np.rint(zero_point)
    return scale, zero_point_i8


def quant_u8(np_tensor, scale, zero_point):
    quanted_tensor = (np_tensor / scale + zero_point)
    quanted_tensor_1d = quanted_tensor.reshape([-1])
    for i, elem in enumerate(quanted_tensor_1d):
        quanted_tensor_1d[i] = np.rint(elem)
    quanted_tensor = quanted_tensor_1d.reshape(quanted_tensor.shape)
    return quanted_tensor


def dequant(np_tensor, scale, zero_point):
    dequant_tensor = np_tensor.astype("float32")
    dequant_tensor = (dequant_tensor-zero_point)*scale
    return dequant_tensor


def get_error(tensor1, tensor2):
    return np.sum(np.abs(tensor1 - tensor2))


def get_dequant(np_tensor):
    scale, zero_point = get_u8_quant_coef(np_tensor)
    quanted_tensor = quant_u8(np_tensor, scale, zero_point)
    dequant_tensor = dequant(quanted_tensor, scale, zero_point)
    return dequant_tensor, scale, zero_point


dequant_tensor, scale, zero_point = get_dequant(weight)

error = get_error(weight, dequant_tensor)

weight1 = copy.deepcopy(weight)
weight2 = copy.deepcopy(weight)


col = weight1.shape[1]
row = weight1.shape[0]


for i in range(col):
    line_data = weight[:, i]
    dequant_tensor_i, scale_i, zero_point_i = get_dequant(line_data)
    weight1[:, i] = dequant_tensor_i


for i in range(row):
    line_data = weight[i, :]
    dequant_tensor_i, scale_i, zero_point_i = get_dequant(line_data)
    weight2[i, :] = dequant_tensor_i

error1 = get_error(weight, weight1)
error2 = get_error(weight, weight2)

activation量化

max calibration一般不可用，CV类模型Entropy和99%(99.9% 99.99% 99.999% 99.9999%) calibration精度差别不太大，大多数情况下Entropy更优。NLP部分模型Entropy calibration精度比99%下降明显。

99.9% percentile calibration clips the large magnitude values too aggressive and leads to significant accuracy drops on most networks.

The best post training quantization results are achieved with entropy, 99.99%, or 99.999% percentile calibrations, though no single calibration is best for all networks.

其他算子量化
Notable integer-only quantization works include [154], which fuses Batch Normalization into the previous convolution layer, and [113], which proposes an integeronly computation method for residual networks with batch normalization. However, both methods are limited to ReLU activation. The recent work of [132] addresses this limitation by approximating GELU [94], Softmax, and Layer Normalization [6] with integer arithmetic and further extends integer-only quantization to Transformer [243] architectures.

部分算子量化可能对精度影响大，例如softmax等，不进行量化。

一个简单模型子图量化：

量化系数优化

Dyadic quantization is another class of integer-only quantization, where all the scaling is performed with dyadic numbers, which are rational numbers with integer values in their numerator and a power of 2 in the denominator. （也就是量化系数使用一个整数除法来替代，并且除数是2的幂，这样scale计算可以不使用除法）

不同rounding model的影响

Techniques to Recover Accuracy

Partial Quantization：不量化对精度影响大的layer。

混合精度量化：不同layer采用不同精度的量化，比如fp16, int8等。

排除少量Layer可以把精度损失降到0.05%到0.1%以内。

可以考虑做加法：每次找到精度影响最小的cluster做量化。也可以考虑做减法，量化所有，再去除精度影响最大的层。后者可能效率更高，因为去除的比例占总量化比例小。

Data-free quantization through weight equalization and bias correction

CV, transformer可用，bert模型比较独特

For BERT, sensitivity analysis does not reveal any particular layer that contributes more to the accuracy drop. As a result we cannot identify a small subset of layers to leave in floating-point.
To address this we need to consider different approaches. Section 5.2, incorporates quantization with training to recover accuracy. Additionally, Appendix D examines the GELU activation function in BERT and presents a simple augmentation to significantly improve post training quantization accuracy.

有文章研究除了四舍五入round的其他round方式可能有更好的精度。

混合精度量化参数搜索方法

可以用强化学习，遗传算法，NAS等方法搜索混合精度量化参数。
Different than these exploration and regularizationbased approaches, HAWQ [51] introduces an automatic way to find the mixed-precision settings based on secondorder sensitivity of the model.

In HAWQv2, this method was extended to mixedprecision activation quantization [50], and was shown to be more than 100x faster than RL based mixed-precision methods [246].

Recently, in HAWQv3, an integer-only, hardware-aware quantization was introduced [267] that proposed a fast Integer Linear Programming method to find the optimal bit precision for a given applicationspecific constraint (e.g., model size or latency).

Recommended Workflow

we recommend the following for int8 quantization:
• Weights:
– Use scale quantization with per-column/per-channel granularity
– Use a symmetric integer range for quantization [-127, 127]) and max calibration
• Activations:
– Use scale quantization with with per-tensor granularity

activation可能用affine更好，论文没有详细评测。因为上一层可能接relu，因此数值范围不是对称的。
We recommend the following procedure to quantize a pre-trained neural network.
• PTQ: Quantize all the computationally intensive layers (convolution, linear, matrix multiplication, etc.) and run activation calibration including max, entropy and 99.99%, 99.999% percentile. If none of the calibrations yield the desired accuracy continue to partial quantization or QAT.
• Partial Quantization: Perform sensitivity analysis to identify the most sensitive layers and leave them in floating-point. If the impact on computational performance is not acceptable or an acceptable accuracy cannot be reached, continue to QAT.
• QAT: Start from the best calibrated quantized model. Use QAT to fine-tune for around 10% of the original training schedule with an annealing learning rate schedule starting at 1% of the initial training learning rate. Refer to Appendix A.2 for specific hyperparameter choices.

三个重要的算子

Quantize：把fp32量化为int8。

Dequantize：把int8反量化为fp32。

Requantize：Conv2d/MatMul等算子两个输入都是int8，为防止溢出，输出类型为int32。这时需要使用Requantize：Conv2d把Conv2d/MatMul等算子输出的int32量化为int8作为下一个量化算子的输入。也就是把输入的一组量化参数表示的int类型转换为另一组量化参数表示的int类型，转换前后的浮点数值是等价的。s1(q1-z1)=s2(q2-z2)，由其他已知参数求q2的过程。

量化工具

TensorRT量化

fp16量化：config配置fp16，无需额外数据

config.set_flag(trt.BuilderFlag.FP16)

int8量化，可以使用Polygraphy作为辅助，需要额外数据，但是没有要求label作为精度评估依据

https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy

https://developer.download.nvidia.cn/compute/redist/polygraphy/polygraphy-0.33.2-py2.py3-none-any.whl

TensorRT supports the use of 8-bit integers to represent quantized floating point values. The
quantization scheme is symmetric uniform quantization - quantized values are represented
in signed INT8, and the transformation from quantized to unquantized values is simply a
multiplication. In the reverse direction, quantization uses the reciprocal scale, followed by
rounding and clamping.
To enable the use of any quantized operations, the INT8 flag must be set in the builder
configuration.

TensorRT supports only per-tensor quantization for activation tensors, but supports perchannel weight quantization for convolution, deconvolution, fully connected layers, and MatMul
where the second input is constant and both input matrices are 2-dimensional

In explicitly quantized networks, the scaling operations to transform between the quantized
and unquantized values are represented explicitly by IQuantizeLayer (C++, Python) and
IDequantizeLayer (C++, Python) nodes in the graph - these will henceforth be referred
to as Q/DQ nodes.

Post-Training Quantization using Calibration

In post-training quantization, TensorRT computes a scale value for each tensor in the network.
This process, called calibration, requires you to supply representative input data on which
TensorRT runs the network to collect statistics for each activation tensor.
The amount of input data required is application-dependent, but experiments indicate that
about 500 images are sufficient for calibrating ImageNet classification networks.

TensorRT provides multiple different calibrators which calculate the scale in different ways.

IInt8EntropyCalibrator2
Entropy calibration chooses the tensor’s scale factor to optimize the quantized tensor’s
information-theoretic content, and usually suppresses outliers in the distribution. This
is the current and recommended entropy calibrator and is required for DLA. Calibration
happens before Layer fusion by default. It is recommended for CNN-based networks.
IInt8MinMaxCalibrator
This calibrator uses the entire range of the activation distribution to determine the scale
factor. It seems to work better for NLP tasks. Calibration happens before Layer fusion by
default. This is recommended for networks such as NVIDIA BERT (an optimized version of
Google's official implementation)
ONNX Support
When a model trained in PyTorch or TensorFlow using Quantization Aware Training (QAT) is
exported to ONNX, each fake-quantization operation in the framework’s graph is exported as a
pair of QuantizeLinearand DequantizeLinearONNX operators.
When TensorRT imports ONNX models, the ONNX QuantizeLinear operator is imported as
an IQuantizeLayer instance, and the ONNX DequantizeLinear operator is imported as an
IDequantizeLayer instance. ONNX using opset 10 introduced support for QuantizeLinear/
DequantizeLinear, and a quantization-axis attribute was added in opset 13 (required for perchannel quantization).

Layer-level Control of Precision
The builder-flags provide permissive, coarse-grained control. However, sometimes part
of a network requires higher dynamic range or is sensitive to numerical precision. You can
constrain the input and output types per layer:
C++
layer->setPrecision(DataType::kFP16)
Python
layer.precision = trt.fp16

这个设置layer精度似乎并不好使，实际模型fp16量化精度不足，使用这种方式设置layer精度并没有解决问题。
Sparsity
NVIDIA Ampere Architecture GPUs support Structured Sparsity. To make use of this feature
to achieve higher inference performance, the convolution kernel weights and/or the fullyconnected weights must meet the following requirements:

采用TensorRT量化的几种方案

方案1. 直接使用TensorRT内置量化。

方案2.TensorRT 8 支持QDQ fake int8量化模型，可以采用这种方式进行模型量化，量化完转TensorRT。而手动量化为QLinearConv等算子构成的模型不能转TensorRT。QAT量化后也可以保存为这种量化方式的模型。TensorRT自动融合量化算子进行int8量化。

Generated ONNX graph with QuantizeLinear and DequantizeLinear ops is parsed using ONNX parser available in TensorRT.https://developer.download.nvidia.cn/video/gputechconf/gtc/2020/presentations/s21664-toward-int8-inference-deploying-quantization-aware-trained-networks-using-tensorrt.pdf

Intel® AI Quantization Tools

GitHub - IntelAI/tools

https://github.com/intel/neural-compressor

https://github.com/IntelAI/tools/blob/master/tensorflow_quantization/quantization/quantize_graph.py

ONNX模型量化

https://github.com/onnx/onnx/blob/master/docs/Operators.md

https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization提供了模型量化工具

Quantize ONNX Models - onnxruntime for usage details and https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization for examples.

ONNX quantization representation format

There are 2 ways to represent quantized ONNX models:

Operator Oriented. All the quantized operators have their own ONNX definitions, like QLinearConv, MatMulInteger and etc.
Tensor Oriented, aka Quantize and DeQuantize (QDQ). This format uses DQ(Q(tensor)) to simulate the quantize and dequantize process, and QuantizeLinear and DeQuantizeLinear operators also carry the quantization parameters. Models generated like below are in QDQ format:
- Models quantized by quantize_static API below with quant_format=QuantFormat.QDQ.
- Quantization-Aware training (QAT) models converted from Tensorflow or exported from PyTorch.
- Quantized models converted from tflite and other framework.

Quantizing an ONNX model

Quantization API

Quantization has 3 main APIs, which corresponds to the 3 quantization methods:

quantize_dynamic: dynamic quantization
quantize_static: static quantization
quantize_qat: quantize-aware training quantization

Please refer to quantize.py for quantization options for each method.

Please refer to E2E_example_model for an example of static quantization.

In general, it is recommended to use dynamic quantization for RNN and transformer-based models, and static quantization for CNN models.

Data type selection

Quantization represents value with 8 bit, which can be either int8 and uint8. Combining with activation and weight, the data format can be (activation:uint8, weight:uint8), (activation:uint8, weight:int8), etc.

Transformer-based models

There are specific optimization for transformer-based models, like QAttention for quantization of attention layer. In order to leverage those specific optimization, you need to optimize your models with Transformer Model Optimization Tool before quantizing the model.

This notebook demonstrates the E2E process.

Quantization on GPU

Hardware support is required to achieve better performance with quantization on GPUs. You need a device that support Tensor Core int8 computation, like T4, A100. Older hardware won’t get benefit.

ORT leverage TRT EP for quantization on GPU now. Different with CPU EP, TRT takes in full precision model and calibration result for inputs. It decides how to quantize with their own logic. The overall procedure to leverage TRT EP quantization is:

Implement a CalibrationDataReader.
Compute quantization parameter with calibration data set. Our quantization tool supports 2 calibration methods: MinMax and Entropy. Note: In order to include all tensors from the model for better calibration, please run symbolic_shape_infer.py first. Please refer tohere for detail.
Save quantization parameter into a flatbuffer file
Load model and quantization parameter file and run with TRT EP.

We have 2 E2E examples Yolo V3 and resnet50 for your reference.

onnx resnet50 cpu量化案例(bn folded)

但onnxruntime这些QLinearConv似乎也只有CPU实现，可用性不大。