AI model linear quantization and acceleration

Model quantification

background

Data format for AI calculations

Computers use 0/1 to identify information. Each 0 or each 1 represents a bit/binary bit (bit). Information is generally expressed in three forms:

The smallest unit of a string is char, which occupies 8 bits (abbreviated as b) of memory, equal to 1 byte (Byte, abbreviated as B). 1 byte=8bit=8b.

Integer INT (Integer), the value after INT indicates the number of bits in the memory occupied by the integer type, commonly used INT8, INT16, INT32, INT64

Floating point number PF (Floating points), the value after PF also represents the number of bits of memory occupied by the floating point type. Commonly used are FP16 (half precision), FP32 (single precision) and FP64 (double precision).

Quantitative techniques

There are two commonly used quantization methods, post training quantization (PTQ) and quantization-aware training (QAT). PTQ is performed after the model is trained, but generally if the accuracy of PTQ does not meet the requirements, QAT will be considered. During the quantization process, data overflow and insufficient precision rounding errors will occur, and single-precision and half-precision data formats will be mixed. The advantage is to compress the model size, but because the model structure and parameters have not changed, and different precisions need to be aligned Computing, on the contrary, will lead to a reduction in computing speed. For this scenario, NVIDIA GPU has a dedicated computing unit (Tensor Core, etc.) to complete single-instruction mixed-precision computing to improve computing speed.

Quantitative tools

As quantification technology develops and matures, there are already very mature software tools, including NVIDIA's TensorRT. TensorRT is a deep learning inference engine (GPU Inference Engine) developed by NVIDIA. It is a complete set of tools from model acquisition, to model optimization and compilation, to deployment. The model supports mainstream training frameworks such as Tensorflow, Pytorch, and Caffe. During the model optimization and compilation process, it has supported mixed precision, PTQ, and QAT quantification training. The trained model will eventually be deployed on the embedded side, the cloud, and the hardware on the car. run on the platform.

linear quantization

The commonly used linear quantization process can be expressed using the following mathematical expression:

$$

Q = clamp(Round(R/S+Z))=Q_{max},R \in float 且R>T_{max} \

Q = clamp(Round(R/S+Z))=Round(R/S+Z).R\in float且T_{min}<R<T_{max} \

Q = clamp(Round(R/S+Z))=Q_{min},R\in float且R<T_{min} \

\

R=(Q-Z)*S

$$

Among them, Q represents the fixed-point number after quantization, R represents the floating-point number before quantization, and Z is zero_point, that is, the fixed-point value corresponding to floating point 0 after the floating-point number is mapped to a fixed point. S is scale, that is, scaling scale. Round() function is rounding. The function of the clamp() function is to limit a value between an upper limit and a lower limit. Tmax represents the maximum threshold of floating point numbers, and Tmin represents the minimum threshold of floating point numbers. Qmax represents the maximum value of a fixed number, and Qmin represents the minimum value of a fixed number.

Through conversion, the mathematical relationship between the threshold and the linear mapping parameters S and Z can be obtained. After the threshold is determined, the parameters of the linear mapping are also determined.

$$

S=(T_{max}-T_{min})/(Q_{max}-Q_{min})\

Z=Q_{max}-T_{max}/S

$$

type of data

Ranges

float32

-2^31 ~ 2^31-1

you8

-2^7 ~ 2^7-1 (-128 ~ 127)

uint8

0 ~ 2^8-1 (0~255)

From the above mapping relationship, if the threshold is known, then its corresponding linear mapping parameters are also known, and the entire quantization process is clear.

So how to determine the threshold?

Generally speaking, for the quantification of weights, since the data distribution of weights is static, it is generally sufficient to directly find the linear mapping of MIN and MAX; for inference activation values, the data distribution is dynamic. In order to obtain the activation value Data distribution often requires a so-called calibration set to carry out sampling distribution. After having the sampling distribution, some quantification algorithms are used to select the quantification threshold (saturation quantization).

Example:

After model training, the weights or activation values ​​are often distributed within a limited range. For example, the weight value range is [-2.0, 6.0], that is, Tmax = 6.0, Tmin = -2.0 (non-saturated quantization). Then we use int8 to quantize the model, and the fixed-point quantization value range is [-128, 127], that is, Qmax = 127, Qmin = -127, then the evaluation process of S and Z is as follows:

$$

S = 6.0 - (-2.0) / (127 - (-128)) = 8.0 / 255 ≈ 0.03137255\

Z = 127 - 6.0 / 0.03137255 ≈ 127 - 191.25 ≈ -64.25 ≈ -64

$$

The following correspondence can be obtained:

floating point number

Number of fixed points

6.0

-128

0

-64

-2.0

127

After getting the quantization parameters S and Z, we can find the fixed-point number corresponding to any floating-point number. For example, if there is a weight equal to 0.28, that is, R=0.28

$$

Q = 0.28 / 0.03137255 + (-64) ≈ -55

$$

Guess you like

Origin blog.csdn.net/qq_43805944/article/details/129666724