weight quantization method

Weight quantization is a technique for quantizing the weight parameters in a neural network into a low-precision representation. In traditional neural networks, weights are usually represented by high-precision floating-point numbers, such as 32-bit floating-point numbers. However, high-precision floating-point numbers require more storage space and computing resources. For some resource-constrained scenarios, such as embedded devices or mobile terminals, high-precision floating-point numbers will lead to an increase in model size and calculation delay. Therefore, for these scenarios, low-precision representations are more suitable.

The basic idea of ​​weight quantization is to map weight parameters represented by floating point numbers to a discrete value set, such as 8-bit integers or 4-bit integers, thereby reducing the storage and calculation overhead of weight parameters. Common weight quantization methods include symmetric quantization and asymmetric quantization. Symmetric quantization maps weight parameters to a symmetrical quantization range, such as [-128, 127]. This method can be implemented by subtracting a bias value, so faster integer computing units can be used in the calculation process. Asymmetric quantization maps weight parameters to an asymmetric quantization range, such as [0, 255]. This method can better adapt to the distribution of parameters, but requires more floating-point conversion operations during calculation. .

Weight quantization can significantly reduce model size and computation latency while maintaining model accuracy to a certain extent. However, weight quantization may cause the accuracy of the model to drop, so retraining or fine-tuning is required to improve the accuracy of the model.

Common weight quantization methods include the following:

Symmetrical quantization : Quantize the weight into a symmetrical interval, such as [-128, 127], represented by an 8-bit integer, and the quantization parameter can be determined by the maximum and minimum values ​​before quantization.
Asymmetric quantization : Quantize the weight into an asymmetric interval, such as [0, 255], represented by an 8-bit integer, and the quantization parameter can be determined by the maximum and minimum values ​​before quantization, but 0 needs to be processed separately.
Mixed-precision quantization : Divide the weight into two parts, high-precision part and low-precision part. The high-precision part is represented by floating-point numbers, and the low-precision part is represented by integers, which can achieve better compression effect.
Sparse quantization : Set a part of the weights in the network to zero, so that this part of the weights does not need to be quantized, so as to achieve a better compression effect.
Block quantization : divide the weight into multiple blocks, and use the same quantization parameters inside each block, which can improve the quantization efficiency.
These methods have different advantages and disadvantages in different application scenarios, and the appropriate method needs to be selected according to the actual situation.
Outlier_Remove is a weight quantization method : its core idea is to remove the outlier or outlier (Outlier) in the weight value, and then quantize the remaining weight value. This method can effectively improve the accuracy and inference speed of the quantized model.

The specific implementation method is to first detect outliers in weight values ​​through statistical analysis methods (such as z-score) or other rules (such as setting thresholds) and remove them. Then, the remaining weight values ​​are quantized, usually using fixed-point quantization or symmetric quantization methods.

The advantage of the Outlier_Remove quantization method is that it can remove outliers in weight values ​​and reduce quantization errors, thereby improving the accuracy and inference speed of the model. But its disadvantage is that it needs to detect and process the weight value, which increases the cost of calculation and storage, and removing outliers may affect the generalization ability of the model.
KLD (Kullback-Leibler Divergence) is a weight quantization method based on distribution similarity : it measures the difference between the distribution of the original weight and the distribution of the quantized weight by comparing them, and selects the quantization parameter based on this difference .

The basic idea of ​​the KLD method is to select quantization parameters by minimizing the KL divergence between the original weight distribution and the quantized weight distribution. KL divergence is a measure of the distance between two probability distributions, therefore, the KLD method can be viewed as a distance measure.

In the KLD method, we first discretize the original weight distribution into N intervals, and then calculate the average value of the weight values ​​in each interval as the value of the quantized weight. Next, we compute the KL divergence between the original weight distribution and the quantized weight distribution, and minimize it to select the optimal quantization parameter.

The advantage of the KLD method is that it can maintain the shape of the original weight distribution, so it can achieve better performance in some specific application scenarios. The disadvantage is that the original weight distribution needs to be discretized, which will introduce certain errors.
The MAX weight quantization method is a quantization method based on the maximum value : in this method, the weights are first statistically analyzed to obtain their maximum value max_w and minimum value min_w, and then this range is divided into N equal interval intervals, Quantize weight values ​​to the value of the nearest interval midpoint. Since this method does not need to normalize or standardize the weight values, it has high flexibility and generalization, but it may also cause some weight values ​​to be mapped to the same quantization level, resulting in a loss of precision.
The Euclidean weight quantization method is a weight quantization method based on Euclidean distance : it divides each element of the weight vector by its L2 norm, thereby normalizing the weight vector to a unit vector. Specifically, for each weight vector w, its quantization value is:

w’ = w / ||w||2

Among them, ||w||2 is the L2 norm of w, namely:

||w||2 = sqrt(w1^2 + w2^2 + … + wn^2)

By normalizing the weight vectors to unit vectors, the Euclidean quantization method can reduce the distance difference between weight vectors to a certain extent and improve the accuracy of quantization.
Cosine weight quantization is a quantization method based on cosine similarity : it uses cosine similarity to quantify the similarity between two vectors before and after quantization. In this method, the weight vector is first mapped to a high-dimensional space, and then the cosine similarity between the two vectors is calculated, and then the weight is quantized according to the value of the cosine similarity. If the cosine similarity between two vectors is greater, then they are closer together in high-dimensional space, and thus their quantization error will be smaller. The advantage of this approach is that it enables high-bit weight compression without reducing model accuracy.
The Pearson weight quantization method is a weight compression method based on the Pearson correlation coefficient : the Pearson correlation coefficient is an indicator used to measure the strength of the linear relationship between two variables, and its value is between -1 and 1. The closer to 1, the two The stronger the linear relationship between variables.

In the Pearson weight quantization method, by calculating the Pearson correlation coefficient between each convolution kernel and all other convolution kernels in the network, the weights of convolution kernels that are weakly correlated with other convolution kernels are compressed to reduce the network weight. Redundant parameters in . Specifically, the Pearson weight quantization method can be divided into the following steps:

Calculate the Pearson correlation coefficient matrix of the weight matrix of each convolution kernel in the network.
For each convolution kernel, select a set of convolution kernels with the smallest Pearson correlation coefficient and compress their weights.
Repeat the second step until the weights of all convolution kernels in the network are compressed.
Compared with other weight quantization methods, Pearson weight quantization method can more accurately capture the interaction relationship between convolution kernels in the network, thereby achieving more effective parameter compression and model acceleration.

The Statistics quantization method is a method that does not quantify the weights, but compresses them by counting the mean and variance of the weights: specifically, for each weight, it is first subtracted from the mean, and then divided by the square root of the variance to achieve The purpose of distributing weights in a smaller interval. The advantage of this method is that it can retain more weight information, but requires higher storage space. At the same time, since this method does not discretize the weights, it is not very suitable for low-bit quantization.

Guess you like

Origin blog.csdn.net/qq_37464479/article/details/129240035