int8 quantify

 

Feeling deep learning in full bloom mobile terminals in the past two years, in fact, feeling has begun to bloom.

First talk to quantify how it is, we are currently in the caffe, tensorflow framework for training and other models (forward and reverse) are using the float 32, compared with int 8, needed more storage space, but accuracy better.

Quantify the current situation, there are two ways, one is through training quantify finetune the original model, the other is directly on the model and calculation to quantify. This article first in terms of not using finetune, direct model is trained to quantify completed.

Depth study, the float 32 to be quantized int 8, so that smaller models, inference faster, lower power consumption. The only drawback, model accuracy will decline.

First compare, float 32 and int 8 What is the difference

Dynamic Range (two types of data value range)

FP32 -3.4 x 10^38 ~ +3.4 x 10^38

INT8 -128 ~ +127

In fact, the quantization process is simple, more precise mapping of the range of low accuracy.

[official]

[official]

[official]

When the quantization bias = 0, the positive and negative symmetry is made of a symmetric

Pictures from the "8-bit inference with tensor RT", belongs to all nvidia

In the quantization process, and it can be divided into two parts, one model parameter weights, the general model can be used directly No saturation of directly quantify  [official] .

Another part is to calculate the value of the quantization activations process, this part we choose Saturate the approach that there is a threshold selection process.

In "8-bit inference with tensor RT" article, the choice of threshold of using the relative entropy (kl divergence), relative entropy in the discrete case [official]

In a validation data set, a batch to an example, each layer of the batch calculating frequency histograms (FP32 histogram H with 2048 bins: bin [0], ..., bin [2047])

Pictures from the "8-bit inference with tensor RT" belongs to all nvidia

Of course, quantitative methods in this article is quite simple, for Resnet, more complex networks googlenet like, the effect is good, is relatively small loss of precision, but for the class of mobilenet, shufflenet of itself more streamlined network loss of accuracy is quite obvious, especially in the detection of network application architecture of the time.

In addition a google article "Quantizing deep convolutional networks for efficient inference: A whitepaper". This paper is optimized on the quantitative approach, there are two complementary aspects of the above methods.

First proposed the concept of a symmetrical and asymmetrical quantify quantify, that in the formula bias (1) is in nvidia article, bias = 0 (mentioned in his article does not require the bias) is symmetrical quantization concept, quantified after 0 or 0;  [official] the situation is asymmetric quantify, but which needs to be noted that the bias is an integer, as in depth learning model, there are too many 0-padding exist, and if non-bias integer, then there will be a large number of numerical precision received 0 loss in the quantization process.

The concept second layer by layer-by-channel quantization and the quantization, the quantized two very easy to understand the literal distinction, we also see a quantization method using a layer by layer in the method of the nvidia, each layer uses the same threshold value to quantify. Is the quantized channel by channel for each channel of each layer has its own threshold, there is a good accuracy improvement.

It is to select the quantization threshold is calculated, the article using a simpler method. For weights, using the actual maximum and minimum values ​​determined quantization parameter. To activate the output, using a cross-batch (Batches) maximum and minimum values ​​of the moving average to determine a quantization parameter.

Finally, look at the results, the article only classified network, the results for the accuracy of detection of network decline, still experiment.

Pictures from the "Quantizing deep convolutional networks for efficient inference: A whitepaper" belongs to all google

As well as training in quantitative, after the article revisit.

Released 2718 original articles · won praise 1004 · Views 5.36 million +

Guess you like

Origin blog.csdn.net/jacke121/article/details/104761170