I. Introduction

When HopeNet was deployed before, the difference between relu6 and relu was found. Specifically, the accuracy of the relu model decreases by 14% after quantization, and the accuracy of the relu6 model decreases by 2.5% after quantization (referring to directly replacing it with relu before quantization, and then quantizing). These two models are the same except for the backbone and activation function. So relu6 can reduce quantization accuracy loss? Because the backbones of the above two models are different, consider conducting a comparative experiment with stricter conditions.

2. Experiment

I deliberately chose MobileNet v2, its activation function is just relu6, the number of categories in the test data set is 2, the size of the data set is 500 positive and negative categories, the accuracy is the accuracy of the classification, and then the category accuracy is calculated for each category . We train two models, MobileNet v2-relu and MobileNet v2-relu6, which are the same except for the activation function. Train and test under pytorch, because my caffe does not support relu6. The image demo of rgb image converted to bgr format is here .
1. The pytoch reasoning results are as follows:

Model	MobileNet v2-relu
total accuracy	97.9%
0 category accuracy	97.2%
1 class accuracy	98.6%

Model	MobileNet v2-relu6
total accuracy	97.6%
0 category accuracy	97.6%
1 class accuracy	97.6%

2. Hisilicon nnie reasoning results

Model	MobileNet v2-relu
total accuracy	97.7%
0 category accuracy	97.0%
1 class accuracy	98.4%

Model	MobileNet v2-relu6
total accuracy	97.8%
0 category accuracy	97.6%
1 class accuracy	98.0%

Note: HiSilicon does not support relu6 (check it out, later I successfully deployed relu6 on HiSilicon, related blog ), when I converted pytorch to caffe, I directly replaced it with relu.
It can be seen from the above data that although the accuracy loss of MobileNet v2-relu is not large, relu6 still has a positive effect on reducing the loss of quantization accuracy.

3. Preface

I feel that the gap is not very obvious, and the persuasiveness is not very strong. After thinking about it, I will do some experiments to improve my persuasiveness.

4. Reason thinking

Since I discovered this problem, I have been thinking about the reasons behind it. Although this conclusion has not been supported by many cases, I still want to briefly discuss it. Brothers, just take a look. If you say something bad, you can point it out. I will modify it, thank you!
I thought about it for a long time at the beginning, but I had no way out. Later, when I read quantification-related technical blogs, I mentioned weight quantification ( quantification-related blogs , this blog is more convenient for novices to learn), and I thought it might be: relu caused the weight range to differ too much is large, and using relu6 alleviates this phenomenon .
So there are two questions to discuss:
1. Why can relu6 alleviate the situation where the weight difference is too large?
2. Why is the weight difference too large to reduce the accuracy after quantization?

Let me talk about the first question first, please come up with the chain rule.

insert image description here
The picture above is a simple tensor flow chart, and the picture below is the picture of relu6.
$\frac{ {\partial loss}}{ {\partial w}} = \ frac { {\partial loss}}{ {\partial y}} * \frac{ {\partial y}}{ {\partial B}} * \frac{ {\partial B}}{ {\partial w}} = \ frac{ {\partial loss}}{ {\partial y}} * \frac{ {\partial y}}{ {\partial B}} * A$
uses a simple version of the chain rule formula to explain that the large weight difference is generally caused by the different weight gradients. The W1 iteration with a large gradient is a bit larger, and the W2 iteration with a small gradient is a little smaller. After a few epochs, the difference between each W may be too large.
$\frac{ {\partial y}}{ {\partial B}}$ is the gradient of relu/relu6. We can see that there is a linear relationship between A and B. When B is too large, A is likely to be too large, making the gradient ∂ loss ∂ $\frac{ {\partial loss}}{ {\partial w}}$ If it is too large, it will eventually lead to the situation mentioned above when the gradient is too large (when relu is used as the activation function).
In relu6, the positive interval is partitioned. When B>6, $\frac{ {\partial y}}{ {\partial B}}$ Will be 0, that is, when A is too large, it will make B greater than 6, so that $\frac{ {\partial loss}}{ {\partial w}} = \frac{ { \partial y}}{ {\partial B}} = 0$ , avoiding the generation of large gradients.

Let’s talk about the second question— why the weight difference is too large to reduce the accuracy after quantization . You will understand this question more clearly after reading the quantization-related blog. Let me give a simple example here.
Because the value of convolution weight is fixed during reasoning, it generally uses unsaturated, asymmetric, and channel-by-channel quantization. The meanings of these three terms are not explained much, so go and see for yourself → (related blog 1 , related blog 2 ).
insert image description here

The above figure is a schematic diagram of quantization. If most of the values of w are in the range of 0-1 (float type), and now some of the values of w are in the range of 99-100 (the weight difference is 100 times), the quantized data is basically distributed in In the two fixed places (128, 255, uint8 type), a large amount of information for the convolution weight is roughly compressed, which seriously reduces the expressive ability of the model. And it can be seen that many intervals are wasted, such as 0-127, 129-254 intervals, if the range of weights is limited, the entire interval is well utilized, so that the weights can be more evenly mapped to 0- 255 interval, so that more weight information is preserved, thereby reducing quantization errors.

HiSilicon Development: Accuracy Changes and Causes of Relu and Relu6 Before and After Quantization

I. Introduction

2. Experiment

3. Preface

4. Reason thinking

Guess you like