Model and quantify the principle tflite example

Model to quantify

What is quantified

weights are generally float32 data model, and they will soon be converted to quantify the int8. Of course, there are many actually quantify the mainstream is int8 / fp16 quantify, there are other such

  • Binary neural network: having binary weights and activated at run-time neural network, and the gradient calculation parameters during training.
  • Three yuan network weights: constrained neural network weights +1, 0 and -1
  • XNOR network: the filter input and binary convolution layer. XNOR network mainly using the binary operation approximated convolution.
    Now many frameworks or tools such as the nvidia TensorRT, xilinx of DNNDK, TensorFlow, PyTorch, MxNet so have to quantify the function.

Quantify the advantages and disadvantages

Quantifiable advantages are quite obvious, int8 uses less memory, faster operation, the model can be quantized better run on low-power embedded devices. To apply to the mobile phone side, autopilot and more.
Natural disadvantages are also obvious, model quantified the loss of precision. Resulting in decreased accuracy of the model.

Quantification of principle

First look at how computers store floating-point and fixed-point:


where the negative index determines the absolute value of the smallest non-zero number can express floating point; and a positive index determines the absolute value of the maximum number of floating-point numbers can express, also That determines the range of floating-point numbers.
float range from -2 ^ 128 ^ 128 + 2 can be seen the float is extremely wide distribution range.

Coming back to quantify the essence is: find a relationship mapping, and int8 can make float32 one correspondence. . That the question is, float32 able to express range is very wide, but only int8 expression [0,255].
How can use the number 255 represents the infinite (in fact, is not infinite, a lot, but still finite) floating points?

Fortunately, proved, weights neural networks tend to be concentrated in a very narrow range, as follows:

So this issue is resolved, that is, we do not need all values of ~ 128 + range -2 ^ 2 ^ 128 are do mapping. But even a small area, such as [-1,1] floating-point numbers can express is also very much, so inevitably
there will be more floating point are mapped to the same int8 integer resulting in the loss of precision.

At this time, the second question, why quantification is effective, why weights becomes accuracy after int8, the model does not make too much of the decline?
After searching a lot of information, I found that currently do not have a very strict the theoretical explanation of this matter.

You may ask why quantification is effective (sufficiently good prediction accuracy), especially when converting the FP32 INT8 information has been lost? Strictly speaking, rigorous theoretical correlation is not yet occur. An intuitive explanation is that the neural network is over-parameterization, further comprising sufficient redundant information, cropping the redundant information does not result in a significant decline in accuracy. Evidence that for a given quantization method, the accuracy of the gap between the network and the FP32 INT8 network for small to large networks, large networks because of the higher degree of over-parameterized

And deep learning models, many times, we can not explain why some parameters that can work, quantify, too, proved to quantify the loss of accuracy is not too much, do not know why it works, it just works.

How do quantify

由以下公式完成float和int8之间的相互映射.
\(x_{float} = x_{scale} \times (x_{quantized} - x_{zero\_point})\)
其中参数由以下公式确定:

举个例子,假设原始fp32模型的weights分布在[-1.0,1.0],要映射到[0,255],则\(x_{scale}=2/255\),\(x_{zero\_point}=255-1/(2/255)=127\)

量化后的乘法和加法:
依旧以上述例子为例:
我们可以得到0.0:127,1.0:255的映射关系.
那么原先的0.0 X 1.0 = 0.0 注意:并非用127x255再用公式转回为float,这样算得到的float=(2/255)x(127x255-127)=253


我们假设所有layer的数据分布都是一致的.则根据上述公式可得\(z_{quantized}=127\),再将其转换回float32,即0.0.

同理加法:


tflite_convert

日常吐槽:tensorflow sucks. tensorflow要不是大公司开发的,绝对不可能这么流行. 文档混乱,又多又杂,api难理解难使用.

tensorflow中使用tflite_convert做模型量化.用法:

tflite_convert \
  --output_file=/tmp/foo.cc \
  --graph_def_file=/tmp/mobilenet_v1_0.50_128/frozen_graph.pb \
  --inference_type=QUANTIZED_UINT8 \
  --input_arrays=input \
  --output_arrays=MobilenetV1/Predictions/Reshape_1 \
  --default_ranges_min=0 \
  --default_ranges_max=6 \
  --mean_values=128 \
  --std_dev_values=127

官方指导:https://www.tensorflow.org/lite/convert/cmdline_examples
关于各参数的说明参见:
https://www.tensorflow.org/lite/convert/cmdline_reference

关于参数mean_values,std_dev_values比较让人困惑.tf的文档里,对这个参数的描述有3种形式.

  • (mean, std_dev)
  • (zero_point, scale)
  • (min,max)
    转换关系如下:
std_dev = 1.0 / scale
mean = zero_point

mean = 255.0*min / (min - max)
std_dev = 255.0 / (max - min)

结论:
训练时模型的输入tensor的值在不同范围时,对应的mean_values,std_dev_values分别如下:

  • range (0,255) then mean = 0, std_dev = 1
  • range (-1,1) then mean = 127.5, std_dev = 127.5
  • range (0,1) then mean = 0, std_dev = 255

参考:
https://heartbeat.fritz.ai/8-bit-quantization-and-tensorflow-lite-speeding-up-mobile-inference-with-low-precision-a882dfcafbbd
https://stackoverflow.com/questions/54830869/understanding-tf-contrib-lite-tfliteconverter-quantization-parameters/58096430#58096430
https://arleyzhang.github.io/articles/923e2c40/
https://zhuanlan.zhihu.com/p/79744430
https://zhuanlan.zhihu.com/p/58182172

Guess you like

Origin www.cnblogs.com/sdu20112013/p/11960552.html