quantization quantization

Copyright: original works, welcome to forward! Forwarding attached link https://blog.csdn.net/qq_26369907/article/details/90035538

The method commonly used in the model compression: pruning, decomposition, distillation, quantization, lightweight network model. Summary record when learning to quantify.

Background : The current neural networks typically use floating-point format store weights, and network structure, which is effective to maintain the accuracy of the model and the easiest way, the GPU may be preferred to accelerate the calculations. However, with the former increasing loads also increases in proportion to the derived model calculation, Quentization can effectively solve this problem, it is more compact than the 32-bit digital format to store and perform calculations.

Feasibility : low-accuracy calculations are another source of noise (?? be confirmed)

Role : to reduce the space occupied by the model.

Quantization method and a calculation example (an example is 8bit) :
Method a: storing minimum and maximum values of each layer, and then compressed into a float value for each 8-bit integers, the maximum value, the minimum value of the range of linear space division 256 sections, each represented by a unique 8-bit integer real value over the period, and then converted to floating point calculation;
examples: a minimum level parameters, maximum: -10,10. Section which is 256, 0 represents a section -10, 256 denotes para 10, so segment 128 represents the value 0, -5. 64 represents the like.
And the relationship between the floating-point number of stages is:
Here Insert Picture Description
wherein X is a floating-point number, N being the number of segment
Here Insert Picture Description
table on the left real network parameters (floating point), the right section of the table is the number of quantized after quantization. Storage, decreased to 1/4 of the original, is converted to floating point model is run, the following conversion formula:
Here Insert Picture Description
wherein X is a floating-point number, N is the number of segments

Guess you like

Origin blog.csdn.net/qq_26369907/article/details/90035538