The method commonly used in the model compression: pruning, decomposition, distillation, quantization, lightweight network model. Summary record when learning to quantify.
Background : The current neural networks typically use floating-point format store weights, and network structure, which is effective to maintain the accuracy of the model and the easiest way, the GPU may be preferred to accelerate the calculations. However, with the former increasing loads also increases in proportion to the derived model calculation, Quentization can effectively solve this problem, it is more compact than the 32-bit digital format to store and perform calculations.
Feasibility : low-accuracy calculations are another source of noise (?? be confirmed)
Role : to reduce the space occupied by the model.
Quantization method and a calculation example (an example is 8bit) :
Method a: storing minimum and maximum values of each layer, and then compressed into a float value for each 8-bit integers, the maximum value, the minimum value of the range of linear space division 256 sections, each represented by a unique 8-bit integer real value over the period, and then converted to floating point calculation;
examples: a minimum level parameters, maximum: -10,10. Section which is 256, 0 represents a section -10, 256 denotes para 10, so segment 128 represents the value 0, -5. 64 represents the like.
And the relationship between the floating-point number of stages is:
wherein X is a floating-point number, N being the number of segment
table on the left real network parameters (floating point), the right section of the table is the number of quantized after quantization. Storage, decreased to 1/4 of the original, is converted to floating point model is run, the following conversion formula:
wherein X is a floating-point number, N is the number of segments