Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective
Paper address: https://arxiv.org/pdf/2303.11906.pdf
Project address: https://github.com/bytedance/MRECG
foreword
Post-training quantization (PTQ) is widely considered to be one of the most effective compression methods in practice due to its data privacy and low computational cost
The author argues that the oscillation problem is neglected in the PTQ method
In this paper, the author actively explores and presents a theoretical proof to explain why such questions are common in PTQ
The authors then attempt to address this question by introducing a principled, generalized theoretical framework
Oscillations in PTQ are first formulated and the problem is shown to be caused by differences in module capacities
To this end, the module capacity (ModCap) for data-dependent and data-free cases is defined, where the difference between adjacent modules is used to measure the degree of oscillation
The problem is then solved by selecting Top-k differences, where the corresponding modules are jointly optimized and quantized
Extensive experiments show that our approach successfully reduces performance degradation and generalizes to different neural network and PTQ methods
For 2/4-bit ResNet-50 quantization, our method outperforms the previous state-of-the-art by 1.9%
Becomes more important in small model quantization, e.g. 6.61% higher than BRECQ method on MobileNetV2×0.5
Introduction
Deep neural network (DNN) has rapidly become a research hotspot in recent years and has been applied in various scenarios in practice
However, with the development of DNNs, better model performance is usually associated with huge resource consumption of deeper and wider networks
At the same time, the research field of neural network compression and acceleration aimed at deploying models in resource-constrained scenarios is gradually gaining more and more attention.
including but not limited to neural architecture search, network pruning and quantization
Among these methods, quantization proposes to convert floating-point network activations and weights to low-bit fixed-point, which can speed up inference or training with little performance degradation.
Generally, quantization methods are divided into quantization-aware training (QAT) and post-training quantization (PTQ)
The former reduces quantization error by quantizing fine-tuning
Despite remarkable results, massive data requirements and high computational costs hinder the widespread deployment of DNNs, especially on resource-constrained devices
Therefore, PTQ is proposed to solve the above problem, which requires only little or zero calibration data for model reconstruction
Since there is no iterative process of quantization training, the PTQ algorithm is very efficient, and quantized models can be obtained and deployed usually within minutes
However, this efficiency often comes at the expense of accuracy
PTQ often performs worse than full-precision models trained without quantization, especially in low-bit compact model quantization
Nagel et al. construct a new optimization function by second-order Taylor expansion of the loss function before and after quantization, which introduces soft quantization with learnable parameters to achieve adaptive weight rounding
Li et al. changed layer-by-layer to block-by-block reconstruction and used the diagonal Fisher matrix to approximate the Hessian matrix to preserve more information
Wei et al. found that randomly disabling some elements of activation quantization smoothes the loss of quantized weights
During reconstruction, the authors observe that all the above methods exhibit varying degrees of oscillation as layers or blocks deepen
According to the authors, this question is critical and has been neglected in previous PTQ approaches
In this paper, 3 questions about the oscillatory problem are answered through rigorous mathematical definitions and proofs
- Why does PTQ oscillate?
- How will the oscillation affect the final performance?
- How to solve the oscillation problem in PTQ?
The contributions of this paper are as follows:
For the first time, the problem of oscillations in PTQ is revealed, which was neglected in previous algorithms. And found that in the optimization of PTQ, eliminating this oscillation is crucial.
It is theoretically shown that such oscillations are caused by differences in the capabilities of neighboring modules. A small module capacity will exacerbate the cumulative effect of quantization error, causing the loss to increase rapidly, while a large module capacity will reduce the cumulative quantization error and reduce the loss.
A novel hybrid reconstruction granularity (MRECG) method is proposed, which utilizes loss metrics and module capacity to optimize hybrid reconstruction fine-grained in data-dependent and data-free scenarios. The former finds the global optimum with moderately high overhead and thus has the best performance. The latter is more efficient, with a slight decrease in performance.
The effectiveness of the proposed method is verified on various compression tasks in ImageNet. In particular, 58.49% Top-1 accuracy is achieved in MobileNetV2 with 2/4 bits, which greatly exceeds the current SOTA methods. In addition, it is also confirmed that the algorithm in this paper does eliminate the oscillation of the reconstruction loss on different models, making the reconstruction process more stable.
related work
Module Capacity
Some common parameters affect the module capacity, such as the size of the filter, the bit width of the weight parameter and the number of convolution groups
Research shows that stride and residual links also affect module capacity
Kong et al. showed that a convolution with a stride of 2 can be equivalently replaced by a convolution with a stride of 1
At the same time, the filter size of the replacement convolution is larger than that of the original convolution, which means an increase in module capacity
MobileNetV2 contains depthwise convolution, which does not contain information exchange between channels, so it hurts the model performance somewhat
According to Liu et al., the input of the full-precision residual link increases the expressive power of the quantization module
Method in this paper
Through a theorem and deduction, it is proved that the oscillation problem of PTQ is highly related to the module capacity
Secondly, the capacity difference optimization problem is constructed, and two solutions are given in the data-dependent and data-free cases respectively
Finally, scaling up the batch size of the calibration data to reduce the expected approximation error is analyzed, which shows a trend of diminishing marginal utility
PTQ Oscillation Problem
Without loss of generality, use modules as the basic unit of analysis
In particular, unlike BRECQ, the module granularity in this paper is more flexible, which represents the granularity of Layer, Block or even Stage
A more general reconstruction loss is proposed under the module granularity framework as follows
in:
- f is the ith module of the neural network containing n convolutional layers
- w and x are the weights and inputs of the i-th module, respectively
- dw and dx are the corresponding quantized versions
When f contains only one convolutional layer, the equation degenerates into the optimization function in AdaRound
When f contains all convolutional layers of the i-th module, the equation degenerates into the optimization function in BRECQ
Two modules are said to be equivalent if they have the same number of convolutional layers and all associated convolutional layers have the same hyperparameters
There is a cumulative effect of quantization errors in quantization, manifested as an increasing trend of quantization errors in the network affected by the quantization of previous layers
Since PTQ does not include quantized training, the cumulative effect here is more evident in the reconstruction loss
It is explained in the text: the cumulative effect of quantization error in PTQ leads to incremental loss
The author proposes that if two adjacent modules are equal, the accumulation of quantization errors will lead to an increase in loss, but this situation is difficult to satisfy
Secondly, if the topology of two adjacent modules is the same structure, and if the capacity of the following modules is large enough, the resulting loss will be reduced, and vice versa
The article also explained: why there is oscillation in PTQ
The author uses the above figure to explain: the more severe the loss distribution oscillation corresponding to different algorithms, the greater the peak loss
That is, the degree of oscillation is positively correlated with the maximum error
And explained by the above figure, for different models, on different algorithms, the maximum error is positively correlated with the final error
At the same time, the author carried out Taylor expansion on the accuracy loss function according to BRECQ and Adaround, and it can be concluded that the final error will affect the accuracy of the model
In short, a large number of experiments have proved that the degree of oscillation of the error is negatively correlated with the accuracy, that is, the lower the oscillation level, the higher the accuracy
Solution
Through the oscillation problem, it is shown that the difference in module structure and capacity (PTQ quantification) leads to information loss that affects the accuracy of the model
Since there is no training process in PTQ, this information loss cannot be recovered even if the capacity of subsequent modules is increased
Therefore, loss oscillations are smoothed by joint optimization of modules with large capacity differences, thereby reducing accuracy loss
Simply put, if two modules have the same topology, but the capacity of the latter is too large, then the following formula is used
Make the capacities of the two modules as close as possible to reduce errors
in
- m is a binary mask vector, when m=1, it means that the i-th and i+1-th modules perform joint optimization
- 1 represents a vector whose elements are all 1
- k is a hyperparameter controlling the number of jointly optimized modules
- λ is an important index to control the regularization term and the optimization objective of the square of the capacity difference
The calculation process is as follows:
experiment
references
- https://arxiv.org/pdf/2303.11906.pdf