Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Paper address: https://arxiv.org/pdf/2303.11906.pdf
Project address: https://github.com/bytedance/MRECG


foreword

Post-training quantization (PTQ) is widely considered to be one of the most effective compression methods in practice due to its data privacy and low computational cost

The author argues that the oscillation problem is neglected in the PTQ method

In this paper, the author actively explores and presents a theoretical proof to explain why such questions are common in PTQ

The authors then attempt to address this question by introducing a principled, generalized theoretical framework

Oscillations in PTQ are first formulated and the problem is shown to be caused by differences in module capacities

To this end, the module capacity (ModCap) for data-dependent and data-free cases is defined, where the difference between adjacent modules is used to measure the degree of oscillation

The problem is then solved by selecting Top-k differences, where the corresponding modules are jointly optimized and quantized

Extensive experiments show that our approach successfully reduces performance degradation and generalizes to different neural network and PTQ methods

For 2/4-bit ResNet-50 quantization, our method outperforms the previous state-of-the-art by 1.9%

Becomes more important in small model quantization, e.g. 6.61% higher than BRECQ method on MobileNetV2×0.5


Introduction

Deep neural network (DNN) has rapidly become a research hotspot in recent years and has been applied in various scenarios in practice

However, with the development of DNNs, better model performance is usually associated with huge resource consumption of deeper and wider networks

At the same time, the research field of neural network compression and acceleration aimed at deploying models in resource-constrained scenarios is gradually gaining more and more attention.

including but not limited to neural architecture search, network pruning and quantization

Among these methods, quantization proposes to convert floating-point network activations and weights to low-bit fixed-point, which can speed up inference or training with little performance degradation.

Generally, quantization methods are divided into quantization-aware training (QAT) and post-training quantization (PTQ)

The former reduces quantization error by quantizing fine-tuning

Despite remarkable results, massive data requirements and high computational costs hinder the widespread deployment of DNNs, especially on resource-constrained devices

Therefore, PTQ is proposed to solve the above problem, which requires only little or zero calibration data for model reconstruction

Since there is no iterative process of quantization training, the PTQ algorithm is very efficient, and quantized models can be obtained and deployed usually within minutes

However, this efficiency often comes at the expense of accuracy

PTQ often performs worse than full-precision models trained without quantization, especially in low-bit compact model quantization

Nagel et al. construct a new optimization function by second-order Taylor expansion of the loss function before and after quantization, which introduces soft quantization with learnable parameters to achieve adaptive weight rounding

Li et al. changed layer-by-layer to block-by-block reconstruction and used the diagonal Fisher matrix to approximate the Hessian matrix to preserve more information

Wei et al. found that randomly disabling some elements of activation quantization smoothes the loss of quantized weights

insert image description here
During reconstruction, the authors observe that all the above methods exhibit varying degrees of oscillation as layers or blocks deepen

According to the authors, this question is critical and has been neglected in previous PTQ approaches

In this paper, 3 questions about the oscillatory problem are answered through rigorous mathematical definitions and proofs

  1. Why does PTQ oscillate?
  2. How will the oscillation affect the final performance?
  3. How to solve the oscillation problem in PTQ?

The contributions of this paper are as follows:

For the first time, the problem of oscillations in PTQ is revealed, which was neglected in previous algorithms. And found that in the optimization of PTQ, eliminating this oscillation is crucial.

It is theoretically shown that such oscillations are caused by differences in the capabilities of neighboring modules. A small module capacity will exacerbate the cumulative effect of quantization error, causing the loss to increase rapidly, while a large module capacity will reduce the cumulative quantization error and reduce the loss.

A novel hybrid reconstruction granularity (MRECG) method is proposed, which utilizes loss metrics and module capacity to optimize hybrid reconstruction fine-grained in data-dependent and data-free scenarios. The former finds the global optimum with moderately high overhead and thus has the best performance. The latter is more efficient, with a slight decrease in performance.

The effectiveness of the proposed method is verified on various compression tasks in ImageNet. In particular, 58.49% Top-1 accuracy is achieved in MobileNetV2 with 2/4 bits, which greatly exceeds the current SOTA methods. In addition, it is also confirmed that the algorithm in this paper does eliminate the oscillation of the reconstruction loss on different models, making the reconstruction process more stable.


related work

Module Capacity

Some common parameters affect the module capacity, such as the size of the filter, the bit width of the weight parameter and the number of convolution groups

Research shows that stride and residual links also affect module capacity

Kong et al. showed that a convolution with a stride of 2 can be equivalently replaced by a convolution with a stride of 1

At the same time, the filter size of the replacement convolution is larger than that of the original convolution, which means an increase in module capacity

MobileNetV2 contains depthwise convolution, which does not contain information exchange between channels, so it hurts the model performance somewhat

According to Liu et al., the input of the full-precision residual link increases the expressive power of the quantization module


Method in this paper

Through a theorem and deduction, it is proved that the oscillation problem of PTQ is highly related to the module capacity

Secondly, the capacity difference optimization problem is constructed, and two solutions are given in the data-dependent and data-free cases respectively

Finally, scaling up the batch size of the calibration data to reduce the expected approximation error is analyzed, which shows a trend of diminishing marginal utility

PTQ Oscillation Problem

Without loss of generality, use modules as the basic unit of analysis

In particular, unlike BRECQ, the module granularity in this paper is more flexible, which represents the granularity of Layer, Block or even Stage

A more general reconstruction loss is proposed under the module granularity framework as follows

insert image description here
in:

  • f is the ith module of the neural network containing n convolutional layers
  • w and x are the weights and inputs of the i-th module, respectively
  • dw and dx are the corresponding quantized versions

When f contains only one convolutional layer, the equation degenerates into the optimization function in AdaRound

When f contains all convolutional layers of the i-th module, the equation degenerates into the optimization function in BRECQ

Two modules are said to be equivalent if they have the same number of convolutional layers and all associated convolutional layers have the same hyperparameters

There is a cumulative effect of quantization errors in quantization, manifested as an increasing trend of quantization errors in the network affected by the quantization of previous layers

Since PTQ does not include quantized training, the cumulative effect here is more evident in the reconstruction loss

It is explained in the text: the cumulative effect of quantization error in PTQ leads to incremental loss

The author proposes that if two adjacent modules are equal, the accumulation of quantization errors will lead to an increase in loss, but this situation is difficult to satisfy

Secondly, if the topology of two adjacent modules is the same structure, and if the capacity of the following modules is large enough, the resulting loss will be reduced, and vice versa

The article also explained: why there is oscillation in PTQ

insert image description here

The author uses the above figure to explain: the more severe the loss distribution oscillation corresponding to different algorithms, the greater the peak loss

That is, the degree of oscillation is positively correlated with the maximum error

insert image description here
And explained by the above figure, for different models, on different algorithms, the maximum error is positively correlated with the final error

At the same time, the author carried out Taylor expansion on the accuracy loss function according to BRECQ and Adaround, and it can be concluded that the final error will affect the accuracy of the model

In short, a large number of experiments have proved that the degree of oscillation of the error is negatively correlated with the accuracy, that is, the lower the oscillation level, the higher the accuracy


Solution

Through the oscillation problem, it is shown that the difference in module structure and capacity (PTQ quantification) leads to information loss that affects the accuracy of the model

Since there is no training process in PTQ, this information loss cannot be recovered even if the capacity of subsequent modules is increased

Therefore, loss oscillations are smoothed by joint optimization of modules with large capacity differences, thereby reducing accuracy loss

Simply put, if two modules have the same topology, but the capacity of the latter is too large, then the following formula is used

Make the capacities of the two modules as close as possible to reduce errors

insert image description here
in

  • m is a binary mask vector, when m=1, it means that the i-th and i+1-th modules perform joint optimization
  • 1 represents a vector whose elements are all 1
  • k is a hyperparameter controlling the number of jointly optimized modules
  • λ is an important index to control the regularization term and the optimization objective of the square of the capacity difference

The calculation process is as follows:
Please add a picture description


experiment

Please add a picture description


references

  • https://arxiv.org/pdf/2303.11906.pdf

Guess you like

Origin blog.csdn.net/qq_38973721/article/details/130258885