ZeroQuant-V2 LLM Weight and Activation Quantization

ref

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Why is 4bit quantization important

The case for 4-bit precision: k-bit Inference Scaling Laws

The research in this article shows that 4bit is usually the optimal quantization method: under the same quantized model size, the accuracy of the 4bit model is higher. The lower the number of bits, the model with larger parameters can be enabled, but the lower bits lead to a greater loss of quantization accuracy, which is equivalent to achieving the best compromise. For the same large compressed model, the precision of the 4-bit compression of the 2-fold parameter model and the 8-bit compression of the 1-fold parameter model is often higher.

Why Activation Quantization Matters

Methods such as GPTQ successfully compress model parameters to 4 bits and have been widely used, but this method is mainly aimed at weight quantization. At present, there is a lack of effective activation quantization methods. The big factor is that weight quantization plus activation quantization will lead to greater loss of accuracy.

However, due to the lack of activation quantization, matrix multiplication and convolution calculations still require float/half floating point calculations, and weights need to be dequantized from int to floating point, which leads to poor performance and large memory usage.

Contributions to ZeroQuant-V2

Some valuable insights are given, and a method to improve the accuracy of activation + weight quantization is given.

we undertake an exhaustive examination of the impact of PTQ on weight-only, activation-only, and combined weight-and-activation quantization. This investigation incorporates a range of PTQ methods, including round-to-nearest (RTN), GPTQ [12], ZeroQuant [36], and their respective variants. To broaden the scope of our analysis, we focus on two distinct model families, OPT [40] and BLOOM [28], spanning model sizes from 125M to a massive 176B.

In summary, we make the following contributions:
(1) We provide a thorough sensitivity analysis to demonstrate that a) Activation quantization is generally more sensitive to weight quantization; Smaller models usually have better activation quantization performance than the relative larger model. b) Different model families show different INT8 activation quantization behaviors; Particularly for large models, BLOOM-176B has small accuracy drops (about 1 perplexity or PPL) but OPT-30B and -66B experience worse performance.

(2) We carry out a detailed evaluation and comparison of current PTQ methods, utilizing optimal configurations to maximize model size reduction while minimizing accuracy impact. We found that the current existing method can barely achieve less than 0.1 PPL points degradation for quantization with either INT4-weight or INT4-weight-and-INT8-activation (W4A8). To recover the 0.1 PPL, we strive to push the boundaries of employing fine-grained quantization (FGQ) techniques. We observe FGQ is able to recovered points degradation of <0.1 PPL for large models (>13B) for INT4 weight quantization, but there are still non-negligible model quality drops.

(3) Based on the above understanding, we further optimize existing methods and introduce a technique called Low Rank Compensation (LoRC), which employs low-rank matrix factorization on the quantization error matrix. Complementary to FGQ, LoRC plays a crucial role in enhancing the full model quality recovery, while there is little increase of the model size.

using LoRC on top of PTQ methods from [36, 12] and fine-grained quantization, we set a new quantization Pareto frontier for LLMs.
Meanwhile, we recommend the following setting for quantizing LLMs with LoRC (Note that activation quantization should be only applied if necessary):

(1) For larger models (>10B), fine-grained (block size 64–256) 4-bit weight quantization plus 8-bit activation quantization (block size 64–256) with PTQ can be used for real deployment;

(2) For middle-size models (<10B and >1B), per-row INT8 quantization plus fine-grained (block size 64–256) INT8 activation quantization can be used with PTQ from [12, 36];

(3) For smaller models (<1B), per-row W8A8 (INT8 weight and INT8 activation) RTN is enough based on [36].
 

We employ both symmetric and asymmetric quantization to gauge the quantization sensitivity and highlight the advantage of asymmetric quantization.

Particularly, we implement per-row quantization [12] for weight quantization and per-token quantization for activation [36].
 

Robustness of Weight-only Quantization for Large Models.

INT8 weight-only quantization, either symmetric or asymmetric, results in negligible accuracy loss (less than 0.05, i.e., Class-1).
For INT4 quantization, the asymmetric method outperforms the symmetric approach in accuracy, attributable to its superior utilization of the quantization range. Interestingly, larger models exhibit better tolerance to low-precision quantization (i.e., INT4) than smaller models, with a few exceptions such as OPT-66B.


Challenge Encountered in Activation Quantization for Large Models.

Activation quantization has consistently proven more difficult than weight quantization
When compared to weight-only quantization, activation-only quantization indicates that asymmetric quantization can significantly improved performance over symmetric quantization. Moreover, contrary to weight-only quantization, smaller models typically exhibit better tolerance to activation quantization, as their hidden dimension is smaller and the activation dynamic range is also narrower than larger models [36].

existing quantization methods optimally harnessing the potential to minimize LLMs sizes

Fine-grained Quantization and Its Evaluation
finer-grained quantization schemes [5], where every k elements possess their own scaling factor and/or zero point.
For models of considerable size, specifically those equal to or exceeding 1B, the application of such fine-grained activation quantization (Case-1) results in a substantial reduction in quantization error compared to per-row activation (Case-2). By implementing fine-grained activation quantization with weight quantization (Case-3), we are able to almost restore the performance to the level of their W4A16 counterparts.
A trend of superior accuracy is observed with smaller block sizes in contrast to larger ones. However, the enhancement in performance reaches a saturation point when the size smaller or equal to 256, which corresponds to the range of values INT8 can represent. Despite INT8’s capability to signify 256 distinct values, activation quantization errors persist due to the application of uniform quantization.


LoRC (Low Rank Compensation)

LoRC can be viewed as a supplementary feature to existing quantization methodologies such as RTN, GPTQ, and ZeroQuant-Local/Global, and can be seamlessly integrated with FGQ.
low-rank dimension m can be as small as 4 or 8
the two low-rank matrices, Uˆ and Vˆ , can be quantized to 8-bit without any performance discrepancy

the combination of fine-grained quantization with LoRC yields the most impressive results, underscoring the efficacy of LoRC when integrated with FGQ. Overall, the results emphasize the benefits of using LoRC for enhanced performance in weight quantization and its compatibility with FGQ. Notably, recovering the last 0.05-0.1 perplexity can be challenging, but with LoRC, we are able to nearly recover the original model quality for INT4 quantization.

How to use this UV after calculation? This article did not mention the details!

only mentioned

But if you add it directly, there is no need to store these two matrices separately, which contradicts the article's mention of increasing storage. Moreover, w^ may have been partially saturated, and the addition does not necessarily improve the accuracy.

It is also possible that the activation is matrix multiplied with W^ and Uˆ, Vˆ respectively, and then the results are added? This requires an additional matrix multiplication step.

This job doesn't seem to require a training process? ZQ-Global should need distillation.

The article is not enough for LoRC to improve the W4A8 accuracy evaluation experiment!

It is not clear whether the normal activation of min and max dynamic quantization can improve the accuracy, or can it only be combined with ZQ-Global to achieve almost lossless?

Guess you like

Origin blog.csdn.net/u013701860/article/details/131260373