The first review of large model compression is here~~

From: Dialogue’s Algorithm House

Enter the NLP group—> Join the NLP communication group

Recently, LLM has amazed the world with its amazing inference effects, thanks to its huge number of parameters and calculation tasks. Taking the GPT-175B model as an example, it has 175 billion parameters and requires at least 320GB (calculated as a multiple of 1024) of half-precision (FP16) format storage space. Additionally, to effectively manage operations, deploying the model for inference requires at least five A100 GPUs, each equipped with 80GB of memory. The huge storage and computing costs make effective model compression a difficult problem that needs to be solved.

Researchers from the Chinese Academy of Sciences and Renmin University conducted an in-depth discussion on the research progress of LLM-based model compression and published the first review in this field, "A Survey on Model Compression for Large Language Models".

728436cffeef56303e66a1d42d1ffefe.jpeg
论文链接:https://arxiv.org/pdf/2308.07633.pdf

Model compression involves converting large resource-intensive models into compact versions suitable for storage on constrained mobile devices. Additionally, it can optimize the model for faster execution and minimal latency, or to strike a balance between these goals.

This review mainly focuses on the methods, indicators and benchmarks of model compression technology for LLMs, and organizes related research contents into a new classification, including:

  • Pruning

  • Knowledge Distillation

  • Quantization

  • Low-Rank Factorization

3e81da1df77539779e05ea54a1ff79fd.jpeg

Figure 1: Classification of model compression methods for large language models.

1. Method

pruning

Pruning refers to removing unnecessary or redundant components, such as parameters, from a model to make the model more efficient. By pruning redundant parameters with limited contribution in the model, storage requirements can be reduced and memory and computing efficiency can be improved while ensuring the lowest performance degradation. The paper divides pruning into two main types: unstructured pruning and structured pruning.

Unstructured pruning: refers to removing individual parameters without considering the overall network structure. This approach works on individual weights or neurons by zeroing out parameters below a threshold. It will cause specific parameters to be removed and the model to have an irregular sparse structure. And this irregularity requires specialized compression techniques to store and compute the pruned model. Furthermore, unstructured pruning often requires extensive retraining of the LLM to restore accuracy, which is especially expensive for LLM. SparseGPT [Frantar and Alistarh, 2023] introduces a one-shot pruning strategy that requires no retraining. This approach treats pruning as a broad sparse regression problem and solves it using an approximate sparse regression solver, achieving significant unstructured sparsity. LoRAPrune [Zhang et al., 2023a] combines the Parameter Efficient Tuning (PEFT) method with pruning to improve the performance of downstream tasks. It introduces a unique parameter importance criterion using values ​​and gradients from Low-Rank Adaption (LoRA). Wanda [Sun et al., 2023] proposed a new pruning metric. It is evaluated as the product of the size of each weight and the norm of the corresponding input activation, which is approximated by using a small calibration dataset. This metric is used for local comparison within the linear layer output, allowing lower priority weights to be removed from the LLM.

Structural pruning: Removes connections or hierarchies based on predefined rules while maintaining the overall network structure. This method targets the entire set of weights at once, which has the advantage of reducing model complexity and memory usage while keeping the overall LLM structure intact. LLM-Pruner [Ma et al., 2023] adopts a versatile approach to compress LLMs while preserving their multi-task solving and language generation capabilities. It introduces a dependency detection algorithm to locate interdependence structures within the model. It also implements an efficient importance estimation method that takes into account first-order information and approximate Hessian information.

Knowledge Distillation (KD)

KD is implemented by transferring knowledge from a complex model (called the teacher model) to a simplified model (called the student model). In this section, we outline distillation methods using LLM as a teacher and classify these methods according to whether they emphasize distilling the emergent ability (EA) of LLM into small language models (SLM), including: Standard KD and EA- based KD.

7ff1c7910c2cbfc7e18c172bd86df980.jpeg

Figure 2: A brief classification of language model knowledge distillation.

Standard KD is designed to enable student models to learn common knowledge possessed by LLM, such as output distribution and feature information. This method is similar to traditional KD, but the difference is that the teacher model is an LLM.

MINILLM [Gu等,2023] 深入研究了从白盒生成LLMs进行蒸馏的方法,并选择最小化反向KLD,防止了学生过高估计教师分布中的低概率区域,从而提高了生成样本的质量。

GKD [Agarwal等,2023] 探索了从自回归模型进行蒸馏的方法,其中白盒生成LLMs作为其中一个子集。它通过在训练期间从学生模型中采样输出序列来处理分布不匹配问题,并通过优化替代的散度,如反向KL散度,来解决模型不足的问题。

In contrast, EA-based KD does not just transfer common knowledge of LLMs into student models, but also covers distilling their unique emergent capabilities. Specifically, EA-based KD is divided into contextual learning (ICL), chain of thought (CoT) and instruction following (IF).

ICL uses a structured natural language prompt that contains a task description and possibly some demonstration examples. With these task examples, LLM can master and execute new tasks without requiring explicit gradient updates. The work of Huang et al. introduces ICL distillation to transfer the contextual few-shot learning and language modeling capabilities of LLMs to SLMs. It combines contextual learning objectives with traditional language modeling objectives and explores ICL distillation under two few-shot learning paradigms, namely Meta-ICT and Multitask-ICT.

Compared to ICL, CoT incorporates intermediate inference steps into the prompts that can lead to the final output, rather than using simple input-output pairs. MT-COT [Li et al., 2022] aims to leverage the explanations produced by LLMs to enhance the training of smaller reasoners. It leverages a multi-task learning framework to empower smaller models with powerful reasoning capabilities and the ability to generate explanations. Fine-tune-CoT [Ho et al., 2023] generates multiple inference solutions from LLMs through random sampling. This increase in training data aids the learning process of the student model. Fu et al. discovered the trade-off between the multidimensional capabilities of language models and proposed a fine-tuned instruction adjustment model. They extract CoT inference paths from large teacher models to improve out-of-distribution generalization. Xie et al. used LLM principles as additional guidance within a multi-task framework to train smaller models. SOCRATIC CoT [Shridhar et al., 2023] trains two distillation models: a problem decomposer and a sub-problem solver. Decomposers decompose the original problem into a series of sub-problems, while sub-problem solvers deal with solving these sub-problems. DISCO [Chen et al., 2023] introduces a fully automatic counterfactual knowledge distillation method based on LLMs. It generates phrase perturbations through engineered prompts and then filters these perturbation data through a task-specific teacher model to extract high-quality counterfactual data. SCOTT [Wang et al., 2023a] uses contrastive decoding to link each principle to the answer. It encourages the extraction of relevant principles from teachers. Additionally, students are instructed to reason counterfactually and make predictions based on principles that lead to different answers.

IF relies only on task description and not on few examples. By fine-tuning using a sequence of tasks expressed in the form of instructions, language models demonstrate the ability to accurately perform tasks described by previously unseen instructions. Lion et al., 2023] exploit the adaptability of LLMs to improve the performance of student models. It prompts LLM to identify and generate "difficult" instructions, and then utilizes these instructions to enhance the capabilities of the student model.

d8f6263b2cd6601a059afd95d4d3a0f8.jpeg

Figure 3: Overview of EA-based KD. a) Contextual learning distillation, (b) Thought chain distillation, (c) Instruction following distillation.

Quantization

Quantization technology converts floating-point numbers in traditional representation methods into integers or other discrete forms to reduce the storage and computational burden of deep learning models. Careful quantization techniques can achieve large-scale model compression with only a slight loss of accuracy. According to the stage of applying the quantization compression model, it can be divided into the following three methods:

Quantization-Aware Training (QAT): In QAT, quantification targets are seamlessly integrated into the model training process. This approach enables LLM to adapt to low-precision representations during training, enhancing its ability to handle precision losses caused by quantization. This adaptation aims to maintain higher performance after the quantization process. LLM-QAT [Liu et al., 2023] utilizes results generated by pre-trained models to achieve data-free distillation. Furthermore, LLM-QAT quantifies not only weights and activations, but also key-value (KV) caches. This strategy is designed to enhance throughput and support longer sequence dependencies. LLM-QAT is able to distill large LLaMA models with quantized weights and KV cache into only 4-bit models. This breakthrough result demonstrates the feasibility of fabricating accurate 4-bit quantized LLMs.

Quantization -Aware Fine-tuning (QAF) QAF involves quantizing the LLM during fine-tuning. The main goal is to ensure that the fine-tuned LLM maintains performance after quantization to a lower bitwidth. By integrating quantization awareness into fine-tuning, LLM aims to strike a balance between model compression and preservation of performance. PEQA [Kim et al., 2023] and QLORA [Dettmers et al., 2023a] both belong to the category of quantized perceptual parameter efficient fine-tuning (PEFT) technology. These techniques focus on facilitating model compression and accelerating inference. PEQA uses a two-stage process. In the first stage, the parameter matrix of each fully connected layer is quantized into a low-bit integer matrix and a scalar vector. In the second stage, the scalar vectors for each specific downstream task are fine-tuned. QLORA introduces innovative concepts such as new data types, dual quantization, and paging optimizers. These ideas are designed to save memory without affecting performance. QLORA enables large models to be fine-tuned on a single GPU while achieving state-of-the-art results on the Vicuna benchmark.

Post-training quantification(Post-Training Quantization, PTQ) PTQ involves quantizing the parameters of the LLM after its training phase is completed. The main goal of PTQ is to reduce the storage and computational complexity of LLM without requiring modifications or retraining of the LLM architecture. The main advantages of PTQ are its simplicity and efficiency. However, it is worth noting that PTQ may introduce some degree of accuracy loss in the quantization process. In PTQ, some methods focus on quantizing only the weights of the LLM to improve efficiency and reduce computational requirements. LUT-GEMM [Park et al., 2022] enhances latency reduction and performance by improving computational efficiency by quantizing only weights and optimizing matrix multiplication in LLM using BCQ format. LLM. int8() [Dettmers et al., 2022] uses 8-bit quantization for matrix multiplication in LLM transformers, effectively reducing GPU memory usage during inference while maintaining performance accuracy. The method employs vector quantization and mixed-precision decomposition to handle outliers for efficient inference. ZeroQuant [Yao et al., 2022] integrates a hardware-friendly quantization scheme, layer-by-layer knowledge distillation, and optimized quantization support to reduce the weight and activation accuracy of Transformer-based models to a minimum of INT8 with little impact on accuracy . GPTQ [Frantar et al., 2022] proposes a novel hierarchical quantization technique based on approximate second-order information, which reduces the bit width of each weight to 3 or 4 bits with almost no accuracy loss compared to the uncompressed version. Dettmers and Zettlemoyer provide an in-depth exploration of the trade-off between model size and bit accuracy in zero-shot performance in LLM by analytically inferring scaling laws. They conducted extensive experiments across various LLM families and found that 4-bit accuracy is almost universally the best choice for achieving the right balance between total model bits and zero-sample accuracy. AWQ [Lin et al., 2023] found that the weights are not equally important for the performance of LLM, and protecting only 1% of the significant weights can greatly reduce the quantization error. On this basis, AWQ adopts an activation-aware approach that considers the importance of weight channels corresponding to larger activation amplitudes, which plays a key role in processing important features. This method employs a channel-by-channel scaling technique to determine the optimal scaling factor to minimize the quantization error while quantizing all weights. By analyzing how activation anomalies amplify errors in weight quantization, OWQ [Lee et al., 2023] introduces a mixed-precision quantization scheme that applies higher precision to weights that are susceptible to activation anomalies. SpQR [Dettmers et al., 2023b] identifies and isolates outlier weights, stores them in higher precision, and compresses all other weights to 3-4 bits. Furthermore, many works in PTQ attempt to quantify the weights and activations of LLM. SmoothQuant [Xiao et al., 2022] address the challenge of quantifying activations, which is often complicated by the presence of outliers. SmoothQuant observes that different markers exhibit similar changes across their channels and introduces a channel-by-channel scaling transformation that effectively smooths the amplitude, making the model easier to quantify. Given the complexity of quantifying activations in LLM, RPTQ [Yuan et al., 2023] reveals the challenge of uneven ranges between different channels, as well as problems caused by the presence of outliers. To solve this problem, RPTQ strategically groups channels into clusters for quantization, effectively mitigating channel-wide differences. Furthermore, it integrates channel rearrangement into layer normalization operations and linear layer weights to minimize the associated overhead. OliVe [Guo et al., 2023] further adopts outlier-victim pair (OVP) quantization and locally handles outliers with low hardware overhead and high performance gain, as it finds outliers to be important compared to normal values ​​next to them But it's not important. Outlier Suppression+ [Wei et al., 2023] introduces a new strategy involving channel-level pan and zoom operations to correct the inconsistency of anomalies by confirming that deleterious anomalies in activation exhibit an asymmetric distribution, mainly concentrated in specific channels. Symmetry is presented and mitigates the impact of problematic channels, and the optimal values ​​of translation and scaling are quantitatively analyzed, taking into account unusual asymmetries and quantization errors caused by the weights of the next layer. ZeroQuant-FP [Wu et al., 2023] explores the applicability of floating point (FP) quantization, focusing specifically on FP8 and FP4 formats. The study reveals that for LLM, FP8 activation consistently outperforms INT8 in performance, while in terms of weight quantization, FP4 performs comparably or even better than INT4. To address the challenges caused by the difference between weights and activations, ZeroQuant-FP requires all scaling factors to be powers of two and limits scaling factors to a single computational group. It is worth noting that ZeroQuant-FP also integrates the Low Rank Compensation (LoRC) strategy to further enhance the effectiveness of its quantification method. [Yuan et al., 2023] revealed the challenge of uneven ranges between different channels, as well as the problems caused by the presence of outliers. To solve this problem, RPTQ strategically groups channels into clusters for quantization, effectively mitigating channel-wide differences. Furthermore, it integrates channel rearrangement into layer normalization operations and linear layer weights to minimize the associated overhead. OliVe [Guo et al., 2023] further adopts outlier-victim pair (OVP) quantization and locally handles outliers with low hardware overhead and high performance gain, as it finds outliers to be important compared to normal values ​​next to them But it's not important. Outlier Suppression+ [Wei et al., 2023] introduces a new strategy involving channel-level pan and zoom operations to correct the inconsistency of anomalies by confirming that deleterious anomalies in activation exhibit an asymmetric distribution, mainly concentrated in specific channels. Symmetry is presented and mitigates the impact of problematic channels, and the optimal values ​​of translation and scaling are quantitatively analyzed, taking into account unusual asymmetries and quantization errors caused by the weights of the next layer. ZeroQuant-FP [Wu et al., 2023] explores the applicability of floating point (FP) quantization, focusing specifically on FP8 and FP4 formats. The study reveals that for LLM, FP8 activation consistently outperforms INT8 in performance, while in terms of weight quantization, FP4 performs comparably or even better than INT4. To address the challenges caused by the difference between weights and activations, ZeroQuant-FP requires all scaling factors to be powers of two and limits scaling factors to a single computational group. It is worth noting that ZeroQuant-FP also integrates the Low Rank Compensation (LoRC) strategy to further enhance the effectiveness of its quantification method. [Yuan et al., 2023] revealed the challenge of uneven ranges between different channels, as well as the problems caused by the presence of outliers. To solve this problem, RPTQ strategically groups channels into clusters for quantization, effectively mitigating channel-wide differences. Furthermore, it integrates channel rearrangement into layer normalization operations and linear layer weights to minimize the associated overhead. OliVe [Guo et al., 2023] further adopts outlier-victim pair (OVP) quantization and locally handles outliers with low hardware overhead and high performance gain, as it finds outliers to be important compared to normal values ​​next to them But it's not important. Outlier Suppression+ [Wei et al., 2023] introduces a new strategy involving channel-level pan and zoom operations to correct the inconsistency of anomalies by confirming that deleterious anomalies in activation exhibit an asymmetric distribution, mainly concentrated in specific channels. Symmetry is presented and mitigates the impact of problematic channels, and the optimal values ​​of translation and scaling are quantitatively analyzed, taking into account unusual asymmetries and quantization errors caused by the weights of the next layer. ZeroQuant-FP [Wu et al., 2023] explores the applicability of floating point (FP) quantization, focusing specifically on FP8 and FP4 formats. The study reveals that for LLM, FP8 activation consistently outperforms INT8 in performance, while in terms of weight quantization, FP4 performs comparably or even better than INT4. To address the challenges caused by the difference between weights and activations, ZeroQuant-FP requires all scaling factors to be powers of two and limits scaling factors to a single computational group. It is worth noting that ZeroQuant-FP also integrates the Low Rank Compensation (LoRC) strategy to further enhance the effectiveness of its quantification method. [Wei et al., 2023] introduced a new strategy involving channel-level translation and scaling operations to correct the asymmetric presentation of anomalies by confirming that deleterious anomalies in activation exhibit an asymmetric distribution, mainly concentrated in specific channels. , and mitigate the impact of problem channels, and quantitatively analyze the optimal values ​​of translation and scaling, taking into account abnormal asymmetries and quantization errors caused by the weights of the next layer. ZeroQuant-FP [Wu et al., 2023] explores the applicability of floating point (FP) quantization, focusing specifically on FP8 and FP4 formats. The study reveals that for LLM, FP8 activation consistently outperforms INT8 in performance, while in terms of weight quantization, FP4 performs comparably or even better than INT4. To address the challenges caused by the difference between weights and activations, ZeroQuant-FP requires all scaling factors to be powers of two and limits scaling factors to a single computational group. It is worth noting that ZeroQuant-FP also integrates the Low Rank Compensation (LoRC) strategy to further enhance the effectiveness of its quantification method. [Wei et al., 2023] introduced a new strategy involving channel-level translation and scaling operations to correct the asymmetric presentation of anomalies by confirming that deleterious anomalies in activation exhibit an asymmetric distribution, mainly concentrated in specific channels. , and mitigate the impact of problem channels, and quantitatively analyze the optimal values ​​of translation and scaling, taking into account abnormal asymmetries and quantization errors caused by the weights of the next layer. ZeroQuant-FP [Wu et al., 2023] explores the applicability of floating point (FP) quantization, focusing specifically on FP8 and FP4 formats. The study reveals that for LLM, FP8 activation consistently outperforms INT8 in performance, while in terms of weight quantization, FP4 performs comparably or even better than INT4. To address the challenges caused by the difference between weights and activations, ZeroQuant-FP requires all scaling factors to be powers of two and limits scaling factors to a single computational group. It is worth noting that ZeroQuant-FP also integrates the Low Rank Compensation (LoRC) strategy to further enhance the effectiveness of its quantification method.

In addition, related work is classified according to the number of bits (accuracy) in the LLM weight, which can be divided into 8-bit quantization and low-bit quantization.

fd0ceee5b38c315b93850dd9e2bc5930.jpeg

Figure 4: Overview of quantitative approaches to language models (LLM). We divide them into two categories: 8-bit quantization and low-bit quantization, based on the number of bits (ie, precision) in the LLM weight.

Low-Rank Factorization

Low-rank decomposition aims to approximate a given weight matrix by decomposing it into two or more matrices of smaller dimensions. The core idea behind low-rank decomposition is to find a decomposition of a large weight matrix W, resulting in two matrices U and V such that W ≈ UV, where U is an m×k matrix and V is a k×n matrix, where k Much smaller than m and n. The product of U and V approximates the original weight matrix, thus greatly reducing the number of parameters and computational overhead.

In the field of model compression in LLM research, researchers usually combine multiple techniques with low-rank decomposition, including pruning, quantization, etc., such as LoRAPrune [Zhang et al., 2023a] and ZeroQuant-FP [Wu et al., 2023] to achieve better Effective compression while maintaining performance. As research in this area continues, further developments in applying low-rank decomposition to compressive LLMs are likely to emerge, but continued exploration and experimentation will be required to realize its full potential for LLMs.

2. Indicators and Benchmarks

index

Number of Parameters: The total number of learnable weights or variables in LLM. LLM needs to optimize these weights during training.

Model Size: Refers to the disk space or memory footprint required to store the entire LLM, including weights, biases, and other necessary components.

Compression Ratio: The ratio between the original size of the uncompressed LLM and the size of the compressed LLM.

Inference time: Measures the time it takes LLM to process input data and generate a response during inference or prediction.

Floating point operations (FLOPs): A measure of the number of arithmetic operations involving floating point numbers (usually 32 or 16 bits) performed by LLM when processing input data.

benchmark

Common NLP benchmarks: GLUE, LAMBADA, LAMA, SQuAD;

HULK: Comprehensive evaluation of the energy efficiency of pre-trained language models (PLMs);

ELUE: Integrate six NLP datasets covering sentiment analysis, natural language inference, similarity and rewriting tasks.

3. Challenges and future directions

Dedicated benchmark. First, the evaluation of model compression lacks a universally accepted standard setting. Second, it may not be the best representation of typical tasks on a mobile device. Also, benchmarks designed for pre-trained models may not be applicable to common tasks on mobile devices.

Performance versus size trade-off. Current work still lacks theoretical and empirical insights into this trade-off.

Dynamic LLM compression. Current compression methods still rely on manual design to determine the compression size and structure of LLMs. This manual attempt will hinder practical work.

Interpretability. The integration of interpretable compression methods should become an important necessary condition for the progress of LLM compression applications. Employing interpretable compression not only addresses interpretability issues but also simplifies the evaluation process of compressed models. This in turn enhances the reliability and predictability of the model during the production phase.


Enter the NLP group—> Join the NLP communication group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/133327640
Recommended