LoRA, AdaLoRA, QLoRA, a review of the principle of efficient fine-tuning technology for large model parameters

From: Eat jelly without spitting out jelly skin

Enter the NLP group —> join the NLP exchange group

With that, ChatGPT exploded rapidly, triggering a change of the era of large models. However, for the general public, pre-training or full fine-tuning of large models is out of reach. As a result, various parameter efficient fine-tuning techniques have been born, giving researchers or ordinary developers the opportunity to try to fine-tune large models.

Therefore, this technology deserves an in-depth analysis of the mechanism behind it. This series is roughly divided into seven articles to explain.

  • Overview of the principle of efficient fine-tuning of large model parameters (1) - background, introduction to efficient fine-tuning of parameters

  • Summary of Principles of Efficient Fine-tuning of Large Model Parameters (2) - BitFit, Prefix Tuning, Prompt Tuning

  • Summary of Principles of Efficient Fine-tuning Technology for Large Model Parameters (3)-P-Tuning, P-Tuning v2

  • A review of the principles of efficient fine-tuning of large model parameters (4) - Adapter Tuning and its variants

  • Summary of principles of efficient fine-tuning of large model parameters (5) - LoRA, AdaLoRA, QLoRA

  • Summary of Principles of Efficient Fine-tuning of Large Model Parameters (6) - MAM Adapter, UniPELT

  • Summary of the principle of efficient fine-tuning technology for large model parameters (7) - best practice and summary

This article is the fifth part of the review of the principle of efficient fine-tuning of large model parameters.

LoRA

background

Neural networks contain many fully connected layers, which are implemented by means of matrix multiplication, however, the weight matrices of many fully connected layers are of full rank. After fine-tuning for a specific task, the weight matrix in the model actually has a very low intrinsic rank (intrinsic rank). Therefore, the authors of the paper believe that the part of the parameter matrix that is updated by the weight can still be projected randomly to a smaller subspace. Effective learning can be understood as these weight matrices do not require full rank for specific downstream tasks.

Technical principle

LoRA (Paper: LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS ), the core idea of ​​this method is to simulate the change of parameters through low-rank decomposition, so as to achieve indirect training of large models with a very small amount of parameters.

In the module involving matrix multiplication, a new channel is added next to the original PLM, and the two matrices A and B are multiplied. The first matrix A is responsible for dimensionality reduction, the second matrix B is responsible for dimensionality enhancement, and the middle The layer dimension is r, thus simulating the so-called intrinsic rank.

b4b7df97918df47e7124defb74538d5f.png
image.png

The dimension of the trainable layer and the dimension of the pre-trained model layer are the same as d. First, the dimension d is reduced to r through the fully connected layer, and then mapped back to the d dimension from r through the fully connected layer, where r<<d, r is the matrix Rank, so that the matrix calculation changes from dxd to dxr + rxd, and the number of parameters is greatly reduced.

1c7e2dcef2349a8687dc0dfeb1288553.png
image.png

During downstream task training, other parameters of the model are fixed, and only the weight parameters of the two newly added matrices are optimized, and the results of the two parts of the PLM and the newly added path are added together as the final result (the input and output dimensions of the paths on both sides are Consistent), that is, h=Wx+BAx. The weight parameter of A in the first matrix will be initialized by a Gaussian function, and the weight parameter of B in the second matrix will be initialized to a zero matrix, which can ensure that the newly added path BA=0 at the beginning of training has no effect on the model result. Influence.

ea61dceb61ddda7520295797ec53ddbf.png
image.png

During inference, just add the results of the left and right parts together, h=Wx+BAx=(W+BA)x, so just add the matrix product BA after training and the original weight matrix W as the new weight It is enough to replace the W of the original PLM with parameters, and no additional computing resources will be added for reasoning.

In addition, Transformer's weight matrix includes Wq, Wk, Wv used to calculate query, key, and value in the Attention module, and Wo of multi-head attention, as well as the weight matrix of the MLP layer. LoRA is only applied to the four weight matrices in the Attention module. And it is found through ablation experiments that adjusting Wq and Wv simultaneously will produce the best results.

Experiments also found that ensuring the number of weight matrix types is more important than increasing the hidden layer dimension r, and increasing r does not necessarily cover more meaningful subspaces.

fa4256d0dd3de7be527f9d19f210b0d5.png
image.png
f7d1247e8213fc1406bdccbf5c00745a.png
image.png

Then regarding the choice of rank, usually, the rank is 4, 8, 16.

f29377b1be4c68a512fecebe52ca3159.png
image.png

Through experiments, it is also found that LoRA can finally match the performance of full fine-tuning on the premise of only training a small number of parameters on many data sets, and even outperform full fine-tuning in some tasks.

a2510130c86bfb84e0b1232b8fe19f59.png
image.png

AdaLoRA

background

In the field of NLP, fine-tuning of large pre-trained language models for downstream tasks has become an important practice. Generally speaking, we will use the method of fully fine-tuning the original pre-trained model to adapt to downstream tasks, but there are two problems with this method.

  • training phase. When fine-tuning the pre-training model, in order to update the weight parameters, a large amount of video memory is required to store the gradient of the parameters and the optimizer information. As the parameters of the pre-training model become larger and larger today, the fine-tuning threshold for downstream tasks getting higher and higher.

  • reasoning stage. Since we update the model parameters in full during training, multiple downstream tasks need to maintain an independent copy of a large model for each task, which causes us to waste unnecessary storage in practical applications.

To address these issues, researchers propose two main research directions to reduce the number of fine-tuned parameters while maintaining or even improving the performance of pretrained language models.

  • Direction 1: Add small network modules : Add small network modules to PLMs, and only fine-tune these modules for each task while keeping the basic model unchanged, which can be used for all tasks. In this way, only a small number of task-specific parameters need to be introduced and updated to adapt to downstream tasks, which greatly improves the practicability of the pre-trained model. Such as: Adapter tuning, Prefix tuning, Prompt Tuning, etc. Although these methods greatly reduce memory consumption. However, these methods have some problems, such as: Adapter tuning introduces inference delay; Prefix tuning or Prompt tuning directly optimizes Prefix and Prompt is non-monotonic, difficult to converge, and consumes input tokens.

  • Direction 2: Incremental update of downstream tasks : Model the incremental update of pre-trained weights without modifying the model architecture, ie W=W0+△W. For example: Diff pruning, LoRA, etc. These methods can achieve almost the same performance as full fine-tuning, but there are also some problems, such as: Diff pruning needs the underlying implementation to accelerate the calculation of unstructured sparse matrices, and cannot directly use the existing framework, the complete ∆W matrix needs to be stored during the training process, which does not reduce the computational cost compared to full fine-tuning. LoRA needs to pre-specify that the intrinsic rank r of each increment matrix is ​​the same, ignoring that when fine-tuning the pre-training model, the importance of the weight matrix is ​​significantly different between different modules and layers, and only the Attention is trained without training. FFN, in fact FFN is more important.

To summarize based on the above questions:

  • First, we cannot pre-specify the rank of the matrix and need to dynamically update the R of the incremental matrix, since the importance of the weight matrix varies significantly between different modules and layers.

  • Second, it is necessary to find more important matrices, assign more parameters, and cut out unimportant matrices. Finding important matrices can improve the effect of the model; while cutting out unimportant matrices can reduce the amount of parameter calculations and reduce the risk of poor model effects.

To bridge this gap, the authors propose AdaLoRA, which adaptively allocates parameter budgets among weight matrices according to their importance scores.

Technical principle

AdaLoRA (Paper: ADAPTIVE BUDGET ALLOCATION FOR PARAMETEREFFICIENT FINE-TUNING ), is an improvement to LoRA, which dynamically allocates parameter budgets to weight matrices based on importance scores. The specific method is as follows:

  • Adjust the incremental moment distribution . AdaLoRA assigns high ranks to critical incremental matrices to capture finer and task-specific information, and lower ranks to less important matrices to prevent overfitting and save computational budget.

  • Incremental updates are parameterized in the form of singular value decomposition, and unimportant singular values ​​are clipped according to the importance index, while singular vectors are preserved . Since exact SVD decomposition of a large matrix is ​​computationally expensive, this method speeds up computation by reducing their parameter budget, while preserving the possibility of future recovery and stabilizing training.965dd7544f5babba89b52ee27db1e651.png

  • An extra penalty term is added in the training loss to normalize the orthogonality of the singular matrices P and Q, thus avoiding the heavy computation of SVD and stabilizing the training.

It is experimentally demonstrated that AdaLoRA achieves better or comparable performance to existing methods on all budgets and all datasets. For example, when the parameter budget is 0.3M, AdaLoRA is 1.8% higher than the best-performing baseline (Baseline) on the RTE dataset.

48c1ae852b0720b9be891a3215afef66.png
image.png

QLoRA

background

Fine-tuning large language models (LLMs) is a very effective way to improve their performance and add desired or remove unwanted behavior. However, fine-tuning very large models is very expensive; taking the LLaMA 65B parameter model as an example, regular 16-bit fine-tuning requires more than 780 GB of GPU memory.

While recent quantization methods can reduce the memory footprint of LLMs, such techniques are only applicable to inference scenarios.

Based on this, the authors propose QLoRA and demonstrate for the first time that a model quantized to 4 bits can be fine-tuned without any performance loss.

Technical principle

QLoRA (paper:  QLORA: Efficient Finetuning of Quantized LLMs ), uses a novel high-precision technique to quantize the pre-trained model to 4 bits, and then adds a small set of learnable low-rank adapter weights, which pass the inverse of the quantized weights Propagate gradients for fine-tuning. QLORA has a low-precision storage data type (4 bit), and a calculation data type (BFloat16). In practice, this means that whenever a QLoRA weight tensor is used, we dequantize the tensor to BFloat16 and then perform a 16-bit matrix multiplication. QLoRA proposes two technologies to achieve high-fidelity 4-bit fine-tuning—4-bit NormalFloat (NF4) quantization and double quantization. Additionally, a paging optimizer was introduced to prevent memory spikes during gradient checkpoints, leading to out-of-memory errors that in the past made large models difficult to fine-tune on a single machine. The specific instructions are as follows:

  • 4bit NormalFloat (NF4): A new information-theoretical optimal data type for normally distributed weights that produces better empirical results than 4-bit integers and 4-bit floating-point numbers for normally distributed data.

  • Double quantization : Quantize the constants after the first quantization to reduce storage space.

  • Paging optimizer : Using the NVIDIA unified memory feature, this feature can perform automatic page-to-page transfers between the CPU and GPU to achieve error-free GPU processing when the GPU occasionally OOMs. This feature works like regular memory paging between CPU memory and disk. Use this feature to allocate paged memory for the optimizer state (Optimizer), then automatically offload it to CPU memory when GPU memory is low, and load it back to GPU memory when needed for the optimizer update step.

e33d48b1ea0e06492ebc4879667d8137.png
image.png

Experiments prove that the benchmark performance of 16bit full parameter fine-tuning can be replicated regardless of the adapter method using 16bit, 8bit or 4bit. This shows that although there will be performance loss during the quantization process, these performances can be fully restored by fine-tuning the adapter.

db9b69b086e901251cb72bb84ef317a8.png
image.png

The experiment also compared the impact of different 4bit data types on the effect (zero-shot mean value). Among them, NFloat is significantly better than Float, and NFloat + DQ is slightly better than NFloat. Although DQ does not improve the accuracy much, it has a good effect on memory control. better.

b62192a67f4e4edbd2193590e44dc003.png
image.png

In addition, the paper also compared the fine-tuning effects of different size models, different data types, and MMLU datasets. Using QLoRA (NFloat4 + DQ) can be equal to Lora (BFloat16). At the same time, the model using QLORA (FP4) lags behind the former two by one percentage point.

9a0fc454b4cd7cf3fb089d67486cb17e.png
image.png

The author also found some interesting points in the experiment, for example: although the effect of instruction tuning is better, it is only suitable for instruction-related tasks, and the effect is not good for chat robots, and chat robots are more suitable for Open Assistant data sets to fine-tune. Tuning through instruction data sets is more like improving the reasoning ability of large models, not for chatting.

In short, the emergence of QLoRA has brought some new thinking to everyone. Whether it is fine-tuning or deploying large models, it will become easier in the future. Everyone can quickly use their own private data for fine-tuning; at the same time, they can easily deploy large models for inference.

epilogue

This article describes the parameter efficient fine-tuning methods LoRA, AdaLoRA, and QLoRA. The following sections will explain the hybrid efficient fine-tuning methods MAM Adapter and UniPELT.

If you think my article can help you, I look forward to your likes, collections and attention~~


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/131199017