[NLP classic paper intensive reading] LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

foreword

One of the most popular low-resource fine-tuning large-scale model methods in the current large-scale model era, the method is simple and easy to understand, the reason is clear, and it is very inspiring for future work. If you want to understand the underlying principles of LoRA, it is recommended to take a closer look at this article. If it is just an application, then simply understand it~


ABSTRACT

Full fine-tuning after large model pre-training is often infeasible due to the computing power gap. Therefore, this paper proposes low-rank adaptation, that is, LoRA, which freezes the weight of the pre-trained model and injects a trainable low-rank decomposition matrix into each Transformer. One layer, greatly reducing the amount of parameters for downstream tasks. Compared with GPT-3, LoRA can reduce the amount of parameter training by 10000 times and the memory requirement by 3 times. LoRA performs on par or better than fine-tuning on RoBERTa, DeBERTa, GPT-2, GPT-3, and has no additional inference latency compared to Adapters. For the integrated implementation of LoRA and some models, see https://github.com/microsoft/LoRA .

1. INTRODUCTION

The traditional fine-tuning paradigm needs to update all model parameters, but it brings great inconvenience and challenges as the scale of the model becomes larger and larger. Many ideas hope to fine-tune only a part of the parameters or add additional modules for new tasks, which can greatly improve the efficiency of deployment, but these methods introduce inference delays by expanding the model depth or reducing the available sequence length of the model, and there is a gap with the fine-tuning results .
The author found that the learned over-parameterized model (the model size is far beyond the task required) actually exists in the lower intrinsic dimension, that is, the weight of the parameter is concentrated on the dimension with higher information content, ignoring other dimensions. This leads to effective representation and learning of the model in a lower intrinsic dimensional space. Therefore, the author assumes that the model also has "intrinsic rank" in the process of adaptation, thus proposing a low-rank adaptation method - LoRA. LoRA indirectly trains the dense layers in the neural network by optimizing the rank decomposition matrix of the dense layer changes during the adaptation process, while keeping the pre-trained weights unchanged (Frozen). As shown in the figure below:
image.png
Taking GPT-3 as an example, r in the above figure is even 1 or 2 (full rank is 12288) is enough, which shows that LoRA can save both space and computational cost.
To sum up, the advantages of LoRA are as follows:

  • Pretrained models can share LoRA modules built for different tasks.
  • LoRA makes training more efficient and reduces the hardware threshold by more than 3 times.
  • The simple linear design allows training and freezing matrices to be merged at deployment time without introducing inference latency.
  • LoRA can combine various methods, such as prefix-tuning.

2. PROBLEM STATEMENT

The LoRA method is a general paradigm. This article takes natural language tasks as an example to prove its superiority.
If it is full fine-tuning, each parameter needs to be learned, which is extremely challenging in storage and deployment. The objective function looks like this:
max ⁡ Φ ∑ ( x , y ) ∈ Z ∑ t = 1 ∣ y ∣ log ⁡ ( P Φ ( yt ∣ x , y < t ) ) \max _{\Phi} \sum_{( x, y) \in \mathcal{Z}} \sum_{t=1}^{|y|} \log \left(P_{\Phi}\left(y_t \mid x, y_{<t}\right )\right)maxF(x,y)Zt=1ylog(PF(ytx,y<t) )
LoRA is equivalent to modifying the objective function, only forΘ \ThetaDefine Θ
, instead: max ⁡ Θ ∑ ( x , y ) ∈ Z ∑ t = 1 ∣ y ∣ log ⁡ ( p Φ 0 + Δ Φ ( Θ ) ( yt ∣ x , y < t ) ) \max _{\Theta} \sum_{(x, y) \in \mathcal{Z}} \sum_{t=1}^{|y|} \log \left(p_{\Phi_0+\Delta\Phi(\Theta )}\left(y_t\mid x, y_{<t}\right)\right)maxTh(x,y)Zt=1ylog(pPhi0+ ΔΦ ( Θ )(ytx,y<t))

3. AREN’T EXISTING SOLUTIONS GOOD ENOUGH?

Since automatic transfer training, many works have sought to make the model efficient in terms of parameters and calculations, which will be mentioned in Section 6. Taking the language model as an example, there are two prominent strategies for efficient adaptation: adding an adapter layer or optimizing the prompt of the input layer. But both approaches have limitations, especially in large-scale and low-latency scenarios.

Adapter Layers Introduce Inference Latency

There are many forms of adapters. Although the overall delay can be reduced by pruning or multi-tasking settings, the extra calculation in the adapter cannot be eliminated. Although the number of parameters in the Adapter is small, the large neural network can only process the Adapter sequentially, and cannot take advantage of its parallel computing to maintain low latency. In online reasoning scenarios, there is often only one sample, which makes the delay of the Adapter more obvious. Also, if the model is trained in slices, the extra depth requires more simultaneous GPU operations, such as AllReduce and Broadcast.
image.png

Directly Optimizing the Prompt is Hard

Prefix tuning is difficult to optimize and the performance is unstable. In addition, fine-tuning preserves a part of the sequence or leads to a reduction in the output length of downstream tasks, which may adversely affect prompt fine-tuning.

4. OUR METHOD

4.1 LOW-RANK-PARAMETRIZED UPDATE MATRICES

The neural network contains many dense base layers that perform matrix multiplication, and the weights are all of full rank. Inspired by the "intrinsic rank", the author uses the pre-trained weight matrix W 0 W_0W0,Passing low order resolution W 0 + Δ W = W 0 + BA W_0 + \Delta W = W_0 + BAW0+ΔW=W0+B A to constrain its update, whereB ∈ R d × r B \in \mathbb{R}^{d \times r}BRd×r A ∈ R r × k A \in \mathbb{R}^{r \times k} ARr × k . During training,W 0 W_0W0is frozen, A and B update the parameters. For h = W 0 xh = W_0xh=W0x , our modified forward pass is as follows:
h = W 0 x + Δ W x = W 0 x + BA xh = W_0x + \Delta Wx = W_0x + BAxh=W0x+ΔWx=W0x+B A x
The author uses random Gaussian distribution initialization for A and 0 initialization for B, soΔ W \Delta WΔ W is 0 at the beginning of training, then useα r \frac{\alpha}{r}raFor W x WxW x to scale. α \alphaα can be adjusted like the learning rate during fine-tuning. This scaling method does not need to re-adjust the hyperparameters when changing r.

A Generalization of Full Fine-tuning

A more general fine-tuning paradigm is to fine-tune only a subset of the pretrained parameters. LoRA further reduces constraints, does not require full rank in training, and can be more flexible to deal with new tasks. All in all, as the model trainable parameters increase, LoRA roughly converges to the original model, while the Adapter-based method converges to MLP, and the Prefix Tuning-based method converges to a model that cannot accept long inputs.
:::info
Such a comparison, with the increase of the number of trainable parameters, the advantages of LoRA will become more apparent. At worst, the overhead of the original model is the same as that of the original model, while the other two methods are either more expensive or more usable. Reduced.
:::

No Additional Inference Latency

The reasoning process is the same as before. When changing different downstream tasks, only need to change different BA to restore W 0 W_0W0, which greatly reduces the memory overhead and ensures that the inference process does not introduce any additional delay.

4.2 APPLYING LORA TO TRANSFORMER

In principle, LoRA can be used in any subset of neural networks to reduce trainable parameters. In the Transformer architecture, the self-attention module has four weight matrices ( W q , W k , W v , W o W_q,W_k,W_v,W_oWq,Wk,Wv,Wo), there are two MLP modules. W q W_q can beWq(or W k , W v W_k, W_vWk,Wv) considered as dmodel × dmodel d_{model} \times d_{model}dmodel×dmodelA single matrix of , the authors only adjust the weights of downstream task attention and freeze the MLP module for simplicity and efficiency.

Practical Benefits and Limitations

The most significant advantage of LoRA is that it greatly saves video memory space and avoids GPU I/O bottlenecks. Also, it is possible to switch between tasks at a lower cost by exchanging only the LoRA weights instead of all parameters. However, LoRA also has its own limitations. When batch processing the data of different tasks, each task needs to be processed separately, which is complicated to implement, but it can also dynamically select the LoRA module according to the task.

5. EMPIRICAL EXPERIMENTS

5.1 BASELINES

The author compared LoRA experimentally with the following methods:

  • fine-tuning .
  • Bias-only or BitFit . Only train the bias vector while freezing other parameters.
  • Prefix-embedding tuning (PreEmbed) . Insert a special token into the input token. It is a linear Embedding and usually cannot correspond to a vocabulary. The number of trainable parameters is ∣ Θ ∣ = dmodel × ( lp + li ) |\Theta|=d_{model} \times (l_p + l_i)∣Θ∣=dmodel×(lp+li)
  • Prefix-layer tuning (PreLayer) . Learn the activation output of each layer of Transformer, the amount of trainable parameters is ∣ Θ ∣ = L × dmodel × ( lp + li ) |\Theta|=L \times d_{model} \times (l_p + l_i)∣Θ∣=L×dmodel×(lp+li) , L is the number of layers of Transformer.
  • Adapter tuning . An adapter layer is inserted between the self-attention module (and the MLP module) and the subsequent residual connection. Adapter is a two-layer fully connected layer with bias, and of course there are many variants.

5.2 ROBERTA BASE/LARGE

RoBERTa optimizes BERT's pre-training method to improve its performance without introducing more parameters. To compare the performance of different methods, the GLUE benchmark is used for evaluation. All tasks use the same batch_size, and are not tested under the general model, but are tested under the pre-trained model of different tasks to verify its ability on different tasks.
image.png

5.3 DEBERTA XXL

DeBERTa is another variant of BERT that has been trained on a larger scale, and the results are shown below the table above.

5.4 GPT-2 MEDIUM/LARGE

LoRA can replace complete fine-tuning on the NLU model, so it is hoped that it can also perform well on the NLG model. Taking GPT-2 as an example, the following is the result of the E2E NLG Challenge:
image.png

5.5 SCALING UP TO GPT-3 175B

Further expand the scale of the model and apply it to the 175B GPT-3. The results are shown in the following table:
image.png
Note that not all models have larger trainable parameters, the better. As shown in the figure below, it is observed that when more than 256 special tokens are used for prefix embedding tuning or more than 32 special tokens are used for prefix layer tuning, the performance drops significantly.
image.png

6. RELATED WORKS

Transformer Language Models

slightly.

Prompt Engineering and Fine-Tuning

The output of GPT-3 is largely determined by its input prompts. An empirical skill is needed to write formatting prompts to maximize the performance of the model on the required tasks, called Prompt Engineering.
Fine-tuning is to retrain a pre-trained model to solve a specific task. Full fine-tuning has a high barrier to entry due to its huge memory footprint.

Parameter-Efficient Adaptation

For the effective learning of parameters, many works have designed the Adapter layer to be embedded in the neural network. The low-rank decomposition in this paper is also similar. The advantage of LoRA is that the learned weight is merged with the fixed weight during the reasoning process, so no delay is introduced. . In addition, although the appearance of prefix-tuning can replace fine-tuning, it will occupy the length of the token available for input.

Low-Rank Structures in Deep Learning

Low-rank structure is very common in machine learning, and it is even more common in over-parameterized neural network models. Many methods consider low-rank decomposition, but do not think of freezing other parameters. It is also theorized that low-rank adaptation might be useful for adversarial learning.

7. UNDERSTANDING THE LOW-RANK UPDATES

This section uses a series of empirical studies to answer the following questions:

  1. Given a parameter budget constraint, which subset of the weight matrix in the Transformer should be tuned to maximize downstream performance?
  2. The optimal adaptation matrix Δ W \Delta WIs ΔW really a low-rank matrix, and how much rank should be more appropriate in practice?
  3. Δ W \Delta W Δ WWWWhat is the connection between W ? Whether the two are highly correlated,Δ W \Delta WHow big is ΔW ?

Questions 2 and 3 actually reveal the true principles of fine-tuning.

7.1 WHICH WEIGHT MATRICES IN TRANSFORMER SHOULD WE APPLY LORA TO?

Only considering the weight matrix in the self-attention module, the result is as follows: Note that Δ W q \Delta W_q
image.png
will be adjusted at the same timeΔWqΔ W k \Delta W_kΔWkWill lead to performance degradation, while simultaneously adjusting Δ W q \Delta W_qΔWqSum Δ W v \Delta W_vΔWvwill produce the best results. Therefore, tuning the parameters of more weight matrices is preferable to tuning the parameters of only one weight matrix.

7.2 WHAT IS THE OPTIMAL RANK r FOR LORA?

According to the selection of different r for experiments, the results are shown in the following table:
image.png
LoRA shows competitiveness under very small r, which further shows that the update matrix Δ W \Delta WΔW has a very small intrinsic rank. In order to further support this finding, the author conducted experiments under different random seed settings, and all obtained similar results .

But this does not mean that for all tasks, a very small internal rank has a good performance, which is related to the specific downstream task. If the gap between the downstream task and the pre-training task is too large, fine-tuning is equivalent to retraining. High rank is often better than low rank in this case.

Subspace similarity between different r

Given the adaptive matrix learned by the pre-training model with ranks 8 and 64, perform singular value decomposition, and obtain the right singular matrix of the corresponding rank, the current problem is, for the subspace formed by the first i singular vectors (1 ≤ i ≤ 8), how many are contained in the subspace spanned by the matrix with rank 64. The authors use a normalized subspace based on Grassmann distance to measure.
ϕ ( A r = 8 , A r = 64 , i , j ) = ∥ UA r = 8 i ⊤ UA r = 64 j ∥ F 2 min ⁡ ( i , j ) ∈ [ 0 , 1 ] \phi\left( A_{r=8}, A_{r=64}, i, j\right)=\frac{\left\|U_{A_{r=8}}^{i \top} U_{A_{r=64 }}^j\right\|_F^2}{\min (i, j)} \in[0,1]ϕ(Ar=8,Ar=64,i,j)=m i n ( i , j ) UAr=8iUAr=64j F2[0,1 ]
ϕ ( ⋅ ) \phi(·)ϕ ( ) ranges from [0, 1], 1 means that the subspaces are completely overlapped, and 0 means that they are completely separated. The following figure showsϕ ( ⋅ ) \phi(·)ϕ ( ) changes.
image.png
In the above figure, it can be found that there is a significant overlap between the top vectors with ranks 8 and 64, but not in other directions. The possible reason is that other directions mainly contain noise accumulated during training. Therefore, the adaptive matrix can indeed have a very low rank.

Subspace similarity between different random seeds

In order to further confirm this, the author compares by drawing the normalized subspace similarity graph of two random seeds under r=64, as shown in the figure below:
image.png
Δ W q \Delta W_qΔWqRatio Δ W v \Delta W_vΔWvThere is a higher intrinsic rank, which can be observed from the above figure.

7.3 HOW DOES THE ADAPTATION MATRIX ∆W COMPARE TO W ?

The author further explores Δ W \Delta WΔ WWWThe connection between W , or from a mathematical point of view, explore Δ W \Delta WIs ΔW mainly contained inWWThe main singular direction of W , andΔ W \Delta WΔ WWWHow large W is compared to the original orientation. The author decomposesWWProjection of W to Δ W \Delta Wr-dimensional subspace of ΔW, then compare the original WWW U ⊤ W V ⊤ U^{\top}WV^{\top} UWV 's F norm, for further comparison, also useWWThe first r singular vectors or random matrices of W to replace U and V.
image.png
The following conclusions can be drawn from the table above:

  1. Compared with a random matrix, Δ W \Delta WΔW W W W has a stronger correlation;
  2. Δ W \Delta W ΔW only magnifiesWWOrientation not emphasized in W.
  3. The magnification factor is quite large (6.91/0.32=21.5).

In addition, according to the content of the appendix, it can be shown that the low-rank adaptation matrix may be an important feature of the method downstream tasks, which are learned but not emphasized in the pre-trained model.

8. CONCLUSION AND FUTURE WORK

LoRA is an effective low-rank decomposition strategy that neither introduces inference delay nor reduces the input sequence length, while maintaining the performance of the model. The future direction of LoRA can have the following points:

  1. Combined with other effective fine-tuning methods.
  2. How to make the features learned during pre-training perform well on downstream tasks.
  3. Is there a more rigorous way to find the used modules of LoRA.
  4. Δ W \Delta W The low-rank property of ΔW suggests that WWW itself may also be low rank.

read summary

A 21-year work on fine-tuning and optimization has become popular again in the 23 years when large models are popular. I think there are several important reasons:

  1. The principle is simple. It is to change the previous serial Adapter into a parallel, two-layer MLP, and incorporate it into the frozen weights for reasoning after learning.
  2. The advantages are obvious. Neither does Adapter require additional reasoning time, nor does Prefix-tuning take up the length of the input token sequence.
  3. The effect is remarkable. Experiments show that LoRA is not inferior to full fine-tuning.
  4. Plug and play. The trained LoRA module can be directly used in similar downstream task scenarios.
  5. The story is exquisite. The author traces the source, first proposes the method, and then explores the mystery of the method. The reason why LoRA works well is because the essence of large model fine-tuning is the effective utilization of parameters. How to prove the effective utilization of parameters, then prove the low-rank characteristics through experiments and calculations. What kind of parameters are used? It is proved by experiments that the parameters that are not emphasized in pre-training but important in downstream tasks are enlarged.

Although the proof is not particularly rigorous, the overall thinking is clear and convincing, and it is particularly worth learning. This has also inspired me to a certain extent. Work should not be superficial, but look deep, and see the essence through phenomena. After penetrating the essence, inspiration will definitely gushing out.

Guess you like

Origin blog.csdn.net/HERODING23/article/details/131272754