Simple understanding of LoRA (Low-Rank Adaptation) in efficient fine-tuning of large model parameters

[Paper Address] [Code] [ICLR 22]

Note before reading: This blog post may contain inaccurate/oversimplified/errors in the description, and it is for reference only.


network structure

insert image description here
Among them, the parameters of the original model are directly frozen, and the trainable parameters are only additionally introduced LoRA parameters (implemented by nn.Parameter).


The nature of model fine-tuning

Note that the original pre-training parameters of the network are W 0 ∈ R d × k W_0 \in R^{d \times k}W0Rd × k . After fine-tuning on the new downstream task, the parameters becomeW ∈ R d × k W \in R^{d \times k}WRd × k . It can be found that the parameter changeΔ W = W − W 0 \Delta W = W - W_0ΔW=WW0. In other words, there is: W = W 0 + Δ WW=W_0+\Delta WW=W0+Δ W That is to say, fine-tuning the model can actually change the original parametersW 0 W_0W0Freeze it directly, just learn this change Δ W = W − W 0 \Delta W = W - W_0ΔW=WW0That's it.


Why Low Rank Decomposition

The LoRA article points out that the existing pre-training models are usually over-parameterized (the learned over-parametrized models in fact reside on a low intrinsic dimension). When fine-tuning these models, the update of the parameters is mainly in the low-dimensional subspace in . In other words, the parameters of many high-dimensional subspaces do not move at all before and after fine-tuning. Based on this, fine-tune the learned Δ W \Delta WIn fact, Δ W does not need such a high dimension (rank), we can reduce it to a lower dimension for optimization. Of course, it can also be noticed from here that if a large number of parameter updates also occur in high-dimensional subspaces, low-rank decomposition at this time will miss information and cause LoRA to fail.


How to understand low-dimensional subspace/high-dimensional subspace features

Here I give an analogy that may not be correct . For example, in computer vision, whether it is doing various downstream tasks such as segmentation, detection, and medicine, it can be fine-tuned based on the pre-trained model (such as ResNet) on ImageNet. The texture, edge, contour and other features in the pre-training model are generally required no matter what kind of task is done, then this task- independent feature is similar to the high-dimensional subspace feature mentioned above, and basically does not need to be fine-tuned in downstream tasks. change. Conversely, for some prior features in downstream tasks (such as unique lighting conditions, target location distribution), they can be regarded as the low-dimensional subspace features mentioned above. If the model wants to brush points to SOTA, it must make effective use of these task-related features .


Describe low-rank decompositions in mathematical form

LoRA will parameter change matrix Δ W \Delta WΔ W is decomposed into the multiplication of two lower-rank matrices: Δ W = BA \Delta W=BAΔW=BA其中 B ∈ R d × r B \in R^{d \times r} BRd×r A ∈ R r × k A \in R^{r \times k} ARr×k


Why is matrix B initialized to 0 while matrix A is normally Gaussian initialized

The disadvantages of the other two setups are discussed here:

  • If B and A are all initialized to 0, then the disadvantage is the same as the initialization of all 0s in the deep network, and it is easy to cause the gradient to disappear (because the functions of all neurons are equivalent at the beginning).
  • If B and A are all Gaussian initialized, then there will be a probability of getting an excessive offset value Δ W \Delta W at the beginning of network trainingΔW thus introduces too much noise , making convergence difficult.

Therefore, part of the initialization is 0, and part of the normal initialization is to maintain the original output of the network (initial offset is 0) at the beginning of training , but also to ensure better convergence after the actual start of learning.


How low is the low-rank decomposition?

Even down to 8 is highly available, and even down to 1:
insert image description here
note here that performance is even degraded for r=64. Interpreted according to the previous conclusion, this is because the update of the parameters is mostly in the low-rank space; the use of a high-rank matrix allows the update of the high-dimensional space, but may cause additional unnecessary parameter changes (introduced noise).


Where does LoRA end up being inserted into the network

It is only added to the Q, K, V, O matrix of the Self Attention layer, and the rest of the positions such as MLP are not added. Of course, some follow-up experiments [1] showed that it would be better to add only Q and K in other tasks, as shown in the figure below. Therefore, this can also be regarded as an adjustable point in the actual application of LoRA.
insert image description here
insert image description here


The difference between LoRA and Adapter

In fact, from a structural point of view, the Adapter that appeared earlier also introduced a small number of trainable parameters, and also has a "BottleNeck" structure that first reduces the dimension and then increases the dimension, as shown below: The main differences are as follows
insert image description here
:

  • insert position. LoRA is "parallel" in the form of residual connection on Transformer's Q, K, V, O matrix, and Adapter is inserted behind Feed-forward Layer.
  • Reasoning delays. After LoRA is trained, its parameters can be directly merged with the original pre-trained model to return to a single-branch structure without introducing additional delays; while the Adapter will introduce additional delays due to the introduction of additional serial network layers.
  • parameter storage. Using LoRA for fine-tuning, you only need to save the parameters of LoRA itself after training; while using Adapter, you need to save the parameters of the entire original model.

How LoRA parameters are merged with the original parameters of the model

Simple understanding is the associative law of multiplication, W 0 x + Δ W x = ( W 0 + Δ W ) x = W x W_0x+\Delta Wx = (W_0+\Delta W)x = WxW0x+ΔWx=(W0+ΔW)x=W x , that is, add the LoRA parameter matrix back to the original parameters.


references

[1] Customized Segment Anything Model for Medical Image Segmentation

Guess you like

Origin blog.csdn.net/qq_40714949/article/details/131988734