LoRA fine-tuning

A New Look at Large Model Fine-Tuning

Everything starts with the recent fire Lora ( 《LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS》), which was presented in ICLR2022. It is said that by using the low-rank adaptation method, only a small number of parameters need to be trained to achieve a good effect when using a large model to adapt to downstream tasks.

How does LoRA fine-tune and adapt to downstream tasks? The process is very simple. LoRA uses the data corresponding to the downstream tasks, and only adds some parameters through training to adapt to the downstream tasks. After the new parameters are trained, the new parameters and the old model parameters are merged by using the re-parameter method, so that the effect of fine-tune the entire model can be achieved on the new task, and the inference will not be increased during inference. time consuming.

The schematic diagram of LoRA is as follows: the blue part in the figure is the pre-trained model parameters. LoRA adds two structures A and B next to the pre-trained model structure. The parameters of these two structures are initialized to Gaussian distribution and 0 respectively. Then the additional parameter at the beginning of training is 0. The input dimension of A and the output dimension of B are the same as the input and output dimensions of the original model respectively, while the output dimension of A and the input dimension of B are a value much smaller than the input and output dimensions of the original model, which is the embodiment of low-rank ( A bit similar to the structure of Resnet), this can greatly reduce the parameters to be trained. Only the parameters of A and B are updated during training, and the pre-trained model parameters are fixed. The idea of ​​reparametrization can be used during inference to combine AB and W, so that additional calculations will not be introduced during inference. Moreover, for different downstream tasks, it is only necessary to retrain AB on the basis of the pre-trained model, which can also speed up the training rhythm of large models.

Since this article does not specifically introduce LoRA, you can view the original text of LoRA for details. We only need to know that subsequent experiments in the LoRA article have demonstrated the effectiveness of the method. Then think further, why can LoRA's idea work well?

The answer is the Intrinsic dimension to be discussed next. This point is also mentioned in the original LoRA text, which was inspired by the following two articles:

  1. MEASURING THE INTRINSIC DIMENSION OF OBJECTIVE LANDSCAPES, published in ICLR2018, for convenience, the next paper is called [Paper 1]

  2. INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING, published in ACL2021, for convenience, the next paper is called [Paper 2]

Intrinsic dimension definition

The concept of intrinsic dimension was proposed by [Paper 1].

Training a neural network often involves the following steps:

  1. For a given data set, first design the structure of the network and select the corresponding loss

  2. Randomly initialize the parameters in the network

  3. Training the network makes the loss lower and lower

The training phase can be considered as finding an effective path on a fixed objective landscape.

Here is an explanation of why it is a fixed target map. Because after the data set and network structure are fixed, the problem to be optimized has been defined, so the target graph is determined.

As shown in the figure below: that is to say, only the d-dimensional parameters can be updated when training the network to achieve the desired effect of the network. Then this d is the so-called intrinsic dimension of the model.

It may be a little dizzy after finishing here, let's take a look at the following picture: Using the intrinsic dimension to think about the effectiveness of large model fine-tuning

[Paper 2] Using the previously proposed eigendimensions to think about the effectiveness of large model fine-tuning, why is it possible to effectively fine-tune large models with hundreds or thousands of pictures now?

According to [Paper 1], for a certain type of problem, there are intrinsic features at a certain accuracy (such as 90% accuracy). For a large model, the test of the intrinsic dimension can know how many parameters need to be adjusted to solve the current problem approximately when solving a certain type of downstream problem. If there is really an experiment that proves that only adjusting a few parameters can solve the downstream problem well, then the above question can also be answered, that is, a small amount of fine-tuning (adjusting a small number of parameters) on the large model can solve the current problem. .

Unless otherwise specified below, "article" refers to 【Paper 2】

Experiment 1: For large models, whether there is an intrinsic dimension The experimental results are shown in the figure below:

The upper and lower subgraphs represent the two tasks of MRPC and QQP respectively. Each subgraph has four solid lines representing the accuracy of the four models, and four dotted lines representing the value that reaches 90% of the accuracy of the entire fine-tune model. The abscissa represents the training d dimension size. It can be seen from the figure that two tasks and four different models only need to train smaller d-dimensional parameters to achieve 90% accuracy. The concept of intrinsic dimension holds true in large models.

Therefore, when training a downstream task, only a small number of parameters need to be trained to achieve good results. At this time, the problem at the beginning of the article has been solved. But the author did some other experiments and found some interesting conclusions.

The relationship between the quality of pre-training and the intrinsic dimension

The article proposes such a hypothesis that the pre-training model can implicitly reduce the intrinsic dimension of the model in each task of NLP.

Based on this conjecture, the article did the following experiment. When pre-training the RoBERTa-base model, save the corresponding pre-training model every 10K, and then test the saved pre-training model in MRPC, QQP, Yelp Polarity, SST-2 , MNLI, ANLI six data set eigendimensions.

The results are as follows: It can be seen that on different data sets, there is the same trend, that is, the more pre-training times, the lower the intrinsic dimension of the model on each task. The experiment did not deliberately optimize the so-called intrinsic dimension, but the pre-training was longer. Therefore, it is confirmed that the stronger the representation ability of the pre-training model (the better the training), the smaller the intrinsic dimension.

The relationship between pre-trained model parameters and intrinsic dimensions

Originally, when doing the relationship between pre-training parameters and intrinsic dimensions, it is necessary to unify the structure of the model, which is more convincing. But the author said that in this way, many large-scale model experiments are required to be trained. In order to make it easier to compare the article, the experiment is done based on the existing structure. From the trend of the experimental results, different structures can also draw valid conclusions. whaosoft  aiot  http://143ai.com  

The article uses the existing pre-training model to calculate the intrinsic dimension on the MRPC data set.

The experimental results are as follows: the ordinate in the figure above represents the value of the intrinsic dimension, and the very coordinate represents the parameter quantity of the model. From the trend in the figure, it can be clearly seen that the larger the model, the smaller the intrinsic dimension, that is, the stronger the model, the lower the intrinsic dimension.

The relationship between intrinsic dimension and generalization ability

The relationship between fine-tune (3.1), pre-training (3.2) and intrinsic dimension is introduced above, but the relationship between intrinsic dimension and generalization ability has not been verified. That is, we now know the way to make the eigendimension small, but if the eigendimension is small, can the generalization ability be improved? It can be seen that the model with low intrinsic dimension has a higher accuracy rate of the trained model. That is to say, the lower the intrinsic dimension, the better the generalization performance.

Back to the introduction question: Why can the LoRA idea work?

Because the large model has the concept of intrinsic dimension, it only needs to adjust a few parameters to get good results on downstream tasks.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131798272