[NLP] LLM efficient fine-tuning (PEFT)--LoRA

LoRA

background

Neural networks contain many fully connected layers, which are implemented by means of matrix multiplication, however, the weight matrices of many fully connected layers are of full rank. After fine-tuning for a specific task, the weight matrix in the model actually has a very low intrinsic rank (intrinsic rank). Therefore, the authors of the paper believe that the part of the parameter matrix that is updated by the weight can still be projected randomly to a smaller subspace. Effective learning can be understood as these weight matrices do not require full rank for specific downstream tasks.

Technical principle

LoRA (Paper: LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS ), the core idea of ​​this method is to simulate the change of parameters through low-rank decomposition, so as to achieve indirect training of large models with a very small amount of parameters.

In the module involving matrix multiplication, a new channel is added next to the original PLM, and the two matrices A and B are multiplied. The first matrix A is responsible for dimensionality reduction, the second matrix B is responsible for dimensionality enhancement, and the middle The layer dimension is r, thus simulating the so-called intrinsic rank.

 The dimension of the trainable layer and the dimension of the pre-trained model layer are the same as d. First, the dimension d is reduced to r through the fully connected layer, and then mapped back to the d dimension from r through the fully connected layer, where r<<d, r is the matrix Rank, so that the matrix calculation changes from dxd to dxr + rxd, and the number of parameters is greatly reduced.

During downstream task training, other parameters of the model are fixed, and only the weight parameters of the two newly added matrices are optimized, and the results of W and the newly added path W1 are added together as the final result (input and output dimensions of both paths are consistent), that is, h=Wx+BAx. The weight parameter of A in the first matrix will be initialized by a Gaussian function, and the weight parameter of B in the second matrix will be initialized to a zero matrix, which can ensure that the newly added path BA=0 at the beginning of training has no effect on the model result. Influence.

During inference, just add the results of the left and right parts together, h=Wx+BAx=(W+BA)x, so just add the matrix product BA after training and the original weight matrix W as the new weight It is enough to replace the W of the original PLM with parameters, and no additional computing resources will be added for reasoning.

Why does updating ΔW only need to update fewer parameters?

Now, let's tackle the big question in the room: if we introduce a new weight matrix, how efficient is this parameter? The new matrices WA and WB can be very small. For example, assuming A=100 and B=500 , the magnitude of ΔW is 100 × 500 = 50,000 . Now, if we decompose this into two smaller matrices, a 100×5 dimensional matrix WA and a 5×500 dimensional matrix WB . These two matrices have a total of only 5×100+5×500=3000 parameters.

The author also clearly stated in the abstract that they used the lora method for fine-tuning. Compared with GPT-3 full parameter fine-tuning, the trainable parameters have dropped by 10,000 times, and the GPU memory requirements have dropped by 3 times. It is even comparable to the full fine-tuning model.

Why the constant emphasis on specific tasks ? Because lora is based on the assumption that when fine-tuning on a specific task, the updated parameter matrix has a lower intrinsic dimensionality. You can think of lora as a device with a specific ability, and the pre-trained model is the game character itself. Based on the pre-trained model (game character), a specific lora (equipment) can enhance the performance of a specific task, but the lora module will not work on other unrelated tasks. If you want to have a performance comparable to the full parameter fine-tuning model on multiple tasks at the same time, you need to train different ΔW modules (multiple devices) for different tasks, and finally integrate them together. However, if you want the model (game character) itself to become stronger as a whole, it is more appropriate to fine-tune the full parameters.

As for whether it is suitable as a solution for general instruction fine-tuning, there is a question that I have not understood, that is, does the general instruction sample really have a unified low-rank space representation? What does this symbol mean? Because the samples in the instruction fine-tuning stage are actually mixed multi-tasking instruction samples, whether lora is suitable in this case needs a more comprehensive evaluation.

## 初始化低秩矩阵A和B
self.lora_A.update(nn.ModuleDict({adapter_name: nn.Linear(self.in_features, r, bias=False)}))
self.lora_B.update(nn.ModuleDict({adapter_name: nn.Linear(r, self.out_features, bias=False)}))
self.scaling[adapter_name] = lora_alpha / r

## 向前计算
result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
result += (
    self.lora_B[self.active_adapter](
        self.lora_A[self.active_adapter](self.lora_dropout[self.active_adapter](x))
    )
    * self.scaling[self.active_adapter]
)

In addition, Transformer's weight matrix includes Wq, Wk, Wv used to calculate query, key, and value in the Attention module, and Wo of multi-head attention, as well as the weight matrix of the MLP layer. LoRA is only applied to the four weight matrices in the Attention module. And it is found through ablation experiments that adjusting Wq and Wv simultaneously will produce the best results.

input_dim = 768  # e.g., the hidden size of the pre-trained model
output_dim = 768  # e.g., the output size of the layer
rank = 8  # The rank 'r' for the low-rank adaptation

W = ... # from pretrained network with shape input_dim x output_dim

W_A = nn.Parameter(torch.empty(input_dim, rank)) # LoRA weight A
W_B = nn.Parameter(torch.empty(rank, output_dim)) # LoRA weight B

# Initialization of LoRA weights
nn.init.kaiming_uniform_(W_A, a=math.sqrt(5))
nn.init.zeros_(W_B)

def regular_forward_matmul(x, W):
    h = x @ W
return h

def lora_forward_matmul(x, W, W_A, W_B):
    h = x @ W  # regular matrix multiplication
    h += x @ (W_A @ W_B)*alpha # use scaled LoRA weights
return h

In the pseudocode above, alphais a scaling factor to resize the combined result (original model output plus low-rank adaptation). This balances knowledge of pre-trained models with new task-specific adaptations - by default, alphausually set to 1. Also note that while WA  is initialized to small random weights, WB is initialized to 0, so ΔW = WA WB = 0 at the beginning of training , which means we start training with the original weights. 
   

Experiments also found that ensuring the number of weight matrix types is more important than increasing the hidden layer dimension r, and increasing r does not necessarily cover more meaningful subspaces.

Setting of Rank r

A very direct question is: In practice, what is the appropriate rank to set?

The author did several sets of experiments for comparison, and found that the rank can be very low, and it is OK if it does not exceed 8, and even 1 is also very good..

Regarding the choice of rank, usually, the rank is 4, 8, 16.

Through experiments, it is also found that LoRA can finally match the performance of full fine-tuning on the premise of only training a small number of parameters on many data sets, and even outperform full fine-tuning in some tasks.

Reduce inference overhead

Note that in practice, if we keep the original weights W and matrices W  A and W  B separate after training, as shown above, we will incur a small efficiency loss during inference since this introduces an extra computational step.  Instead, we can update the weights after training by W' = W + WA WB , which is similar to W' = W + ΔW mentioned earlier .


 However, there may be practical advantages to separating the weight matrices W A and W  B . For example, suppose we want to use our pretrained model as the base model for various customers, and we want to create a fine-tuned LLM for each customer starting from the base model. In this case, we don't need to store the complete weight matrix W' for each customer , where storing all weights of the model W' = W + WA  WB can be very large for LLM, because LLM usually has billions to Trillions of weight parameters. Therefore, we can keep the original model W and only need to store new lightweight matrices WA and WB .


To illustrate this with concrete numbers, a full 7B LLaMA checkpoint requires 23GB of storage capacity, whereas if we choose a rank of r=8, LoRA weights can be as small as 8MB.

Using LoRA can have the following advantages:

  1. When facing different downstream tasks, only low-rank matrices with few parameters need to be trained, and pre-trained weights can be shared between these tasks ;
  2. Eliminates the gradient of pre-trained weights and related optimizer states, greatly increasing training efficiency and reducing hardware requirements ;
  3. The trained low-rank matrix can be merged into the pre-trained weight, and the multi-branch structure becomes a single branch, so as to achieve the effect of no reasoning delay ;
  4. It does not affect each other with some previous parameter-efficient fine-tuning methods (such as Adapter, Prefix-Tuning, etc.), and can be combined with each other

QLoRA and AdaLoRA

The popular fried chicken LoRA, is the correct posture for contemporary fine-tuning LLMs? - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/zwqjoy/article/details/131995503