LoRA principle analysis


Preface

As the size of the model continues to expand, the feasibility of fine-tuning all parameters of the model (so-called full fine-tuning) becomes less and less feasible. Taking the 175B parameters of GPT-3 as an example, every time a new field is added, a new model needs to be completely fine-tuned, and the price and cost are very high!

Paper: LORA: LOW-RANK ADAPTATION OF LARGE LANNGUAGE MODELS
Code: https://github.com/microsoft/LoRA

Problems with existing solutions

Adapter Tuning

Simply put, the adapter fixes the original parameters and adds some additional parameters for fine-tuning. In the picture above, two adapters will be added to the original transformer block, one behind the multi-head attention and the other behind the FFN.
insert image description here
As can be seen from the figure, the Adapter increases the number of layers of the model, causing the model inference speed to slow down.

Prefix Tuning

insert image description here

Specifically, for each layer in the transformer, a trainable virtual token embedding is inserted in front of the sentence representation. For the autoregressive model (GPT series), add a continuous prefix before the sentence, that is, Z = [PREFIX; x; y].
For the Encoder-Decoder model (T5), add the continuous prefix Z = [PREFIX before both Ecoder and Decoder. ; x | PREFIX; y].
The process of adding a prefix is ​​shown in the figure above

Although prefix-tuning does not add too many extra parameters; however, prefix-tuning is difficult to optimize and will reduce the sequence length of downstream tasks.

LoRA

Several key advantages of LoRA:

  • Pre-trained models can be shared, saving hard drive overhead
  • When switching tasks, only the LoRA weights need to be replaced, which is low cost.
  • During training, only LoRA weights need to be trained, and memory consumption is low.

insert image description here
Simple understanding: Add a "side branch" next to the Linear layer of the model. The function of this "side branch" is to replace the original parameter matrix W for training.

Combined with the picture above, let’s intuitively understand this process and enter xxx , with dimensionddd , for example, in the ordinary transformer model, thisxxx may beembeddingthe output of , or it may betransformer layerthe output of the previous layer, andddd is generally 768 (most of Bert's output dimensions are 768). According to the original route, it should only take the left part, which is the original model part.

Under the LoRA strategy, a "side branch" on the right side is added, that is, a Linear layer A is first used to transfer the data from ddd dimension reduced torrr dimension, thisrr is the rank of LORA and is the most important hyperparameter in LoRA. Generally it will be much smaller thanddd (the more common ones are 4 and 8), especially for current large models,ddd is no longer just 768 or 1024. For example, LLaMA-7B, each layer of transformer has 32 heads, soddd reaches 4096.

Then use the second Linear layer B to transfer the data from rrr changes back toddd dimension. Finally, the results of the left and right parts are added and fused to obtain the outputhidden_state.

For the left and right parts, the right side looks like the original matrix WW on the leftDecomposition of W , change the parameter quantity fromd ∗ dd * dddchanged ∗ r + r ∗ dd * r + r *ddr+rd , that is,2 ∗ d ∗ r 2 * d * r2dr,在 r < < d r << d r<<In the case of d , the number of parameters is greatly reduced.

In Albert, the author considered that the dimension of the word list is very large, so he decomposed the Embedding matrix into two relatively small matrices to simulate the effect of the Embedding matrix, so that the number of parameters that need to be trained is reduced a lot. (In fact, it is reduced by about 10M. The main reason for the small number of Albert parameters is cross-layer parameter sharing.)
insert image description here
LoRA has a similar idea, and it is no longer limited to the Embedding layer, but is theoretically available wherever large matrices appear. Such a decomposition can be used.

But unlike Albert, Albert directly replaced the original large matrix with two small matrices, while LoRA retained the original matrix W, but did not let W participate in the training, so the only part that needs to calculate the gradient is the side branch A. and B two small matrices.

Judging from the formula in the paper, during full-parameter fine-tuning, the optimization of model training is expressed as (taking the autoregressive language model as an example): that is,
insert image description here
maximizing the conditional probability

Among them, the parameters of the model are represented by Φ \PhiΦ means.

A major disadvantage of full-parameter fine-tuning is that for each downstream task, a different set of parameters needs to be learned. If the pre-trained model is large, such as GPT3 (175 billion parameters), storing and deploying many independent fine-tuned model instances may be is a challenge.

After adding LoRA, the optimization of the model is expressed as:
insert image description here
Among them, the original parameters of the model are Φ 0 \Phi_0Phi0, the new parameter of LoRA is Δ Φ ( Θ ) \Delta \Phi\left(\Theta\right)D.F( Θ ) .

As can be seen from the second formula, although the parameters seem to increase (more Δ Φ ( Θ ) \Delta \Phi\left(\Theta\right)D.F( Θ ) ), but from the previous max goal, the parameters that need to be optimized are onlyΘ \ThetaΘ , and according to the assumption,Θ < < Φ Θ << \PhiTh<<Φ , which makes the gradient calculation much less during the training process, so in the case of low resources, we can only consumeΘ \ThetaΘ This part of the resources, so that a large model can be trained in the case of a single card with low video memory.

After training, only the parameters of the lora part (that is, the trainable parameters) are saved. When doing inference, you can first add these parameters to the original model to form a new model (the big + sign at the top in Figure 1), and then load it for inference. , which will not increase any additional inference time overhead compared with the original model.

Current LLMs are trained with hundreds of millions of levels of data. LoRA can avoid the collapse of pre-training generalization capabilities by maintaining the gradient of the original model. This is because during the pre-training process, the model has learned a large amount of language knowledge and structures, which can be applied to various downstream tasks. However, in the process of complete fine-tuning, all parameters of the model are retrained, which may cause the model to forget the previously learned knowledge and structure, thus reducing the generalization ability of the model.

In contrast, LoRA only fine-tunes some parameters while maintaining the gradient of the original model. The advantage of this is that LoRA can fine-tune specific tasks while maintaining the language knowledge and structure of the original model, thereby improving the performance of the model. In addition, LoRA's low-rank matrix injection method can further improve the generalization ability of the model, because the low-rank matrix can capture the commonalities and patterns in the data, thereby reducing the risk of over-fitting.

Therefore, by maintaining the gradient of the original model, LoRA can avoid the collapse of pre-training generalization ability and improve the model's generalization ability and performance.

Official implementation

Only the implementation of Lora in the Linear layer is posted here. For all codes, please refer to: https://github.com/microsoft/LoRA

class Linear(nn.Linear, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 0, 
        lora_alpha: int = 1, 
        lora_dropout: float = 0.,
        fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
                           merge_weights=merge_weights)

        self.fan_in_fan_out = fan_in_fan_out
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
        self.reset_parameters()
        if fan_in_fan_out:
            self.weight.data = self.weight.data.transpose(0, 1)

    def reset_parameters(self):
        nn.Linear.reset_parameters(self)
        if hasattr(self, 'lora_A'):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B)

    def train(self, mode: bool = True):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        nn.Linear.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0:
                    self.weight.data -= T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0:
                    self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = True       

    def forward(self, x: torch.Tensor):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        if self.r > 0 and not self.merged:
            result = F.linear(x, T(self.weight), bias=self.bias)
            if self.r > 0:
                result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling
            return result
        else:
            return F.linear(x, T(self.weight), bias=self.bias)

It can also be seen from the implementation code that LoRA freezes the parameters of PLM. The only parameters that actually need to be trained are lora_A and lora_B. Moreover, during training, the PLM weight needs to be involved in the calculation, so LoRA is not efficient in training .

Summarize

  • LoRA is only efficient in parameters, not in training. That is to say, there are indeed many fewer trainable parameters, but the training speed on a single card has not been significantly improved.

    In LoRA, the entire PLM needs to participate in the calculation of backpropagation, not just some parameters of the backpropagation bypass. This is because LoRA’s low-rank matrix injection method needs to use the gradient information of the entire PLM to calculate the gradient of the injected matrix. Specifically, LoRA's gradient calculation includes two steps: first, the gradient of the entire PLM needs to be calculated; then, these gradients need to be used to calculate the gradient of the injection matrix.

  • In Doka training, LoRA’s speed advantage is mainly reflected in two aspects:

    1. Computational Efficiency: Since LoRA only needs to compute and optimize the injected low-rank matrix, it is more computationally efficient than full fine-tuning. In multi-card training, LoRA can distribute the calculation and optimization of the injection matrix to multiple GPUs, thereby accelerating the training process.

    2. Communication efficiency: In multi-card training, communication efficiency is usually a bottleneck. Since LoRA only needs to communicate the parameters of the injected matrix, it communicates more efficiently than full fine-tuning. In multi-card training, LoRA can distribute the parameters of the injected matrix to multiple GPUs, thereby reducing communication traffic and communication time.

    Therefore, LoRA is usually faster than full fine-tuning in multi-kA training. Specifically, LoRA can reduce the hardware threshold by up to 3 times, thereby improving the efficiency of training.

See:

Guess you like

Origin blog.csdn.net/qq_44193969/article/details/131576550