Popular understanding of major fine-tuning methods for large models: from Prefix Tuning, P-Tuning V1/V2 to LoRA, QLoRA

foreword

Anyone who has studied large models knows that the PEFT method only fine-tunes a small number of (extra) model parameters, while freezing most of the parameters of the pre-trained LLM, such as Prefix Tuning, P-Tuning V1/V2, LoRA, and QLoRA. In fact, these fine-tuning methods are introduced online There are a lot of articles/tutorials, and I have read a lot, but there are still few articles/tutorials that are really written at a glance, and most of the articles/tutorials are almost meaningless

In short, it is not easy to write and explain knowledge clearly. For example, I have practiced the ability of "writing knowledge clearly" for more than 10 years since 2010, and I am still keen to write knowledge clearly. In the continuous in-depth technology related to large models, I have written fine-tuning of various models, but I have not summarized various fine-tuning methods before, and fine-tuning methods are very important, so this article

The development history of the first part of efficient parameter fine-tuning

1.1 Google's Adapter Tuning: Embedded in the transformer, the original parameters remain unchanged and only the newly added Adapter is fine-tuned

In 2019, Google researchers proposed a PEFT fine-tuning method for BERT in the paper " Parameter-Efficient Transfer Learning for NLP ", which opened the prelude to PEFT research. they point out

  • In the face of specific downstream tasks, it is too inefficient to perform Full-fintuning (that is, all parameters in the pre-training model are fine-tuned).
  • However, if some layers of the fixed pre-training model are used, it is difficult to achieve better results by only fine-tuning the parameters of those layers close to the downstream tasks.

So they designed the Adapter structure as shown in the figure below

image.png

  1. As shown on the left side of the figure above, it is embedded in the Transformer structure. During training, the parameters of the original pre-training model are fixed, and only the newly added Adapter structure is fine-tuned.
  2. As shown on the right side of the above figure, and in order to ensure the efficiency of training (that is, to introduce more parameters as little as possible), they designed the Adapter as such a structure: first, a down-project layer maps high-dimensional features to low-level Dimensional features, and then after a nonlinear layer, use an up-project structure to map the low-dimensional features back to the original high-dimensional features; at the same time, a skip-connection structure is designed to ensure that in the worst case it can degenerate into identity

From the experimental results, this method can achieve the effect close to Full-finetuning (the GLUE index is within 0.4%) with only an additional 3.6% increase in the parameter size (compared to the parameter amount of the original pre-trained model).

1.2 Stanford's Prefix Tuning

The work before prefix-tuning was mainly to manually design discrete templates or automatically search for discrete templates. The problem is that the final performance is particularly sensitive to the manually designed template: adding a word or missing a word, or changing the position, will cause a lot of problems. changes, so the search results of this discretized token may not be optimal

Therefore, Stanford researchers proposed the Prefix Tuning method through the paper " Prefix-Tuning: Optimizing Continuous Prompts for Generation ", which uses continuous virtual token embedding instead of discrete tokens, and is different from the way Full-finetuning updates all parameters, as follows As shown in the figure (note the difference between fine-tuning and prefix tuning in the figure)

  1. This method is to construct a section of task-related virtual tokens as a Prefix before inputting the token , which is equivalent to inserting several consecutive The trainable "virtual token" embedding, these pseudo-tokens do not have to be real words in the vocabulary, but just a number of adjustable free parameters." To

    illustrate, for example, for the table-to-text task, the context  x is For the serialized form, the output y is the text description of the form, which is generated using GPT-2; for the text summary, it  xis the original text, which  yis the abstract, and is generated using BART.
    \rightarrow  For the autoregressive model, add a prefix in front of the sentence to get z=[P R E F I X ; x ; y]
    this Because the appropriate above can guide the generation of the following in the case of fixed LM (such as in-context learning of GPT3).
    \rightarrow  For the Encoder-Decoder model, both Encoder and Decoder have added prefixes. This z=\left[P R E F I X ; x \mid P R E F I X^{\prime} ; y\right]
    is because the prefix is ​​added on the Encoder side. To guide the encoding of the input part (guiding what to extract from  x), the Decoder side adds a prefix to guide the generation of subsequent tokens (influence the generation of  y by steering the next token distribution)

  2. Then only the parameters of the Prefix part are updated during training, while other parameters in the Transformer are fixed

This method is actually similar to constructing Prompt, except that Prompt is an artificially constructed "explicit" prompt and cannot update parameters, while Prefix is ​​an "implicit" prompt that can be learned

At the same time, in order to prevent the instability of training caused by directly updating the parameters of Prefix, they added an MLP structure in front of the Prefix layer (equivalent to the result of decomposing Prefix into a combination of Input and MLP with smaller dimensions). , keep only the parameters of Prefix

Part 2 P-Tuning V1/V2

2.1 P-Tuning V1

// to be updated

2.2 P-Tuning V2: The key lies in the introduction of Prefix-tuning

// to be updated


The third part LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

3.1 What is LoRA

Such as " LLaMA Interpretation and Fine-tuning: Alpaca-LoRA/Vicuna/BELLE/Chinese LLaMA/Jiang Ziya/LLaMA 2" Section 2.2.3 Alpaca-LoRA: Fine-tuning "LLaMA-based Alpaca" on consumer-grade GPUs through the PEFT library As mentioned above, in the neural network model, the model parameters are usually expressed in the form of matrix. For a pre-trained model, its parameter matrix already contains a lot of useful information. To adapt the model to a specific task, we need to fine-tune these parameters

The core idea of ​​LoRA is to adjust these parameter matrices in a low-rank manner. Mathematically, low rank means that a matrix can be approximated by multiplying two smaller matrices, as shown in the paper " LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS "

  1. Select the target layer: First, select the target layer to apply LoRA in the pre-trained neural network model. These layers are usually task-specific, such as query Q and key K matrices in self-attention mechanisms
  2. Initialize the mapping matrix and inverse mapping matrix: Create two smaller matrices A and B for the target layer.
    \rightarrow  A is the mapping matrix ( generally initialized with a random Gaussian distribution, of course, when the actual code is implemented, such as when Microsoft's deepspeed chat uses LoRA, At the beginning, the 0 matrix is ​​used to occupy the space, and then the kaiming uniform distribution initialization with the ReLU activation function is called. Although it is different from the normal distribution initialization used in the original definition of LoRA, both initialization methods can work. For more information, see the deepspeed chat below. The code  ), the dimension is dimension reduction
    \rightarrow  B is the inverse mapping matrix (initialized with 0 matrix), the dimension is dimension increase
    Among them, the size of the matrix is ​​determined by the rank (rank) and alpha value of LoRA
  3. Parameter transformation: Transform the original parameter matrix W of the target layer through the mapping matrix A and the inverse mapping matrix B. The calculation formula is: W' = W + A * B, where W' is the transformed parameter matrix
  4. W'Fine-tuning the model: replace the original parameter matrix of the target layer with the new parameter matrix W, and then fine-tune the model on the task-specific training data
  5. Gradient update: In the fine-tuning process, calculate the gradient of the loss function with respect to the mapping matrix A and the inverse mapping matrix B, and use an optimization algorithm (such as Adam, SGD, etc.) to update A and B. Note that during the update process, the original parameter
    matrix W remains unchanged. To put it bluntly, the parameters of the original PLM are fixed during training, and only the dimensionality reduction matrix A and dimensionality enhancement matrix B are trained.
  6. Repeated updates: In each batch of training, repeat steps 3-5 until a predetermined training epoch is reached or a convergence condition is met

In summary, the detailed steps of LoRA include selecting the target layer, initializing the mapping matrix and inverse mapping matrix, performing parameter transformation and model fine-tuning. In the fine-tuning process, the model will learn the knowledge of a specific task by updating the mapping matrix U and the inverse mapping matrix V, thereby improving the performance of the model on this task

3.2 Implementation of LoRA fine-tuning in Microsoft DeepSpeed-Chat

Let’s continue, the application of this LoRA is quite wide. For example, the DeepSpeed-Chat launched by Microsoft later used this method.

In the implementation of DeepSpeed-Chat, when the low-rank dimension lora_dim of LoRA is set (such as lora_dim=128), LoRA training is considered to be enabled, and the name of the original model contains "deoder.layers." and the linear layer is modified to LoRA layer, the specific operation is:

  1. Freeze the weight parameter of the original structure;
  2. Two new linear layers, lora_right_weight and lora_left_weight ( corresponding to dimensionality reduction matrix A and dimensionality enhancement matrix B in the above figure respectively  ), are introduced , which can realize dimensionality reduction to lora_dim first and then dimensionality increase back to the original dimension;
  3. The LoRA layer mainly implements two branch paths, one branch is the original structure with frozen weight parameters, and the other branch is the newly introduced linear layer group for dimension reduction and dimension increase
# applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py
# 判断是否启用LoRA模式
if args.lora_dim > 0:
"""
如果启用,则对名称中含有“decoder.layers.”且为线性层的结构部分引入LoRA旁路(实现先降维后升维的2个线性层),
这类结构基本都是attention、信息交互用的inner线性层,
这类结构的Weight参数将被冻结,转而优化LoRA旁路的参数。
"""
    args.lora_module_name = "decoder.layers."
    model = convert_linear_layer_to_lora(model, args.lora_module_name,
                                         args.lora_dim)

# applications/DeepSpeed-Chat/training/utils/module/lora.py
def convert_linear_layer_to_lora(model,
                                 part_module_name,
                                 lora_dim=0,
                                 lora_scaling=1,
                                 lora_droppout=0):
    """
	将名称中带有"decoder.layers."的线性层转换为lora层
	"""
	"""取出模型中参数名含有decoder.layers.的线性层"""
    repalce_name = []
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear) and part_module_name in name:
            repalce_name.append(name)

    for name in repalce_name:
    	"""recursive_getattr实现了从model中根据属性名取出对应原始结构"""
        module = recursive_getattr(model, name)

        """纳入原始结构的参数,实例化lora层"""
        tmp = LinearLayer_LoRA(
            module.weight, lora_dim, lora_scaling, lora_droppout,
            module.bias).to(module.weight.device).to(module.weight.dtype)

        """recursive_getattr实现了将model对应属性的结构换成lora层实例"""
        recursive_setattr(model, name, tmp)
    return model
    
# applications/DeepSpeed-Chat/training/utils/module/lora.py
class LinearLayer_LoRA(nn.Module):
	"""具体的lora层"""
	def __init__(...):
		...
		"""此处的weight和bias即为原始结构中的参数"""
		self.weight = weight
		self.bias = bias
		···

		"""冻结weight部分的参数"""
		self.weight.requires_grad = False
		···
		self.lora_right_weight = nn.Parameter(torch.zeros(columns, lora_dim))
	    self.lora_left_weight = nn.Parameter(torch.zeros(lora_dim, rows))
        ...

	    """初始化LoRA线性层的参数"""
	    self.reset_parameters()

    # 调用reset_parameters(self)做初始化
    def reset_parameters(self):
    	# 降维矩阵与LoRA原始定义所用的(0,\sigma^2)正态分布初始化不同,而是使用的kaiming均匀分布初始化
    	# kaiming服从均匀分布U(-\sqrt{1/in_feature}, +\sqrt{1/in_feature})
        # f_i是矩阵的输入维度,就是nn.Linear(in_features, out_features)中的in_features
        # 对应上面代码中的columns,而这个columns相当于基座模型的hidden_size
        nn.init.kaiming_uniform_(self.lora_right_weight, a=math.sqrt(5))

        # 升维矩阵使用全0初始化
        nn.init.zeros_(self.lora_left_weight)

    def forward(self, input):
    	"""LoRA的正向传播"""
    	···
    	else:
            # F.linear(input, self.weight, self.bias)是使用给定的权重self.weight和偏差self.bias对输入数据input进行线性变换
            # 这个操作等价于input @ self.weight.t() + self.bias,其中@表示矩阵乘法,.t()表示矩阵转置
	    	return F.linear(input, self.weight, self.bias) 
                    # 1,self.lora_dropout(input)对输入进行了随机的dropout操作,这是一种正则化手段
                    # 2,对结果进行两次线性变换,一次是@ self.lora_right_weight,然后是@ self.lora_left_weight
                    # 3,乘法部分* self.lora_scaling是对加号后面部分的结果进行缩放
	    			+ (self.lora_dropout(input) @ self.lora_right_weight @ self.lora_left_weight) * self.lora_scaling

Analyze the last part of this code additionally

# applications/DeepSpeed-Chat/training/utils/module/lora.py
class LinearLayer_LoRA(nn.Module):
	"""具体的lora层"""
	···
    def forward(self, input):
    	"""LoRA的正向传播"""
    	···
    	else:
	    	return F.linear(
	                input, self.weight,
	                self.bias) + (self.lora_dropout(input) @ self.lora_right_weight
	                              @ self.lora_left_weight) * self.lora_scaling

The forward propagation of the regular part is defined by transformers, while the forward propagation of the LoRA part is defined by the forward() of LinearLayer_LoRA(nn.Module), that is, "the results of the two branches of the LoRA layer are summed", as shown in the figure below Shown " Source: LoRA , equivalent to during training, the smaller weight matrices (A and B in the figure below) are separated, but once the training is complete, the weights can be merged into a new weight matrix  "

 reflected in the code as

F.linear(input, self.weight, self.bias) + (self.lora_dropout(input) @ self.lora_right_weight @ self.lora_left_weight) * self.lora_scaling

The left side of the plus sign is the original structure branch, the right side of the plus sign is the new branch, self.lora_right_weight  and self.lora_left_weight  are the parameters of the two newly introduced linear layers

3.3 Encapsulation of LoRA, Prefix Tuning, and P-Tuning by PEFT library on Huggingface

The PEFT ( Parameter-Efficient Fine-Tuning, meaning efficient parameter fine-tuning ) library launched by Huggingface also encapsulates the LoRA method. The PEFT library can make the pre-trained language model adapt to various downstream tasks efficiently without fine-tuning all of the models. parameters, i.e. fine-tuning only a small number of (extra) model parameters, which greatly reduces computation and storage costs

Model Full Finetuning PEFT-LoRA PyTorch PEFT-LoRA DeepSpeed with CPU Offloading
bigscience/T0_3B (3B params) 47.14GB GPU / 2.96GB CPU 14.4GB GPU / 2.96GB CPU 9.8GB GPU / 17.8GB CPU
bigscience/mt0-xxl (12B params) OOM GPU 56GB GPU / 3GB CPU 22GB GPU / 52GB CPU
bigscience/bloomz-7b1 (7B params) OOM GPU 32GB GPU / 3.8GB CPU 18.1GB GPU / 35GB CPU

And the PEFT library ( peft/src/peft/peft_model.py at main huggingface/peft GitHub ) supports the following popular methods

  1. For LoRA, PEFT’s implementation of LoRA can be seen in: peft/src/peft/tuners/lora.py at main huggingface/peft GitHub , such as the code for merging weights (and the above DSC’s implementation of LoRA weight merging, in essence are consistent)
    def merge(self): 
        # 检查当前激活的适配器是否在lora_A的键中,如果不在则终止函数
        if self.active_adapter not in self.lora_A.keys():  
            return  
    
        if self.merged:  
            warnings.warn("Already merged. Nothing to do.")
            return  
    
        # 如果激活适配器的r值大于0,表示有可以合并的权重
        if self.r[self.active_adapter] > 0: 
            # 在当前的权重上加上计算得到的新权重
            self.weight.data += (  
                # 转置运算
                transpose(  
                    # 通过矩阵乘法计算新的权重
                    self.lora_B[self.active_adapter].weight @ self.lora_A[self.active_adapter].weight, 
     
                    # 这是转置运算的维度参数
                    self.fan_in_fan_out,  
                )
    
                # 然后将计算得到的权重乘以对应的缩放因子
                * self.scaling[self.active_adapter]  
            )
            self.merged = True
  2. Prefix Tuning: Prefix-Tuning: Optimizing Continuous Prompts for GenerationP-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
  3. P-Tuning: GPT Understands, Too
  4. Prompt Tuning: The Power of Scale for Parameter-Efficient Prompt Tuning

Part IV QLoRA

// to be updated

References and Recommended Reading

  1. Google's paper on Adapter Tuning "Parameter-Efficient Transfer Learning for NLP"
  2. Let the world have no difficulty Tuning large model - Introduction to PEFT technology
  3. PEFT: Parameter-efficient fine-tuning of billion-scale models on low-resource hardware
  4. Continuous prompt: Prefix- Tuning
  5. Interpretation and fine-tuning of LLaMA: Alpaca-LoRA/Vicuna/BELLE/Chinese LLaMA/Jiang Ziya/LLaMA 2
  6. P-Tuning v2 greatly improves the performance of small models, and NER can also be promp tuning
  7. P-tuning: Automatically build templates to release the potential of language models
  8. Prompt-Tuning - In-depth interpretation of a new fine-tuning paradigm
  9. ..

Guess you like

Origin blog.csdn.net/v_JULY_v/article/details/132116949