Overview of the principles of efficient fine-tuning technology for large model parameters (2) - BitFit, Prefix Tuning, Prompt Tuning

Overview of the principles of efficient fine-tuning technology for large model parameters (2) - BitFit, Prefix Tuning, Prompt Tuning

Eat jelly without spitting out jelly skin

As ChatGPT quickly exploded in popularity, it triggered an era of changes in large models. However, for the general public, pre-training or full fine-tuning of large models is out of reach. As a result, various efficient parameter fine-tuning technologies have been developed , giving scientific researchers or ordinary developers the opportunity to try to fine-tune large models.

Therefore, this technology deserves our in-depth analysis of the mechanism behind it. This series is roughly divided into seven articles.

  • Overview of the principles of efficient fine-tuning technology for large model parameters (1) - Background, introduction to efficient fine-tuning of parameters
  • Overview of the principles of efficient fine-tuning technology for large model parameters (2) - BitFit , Prefix Tuning, Prompt Tuning
  • Overview of the principles of efficient fine-tuning technology for large model parameters (3)-P-Tuning, P-Tuning v2
  • Overview of the principles of efficient fine-tuning technology for large model parameters (4) - Adapter Tuning and its variants
  • Overview of the principles of efficient fine-tuning technology for large model parameters (5) - LoRA, AdaLoRA, QLoRA
  • Overview of the principles of efficient fine-tuning technology for large model parameters (6)-MAM Adapter, UniPELT
  • A review of the principles of efficient fine-tuning technology for large model parameters (7) - best practices and summary

This article is the second part of a review of the principles of efficient fine-tuning technology for large model parameters.

BitFit

background

While full fine-tuning for each task is very efficient, it also generates a unique large model for each pre-trained task, which makes it difficult to infer what changed during fine-tuning and difficult to deploy, especially as As the number of tasks increases, it becomes difficult to maintain.

Ideally, we would like to have an efficient fine-tuning method that satisfies the following conditions:

  • Achieve an effect that can match the full amount of fine-tuning .
  • Only a small set of model parameters are changed.
  • This allows data to arrive in a stream rather than at the same time , facilitating efficient hardware deployment.
  • The changed parameters are consistent across different downstream tasks .

The above question depends on the extent to which the fine-tuning process can guide the learning of new abilities and abilities learned by exposure to pre-trained LM.

Although the previous efficient fine-tuning methods Adapter-Tuning and Diff-Pruning can also partially meet the above needs. However, the author proposes BitFit, a sparse fine-tuning method with smaller parameters to meet the above needs.

Technical principles

BitFit (paper: BitFit: Simple Parameter-efficient Fine-tuning or Transformer-based Masked Language-models ) is a sparse fine-tuning method that only updates the bias parameters or part of the bias parameters during training.

For the Transformer model, most of the transformer-encoder parameters are frozen, and only the bias parameters and the classification layer parameters of the specific task are updated. The bias parameters involved include the bias involved in calculating query, key, value in the attention module and merging multiple attention results, the bias in the MLP layer, and the bias parameters in the Layernormalization layer.

In models such as Bert-Base/Bert-Large, the bias parameter only accounts for 0.08% to 0.09% of the total parameters of the model. However, by comparing the effects of BitFit, Adapter and Diff-Pruning on the Bert-Large model based on the GLUE data set, it was found that BitFit has the same effect as Adapter and Diff-Pruning when the number of parameters is much smaller than that of Adapter and Diff-Pruning. , even slightly better than Adapter and Diff-Pruning on some tasks.

image.png

At the same time, it can also be seen from the experimental results that compared to the fine-tuning of the full parameters , the BitFit fine-tuning results have achieved good results on multiple data sets when only a very small number of parameters are updated. Although not as good as the fine-tuning of the full parameters, it is far better . Frozen method that super-fixes all model parameters .

image.png

At the same time, by comparing the parameters before and after BitFit training , it was found that many bias parameters did not change much (for example: the bias parameters involved in calculating the key ). It was found that the bias parameters of the FFN layer (intermediate) that calculates the query and enlarges the feature dimension from N to 4N change the most obviously. Only updating these two types of bias parameters can also achieve good results. On the contrary, fixing any one of them will reduce the effect of the model. All suffered great losses.

image.png

Prefix Tuning

background

The work before Prefix Tuning was mainly about manually designing discrete templates or automatically searching for discrete templates. For manually designed templates, changes in the template are particularly sensitive to the final performance of the model. Adding a word, missing a word, or changing the position will cause relatively large changes. For automated search templates, the cost is relatively high; at the same time, the results of previous discrete token searches may not be optimal.

In addition, the traditional fine-tuning paradigm uses pre-trained models to fine-tune different downstream tasks, and a fine-tuned model weight must be saved for each task. On the one hand, fine-tuning the entire model takes a long time; on the other hand, it also Takes up a lot of storage space.

Based on the above two points, Prefix Tuning proposes a fixed pre-training LM , adding trainable, task-specific prefixes to the LM, so that different prefixes can be saved for different tasks, and the fine-tuning cost is small; at the same time, this kind of Prefix is ​​actually continuously differentiable Virtual Token (Soft Prompt/Continuous Prompt) is better optimized and has better effect than discrete Token.

image.png

Technical principles

Prefix Tuning (paper: Prefix-Tuning: Optimizing Continuous Prompts for Generation ) constructs a task-related virtual tokens as Prefix before inputting the token, and then only updates the parameters of the Prefix part during training, while other parameters in the PLM are fixed.

For different model structures, different Prefixes need to be constructed.

  • For the autoregressive architecture model: add a prefix in front of the sentence, and the  z = [PREFIX; x; y]appropriate context can guide the generation of the context while fixing the LM (for example: context learning of GPT3).
  • For the encoder-decoder architecture model: prefixes are added to both Encoder and Decoder to get  z = [PREFIX; x; PREFIX0; y]. The prefix is ​​added on the Encoder side to guide the encoding of the input part, and the prefix is ​​added on the Decoder side to guide subsequent token generation.

image.png

This method is actually similar to constructing Prompt, except that Prompt is a human-constructed "explicit" prompt and cannot update parameters, while Prefix is ​​an "implicit" prompt that can be learned.

image.png

   At the same time, in order to prevent the direct update of the parameters of Prefix from causing unstable training and reduced performance, an MLP structure is added in front of the Prefix layer. After training is completed, only the parameters of Prefix are retained.

image.png

   In addition, ablation experiments have proven that adjusting the embedding layer alone is not expressive enough and will lead to a significant performance decline. Therefore, prompt parameters are added to each layer, which is a major change.

image.png

In addition, the experiment also compared the impact of position on the generation effect, and Prefix-tuning was also slightly better than Infix-tuning. Among them, the form of Prefix-tuning is  [PREFIX; x; y], and the form of Infix-tuning is  [x; INFIX; y].

image.png

Prompt Tuning

background

Full fine-tuning of large models requires training a model for each task, which has relatively high overhead and deployment costs. At the same time, the discrete prompts (referring to manually designing prompts and adding prompts to the model) method is relatively expensive and the effect is not very good.

Based on this, the author proposed Prompt Tuning, which uses backpropagation to update parameters to learn prompts instead of manually designing prompts. At the same time, the original weights of the model are frozen and only the prompts parameters are trained. After training, the same model can be used for multi-task inference.

Technical principles

Prompt Tuning (paper: The Power of Scale for Parameter-Efficient Prompt Tuning ), this method can be regarded as a simplified version of Prefix Tuning. It defines its own prompt for each task, and then splices it into the data as input, but only Add prompt tokens to the input layer , and do not need to add MLP for adjustment to solve the problem of difficult training.   

image.png

Through experiments, it was found that as the number of parameters of the pre-trained model increases, the Prompt Tuning method will approach the results of full-parameter fine-tuning.

image.png

   At the same time, Prompt Tuning also proposed Prompt Ensembling, which means training different prompts for the same task at the same time in a batch (that is, asking the same question in multiple different ways). This is equivalent to training different models, such as The cost of model integration is much smaller.

image.png

In addition, the Prompt Tuning paper also discusses the impact of the initialization method and length of the Prompt token on model performance. Through the ablation experiment results, it was found that compared with random initialization and initialization using sample vocabulary, Prompt Tuning uses class labels to initialize the model better. However, as the model parameter scale increases, this gap will eventually disappear.

When the length of the prompt token is around 20, the performance is already good (after it exceeds 20, increasing the length of the prompt token will not significantly improve the performance of the model). Similarly, this gap will also decrease as the model parameter scale increases ( That is, for very large-scale models, even if the Prompt token length is very short, it will not have much impact on performance).

image.png

Conclusion

This article describes the efficient fine-tuning method BitFit that only updates a part of the parameters and the soft-prompt fine-tuning methods Prefix Tuning and Prompt Tuning by adding additional parameters. The following will explain the efficient fine-tuning methods P-Tuning and P-Tuning v2 .

If you think my article can help you, please like, favorite and follow~~

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132734693