Summary of three fine-tuning techniques for pre-training large language models: fine-tuning, parameter-efficient fine-tuning and prompt-tuning

Pre-training large models, especially large language models, is currently the hottest AI technology. When Google released the BERT model in 2018 (BERT's official model card information in DataLearner: https://www.datalearner.com/ai-models/pretrained-models/BERT ), everyone has not realized that this round of AI wave will develop to So hot. However, after the emergence of BERT, the fine-tuning technology has also become popular.Freezing the weights of the pre-trained model, and then fine-tuning according to specific tasks becomes very effective and is applied in many scenarios. With the popularity of ChatGPT, parameter-efficient fine-tuning and prompt-tuning technologies seem to have alternativesTraditional fine-tuningThis paper will briefly describe the three fine-tuning techniques and their differences in the field of pre-training models.

Of course, these three technologies do not have a completely isolated boundary, and many current studies do not clearly distinguish them. The content of this article combines some studies to provide a perspective. Welcome to rational discussion~

img

1. Fine-tuning technology

Fine-tuning is a technique used in Natural Language Processing (NLP) to adapt a pretrained language model to a specific task or domain. The basic idea of ​​fine-tuning is to take a pre-trained language model that has been trained on a large amount of text, and then continue to train it on a small-scale task-specific text.

The concept of fine-tuning has been around for many years and is used in various contexts. The earliest known application of fine-tuning in NLP was in the context of neural machine translation (NMT), where researchers used a pretrained neural network to initialize the weights of a smaller network, which were then translated specifically Fine-tuning of tasks.

Classic fine-tuning methods consist of continuing training of a pre-trained model with a small amount of task-specific data. During this process, the weights of the pre-trained model are updated to better fit the task. The amount of fine-tuning required depends on the similarity between the pre-training corpus and the task-specific corpus. If the two are similar, probably only a small amount of fine-tuning is needed. If the two are not similar, more fine-tuning may be needed.

In NLP,One of the most famous examples of fine-tuning is the OpenAI GPT (Generative Pre-Trained Transformer) model developed by OpenAI. GPT models are pre-trained on large amounts of text and then fine-tuned on various tasks, such as language modeling, question answering, and summarization. Fine-tuned models achieve state-of-the-art performance on these tasks.

The figure below comes from a summary by Sebastian Raschka, a professor of statistics at the University of Wisconsin-Madison.

img

This is a very popular method of using large models in recent years. That is to "freeze" all weights except the output layer. Then the output layer parameters are randomly initialized, and then trained by transfer learning. Only the fully connected output layer is updated, and the weights of other layers remain unchanged.

Two, parameter-efficient fine-tuning technology

Parameter efficient fine-tuning, referred to as PEFT, which aims to achieve effective fine-tuning of pre-trained language models while reducing the required parameters and computing resources as much as possible. It is a group of methods in natural language processing (NLP) for adapting pre-trained language models to specific tasks, which requires fewer parameters and computing resources than traditional fine-tuning methods.

From another perspective, parameter-efficient fine-tuning technology solves the resource-intensive problem of traditional fine-tuning techniques by training only a small set of parameters, which may be a subset of existing model parameters or a newly added set of parameters. These methods differ in parameter efficiency, memory efficiency, training speed, final quality of the model, and additional inference cost (if any).

These techniques are important for researchers and developers who may not have access to powerful hardware or require model fine-tuning on low-resource devices.

inA parameter-efficient fine-tuning technique is called distillation, which was introduced in 2015 by Hinton et al. The method involves training a smaller model to mimic the behavior of a larger pretrained model. A pretrained model generates "teacher" predictions, which are then used to train a smaller "student" model. By doing this, the student model can learn from the knowledge of the larger model without storing all parameters.

Another technique is called adapter training, which was introduced in 2019 by Houlsby et al. Adapters are small neural networks added to pretrained models for task-specific fine-tuning. These adapters occupy a fraction of the size of the original model, which enables faster training and lower memory requirements. Adapters can be trained for multiple tasks and then plugged into pretrained models to perform new tasks.

The third technique is called progressive shrinking, which was introduced in 2020 by Kaplan et al. This technique involves gradually reducing the size of the pretrained model during fine-tuning. Start with a large model and gradually reduce the number of parameters until the desired performance is achieved. This approach can produce small models that perform better than models trained from scratch.

A professor at UMass Lowell University released a review of parameter-efficient fine-tuning in March, you can study it: Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

3. Prompt-tuning technology

prompt-tuning isA more recent approach to fine-tuning pretrained language modelsThe focus is on adjusting the input prompt (input prompt) rather than modifying model parameters. This means that the pre-trained model remains unchanged, only the input cues are modified for downstream tasks. By designing and optimizing a set of hints, a pretrained model can be made to perform a specific task.

The main difference between prompt-tuning and traditional fine-tuning is the degree to which the pre-trained model is modified. Fine-tuning modifies the weights of the model, while hint tuning only modifies the inputs of the model. Hence, prompt-tuning is less computationally expensive than fine-tuning and requires fewer resources and training time. Furthermore, prompt-tuning is more flexible than fine-tuning, as it allows the creation of task-specific prompts that can be adapted to various tasks.

For large-scale models like GPT-3, overall fine-tuning can require significant computational resources.

Some notable prompt-tuning techniques include:

Prefix tuning: Proposed by Li and Liang in the paper "Prefix-Tuning: Optimizing Continuous Prompts for Generation" (2021). Prefix adjustment involves learning task-specific sequential cues, which are added before the input during inference. By optimizing this continuous hint, the model can be adapted to a specific task without modifying the underlying model parameters, which saves computational resources and enables efficient fine-tuning.

The following figure is a schematic diagram of the difference between fine-tuning and Prefix tuning given in the paper:

img

== P-Tuning : == proposed by Liu et al. in the paper "P-Tuning: GPT Understands, Learns, and Generates Any Language" (2021). P-Tuning involves training learnable parameters called "cue tokens" that are concatenated with the input sequence. These hint tokens are task-specific and are optimized during fine-tuning so that the model can perform well on new tasks while keeping the original model parameters unchanged.

The classic prompt-tuning approach does not involve any parameter updates to the underlying model. Instead, it focuses on crafting input cues, or templates, that can guide a pretrained model to produce the desired output. This is often a manual trial-and-error process from which to choose the most appropriate prompt for a particular task. However, with the advent of hint tuning techniques such as prefix tuning and P-Tuning, they provide more systematic and efficient ways to adapt input hints to improve the performance of large pretrained models on specific tasks.

Four. Summary

In fact, as you can see from the above summary, the development of fine-tuning technology seems to be closely related to the development scale of the model. The earliest fine-tuning appeared after the model became larger. Using the architecture of the original model to train on new data was too costly and lost the ability of the original model. Later, it was also because the scale of model development far exceeded the development of hardware performance. Leading to the emergence of more efficient fine-tuning techniques. The current hottest prompt-tuning is actually an alternative method that is almost impossible to fine-tune the large model. Of course, the power of the model itself also makes this method effective. In fact, from this idea, you can also see that when large models start to involve more complex and realistic problems, automatic prompt-tuning technology instead of manual adjustment may be an important direction in the future. After all, in scenarios where a large amount of input is required to allow the model to understand the problem, such as text summarization and code debugging, how to effectively prompt the long input to the model is a very important issue. The current large model has a high cost of reasoning in terms of long input. The height is also very limited, so this technology is also a very important direction in the future!

Guess you like

Origin blog.csdn.net/linjie_830914/article/details/131020240