A Survey of Fine-tuning Methods for Large Models

0. Introduction

Recently, I am more interested in the content of large models. The author first came into contact with large models in the second half of 2022. At that time, I thought it was amazing, and I thought it was a subversive work. Now, with the gradual increase of open source models. I think we have to learn and understand these basic knowledge so that we can learn and use them later in our work. In deep learning, fine-tuning is an important technique for improving the performance of pre-trained models. In addition to fine-tuning ChatGPT, there are many other pre-trained models that can be fine-tuned. Here are some ways to fine-tune a pretrained model:

  • Fine-tune all layers : Fine-tune all layers of the pre-trained model to adapt to new tasks.
  • Fine-tune the top layers : Only fine-tune the top layers of the pre-trained model for new tasks.
  • Freeze the bottom layer : The bottom layer of the pre-trained model is fixed, and only the top layer is fine-tuned.
  • Layer-by-layer fine-tuning : Starting from the bottom layer, fine-tune the pre-trained model layer by layer until all layers are fine-tuned.
  • Transfer Learning : Transfer the knowledge of pre-trained models to new tasks to improve model performance. This method usually uses the method of fine-tuning the top layer or freezing the bottom layer.

At present, the methods we commonly use are generally the first three. To put it simply , the parameters of the model are analogous to a college student who has learned all the professional knowledge in college. Based on past learning experience and some things in life, he already has his own set of learning methods and thinking logic. And fine-tuning means that a college student who works in a certain industry after graduation must start to learn the content of the job to produce the results of the job. Below we will introduce some commonly used fine-tuning methods.

1. Fine tuning

Fine tuning is a technique used in Natural Language Processing (NLP) to adapt a pretrained language model to a specific task or domain . The basic idea of ​​fine tuning is to take a pre-trained language model that has been trained on a large amount of text, and then continue to train it on a small-scale task-specific text.

The classic fine tuning method involves continuing the training of the pre-trained model with a small amount of task-specific data. During this process, the weights of the pre-trained model are updated to better fit the task. The amount of fine-tuning required depends on the similarity between the pre-training corpus and the task-specific corpus. If the two are similar, a small amount of Fine tuning may be needed. If the two are not similar, more Fine tuning may be required.

insert image description here

1. Prompt tuning

The easiest way to achieve parameter efficiency fine-tuning is Prompt tuning (also known as P-Tuning), which fixes the model feed-forward layer parameters and only updates some embedding parameters to achieve low-cost fine-tuning of large models.

insert image description here
The classic prompt tuning approach does not involve any parameter updates to the underlying model. Instead, it focuses on crafting input cues, or templates, that can guide a pretrained model to produce the desired output . The main structure is to use a prompt encoder (BiLSTM+MLP) to encode some pseudo prompts (discrete tokens) and then splicing them with input embedding . At the same time, LSTM is used for Reparamerization to accelerate training, and a small amount of natural language prompts are introduced. Anchor characters (Anchor , such as Britain) to further enhance the effect. Then combine (capital, Britain) to generate the result, and then optimize the generated encoder part. But P-tuning v1 has two significant shortcomings: the task is not universal and the scale is not universal . The effect is poor on some complex natural language understanding NLU tasks, and the parameters of the pre-trained model cannot be too small.

2. Prefix Tuning

If you analyze P-tuning, you have to mention the prefix-tuning technology. Compared with fine-tuning, in the process of adjusting the model, only a small segment of learnable continuous task-specific vector (prefix) is optimized instead of the parameters of the entire model.

Prefix Tuning has different modes designed for different model structures. Taking the autoregressive model as an example, it no longer uses token as a prefix, but directly uses parameters as a prefix , such as a l × dl × dl×matrixPP of dP is used as a prefix, but the effect of using such a prefix directly is unstable,so use an MLP layer to re-parameterize and enlarge the dimension ddd , in addition to adding this prefix to the embedding layer, add such a prefix to all other layers. In the final fine-tuning, only the parameters of the prefix are adjusted, and the parameters of the large model remain unchanged. When saving, only the results of heavy parameters need to be saved for each task.
insert image description here

3. P-tuning v2

The V2 version is mainly based on P-tuning and prefix-tuning technologies, and introduces strategies such as Deep Prompt Encoding and Multi-task Learning for optimization.

Experiments have shown that only fine-tuning 0.1% of the parameters can achieve performance comparable to Fine-tuning on LM models with different parameter scales from 330M to 10B. The following is a comparison between the v1 and
insert image description here
v2 frameworks. We can see the p-tuning on the right In v2, the continuous prompt is added to the front end of the sequence, and trainable prompts are added to each layer. In the v1 model on the left, only inserting the prompt into the input embedding will cause the trainable parameters to be limited by the length of the sentence. In addition, P-Tuning v2 also includes the following improvements:

  • Removed the Reparamerization accelerated training method;
  • Multi-task learning optimization is adopted: pre-training based on the Prompt of the multi-task dataset, and then adapted downstream tasks.
  • Abandon the use of Verbalizer for vocabulary mapping, re-use [CLS] and character tags, use the output of cls or token as NLU like traditional finetune, to enhance versatility, and can be adapted to sequence labeling tasks.
    insert image description here
    All in all, P-tuning v2 is a way to apply Prefix-tuning to NLU tasks. At the same time, because P-tuning v2 inserts tokens in each layer, which increases the amount of change in model training, it is more suitable for smaller models.

4. Lora

The essence of Lora is to put a layer of "shell" on all weight matrices . These shells will add and subtract the original pre-trained weight matrix to make it more suitable for downstream tasks, that is, to achieve fine-tuning. His assumption is that the pre-trained language model has low "intrinsic dimensionality" , so he believes that in the process of model adaptation to downstream tasks, weight updates should also have low "intrinsic rank".

The formula for fine-tuning a large language model can be simplified to the following formula

W = W 0 + Δ W . W ∈ R d × k , W 0 ∈ R d × k W = W_0 + \Delta W. W\in R^{d \times k}, W_0\in R^{d \times k} W=W0+ΔW . _ INRd×k,W0Rd×k

Where W is the matrix weight after fine-tuning (it is the corresponding dense layer in the large language model, generally these matrices are full rank), W 0 W_0W0is the pre-trained weight, Δ W \Delta WΔW is the gradient updated by fine-tuning. The aboveΔ W \Delta WΔ W does some transformations, turning him into two matrices multiplied

W = W 0 + Δ W = W 0 + B A . B ∈ R d × r , A ∈ R r × k W = W_0 + \Delta W = W_0 + BA. B\in R^{d\times r},A\in R^{r \times k} W=W0+ΔW=W0+B A . BRd×r,ARr×k

The rank r is introduced in it, r << min ( d , k ) r << min(d,k)r<<min(d,k ) ,lora will freeze the pre-training weight W 0 W_0 during the training processW0, train only AAA andBBB , the amount of parameters that need to be trained is reduced. Generally speaking, for the lora fine-tuning model, the larger r is set, the better the fine-tuning effect will be.

The core idea of ​​the LoRA algorithm is to decompose the original matrix A AA into the product form of two low-rank matrices X XX and Y YY, that is, A = X ⋅ YA=X\cdot YA=XY. _ Specifically, the LoRA algorithm will first perform SVD decomposition on the original matrix to obtain the matrixA = U Σ VTA=U\Sigma V^TA=UΣVT , among whichUUUV andVVV is AAT AA^TrespectivelyAAT A T A A^TA AThe eigenvector matrix of T AΣ \SigmaΣ is the singular value matrix. Then, the LoRA algorithm will takeUUU 's exkkcolumn k andVVV 's ex-kkK rows, get a low-rank matrixX = U ( : , 1 : k ) X=U(:,1:k)X=U(:,1:k )Y = V ( 1 : k , : ) Y=V(1:k,:)Y=V(1:k,:) , wherekkk is a preset parameter, representing the matrixAAA 's rank. Finally, the LoRA algorithm approximates the matrixA k = X ⋅ Y A_k=X\cdot YAk=XY as original matrixAAApproximation of A , that is, A k ≈ A A_k \approx AAkA
insert image description here

In the end, a weight file we get will be the BA BA of each layerB A , in reasoning, it is necessary to calculateW = W 0 + BAW = W_0 + BAW=W0+B A , you can get the final fine-tuned model weights, the benefits of this method are as follows

  1. Reduces the amount of parameters that need to be inferred
  2. Compared with adding an adapter layer to fine-tune the model, because it does not add additional layers to the model, it just adds and subtracts weights from the original weights, and the reasoning time of the model remains unchanged before and after fine-tuning
  3. Because the weight he finally generates is the BA BA of each layerB A does not change the weight parameters of the original model, so the result is equivalent to a plug-in, which can be plug-and-play, and can generate individual lora fine-tuning weight values ​​for multiple different fine-tuning tasks, and is also convenient for storage.

Generally, when using lora to fine-tune the model satellite, there are only two parameters that need to be paid attention to: rand lora_target_modules. The former determines the rank size of the matrix constructed during lora fine-tuning (here can also be simply understood as the size of matrices B and A), as well as the different modules applied in the large language model. The specific name of the latter module needs to be determined according to different models.

5. RLHF – Reinforcement Learning with Human Feedback

The idea of ​​RLHF is to use reinforcement learning to directly optimize a language model with human feedback. RLHF enables language models trained on general text data corpora to align complex human values.
RLHF is a complex concept involving multiple models and different training stages, generally divided into three steps, which is also necessary for a large model to generate itself.

  • The first step is supervised-fintuning, which is to use the data set mentioned above to fine-tune the model and pre-train a language model (LM).

  • The second step is to train a reward model, which aggregates question and answer data and trains a reward model (Reward Model, RM) by manually sorting different outputs of the same prompt;

  • The third step is to fine-tune the LM in a reinforcement learning algorithm (RL) manner.

Next, we directly apply the content of the hug face website

Step 1. Pre-trained language model

​First
, we train a language model using the classic pre-trained objective. For this step of the model, OpenAI used a smaller version of GPT-3 in its first popular RLHF model InstructGPT ; Anthropic used a Transformer model with 10 million to 52 billion parameters for training; DeepMind used its own 280 billion Parametric model Gopher .
​This
LM can be fine-tuned with additional text or conditions . For example, OpenAI has fine-tuned the "preferable" (preferable) artificially generated text, while Anthropic is based on the "useful, honest and harmless" standard on context clues The original LM was distilled. Expensive data augmentation may be used here, but it is not a necessary step for RLHF. Since RLHF is still an unexplored field, there is no clear answer to "which model" is suitable as a starting point for RLHF. ​​Next, we will generate training reward model (RM, also called preference model) data based on LM , and introduce human preference information in this step.




Step 2. Train the reward model


​The training of RM This model takes a sequence of text and returns a scalar reward that numerically corresponds to the person's preference . We can model this with LMs in an end-to-end fashion, or as a modular system (such as ranking outputs and converting the rankings to rewards). This reward value will be crucial for subsequent seamless access to existing RL algorithms.
Regarding
model selection, the RM can be another fine-tuned LM, or an LM trained from scratch on preference data. For example, Anthropic proposes a special pre-training method, which uses Preference Model Pretraining (PMP) to replace the fine-tuning process after general pre-training. Because the former is considered to have a higher utilization rate of sample data. But the jury is still out on which RM is better.
Regarding
the training text, RM's hint-generation pairs are sampled from a predefined dataset, and an initial LM is used to generate text for these hints. Anthropic's data is primarily generated via a chat tool on Amazon Mechanical Turk and available on the Hub , while OpenAI uses prompts submitted by users to the GPT API.
Regarding
the value of training rewards, here we need to manually rank the answers generated by LM. At first we might think that RM should be trained directly on text annotation scores, but these scores are not calibrated and full of noise due to the different values ​​of the annotators. Ranking allows you to compare the output of multiple models and build better canonical datasets.
​For
a specific ranking method, a successful way is to compare the output of different LMs under the same prompt, and then use EloThe system builds a complete ranking. These different ranking results will be normalized to a scalar reward value for training. ​An interesting artifact of
this
One intuition is that preference models and generative models need to have similar abilities to understand the text presented to them. Next comes the final step: fine-tuning and optimizing the LM with reinforcement learning using the rewards output by the RM.




Step 3. Fine-tuning with reinforcement learning


​Training LMs with reinforcement learning has long At present, the feasible solution found by many organizations is to use Policy Gradient RL algorithm and Proximal Policy Optimization (PPO) to fine-tune some or all parameters of the initial LM . Because the cost of fine-tuning the entire 10B~100B+ parameters is too high (relevant work refers to the Sparrow LM adapted to LoRA and DeepMind for related work). The PPO algorithm has been around for a relatively long time, and there is plenty of guidance on its principles, making it a favorable choice in RLHF. ​It turns out that many of the core RL advances in RLHF have been figuring out how to apply familiar RL algorithms to update such large models. ​Let us first formulate the fine-tuning task as an RL problem. First, the policy is an LM that takes a prompt and returns a sequence of texts (or a probability distribution of texts). The action space of this strategy is all the tokens corresponding to the LM vocabulary (generally on the order of 50k), and the observation space is the possible input token sequence, which is also relatively large (vocabulary ^ input token quantity) . The reward function is a combination of a preference model and a policy shift constraint.





Finally, according to the PPO algorithm, we optimize according to the reward index of the current batch of data (from the on-policy characteristics of the PPO algorithm). The PPO algorithm is a Trust Region Optimization (TRO) algorithm that uses gradient constraints to ensure that the update step does not destabilize the learning process. DeepMind uses a similar reward setup for Gopher, but uses the A2C ( synchronous advantage actor-critic ) algorithm to optimize the gradient.

6. DeepSpeed ​​ZeRO----zero redundancy optimization

Deepspeed is Microsoft's large-scale distributed training tool. Specifically designed for training very large models. Its 3D parallelism simultaneously solves two fundamental challenges of training trillion-parameter models: memory efficiency and computational efficiency. As a result, DeepSpeed ​​can scale to fit the largest models in video memory without sacrificing speed.

insert image description here
Then use the combination of DeepSpeed+Zero to achieve full parameter fine-tuning. Of course, using DeepSpeed ​​for full finetuning has higher requirements for video memory and slower training. But this is undoubtedly a better way, because DeepSpeed ​​ZeRO-2 is mainly used for training, because its function is not useful for reasoning. But when DeepSpeed ​​is developed to ZeRO-3, it can also be used for inference, because it allows loading large models on multiple GPUs, which is not possible on a single GPU.

In Python, the Accelerate[7] library provides simple APIs that allow us to run on any type of single-node or distributed nodes (single-CPU, single-GPU, multi-GPU, and TPU), with or without a mix of Run with high precision (fp16).

Here is an example of me using Accelerator and DeepSpeedPlugin, where I need to know the gradient accumulation step gradient_accumulation_steps and gradient accumulation calculation in advance

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/130778577