NLP large model fine-tuning principle

1. Background

LLM (Large Language Model) is a large language model designed to understand and generate human language and needs to be trained on a large amount of text data. Generally based on the Transformer structure, it has parameters above Billion level. Such as GPT-3 (175B), PaLM (560B).

Three big things happened in the NLP world:

  1. ChatGPT: AI chat robot program released by OpenAI in November 2022, based on GPT-3.5

  2. LLaMA: The pre-training model released by Meta in February 2023, redefining the "big" of the large model

  3. Alpaca: A fine-tuning model released by Stanford in March 2023, proving the feasibility of Instruction Fine-Tuning

The technology behind ChatGPT:

  • GPT models: base model, GPT-3, GPT-3.5-Turbo and GPT4, large model capacity, requires pre-training on a large amount of data.

  • IFT (Instruction Fine-Tuning): Instruction fine-tuning. Instructions refer to the input text with a clear purpose passed in by the user. The instruction fine-tuning allows the model to learn to follow the user's instructions. OpenAI is called SFT (Supervised Fine-Tuning), which means the same thing.

  • CoT (Chain-of-Thought): On the data level, it represents a special case of instruction form, including step-by-step reasoning process (as shown in the figure below). At the model level, it means that the model has the reasoning ability of step-by-step reasoning.

  • RLHF (Reinforcement Learning from Human Feedback): Optimize the language model based on human feedback in a reinforcement learning manner.

2. Large model training method

2.1 FLAN

The paper "Finetuned Language Models Are Zero-Shot Learners" FLAN clearly proposes instruction fine-tuning. The essential purpose is to convert NLP tasks into natural language instructions and then feed them to the model for training, so as to improve the performance of zero-shot tasks.

paper:https://arxiv.org/abs/2109.01652

2.2 T0

The paper "Multitask Prompted Training Enables Zero-shot Task Generalization" T0 explores how the generalization ability of the large model zero-shot is realized, and proves that the zero-shot generalization ability of the language model can be realized through explicit multi-task prompt training .

paper:https://arxiv.org/abs/2110.08207

1. Multi-task prompted training is more capable than zero-shot of the same parameter model.

2. The paper compares the zero-shot performance of T0 and GPT-3 models:

a. Found that T0 exceeds GPT-3 in 9 out of 11 datasets;

b. Neither T0 nor GPT-3 are trained on NLI, but T0 outperforms GPT-3 on all NLI datasets.

2.3 Flan-T5

The paper "Scaling Instruction-Finetuned Language Models" Flan-T5 proposes a set of multi-task fine-tuning scheme (Flan). By fine-tuning on ultra-large-scale tasks, the language model has a strong generalization performance, so that a single The model can perform well on more than 1800 NLP tasks. This means that the model can be used directly on almost all NLP tasks, realizing "One model for ALL tasks", which is very tempting!

paper:https://arxiv.org/abs/2210.11416

Flan-T5 shows the following experimental conclusions:

  1. Scaling the number of tasks (the more Finetune tasks, the better the effect)

  2. scaling the model size (the more model parameters the better)

  3. finetuning on chain-of-thought(CoT) data (Thinking chain data can improve reasoning ability)

2.4 Chain-of-Thought(CoT)

● Few-shot CoT

The paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" proposes the method of Chain-of-Thought (CoT) thinking chain to improve the reasoning ability of large models in tasks such as mathematical calculation, common sense, and symbolic reasoning.

paper:https://arxiv.org/abs/2201.11903

The inspiration of the COT thinking chain comes from the process of human reasoning. The author draws on this process to stimulate the large model to have reasoning ability by designing the thinking chain, and because of the existence of this logical thinking chain, the multi-step intermediate reasoning is can get the final correct answer.

● Zero-shot CoT

The paper "Large Language Models are Zero-Shot Reasoners" explored the reasoning ability of large models. The author found that for GPT3 with 175B parameters, simply adding `Let's think step by step` can improve the mathematical reasoning of the model (arithmetics reasoning) And the zero-shot ability of symbolic reasoning.

paper:https://arxiv.org/abs/2205.11916

The author proposes that Zero-shot-CoT is divided into two stages:

1. Add text prompts to the original question, and use LLM to generate the reasoning process. (left side of picture above)

2. Add the reasoning process generated by LLM to the original question, and add a hint to generate the answer, and use LLM to generate the final answer to the question. (right side of picture above)

Effect: coaxing can make the accuracy rate of GPT-3 skyrocket by 61%!

2.5 Reinforcement Learning Human Feedback (RLHF)

● InstructGPT

The paper "Training language models to follow instructions with human feedback" InstructGPT is the predecessor of ChatGPT, which mainly explores the use of RLHF (Reinforcement Learning from Human Feedback) method to align human intentions in large models.

paper:https://arxiv.org/abs/2203.02155

A big problem with the Prompt-based zero-shot learning paradigm of GPT and other large language models is that the task completed by the pre-trained model is the prediction of the subsequent text, which deviates from the requirements of the specific task, and the generated results are not necessarily conform to human intent. So some form of fine-tuning is needed to align this.

The method is three steps:

1. Use artificially given exemplary data to supervise the training policy model

2. Training a reward model with manually sorted comparative data

3. Train the policy model through reinforcement learning (using a reward model)

Steps 2 and 3 can be iterated alternately to involve people in the optimization process.

Regardless of the left and right pictures, the models using reinforcement learning (PPO and PPO-ptx) are much better than GPT or supervised learning models.

"Good" here means that its output is more liked by annotators (true, useful, harmless). The data and evaluation criteria are consistent with the training time, all aligned to human preferences. This is where the advantage of RLHF lies, and it naturally beats GPT3.

In addition, the PPO-ptx model adds pretraining regularization during reinforcement learning to avoid dropping points on public NLP tasks (to solve the so-called "alignment tax" problem), but there is a slight negative impact here.

3. LLaMA

The author of the paper "LLaMA: Open and Efficient Foundation Language Models" trained on 1T-level tokens, proving that SOTA-level models can be trained only using public data sets.

Unlike Chinchilla, PaLM or GPT-3, LLaMA only uses publicly available data, making our work compatible with open source, whereas most existing models rely on non-public data.

paper:https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/

3.1 Performance

● LLaMA-13B > GPT-3 (175B): LLaMA-13B performs better than GPT3(175B) on most tests

● LLaMA-65B ≈ PaLM-540B: ​​LLaMA-65B is also competitive with the best models Chinchilla-70B and PaLM-540B

LLaMA redefines "big" for large models

3.2 Motivation

1. The open source OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022) models, their effects cannot Compared with PaLM-62B or Chinchilla, LLaMA is stronger.

2. The paper "Training Compute-Optimal Large Language Models" found that the best performance is not on the largest model, but on the model that uses more tokens, so the author believes that a smaller model takes longer to train , using more tokens, can achieve the same model effect, and is cheaper in prediction.

paper:https://arxiv.org/abs/2203.15556

3.3 Optimizer

LLaMA also made improvements to the transformer architecture:

●Pre-normalization[GPT3]: RMSNorm normalizing function

● SwiGLU activation function [PaLM].

● Rotary Embeddings [GPTNeo].

3.4 Future Work

1. Continue to scale larger models

2. Do instruction tuning and RLHF

4. Fine-tuning method

4.1 adapter

In 2019, the paper "Parameter-Efficient Transfer Learning for NLP" proposed the Adapter solution, which became the pioneering work of the Delta Tuning technical solution.

The main idea of ​​this solution is to fix the parameters of the pre-trained model in a model with a large scale of parameters without fine-tuning, and to add some trainable parameters to the original network structure. These trainable parameters are added after the attention and feed-forward layers of each layer. Specifically, the Adapter scheme will add a new small neural network called Adapter between the attention layer and the feed-forward layer of each layer.

Adapter parameters can be trained during fine-tuning to suit specific downstream tasks. This approach can greatly reduce the number of parameters that need to be fine-tuned, thereby improving the parameter efficiency and training speed of the model.

As shown in the figure below, the Adapter scheme adds an Adapter after the attention layer and feed-forward layer of each layer, which effectively reduces the parameter scale of the model. This method has been widely used in the field of natural language processing and has achieved good results.

4.2 LoRA

In 2021, Microsoft proposed a scheme called LORA in the paper "LoRA: Low-Rank Adaptation of Large Language Models". The solution analyzes the Transformer network structure and finds that the calculation of the weight matrix takes up a lot of computing time, including the Q/K/V conversion matrix of the Attention layer and the MLP matrix of the Feed-Forward layer.

paper:https://arxiv.org/abs/2106.09685

LORA mainly focuses on the transformation matrix of Attention to improve the speed and efficiency of model training. The core idea of ​​the LORA scheme is to add a matrix in parallel to the transformation matrix. Also use a combination of low-rank matrices to replace the added matrices. This approach can effectively reduce the number of parameters.

The experimental results show that the LORA scheme can significantly improve the speed and efficiency of model training without affecting the performance of the model. This scheme has been widely used in the field of natural language processing, and has achieved excellent results on multiple tasks.

4.3 Prefix-Tuning

In 2021, Stanford proposed the Prefix-Tuning method in the paper "Prefix-Tuning: Optimizing Continuous Prompts for Generation". The main idea is not to change the structure of the original network layer, but to add a prompt prefix to the input part. The prompt can be discrete. , Specifically, for example, for the ner task, "please find all entities in the sentence" can be added as a prompt.

This kind of prompt can be manually designed or automatically searched. The problem is that the final performance is particularly sensitive to changes in the manually designed prompt. Adding a word or missing a word, or changing the position will cause relatively large changes. The cost of automated search prompts is also relatively high.

The implementation method of this paper is to adopt the second scheme, to train a separate continuously fine-tunable virual token for each task, which is better than discrete tokens. At the same time, in order to expand the amount of fine-tunable parameters, it is not only added to the first layer but added to each layer of the transformer. For the T5 network structure, there are both encoder and decoder parts, so the prompt prefix needs to be added to both the input of the encoder and the input of the decoder.

To sum up, the idea of ​​delta-tuning is to fix the parameters of the original pre-training model, and then add some network parameters for fine-tuning of downstream tasks. As for where to put the new network parameters? You can put it anywhere, as long as it works.

5. Fine-tuning implementation

5.1 Alpaca

On March 15, 2023, Stanford released the Alpaca model "Alpaca: A Strong, Replicaable Instruction-Following Model", which was fine-tuned on Meta's LLaMA-7B, using only 52k data, and its performance is approximately equal to GPT-3.5 , and the training cost is less than $600.

paper:Stanford CRFM

There are two main challenges in training large models:

1. Large excellent base model: use LLaMA-7B;

2. High-quality instruction data: Based on the automatic instruction data generation method in the SELF-INSTRUCT paper, use OpenAI's text-davinci-003 to generate 52K instruction data.

After the Alpaca alpaca model, a large number of fine-tuning models based on LLaMA appeared, including ChatLLama, FreedomGPT, Vicuna, Koala, etc. The names of the alpaca family are almost not enough.

Essentially speaking, ChatGPT, a large language model, achieves the effect of "strengthening miracles" by burning money and computing power. And this also brings about a problem, that is, the degree to which such a large language model burns money will discourage many small companies, and they can only join the circle of hegemony and monopoly. And for a company like Xiaohongshu/Bilibili, which can neither afford the cost of training large models, nor is willing to hand over the data of its own content pool to others, it is actually quite an embarrassing situation.

Alpaca and Vicuna show another possibility, which is to reproduce 90% or even 99% of the ability of large language models at a very low price through "knowledge distillation". And that means small companies can also train their own AI models.

In other words, ChatGPT opened the prelude to the landing of AI, and Vicuna tells us that the world of AI everywhere may be just around the corner.

5.2 Self-Instruct

In 2022, the University of Washington proposed a framework in the paper "SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions": Self-Instruct, which can use the least manual annotation to generate a large amount of data for instruction-tuning; Also released a 52K dataset for instruction-tuning obtained using the above method.

paper:https://arxiv.org/abs/2212.10560

Self-Instruct Dataset Construction Method

Effect comparison:

Answer the result score, A is the best, D is the worst; green is the best, red is the worst

As can be seen:

  • The original GPT3 can barely respond to user instructions, and all instruction-tuned fine-tuned models have improved significantly

  • Even though the data generated by Self-Instruct is noisy, the model GPT3 self-instruct is significantly better than the model GPT3+T0 Training and GPT3 + SuperNI Training

  • The effect of model GPT3 self-instruct is very close to the effect of model InstructGPT001

  • InstructGPT003 is most effective

Instruction data collection method:

  • Refer to the self-instruct data obtained by Alpaca based on GPT3.5;

  • Refer to the self-instruct data obtained by Alpaca based on GPT4;

  • Data ShareGPT shared by users using ChatGPT.

6. Future directions

1. Further expand the model scale, improve model architecture and training

Improving the model's architecture or training process may lead to high-quality models with emergent capabilities and reduced computation.

One direction is to use sparse mixed-expert architectures, which are more computationally efficient while maintaining a constant input cost, use more localized learning strategies instead of backpropagating over all weights of the neural network, and use external storage to Enhanced model.

2. Expand data scale

Training for a long enough time on a sufficiently large dataset has proven to be key for language models to acquire syntactic, semantic, and other knowledge of the world. Recently, Hoffmann et al. argued that previous work underestimated the amount of training data needed to train an optimal model and underestimated the importance of training data. Collecting large amounts of data on which models can be trained for longer periods of time allows for a greater range of emergent capabilities within the constraints of a fixed model size.

3. Better prompts

While few-shot prompting is simple and effective, improvements to the generality of prompting will further expand the capabilities of language models.

For example, augmenting with few-shot examples with intermediate steps enables the model to perform multi-step inference tasks that cannot be achieved with standard prompting. Also, a better explanation of why prompting works might help to bootstrap emergent capabilities on smaller models. A full understanding of why a model works often lags behind the development and popularity of the technology, and as more powerful models are developed, prompting best practices may change.

4. Understanding Emergent Capabilities

Understanding Emergence In addition to studying how to further unlock emergent capabilities, a future research direction is how and why emergent capabilities appear in large language models. Understanding emergence is a very important direction, which helps us determine what emergence capabilities the model can have and how to train a stronger semantic model.

Guess you like

Origin blog.csdn.net/shibing624/article/details/130540936