From: No Data Not Smart
At present, the information about chatGPT is too scattered, and there is no article detailing all knowledge points and system overview. Therefore, the author made this summary article.
Overview of the training process
Clarify the evolutionary path
pre-training
GPT-3 Overview
The idea of the GPT 3 model
How GPT-3 learns
data set
Instruction Fine-Tuning (IFT)
Supervised Fine-tuning (SFT)
Human Feedback Reinforcement Learning (Reinforcement Learning From Human Feedback, RLHF)
Other methods
Chain-of-thought (CoT)
Similar job to chatGPT
quote
Enter the NLP group —> join the NLP exchange group (remark nips/emnlp/nlpcc enters the corresponding contribution group)
Overview of the training process
OpenAI uses a large language model (LM) with 175B parameters and a reward model (RM) with 6B parameters. In addition to pre-training, the training process is divided into three steps:
Collect datasets of various NLP tasks, add task descriptions and prompts to assemble new datasets, and use these data to fine-tune pre-trained large language models. Including instruction fine-tuning and supervised fine-tuning .
Sample from the above dataset, generate multiple responses using a large language model, manually rank these responses, and train a reward model (RM) to fit human preferences.
Based on the supervised fine-tuning model in the first stage and the reward model in the second stage, a large language model is further trained using a reinforcement learning algorithm.
Clarify the evolutionary path
The parameter quantity of GPT-3.5 is still 175B, and the overall evolutionary tree is as follows:
pre-training
GPT-3 Overview
GPT-3 is an autoregressive model that only uses a decoder, and the training goal is also to predict the next word (the task of not judging the next sentence).
The largest GPT-3 model has 175B parameters, which is 470 times larger than the BERT model (0.375B)
The idea of the GPT 3 model
No need to connect to a new model structure: if bert is used for NER tasks, it is generally connected to LSTM+CRF
no fine tuning required
One model solves multiple NLP tasks
NLP tasks can be solved with generative models
Like humans, it only needs to see a very small number of examples to learn
How GPT-3 learns
Zero-shot learning: provide task description, hints
One-shot learning: provide a task description, an example, hints
Few-shot learning: provide task description, few examples, hints
data set
Model | release time | Parameter amount | The amount of pre-training data |
---|---|---|---|
BERT-large | March 2019 | 375 million | about 3.3GB |
GPT | June 2018 | 117 million | about 5GB |
GPT-2 | February 2019 | 1.5 billion | 40GB |
GPT-3 | May 2020 | 175 billion | 45TB |
BERT-large:BooksCorpus 800M words、 English Wikipedia 2.5Bwords
GPT: WebText2, BooksCorpus, Wikipedia over 5GB.
GPT-2: The total amount of WebText2, BooksCorpus, and Wikipedia reached 40GB.
GPT-3: **WebText2, BooksCorpus, Wikipedia, Common Crawl** and other data sets with 45TB of data.
image-20230221153905277
Instruction Fine-Tuning (IFT)
Collect datasets of various NLP tasks, add task descriptions and hints to assemble new datasets. The data sets used by chatGPT are as follows:
Some related papers:
Unnatural Instructions (Honovich 等, '22): https://arxiv.org/abs/2212.09689
Super-natural instructions (Wang 等, '22): https://arxiv.org/abs/2204.07705
Self-Instruct (Wang 等, '22): https://arxiv.org/abs/2212.10560
T0 (Sanh et al., '22): https://arxiv.org/abs/2110.08207
Natural instructions dataset (Mishra et al., '22): https://arxiv.org/abs/2104.08773
FLAN LM (Wei et al, '22): https://arxiv.org/abs/2109.01652
OPT-IML (Iyer 等, '22): https://arxiv.org/abs/2212.12017
Supervised Fine-tuning (SFT)
This step is not to prevent meaningless answers such as [I don’t know] when encountering sensitive topics, to add some manually labeled data to increase the security of the reply, and it can be completed with a 100-level data set .
Some related papers:
Google's LaMDA: Appendix A https://arxiv.org/abs/2201.08239
Sparrow by DeepMind: Sparrow: Appendix F https://arxiv.org/abs/2209.14375
Human Feedback Reinforcement Learning (Reinforcement Learning From Human Feedback, RLHF)
describe:
Policy: An LM that takes a prompt and returns a sequence of texts (or a probability distribution of texts).
Action space (action space): all the tokens corresponding to the vocabulary of LM (generally in the order of 50k),
The observation space is the sequence of possible input tokens, which is also relatively large (vocabulary ^ number of input tokens).
The reward function is a combination of a preference model and a policy shift constraint.
This process is a two-step process :
Aggregate Q&A data and train a reward model (Reward Model, RM)
Fine-tuning LMs with Reinforcement Learning (RL)
Open source datasets:
Anthropic/hh-rlhf · Datasets at Hugging Face
OpenAI uses feedback submitted by users.
Other methods
This part briefly introduces some methods parallel to the fine-tuning used by chatGPT
Chain-of-thought (CoT)
Fine-tuning using some datasets with stepwise inference as shown below
Orange is the task description, pink is the question and answer, and blue is the reasoning process
Chain of Thought Tips (Wei et al., '22): https://arxiv.org/abs/2201.11903
Similar job to chatGPT
Meta's BlenderBot: https://arxiv.org/abs/2208.03188
Google's LaMDA: https://arxiv.org/abs/2201.08239
Sparrow by DeepMind: https://arxiv.org/abs/2209.14375
Anthropic 的 Assistant: https://arxiv.org/abs/2204.05862
quote
TRANSFORMER MODELS: AN INTRODUCTION AND CATALOG
WebGPT: Browser-assisted question-answering with human feedback
Training language models to follow instructions with human feedback
https://mp.weixin.qq.com/s/b0AI01-pUnXVWPPXix-hew
https://openai.com/blog/chatgpt/
https://mp.weixin.qq.com/s/eYmssaPFODjC7xwh1jHydQ
https://mp.weixin.qq.com/s/mXViN_GB9VC1WrXP1Q1iug
https://mp.weixin.qq.com/s/y9Jy9AyAyTCgCOKyMgTo3w
https://zhuanlan.zhihu.com/p/595891945
https://www.hpc-ai.tech/blog/colossal-ai-chatgpt
https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756
https://arxiv.org/pdf/1706.03762.pdf
https://arxiv.org/pdf/2005.14165.pdf
https://arxiv.org/pdf/1810.04805.pdf
Enter the NLP group —> join the NLP exchange group (remark nips/emnlp/nlpcc enters the corresponding contribution group)
Join the planet, you will get:
1. Update 3-5 latest and high-quality paper speed readings every day
2. The latest introductory and advanced learning materials
4. Daily 1-3 recruitment information for AI positions such as NLP, search, promotion and promotion, and CV