Easy to understand the principle of chatGPT

From: No Data Not Smart

At present, the information about chatGPT is too scattered, and there is no article detailing all knowledge points and system overview. Therefore, the author made this summary article.

  • Overview of the training process

  • Clarify the evolutionary path

  • pre-training

    • GPT-3 Overview

    • The idea of ​​the GPT 3 model

    • How GPT-3 learns

    • data set

  • Instruction Fine-Tuning (IFT)

  • Supervised Fine-tuning (SFT)

  • Human Feedback Reinforcement Learning (Reinforcement Learning From Human Feedback, RLHF)

  • Other methods

    • Chain-of-thought (CoT)

  • Similar job to chatGPT

  • quote

Enter the NLP group —> join the NLP exchange group (remark nips/emnlp/nlpcc enters the corresponding contribution group)

Overview of the training process

OpenAI uses a large language model (LM) with 175B parameters and a reward model (RM) with 6B parameters. In addition to pre-training, the training process is divided into three steps:

  1. Collect datasets of various NLP tasks, add task descriptions and prompts to assemble new datasets, and use these data to fine-tune pre-trained large language models. Including instruction fine-tuning and supervised fine-tuning .

  2. Sample from the above dataset, generate multiple responses using a large language model, manually rank these responses, and train a reward model (RM) to fit human preferences.

  3. Based on the supervised fine-tuning model in the first stage and the reward model in the second stage, a large language model is further trained using a reinforcement learning algorithm.

outside_default.png
img

Clarify the evolutionary path

The parameter quantity of GPT-3.5 is still 175B, and the overall evolutionary tree is as follows:

15bc3b7076a9d34079a4c09932a0cb0a.png
img
3c0ff2e36227961fd235036dd3ddb4a6.png
img

pre-training

GPT-3 Overview

30c5b2b87f7fc41a3b4f3a978edb8505.jpeg 60ad4cd821f655d8fadb09ab955076e3.png
  • GPT-3 is an autoregressive model that only uses a decoder, and the training goal is also to predict the next word (the task of not judging the next sentence).

  • The largest GPT-3 model has 175B parameters, which is 470 times larger than the BERT model (0.375B)

00d87486b5b2181c087b8efb6c5dbd3e.png
image-20230221144754842

The idea of ​​the GPT 3 model

  • No need to connect to a new model structure: if bert is used for NER tasks, it is generally connected to LSTM+CRF

  • no fine tuning required

  • One model solves multiple NLP tasks

  • NLP tasks can be solved with generative models

  • Like humans, it only needs to see a very small number of examples to learn

How GPT-3 learns

  • Zero-shot learning: provide task description, hints

  • One-shot learning: provide a task description, an example, hints

  • Few-shot learning: provide task description, few examples, hints

50665215876e69209b199e231f51f4f2.png


data set

Model release time Parameter amount The amount of pre-training data
BERT-large March 2019 375 million about 3.3GB
GPT June 2018 117 million about 5GB
GPT-2 February 2019 1.5 billion 40GB
GPT-3 May 2020 175 billion 45TB
  • BERT-large:BooksCorpus 800M words、 English Wikipedia 2.5Bwords

  • GPT: WebText2, BooksCorpus, Wikipedia over 5GB.

  • GPT-2: The total amount of WebText2, BooksCorpus, and Wikipedia reached 40GB.

  • GPT-3: **WebText2, BooksCorpus, Wikipedia, Common Crawl** and other data sets with 45TB of data.

    f24b7264604996269011e2375c492002.png
    image-20230221153905277

Instruction Fine-Tuning (IFT)

Collect datasets of various NLP tasks, add task descriptions and hints to assemble new datasets. The data sets used by chatGPT are as follows:

375dc2201b7a793d887f2241ca81d47a.png
image-20230221113507381

Some related papers:

  • Unnatural Instructions (Honovich 等, '22): https://arxiv.org/abs/2212.09689

  • Super-natural instructions (Wang 等, '22): https://arxiv.org/abs/2204.07705

  • Self-Instruct (Wang 等, '22): https://arxiv.org/abs/2212.10560

  • T0 (Sanh et al., '22): https://arxiv.org/abs/2110.08207

  • Natural instructions dataset (Mishra et al., '22): https://arxiv.org/abs/2104.08773

  • FLAN LM (Wei et al, '22): https://arxiv.org/abs/2109.01652

  • OPT-IML (Iyer 等, '22): https://arxiv.org/abs/2212.12017

Supervised Fine-tuning (SFT)

This step is not to prevent meaningless answers such as [I don’t know] when encountering sensitive topics, to add some manually labeled data to increase the security of the reply, and it can be completed with a 100-level data set .

bc35c516e632633b9947f933e0d17cc0.png

Some related papers:

  • Google's LaMDA: Appendix A https://arxiv.org/abs/2201.08239

  • Sparrow by DeepMind: Sparrow: Appendix F https://arxiv.org/abs/2209.14375

Human Feedback Reinforcement Learning (Reinforcement Learning From Human Feedback, RLHF)

describe:

  • Policy: An LM that takes a prompt and returns a sequence of texts (or a probability distribution of texts).

  • Action space (action space): all the tokens corresponding to the vocabulary of LM (generally in the order of 50k),

  • The observation space is the sequence of possible input tokens, which is also relatively large (vocabulary ^ number of input tokens).

  • The reward function is a combination of a preference model and a policy shift constraint.

This process is a two-step process :

  1. Aggregate Q&A data and train a reward model (Reward Model, RM)

  2. Fine-tuning LMs with Reinforcement Learning (RL)

Open source datasets:

Anthropic/hh-rlhf · Datasets at Hugging Face

OpenAI uses feedback submitted by users.

d7bfa7352c085459adf4058d1ee66fba.png
image-20230221111329526

Other methods

This part briefly introduces some methods parallel to the fine-tuning used by chatGPT

Chain-of-thought (CoT)

Fine-tuning using some datasets with stepwise inference as shown below

Orange is the task description, pink is the question and answer, and blue is the reasoning process

36c218fce4df15a8589b5ad409384c1e.png

Chain of Thought Tips (Wei et al., '22): https://arxiv.org/abs/2201.11903

Similar job to chatGPT

  • Meta's BlenderBot: https://arxiv.org/abs/2208.03188

  • Google's LaMDA: https://arxiv.org/abs/2201.08239

  • Sparrow by DeepMind: https://arxiv.org/abs/2209.14375

  • Anthropic 的 Assistant: https://arxiv.org/abs/2204.05862

quote

  • TRANSFORMER MODELS: AN INTRODUCTION AND CATALOG

  • WebGPT: Browser-assisted question-answering with human feedback

  • Training language models to follow instructions with human feedback

  • https://mp.weixin.qq.com/s/b0AI01-pUnXVWPPXix-hew

  • https://openai.com/blog/chatgpt/

  • https://mp.weixin.qq.com/s/eYmssaPFODjC7xwh1jHydQ

  • https://mp.weixin.qq.com/s/mXViN_GB9VC1WrXP1Q1iug

  • https://mp.weixin.qq.com/s/y9Jy9AyAyTCgCOKyMgTo3w

  • https://zhuanlan.zhihu.com/p/595891945

  • https://www.hpc-ai.tech/blog/colossal-ai-chatgpt

  • https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756

  • https://arxiv.org/pdf/1706.03762.pdf

  • https://arxiv.org/pdf/2005.14165.pdf

  • https://arxiv.org/pdf/1810.04805.pdf


Enter the NLP group —> join the NLP exchange group (remark nips/emnlp/nlpcc enters the corresponding contribution group)

Join the planet, you will get:

1.  Update 3-5 latest and high-quality paper speed readings every day

2.  The latest introductory and advanced learning materials

4.  Daily 1-3 recruitment information for AI positions such as NLP, search, promotion and promotion, and CV

f19b849de8e43ddbdd4a4747140a1d8a.png

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130023510