GPT3 study notes

GPT-3 overview

Key facts about GPT-3:

  • Classification of models : GPT-3 has 8 different models with parameters ranging from 125 million to 175 billion.

  • Model Size : The largest GPT-3 model has 175 billion parameters. This is 470 times larger than the largest BERT model (375 million parameters)

  • Architecture : GPT-3 is an autoregressive model using a decoder-only architecture. Train with the next word prediction target

  • Learning method : GPT-3 passes very little learning, and there is no gradient update when learning

Need training data: GPT-3 needs less training data. It can learn from very little data, which allows its application to domains with less data.
insert image description here
Key assumptions:

  • Increased model size and training on larger data can lead to improved performance
  • A single model can provide good performance on many NLP tasks.
  • Models can be inferred from new data without fine-tuning
  • The model can solve problems on datasets it has never been trained on.

Early pre-trained models - fine-tuning:
insert image description here

  • GPT-3 takes a different learning approach. Large amounts of labeled data are not required to infer new questions.
  • Instead, it can 不从learn from data (Zero-Shot Learning), 只从a single example (One-Shot Learning), or few examples (Few-Shot Learning).

Comparing with Bert:
insert image description here
the most notable 3 features :

  1. Size: The size of GPT-3 is its outstanding feature. It is almost 470 times larger than the largest BERT model
  2. Structure: In terms of architecture, BERT is still in the leading position. It is a method trained to better capture potential relationships between texts in different question contexts. , which is probability-based, and outputs one by one
  3. Method: The GPT-3 learning method is relatively simple and can be applied to many problems without enough data. Therefore, GPT-3 should have wider applications compared to BERT.

Two breakthrough functions :

  • text generation
  • Build NLP solutions with limited data

The performance of each task:

  • Language Modeling: GPT-3 beats all benchmarks on pure language modeling tasks.

  • Machine Translation: The model outperforms benchmarks for translation tasks that require converting documents into English. But if the language needs to be translated from English to non-English , then the situation is different, and the performance of GPT-3 will also be problematic.

  • 阅读理解: The performance of the GPT 3 model is far below the state of the art here.

  • 自然语言推理: Natural Language Inference (NLI) focuses on the ability to understand the relationship between two sentences. GPT 3 model performs poorly on NLI tasks

  • 常识推理: The Commonsense Reasoning Dataset tests the performance of physical or scientific reasoning skills. GPT 3 models perform poorly on these tasks

Problems with GPT3

  • GPT3 is a hybrid model and may lose performance on pre-trained custom models
  • Concerns about model bias and interpretability: given GPT-3's sheer size, it will be difficult for companies to explain the decisions made by the algorithm
  • Regulations are needed to prevent abuse: if not properly regulated

Graphical detailed understanding

Zhihu Illustrated Articles

  • Predict the next word directly instead of predicting based on context and mask
  • Generate one token at a time, generate iteratively
  • 17.5 billion parameters

GPT3 has 2048 tokens. This is its "context window". This means it has 2048 tracks along which tokens are processed.
insert image description here

Specifically how to deal with:

  • Let's follow the purple track. How does the system process the word "robotics" and produce an "A"?

step:

  • Convert words to vectors (lists of numbers) representing words
  • Calculation Forecast
  • convert the resulting vector to words
    insert image description here
  • The important computations of GPT3 happen inside its stack of 96 Transformer-decoder layers. Each of these layers has its own 1.8B parameters to compute. That's where the "magic" happens. Here's a high-level view of the process:
    insert image description here
    insert image description here

Paper Intensive Reading

Three cores: Fine-Tuning, Few-Shot, One-Shot

Guess you like

Origin blog.csdn.net/RandyHan/article/details/131470858