GPT-3 overview
Key facts about GPT-3:
-
Classification of models : GPT-3 has 8 different models with parameters ranging from 125 million to 175 billion.
-
Model Size : The largest GPT-3 model has 175 billion parameters. This is 470 times larger than the largest BERT model (375 million parameters)
-
Architecture : GPT-3 is an autoregressive model using a decoder-only architecture. Train with the next word prediction target
-
Learning method : GPT-3 passes very little learning, and there is no gradient update when learning
Need training data: GPT-3 needs less training data. It can learn from very little data, which allows its application to domains with less data.
Key assumptions:
- Increased model size and training on larger data can lead to improved performance
- A single model can provide good performance on many NLP tasks.
- Models can be inferred from new data without fine-tuning
- The model can solve problems on datasets it has never been trained on.
Early pre-trained models - fine-tuning:
- GPT-3 takes a different learning approach. Large amounts of labeled data are not required to infer new questions.
- Instead, it can
不从
learn from data (Zero-Shot Learning),只从
a single example (One-Shot Learning), or few examples (Few-Shot Learning).
Comparing with Bert:
the most notable 3 features :
- Size: The size of GPT-3 is its outstanding feature. It is almost 470 times larger than the largest BERT model
- Structure: In terms of architecture, BERT is still in the leading position. It is a method trained to better capture potential relationships between texts in different question contexts. , which is probability-based, and outputs one by one
- Method: The GPT-3 learning method is relatively simple and can be applied to many problems without enough data. Therefore, GPT-3 should have wider applications compared to BERT.
Two breakthrough functions :
- text generation
- Build NLP solutions with limited data
The performance of each task:
-
Language Modeling: GPT-3 beats all benchmarks on pure language modeling tasks.
-
Machine Translation: The model outperforms benchmarks for translation tasks that require converting documents into English. But if the language needs to be translated from English to non-English , then the situation is different, and the performance of GPT-3 will also be problematic.
-
阅读理解
: The performance of the GPT 3 model is far below the state of the art here. -
自然语言推理
: Natural Language Inference (NLI) focuses on the ability to understand the relationship between two sentences. GPT 3 model performs poorly on NLI tasks -
常识推理
: The Commonsense Reasoning Dataset tests the performance of physical or scientific reasoning skills. GPT 3 models perform poorly on these tasks
Problems with GPT3
- GPT3 is a hybrid model and may lose performance on pre-trained custom models
- Concerns about model bias and interpretability: given GPT-3's sheer size, it will be difficult for companies to explain the decisions made by the algorithm
- Regulations are needed to prevent abuse: if not properly regulated
Graphical detailed understanding
- Predict the next word directly instead of predicting based on context and mask
- Generate one token at a time, generate iteratively
- 17.5 billion parameters
GPT3 has 2048 tokens. This is its "context window". This means it has 2048 tracks along which tokens are processed.
Specifically how to deal with:
- Let's follow the purple track. How does the system process the word "robotics" and produce an "A"?
step:
- Convert words to vectors (lists of numbers) representing words
- Calculation Forecast
- convert the resulting vector to words
- The important computations of GPT3 happen inside its stack of 96 Transformer-decoder layers. Each of these layers has its own 1.8B parameters to compute. That's where the "magic" happens. Here's a high-level view of the process:
Paper Intensive Reading
Three cores: Fine-Tuning, Few-Shot, One-Shot