In-depth understanding of deep learning - GPT (Generative Pre-Trained Transformer): GPT-3 and Few-shot Learning

Category: General Catalog of "In-depth Understanding of Deep Learning"
Related Articles:
GPT (Generative Pre-Trained Transformer): Basic Knowledge
GPT (Generative Pre-Trained Transformer): Using GPT in Different Tasks GPT
( Generative Pre-Trained Transformer ): GPT-2 and Zero-shot Learning
GPT (Generative Pre-Trained Transformer): GPT-3 and Few-shot Learning


GPT-3 used to be the largest, most amazing and most controversial pre-training language model. The paper introducing GPT-3 is 72 pages long, including model design ideas, theoretical derivation, experimental results, and experimental design. The GPT-3 model is too large, with 175 billion parameters. Even if it is open source, it cannot be deployed as a pre-trained language model for personal use due to the large model and computing power requirements.

Compared with the surprise performance of GPT-2 in the Zero-shot Learning setting introduced in the article " In-depth Understanding of Deep Learning - GPT (Generative Pre-Trained Transformer): GPT-2 and Zero-shot Learning" , GPT-3 is in The performance under the Few-shot Learning setting is enough to shock everyone. In the performance evaluation of natural language processing downstream tasks, the performance of GPT-2 under the Zero-shot Learning setting is far inferior to that of the SOTA model, while the performance of GPT-3 under the Few-shot Learning setting is the same as that of the SOTA model at that time, even Beyond the SOTA model. The figure below shows an example of machine translation using GPT-3 with a small number of samples. The right side of the figure below shows the fine-tuning process of the common model. The model is trained with a large amount of training corpus, and the gradient is iteratively updated using specific task data. Only after the training converges can the model have good translation ability. The left side of the figure below is a learning example of GPT-3 under the N-shot Learning setting. Under the Zero-shot Learning setting, GPT-3 can realize translation only by giving a task description; under the One-shot Learning setting , in addition to giving a task description, a translation sample needs to be given before GPT-3 can translate; under Few-shot Learning settings, in addition to giving a task description, more training data needs to be given (still It is a small number of samples, far less than the training data required for the fine-tuning process, but GPT-3 can achieve better translations). In general, the more sample data given, the better GPT-3 will perform on a given task. Not only that, if it has the same performance on the same task, GPT-3 requires far less fine-tuning training data than the SOTA model.
Example of machine translation using GPT-3 in a small number of samples
GPT-3 has excellent performance on many natural language processing datasets, including common natural language processing tasks such as question answering and text filling. The text generation ability of GPT-3 is enough to achieve the realm of false ones,

Read the words and make a sentence

Given a new word and its meaning, make a sentence using the new word. This is a task that primary school students often encounter when they learn new words. By making sentences, we can judge whether the students have grasped the true meaning of the words. The paper Language Models are Few-shot Learners gives the following example:


A “Burringo” is a car with very fast acceleration. An example of a sentence that uses the word Burringo is: In our garage we have a Burringo that my father drives to work every day. "It's a fast-accelerating car, use Burringo in a sentence.
The sentence-making result given by GPT-3: I have a Burringo in my garage, and my father drives it to work every day.

Although it failed to reflect the characteristics of fast acceleration, GPT-3 accurately grasped the most important feature of the word "Burringo", that is, it represents the meaning of the car.

Let's make up a new word with a verb part of speech to see if GPT-3 can grasp the meaning of the new word well:

To “screeg” something is to swing a sword at it. An example of a sentence that uses the word screeg is: We screeghed at each other for several minutes and then we went outside and ate ice cream. GPT-3 received input original
text : "screeg" refers to fencing, use screeg to make a sentence.
Sentence-making results given by GPT-3: We struck swords for a while, then went out to eat ice cream.

For verbs, GPT-3 also grasps very well, and the sentences are very smooth.

grammar error correction

GPT-3 can not only generate text out of thin air, but also correct sentences. Given a sentence with a grammatical error, let GPT-3 modify it:

Poor English input: The patient was died.
Good English output: The patient died.

The verb in the original sentence was used incorrectly, and there was an extra was. GPT-3 automatically deleted the word and modified it into a sentence with correct grammar and smooth semantics. Let's look at a more difficult example:

Poor English input: Today I have went to the store to buys some many bottle of water.
Good English output:Today I went to the store to buy some bottles of water.

There are two types of errors in the original sentence here: the first type is tense errors, GPT-3 deleted the have in the sentence; the second type is singular and plural errors, by changing buys to buy, and changing some many bottles to some bottles to correct. The error correction effect of GPT-3 is equivalent to that of the automatic error correction system based on expert rules, and it may even perform better in the context of complex sentences and chaotic tenses.

In addition to the examples in the original paper, after OpenAI opened the limited GPT-3 interface, netizens also tried other interesting tasks, such as letting GPT-3 write code, designing web page UI, and some netizens even taught GPT-3 to play chess , Generate financial statements. And GPT-3 has remarkable performance in various tasks, which is far beyond people's imagination. With its huge model and high training costs, GPT-3 can be described as the ceiling of the generative pre-training language model at that time. Model.

The GPT-3 Controversy

While GPT-3, which is so big and popular, has won a lot of praise, it has also been questioned by many scholars at home and abroad. They rationally analyzed the defects of GPT-3. The following collates and summarizes some accepted viewpoints, so that readers can understand GPT-3 more comprehensively.

  • GPT-3 does not have real logical reasoning ability: in the question-and-answer task, if GPT-3 receives the question "how many eyes does the sun have", GPT-3 will answer "the sun has one eye", that is, GPT-3 does not It will not judge whether the question is meaningful, and its answer is based on large-scale corpus training, rather than logical derivation, and cannot give an answer beyond the scope of the training corpus.
  • GPT-3 has the risk of generating bad content: when generating text, because the training corpus comes from the Internet, the corpus containing racial discrimination or sex discrimination cannot be completely filtered, resulting in a certain probability that the text generated by GPT-3 will express discrimination and prejudice. Mistakes can even be made in moral judgments and professional law.
  • GPT-3 performs poorly on highly procedural questions: GPT-3 performs poorly on question answering in STEM subjects (Science, Technology, Engineering, Mathematics), because GPT-3 is more likely to acquire and remember declarative knowledge , rather than understanding knowledge. Julian Togelius, an associate professor at New York University and a game AI researcher, commented on GPT-3: It is like a smart student who did not review carefully, trying to speak nonsense in order to muddle through in the exam. It strings together known facts and lies to make it seem like a smooth narrative.

While the credibility of GPT-3's output has been questioned, its large number of parameters and high training costs also prevent it from being widely used. Even so, GPT-3 was once the largest and best pre-trained language model, and its real significance lies in unveiling the corner of the veil of general artificial intelligence. Geoffrey Hinton, the father of deep learning, commented on GPT-3 as follows: If the future is calculated based on the excellent performance of GPT-3, then life and everything in the world are just 4.398 trillion parameters. With the development of deep learning, if there is a model structure that replaces Transformer, or if the model parameter scale is expanded by 1000 times, there may really be a general artificial intelligence model that can learn logical reasoning and thinking.

References:
[1] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015 [
2] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola. Dive Into Deep Learning[J] . arXiv preprint arXiv:2106.11342, 2021.
[3] Che Wanxiang, Cui Yiming, Guo Jiang. Natural Language Processing: A Method Based on Pre-Training Model [M]. Electronic Industry Press, 2021. [4]
Shao Hao, Liu Yifeng. Pre-training language model [M]. Electronic Industry Press, 2021.
[5] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019
[6] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Practice[ M]. People's Posts and Telecommunications Press, 2023
[7] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131343900