[AI Theory Learning] Language Model: In-depth understanding of the self-attention process of GPT-2 calculation mask and the working principle of GPT-3


The development idea from GPT-2 to GPT-3 is to gradually improve the model size, performance and generality, while solving some limitations of previous versions. The following are the main ideas and key points of this development process:

  1. Gradually increase the model size:
    • GPT-2 : GPT-2 is the first Transformer-based language model launched by OpenAI. With 150 million parameters, it was one of the larger models at the time. However, OpenAI initially released its largest version privately due to concerns about misuse.
    • GPT-3 : On the basis of GPT-2, OpenAI has further improved the scale of the model. The largest version of GPT-3 (GPT-3.5-turbo) has 175 billion parameters, about 10 times larger than GPT-2, making it one of the largest language models to date. This makes GPT-3 have higher natural language processing capabilities.
  2. Increased versatility:
    • Wider application : GPT-3 aims to be a general-purpose natural language processing model that can be applied to a variety of tasks, not just text generation. This goal enables GPT-3 to perform well on multiple tasks such as question answering, dialogue, text generation, and translation.
    • Zero-shot learning : GPT-3 introduces the ability of zero-shot learning, which means it can perform various tasks without being trained for a specific task. This property makes GPT-3 more general and can be quickly adapted to new tasks.
  3. To address abuse:
    • Enhanced security : Due to concerns about abuse, especially with regard to automatic text generation, OpenAI has taken some security measures. They restricted access to models and reviewed applications using the API to reduce inappropriate content generation.
    • Iterative improvements : OpenAI continuously improves the security of GPT-3 as it is released and used to reduce the risk of possible abuse.

In general, the development idea from GPT-2 to GPT-3 is to continuously improve the model size, performance and generality, while paying attention to the abuse problem to ensure the ethical and social responsibility of the technology. This development process represents rapid development and innovation in the field of deep learning natural language processing.

The previous article has introduced the overall situation of GPT and GPT2. This article focuses on the detailed process of GPT-2 calculating mask self-attention, understanding the working principle of GPT-3, and introducing some common language model applications.

Graphical Self-Attention

In the previous article , we used the following image to show how to use Self-Attention in a layer, which is processing words it.
Self-Attention layer
In this section, we will detail how to achieve this. Note that we explain exactly what happens to each word. That's why we show a large number of individual vectors. The actual code implementation is done by multiplying huge matrices. But here, we focus on the lexical level.

Graphical Self-Attention (without masking)

First, we introduce the original Self-Attention, which is used in the Encoder module for calculation. Take a simple Transformer as an example, it can only process 4 tokens at a time.
Self-Attention is mainly realized through 3 steps:

  1. Create a Query, Key, Value matrix for each path.
  2. For each input token, use its Query vector to score all other Key vectors.
  3. Sum the Value vectors multiplied by their corresponding fractions.
    Self-Attention

1. Create Query, Key and Value vectors

Focus on the first path first. We will use its Query vector and compare all Key vectors. This produces a score for each key vector. The first step of Self-Attention is to compute 3 vectors for each token's path. That is, for each input word, respectively with the weight matrix WQW^QWQ W K W^K WK W V W^V WV is multiplied to get a query vector (query vector), a keyword vector (key vector), and a value vector (value vector), as shown in the following figure:
calculate the three vectors for each token path

2. Calculate the score

Now that we have these vectors, we just use the Query vector and Value vector for step 2. Because we are concerned with the vector of the first token, we multiply the Query vector of the first token with the Key vectors of all other tokens to get the score of 4 tokens .
Multiply (dot product)

3. Calculate and

These scores can now be multiplied by the Value vector. After we add them up, a Value vector with a high score takes up a large portion of the resulting vector.
sum up
The lower the score, the more transparent the Value vector. This is for illustration, multiplying by a small value dilutes the Value vector.

If we do the same for each path, we get a vector representing each token with the appropriate contextual information for each token. These vectors are input to the next sublayer of the Transformer module (feed-forward neural network).
do same operation for each path

Graphical Masked Self-Attention

Now that we have seen the Transformer's Self-Attention step, let's move on to masked self-attention. Masked self-attentionSame as self-attention, except for the 2nd step. Suppose the model has only 2 tokens as input and we are observing (processing) the second token. In this case, the last 2 tokens are masked. So the model interferes with the scoring step. It basically always scores future tokens as 0, so the model cannot see future words :
Masked Self-Attention
this masking is often implemented with a matrix called attention mask. Imagine a sequence of 4 words (for example, a robot must obey orders). In a language modeling scenario, this sequence would be processed in 4 steps - each step processing a word (assuming each word is now a token). Since these models work in batch size, we can assume that this simple model has a batch size of 4, which processes the entire sequence (including 4 steps) as a batch.
process the entire sequence (with its four steps) as one batch
In matrix form, we multiply the Query matrix with the Key matrix to calculate the score . Let's visualize it as follows, the difference is that instead of using words, we use the Query matrix (or Key matrix) corresponding to the words in the grid.
multiplying a queries matrix by a keys matrix
After doing the multiplication, we add the triangular ones attention mask. It sets the cells we want to mask to negative infinity or a very large negative number (eg -1 billion in GPT2):
mask to -infinity
then applying softmax to each row yields the actual scores , which we will use for Self- Attention.
Softmax on each row
The meaning of this score table is as follows:

  • When the model processes the first data (row 1) in the dataset, which contains only one word (robot), it focuses 100% of its attention on this word.
  • When the model processes the second data (line 2) in the data set, it contains the word (robot must). When the model processes the word must, it focuses 48% of its attention on robot and 52% on must.
  • And so on, continue to process the following words.

GPT-2 Masked Self-Attention

Now, we can understand the masked attention mechanism of GPT-2 in more detail.

Evaluation model: process one token at a time

We can make GPT-2 masked self-attentionwork like . But when evaluating the model, when our model only adds a new word after each iteration, it is inefficient to recompute the self-attention along the previous path for tokens that have already been processed .

In this case, we process the first token (ignored for now <s>).
process the first token
GPT-2 saves athe Key vector and Value vector of the token. Each self-attention layer holds the Key vector and Value vector corresponding to this token :
Every self-attention layer holds on to its respective key and value vectors for that token
now in the next iteration, when the model processes words robot, it does not need to generate the Query, Value, and Key vectors of the token a. It just needs to reuse the corresponding vector saved in the first iteration :
reuses the ones it saved from the first iteration

GPT-2 Self-attention

1. Create Query, Key and Value matrix

Let's assume the model is working on words it. If we discuss the bottom module (for the bottom module), the input corresponding to this token is the itembedding of plus the position code of the 9th position:
its input for that token would be the embedding of it + the positional encoding for slot #9
Each module in the Transformer has its own weight (in the following will be disassembled for display). The weight matrices we first encounter are used to create the Query, Key, and Value vectors.
The first we encounter is the weight matrix that we use to create the queries, keys, and values
Self-Attention multiplies its input by the weight matrix (and adds a bias vector, not drawn here), and this multiplication will result in a vector, which is basically the concatenation of the Query, Key and Value vectors .
basically a concatenation of the query
Multiply the input vector with the attention weight vector (and add a bias vector) to get the token's Key, Value and Query vectors split into attention heads.

1.5 split into attention heads

In the previous example, we only focused on Self Attention and ignored the multi-head part . It would be very helpful now to explain this concept a bit. Self-attention is calculated multiple times in different parts of the Q, K, and V vectors. Splitting attention heads just turns a long vector into a matrix . The small GPT-2 has 12 attention heads, so this will be the first dimension of the transformed matrix:
The small GPT2 has 12 attention heads, so that would be the first dimension of the reshaped matrix
in the previous example we looked at what happens inside an attention head. One way to understand multiple attention-heads is like this (if we only visualize 3 out of 12 attention heads):
visualize three of the twelve attention heads

2. Scoring

Now we can continue scoring, here we only focus on one attention head (other attention heads are also doing similar operations).
only looking at one attention head
Now, this token can be scored against the Key vectors of all other tokens (these Key vectors were calculated by the first attention head in the previous iteration):
Now the token can get scored against all of keys of the other tokens

3. Sum

As we saw before, we now multiply each Value vector by the corresponding score, and then sum them up to get the Self-Attention result of the first attention head :
Sum

3.5 Merge attention heads

The way we handle the various attentions is to first concatenate them into a vector:
Merge attention heads
however, this vector is not ready to be sent to the next sublayer (the length of the vector is wrong). We first need to convert this huge vector of hidden states into a homogeneous representation ( homogenous representation).

4. Mapping (projection)

We will let the model learn how to convert the stitched Self Attention results into shapes that the feedforward neural network can handle . Here, we use a second huge weight matrix to map the results of the attention heads to the output vector of the self-attention sublayer: through the mapping, we
projecting
get a vector that we can pass to the next layer:
send along to the next layer

GPT-2 fully connected neural network

level one

The fully connected neural network is used to process the output of the Self Attention layer, and the representation of this output contains the appropriate context. A fully connected neural network consists of two layers . The first layer is 4 times the size of the model (since GPT-2 small is 768, this network will have one neuron). Why four times? This is simply because this is the size of the original Transformer (if the dimension of the model is 512, then the dimension of the first layer in a fully connected neural network is 2048). This seems to give the Transformer enough expressiveness to handle the task at hand.
GPT-2 Neural Network Layer 1

The second layer: map the vector to the dimension of the model

The second layer maps the results obtained by the first layer back to the dimensions of the model (768 in GPT-2 small). The result of this multiplication is the output of Transformer for this token.
Projecting to model dimension
The above is the most detailed version of the Transformer we discussed! You now understand what happens inside the Transformer language model. To summarize, our input encounters the following weight matrices :
weight matricesEach module has its own weights. On the other hand, the model has only one token embedding matrix and one position encoding matrix .
one token embedding matrix and one positional encoding matrix
View all parameters of the model as follows:
All parameters of the model

GPT-3

GPT-3 continues the one-way language model training method of GPT, but introduces the module ( sparse attentionSparse Transformer ) in , and increases the model size to 175 billion, using 45TB of data for training. At the same time, GPT-3 mainly focuses on more general NLP models, achieving the latest SOTA results in a series of benchmarks and domain-specific natural language processing tasks.sparse attention

The difference between sparse attention and traditional self-attention (called dense attention) is:

  • dense attention: Calculate the attention in pairs between each token, the complexity is O(n²)
  • sparse attention: Each token only calculates attention with a subset of other tokens, complexity O(n*logn)

Compared with GPT-2, GPT-3's image generation function is more mature, and it can complete complete images based on incomplete image samples without fine-tuning. GPT-3 achieves two transformations:
1) from language to image conversion
2) using less domain data, even without fine-tuning steps to solve the problem.

1. General process of pre-training model

The general process of the pre-training model is shown in the figure, in which fine-tuning is an important part.
General process of pre-training model

2. The difference between GPT-3 and BERT

GPT-3 and BERT are two different natural language processing (NLP) models that differ significantly in several ways:

  1. Model architecture :
    • GPT-3 (Generative Pre-trained Transformer 3) is an autoregressive language model based on Transformer's Decoder module, using Masked Self-Attention. It is designed to generate text and can be used to generate various natural language texts, such as articles, stories, dialogues, etc.
    • BERT (Bidirectional Encoder Representations from Transformers) is a self-encoding language model based on Transformer's Encoder module, using Self-Attention. The goal of BERT is to extract representations in text through bidirectional context modeling, and is usually used for pre-training for various NLP tasks.
  2. Task type :
    • GPT-3 : GPT-3 is mainly used to generate text, it is a generative model that can generate natural language output related to the input text.
    • BERT : BERT is mainly used to extract the representation of text. It is a representation learning model that can be used for various NLP tasks, such as text classification, named entity recognition, question answering, etc.
  3. Training method :
    • GPT-3 : GPT-3 is pre-trained in an autoregressive manner, that is, it learns the probability distribution of language by generating text. GPT-3 is trained on a corpus of 499 billion identifiers.
    • BERT : BERT is pre-trained using an autoencoder, i.e. it learns text representations by predicting masked words in the input text. The BERT Large training data is based on a corpus of 2.5 billion identifiers
  4. Directionality :
    • GPT-3 : GPT-3 is an autoregressive model, often used to generate text. It generates a sequence of textual tokens, each of which depends on previously generated tokens.
    • BERT : BERT is a bidirectional model that can consider both words in context and thus better understand the context of the entire text.
  5. Task Suitability :
    • GPT-3 : GPT-3 usually requires fine-tuning to fit a specific task . Its output can be used for various tasks by adding task-specific headers (e.g. classification headers).
    • BERT : BERT adapts to different NLP tasks by adding a simple task-specific layer (such as a classification layer) after pre-training, so it is easier to use for transfer learning .
  6. scale :
    • GPT-3 : GPT-3 is one of the largest known language models, with hundreds of billions of parameters, of which the GPT-3 Large model has 175 billion parameters.
    • BERT : BERT is relatively small in size, typically with hundreds of millions or tens of millions of parameters. Among them, the BERT Large model consists of 340 million parameters.

It should be noted that both GPT-3 and BERT are models that have achieved great success in the NLP field, and they each have different advantages and application areas. The choice of which model to use depends on the specific task and requirements.

3. The difference between GPT-3 and traditional fine-tuning

There are some important differences between GPT-3 and traditional fine-tuning (Fine-Tuning) methods in natural language processing (NLP), mainly in the following aspects:

  1. Scale and pretraining :
    • GPT-3 : GPT-3 is an extremely large pre-trained language model with billions of parameters, which is self-supervised pre-trained on a large-scale text corpus. The goal of GPT-3 is to capture as much natural language knowledge as possible in the pre-training stage, rather than optimizing for a specific task.
    • Traditional fine-tuning : Traditional fine-tuning methods typically use smaller-scale pre-trained models (such as BERT or GPT-2), followed by supervised fine-tuning on task-specific labeled data. The goal of fine-tuning is to optimize the model according to the task-specific objective function in order to better adapt to a specific task.
  2. Task suitability :
    • GPT-3 : Due to its huge scale and extensive pre-training process, GPT-3 performs well on a variety of NLP tasks without requiring task-specific fine-tuning. This allows GPT-3 to be used directly for a variety of tasks without retraining the model.
    • Traditional fine-tuning : Traditional fine-tuning methods require individual fine-tuning for each task, often requiring large amounts of labeled data. Each task requires a specific objective function and fine-tuning procedure .
  3. Versatility :
    • GPT-3 : GPT-3 is designed as a general-purpose natural language processing tool that can be used for various tasks such as generating text, answering questions, and translating. It can be used in multi-domain applications without domain-specific customization.
    • Traditional fine-tuning : Traditional fine-tuning typically requires creating custom models and training pipelines for each domain and task, which can require significant engineering effort.
  4. Data requirements :
    • GPT-3 : Due to its large-scale pre-training and transfer learning capabilities, GPT-3 generally requires less task-specific labeled data to perform well on new tasks.
    • Traditional fine-tuning : Traditional fine-tuning methods usually require a large amount of labeled data to achieve good performance, especially in tasks that require precision.
  5. Flexibility :
    • GPT-3 : GPT-3's pre-trained models can be used for various natural language processing tasks without changing the model's architecture. This provides greater flexibility.
    • Traditional fine-tuning : Traditional fine-tuning may require modifying the model architecture to fit the needs of a specific task, which may require more engineering effort.

Overall, GPT-3 is a general-purpose, pre-trained natural language processing model with excellent generalization capabilities that can be directly used for a variety of tasks. Traditional fine-tuning methods focus more on task-specific optimization, and usually require more engineering and labeled data. Which method to choose depends on factors such as the nature of the task, data availability, and resources.

4. Example of GPT-3

GPT-3 (Generative Pre-trained Transformer 3) is a natural language processing (NLP) model with excellent natural language understanding and generation capabilities. Here are some examples showing the various tasks and application scenarios that GPT-3 can perform:

  1. Text generation :
    • Article writing : GPT-3 can generate articles, blog posts, press releases, and more.
    • Creative Writing : It can generate creative texts such as poems, novels, stories, etc.
  2. Automatic question and answer :
    • Question answering system : GPT-3 can answer questions on a wide range of topics, including science, history, technology, and more.
    • Legal Advice : It can provide answers to questions and advice in the legal field.
  3. Natural Language Understanding :
    • Sentiment Analysis : GPT-3 can analyze sentiment in text, such as positive, negative, or neutral.
    • Intent Recognition : It can recognize the user's intent and is used in chatbots and virtual assistants.
  4. Language translation :
    • Translation services : GPT-3 can translate text from one language to another.
  5. Code generation :
    • Code writing : It can generate program code, including Python, JavaScript, etc.
  6. Virtual Assistant :
    • Personalized assistants : GPT-3 can be used to build personalized virtual assistants that answer user questions and perform tasks.
  7. Creativity and Art :
    • Painting and Illustration : It can generate descriptions of paintings or instructions for creating works of fine art.
    • Music creation : GPT-3 can generate music and lyrics.
  8. Science and Research :
    • Data Analysis : It helps analyze data, generate graphs and reports.
    • Scientific Research : Used to generate hypotheses and explore questions in scientific fields.
  9. education :
    • Online Education : GPT-3 can be used to generate educational materials, answers, and practice questions.
  10. Social Media and Chat :
    • Social Media Posts : It can generate posts, comments and replies for social media platforms.

These examples are just a small sample of the potential uses of GPT-3. This model can perform a variety of natural language processing tasks given the input text and tasks provided, making it a powerful natural language processing tool.

For more information about GPT-3, please see How GPT3 Works - Visualizations and Animations

Language model application case

Only Decoder's Transformer has been showing good applications outside the language model. It has been successfully used in many applications, and these successful applications can be described in a diagrammatic manner similar to the above.

1. Machine translation

Encoder is not necessary for machine translation. We can solve the same task with a Decoder-only Transformer :
Machine Translation

2. Generate summary

This is the first task to train using only Decoder's Transformer. It was trained to read a Wikipedia article (with the leading part removed from the table of contents) and generate a summary. The actual beginning of the article is used as a label for the training data:
Summarization
the paper trains the model on Wikipedia articles, so the model is able to summarize the articles and generate summaries:
summarize articles

3. Transfer Learning

In Sample Efficient Text Summarization Using a Single Pre-Trained Transformer , a Transformer with only Decoder is first pre-trained on the language model, and then fine-tuned to generate a summary. The results show that it can achieve better results than the pre-trained Encoder-Decoder Transformer when the amount of data is limited.

The GPT-2 paper also shows the results of generating summaries pre-trained on language models.

4. Music Generation

The Music Transformer paper uses a Decoder-only Transformer to generate expressive timing and dynamic music. Music modeling is like language modeling, just let the model learn music in an unsupervised way, and then let it sample the output (before we called this a walk).

You might be wondering how the music plays out in this scene. Remember that language modeling can represent characters, words, or parts of words (tokens) as vectors. In musical performance (let's consider a piano), we not only represent notes, but also velocity - a measure of how hard a piano key is pressed.
Music Modeling
A performance is a sequence of one-hot vectors. A midi file can be converted to the following format. The paper uses the following input sequence as an example:
A midi file can be converted into such a format
The one-hot vector representation of this input sequence is as follows:
Music Transformer
Some annotations are added on this basis:
annotation
This piece of music has a recurring triangular outline. The Query matrix is ​​on a peak in the back, and it notices the high notes of all the peaks in front, to know where the music begins. This figure shows a Query vector (the source of all attention lines) and previously attended memories (those highlighted notes subject to a larger softmax probability). The color of the attention line corresponds to different attention heads, and the width corresponds to the weight of the softmax probability .

GPT-2 code

  • Open AI's GPT-2 code repository: https://github.com/openai/gpt-2
  • Check out the pytorch-transformers library for Hugging Face . Besides GPT2, it also implements BERT, Transformer XL, XLNet and other cutting-edge transformers models.

References

  1. The Illustrated GPT-2
  2. How GPT3 Works - Visualizations and Animations
  3. GPT / GPT-2 / GPT-3 / InstructGPT Evolution Road
  4. Language Model: Mastering BERT and GPT Models

Guess you like

Origin blog.csdn.net/ARPOSPF/article/details/132673892