[AI Theory Learning] Language Model: Master BERT and GPT Models


The ELMo model can update the feature representation of words according to the context, realizing the transformation of word vectors from static to dynamic. However, because ELMo relies on the architecture of the bidirectional language model, its training can only be applied to small-scale corpora, and the computational efficiency is not high. In order to solve these problems, the BERT and GPT models based on the Transformer framework were proposed.

BERT model

The basic principle of BERT

The full name of BERT is Bidirectional Encoder Representation from Transformers , which is a pre-trained language representation model. It emphasizes that it is no longer the traditional one-way language model or the method of shallow splicing two one-way language models for pre-training, but the new masked language model (MLM) , so that it can generate deep Bidirectional language representations. When the BERT paper was published, it was mentioned that new state-of-the-art results were obtained in 11 NLP (Natural Language Processing, Natural Language Processing) tasks, which was stunned.

The model has the following main advantages:
1) MLM is used to pre-train bidirectional Transformers to generate deep bidirectional language representations .
2) After pre-training, you only need to add an additional output layer for fine-tune, and you can achieve state-of-the-art performance in a variety of downstream tasks . No task-specific structural modifications to BERT are required during this process.

The overall architecture of BERT

BERT uses MLM for pre-training, and uses a deep two-way Transformer component (one-way Transformer is generally called Transformer decoder, and each token (symbol) of it will only attend the token that is currently going to the left. The two-way Transformer is called Transformer encoder, Each of its tokens will attend to all the tokens) to build the entire model, thus finally generating a deep bidirectional language representation that can fuse left and right context information . The overall architecture is shown below:
BERT
Figure 1. Differences in pre-trained model architectures. The Trm refers to the Transformer's Encoder module. BERT uses a bidirectional Transformer. OpenAI GPT uses Transformers from left to right. ELMo uses a concatenation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. Of the three, only the BERT representation is conditioned on both left and right context in all layers . Besides the architectural differences, BERT and OpenAI GPT are fine-tuning methods , while ELMo is a feature-based method .

TransformerBlock code is as follows:

class TransformerBlock(nn.Module):
	def __init__(self, k, heads):
		super().__init__()
		self.attention = SelfAttention(k, heads = heads)
		self.norm1 = nn.LayerNorm(k)
		self.norm2 = nn.LayerNorm(k)

		self.mlp = nn.Sequential(
			nn.Linear(k, 4*k)
			nn.ReLU()
			nn.Linear(4*k, k)
		)
	def forward(self, x):
		attended = self.attention(x)
		x = self.norm1(attended + x)
		feedforward = self.mlp(x)
		return self.norm2(feedforward + x)

BERT provides two models, simple and complex, and the corresponding hyperparameters are as follows:

  • BERT B A S E \text{BERT}_{BASE} BERTBASE: L=12, H=768, A=12, the total amount of parameters is 110MB;
  • BERT L A R G E \text{BERT}_{LARGE} BERTLARGE: L=24, H=1024, A=16, the total amount of parameters is 340MB;

BERT performs self-supervised learning on the basis of massive corpus (the so-called self-supervised learning refers to supervised learning that runs on data sets that are not manually labeled). In the downstream NLP task, the feature representation of BERT can be directly used as the word embedding feature of the downstream task. Therefore, BERT provides a model for migration learning of downstream tasks, which can be used as a feature extractor after fine-tuning or fixing according to downstream tasks.

Input to BERT

The input encoding vector of BERT (d_model=512) is the unit sum of 3 embedded features . This is shown in the figure below:
Input representation
Figure 2. BERT input representation. The input embeddings are the sum of token embeddings, segmentation embeddings and position embeddings. The pink block in the figure is the token, and the yellow block is the corresponding representation of the token. The word dictionary is constructed using the WordPiece algorithm. In order to complete the specific classification task, in addition to the token of the word, the author also inserts a specific classification token ([CLS]) at the beginning of each input sequence , and the output of the last Transformer layer corresponding to the classification token is used. The role of aggregating the entire sequence representation information.

Since BERT is a pre-trained model, it must adapt to a variety of natural language tasks, so the input sequence of the model must be able to contain one sentence (text sentiment classification, sequence labeling tasks) or more than two sentences (text summary , natural language inference, question answering tasks). How to make the model capable of distinguishing which range belongs to sentence A and which range belongs to sentence B? BERT uses two methods to solve it: 1) Insert the segmentation token ([SEP]) into each sentence in
the sequence tokens to separate different sentence tokens. 2) Add a learnable segmentation embedding to each token representation to indicate whether it belongs to sentence A or sentence B.

Therefore, the input sequence tokens of the final model is as follows (if the input sequence contains only one sentence, there will be no [SEP] and subsequent tokens): As mentioned above, the input
The input sequence of the model
of BERT is the representation corresponding to each token, in fact The representation is composed of three parts, namely the corresponding token, segmentation and location embeddings:

  1. The Token Embeddings
    English corpus generally adopts word piece embedding (WordPiece Embedding), that is, words are divided into a limited set of common subword units, which can strike a balance between the effectiveness of words and the flexibility of characters. For example, split "playing" into "play" and "ing". If it is a Chinese corpus, set it to word level.
  2. Position Embeddings
    Position embedding refers to encoding the position information of words into feature vectors, which is a crucial part of introducing the position relationship of words into the model. The position embedding here is different from the Transformer's position embedding in the previous article . It is not a trigonometric function, but learned.
  3. Segment Embeddings
    are used to distinguish between two sentences, such as whether B is the context of A (such as dialogue scenes, question and answer scenes, etc.). For sentence pairs, the feature value of the first sentence is 0 and the feature value of the second sentence is 1.

BERT's output

Because the characteristic of Transformer is that there are as many corresponding outputs as there are inputs , as shown in the figure below:
BERT's output
C is 分类token([CLS]) corresponding to the output of the last Transformer, T i T_iTiIt means that other tokens correspond to the output of the last Transformer. For some token-level tasks (such as sequence labeling and question answering tasks), put T i T_iTiinput into an additional output layer for prediction. For some sentence-level tasks (such as natural language inference and sentiment classification tasks), C is input into an additional output layer, which explains why a specific classification token is inserted before each token sequence.

BERT pre-training

In fact, the concept of pre-training is already very mature in CV (Computer Vision, computer vision), and it is widely used. The pre-training task used in CV is generally the ImageNet image classification task. The prerequisite for completing the image classification task is to be able to extract good image features. At the same time, the ImageNet data set has the advantages of large scale and high quality, so it is often possible to obtain good image features. Effect.

Although there is no high-quality human-labeled data like ImageNet in the NLP field, the self-supervised nature of large-scale text data can be used to build pre-training tasks . Therefore, BERT built two pre-training tasks, namely Masked Language Model and Next Sentence Prediction .

masked language model

The masked language model (Masked Language Model, MLM) is a true two-way method, which is why BERT is not limited by the one-way language model . The ELMo model mentioned in the previous article just trains left-to-right and right-to-left separately. The difference between the two models can be clearly seen from their objective functions.

  • ELMo以P ( tk ∣ t 1 , . . . , tk − 1 ) , P ( tk ∣ tk + 1 ) , . . . . . . . . , tn P(t_k|t_1,...,t_{k-1}), P(t_k|t_{k+1}),...,t_nP(tkt1,...,tk1),P(tktk+1),...,tnAs the objective function, then train independently, and finally stitch the results together.
  • BERT以 P ( t k ∣ t 1 , . . . , t k − 1 ) , t k + 1 , . . . , t n P(t_k|t_1,...,t_{k-1}),t_{k+1},...,t_n P(tkt1,...,tk1),tk+1,...,tnAs an objective function, the word vector learned in this way can pay attention to the information of the left and right words at the same time.

To put it simply, the token in each training sequence is randomly replaced with a mask token ([MASK]) with a probability of 15%, and then the original word at the position of [MASK] is predicted . However, since [MASK] does not appear in the fine-tuning stage of the downstream task, there is a mismatch between the pre-training stage and the fine-tuning stage (well explained here, that is, the pre-training target will make the generated Language representation is sensitive to [MASK], but not to other tokens). Therefore, BERT adopts a small trick to solve this problem, that is, after determining the word to be masked, 80% will be directly replaced with [MASK], 10% will be replaced with any other word, and 10% will retain the original identifier . The specific strategies are as follows:

First, a certain token position is randomly selected for prediction with a probability of 15% in each training sequence . If the ith token is selected, it will be replaced with one of the following three tokens:

  1. 80% of the time it is [MASK]. For example, my dog ​​is hairy——>my dog ​​is [MASK]
  2. 10% of the time it is random other tokens. For example, my dog ​​is hairy——>my dog ​​is apple
  3. 10% of the time it is the original token (keep unchanged, that is, as the negative class corresponding to point 2). For example, my dog ​​is hairy——>my dog ​​is hairy

Then, use the T i T_i corresponding to the positionTiTo predict the original token (input to full connection, then use softmax to output the probability of each token, and finally use cross entropy to calculate loss).

The entire MLM training process is shown in the figure below:
BERT's MLM training process
Figure 3 MLM training process of BERT

predict next sentence

Some tasks such as question answering and natural language inference need to understand the relationship between two sentences, while MLM tasks tend to extract token-level representations , so sentence-level representations cannot be obtained directly. In order to enable the model to be able to understand the relationship between sentences, BERT uses the Next Sentence Prediction (NSP) task for pre-training, which simply means predicting whether two sentences are connected together .

The specific method is: for each training example, we select sentence A and sentence B in the corpus to form, 50% of the time sentence B is the next sentence of sentence A (marked as IsNext), and the remaining 50% of the time Sentence B is a random sentence (labeled NotNext) from the corpus. Next, input the training samples into the BERT model, and use the C information corresponding to [CLS] to predict the two classifications.

The specific training process is shown in the figure below:
BERT's NSP pre-training process
Figure 4 BERT's NSP pre-training process

The final training example looks like this:

Input1=[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label1=IsNext

Input2=[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label2=NotNext

Input each training sample into BERT to obtain the corresponding Loss of the two tasks, and then add these two Losses together to form the overall pre-training Loss (that is, two tasks are trained at the same time).

It can be clearly seen that the data required for these two tasks can actually be constructed from unlabeled text data (self-supervised nature).

The BERT training process includes MLM and NSP. For the specific definition of its loss function, please refer to the corresponding code on Hugging Face official website:https://huggingface.co/docs/transformers/main/en/model_doc/bert

Fine-tuning of BERT

Fine-tuning is easy because the self-attention mechanism in Transformer allows BERT to model many downstream tasks (whether they involve single text or pairs of text) by exchanging appropriate inputs and outputs. For applications involving text pairs, a common pattern is to encode the text pairs independently before applying bidirectional cross attention.
Instead, BERT uses a self-attention mechanism to unify the two stages, since self-attention is used to encode concatenated text pairs, effectively including bidirectional cross attention between two sentences.

When completing the downstream classification task of BERT, it is only necessary to add an output layer on the basis of BERT to complete the fine-tuning of the specific task . For classification problems, the final output of the first identifier (Token) (that is, final hidden state, final hidden state) can be directly taken C ∈ RHC\in R^HCRH , add a layer of weightWWAfter W , Softmax is used to predict the probability of the label. P = Softmax ( CWT ) P=Softmax(CW^T)P=Softmax(CWT)

For other downstream tasks, some adjustments are required, as shown in the figure below:
Fine-tune the BERT pre-training model to complete the corresponding downstream tasks
Figure 5 fine-tunes the BERT pre-training model to complete the corresponding downstream tasks. Tok means different Tokens, E means embedding vector, T i T_iTiIndicates the iiThe feature vector obtained after i Tokens are processed by BERT.

The following is a brief introduction to several downstream tasks and the content that needs to be fine-tuned:
1) Sentence pair-based classification tasks. For example, MNLL, given a premise, infers the relationship between the hypothesis and the premise according to this premise; MRPC, judges whether two sentences are equivalent.
2) Classification tasks based on single sentences. Such as SST-2, sentiment analysis of movie evaluation; CoLA, sentence semantic judgment, whether it is acceptable.
3) Question answering task. For example, SQuAD v1.1: Given a sentence (usually a question) and a description text, output the answer to this question, similar to short-answer questions for reading comprehension.
4) Named entity recognition. Such as CoNLL-2023 NER: Determine whether a word in a sentence is a person (Person), an organization (Organization), a location (Location) or other entities.

Feature extraction for BERT

All BERT results presented so far have used a fine-tuning approach, where a simple classification layer is added to the pre-trained model and all parameters are jointly fine-tuned on downstream tasks. However, feature-based methods for extracting fixed features from pre-trained models have certain advantages . First, not all tasks can be easily represented by a Transformer-encoder architecture, thus requiring the addition of task-specific model architectures. Second, there are significant computational advantages to precomputing an expensive representation of the training data once and then running many experiments with a cheaper model on top of this representation.
Feature extraction with BERT
So, use a pre-trained BERT model to create contextual word embeddings, and then embed those words into the existing model. An example is as follows:
Schematic diagram of BERT using feature extraction method
From the above figure we can see that the contextual embedding of each word in the input sentence is made by BERT base BERT_{base}BERTbasemodel generated. Since BERT base BERT_{base}BERTbaseThe model consists of 12 Transformer encoders, so each encoder layer generates a contextual embedding for each word, which is then passed to the upper Transformer encoder. Thus, we have 12 choices of contextual word embeddings, which are generated by 12 Transformer encoders.

Which vector works best as a contextual embedding? Personally I think it depends on the task. This paper examines six options (compared to the fine-tuned model which achieved a score of 96.4):
The impact of the output of different layers of BERT on downstream tasks

Implementing BERT with PyTorch

The best way to try out BERT is through BERT fine-tuning with Cloud TPU notebooks hosted on Google Colab . It's also a good place to start if you've never used a Cloud TPU before, as the BERT code works on TPUs, CPUs, and GPUs as well.

BERT official code warehouse: https://github.com/google-research/bert, you can also refer to the BERT version implemented by Hugging face based on PyTorch: https://github.com/huggingface/transformers. The AllenNLP library uses this implementation to allow the use of BERT embeddings for any model.

The PyTorch version code used here comes from https://github.com/codertimo/BERT-pytorch. There are mainly two core modules to implement BERT using PyTorch, one of which is the BERT Embedding class that generates BERT input, and the other is the TransformerBlock class. The combination of these two modules is the module bert.py of the BERT model. The relationship between these modules is as follows:
Module class diagram of the BERT model
First, interpret the main file (main) of BERT-pytorch: __main__.py
The main parameters
where train_datasetand test_datasetrefers to the training data and test data of the task you selected, which we generally call corpus(corpus). Here we selected GLUE data The training set and test set of the centralized MRPC task. vocab_pathIt refers to vocabularythe library (vocabulary library), which is equivalent to a large dictionary, which records all possible words. Later, when we convert the words in the corpus, we need to idlook them up in this large dictionary.

The next thing to load is the code of vocab:
load vocab
this code converts the word list from txt format to a corresponding python object for subsequent processing. The specific process is:
loca_vocab
Then, load the data object, here only look at the training data set:
loading train dataset
Then, Jump into BERTDataset (this kind of loader is implemented by calling the __getitem__ method).
timed
This part of the data processing code defaults that each line in our corpus has two sentences and is separated by '\t'.
get_corpus_line
So its processing steps:
(1) Take out two sentences in a line, and then return two sentences normally with a 50% probability, and the label returns 1 (meaning that the two sentences are connected together, otherwise randomly select Put together another sentence with a label of 0 to indicate that they are not together. This step can be handled by the NSP task) (
random_sent
2) Deduct each word in these two sentences, and change the word into [MASK] The id of the original word of other words (id refers to the serial number of this word in the big dictionary, because the computer wants to read the number when reading it).
random word
(3) Add head CLS and tail SEP to the tokens of the two sentences, and the corresponding label of the two is also 0
(4) Segment id, and padding id for completion

The above is the preprocessing part of the data, including the data processing of the NSP and MLM tasks respectively, as well as some operations such as adding CLS and SEP, and corresponding segment ids. Next, we officially start to build the overall framework of BERT.
build bert model
The core code for building BERT is as follows:

import torch.nn as nn

from .transformer import TransformerBlock
from .embedding import BERTEmbedding


class BERT(nn.Module):
    """
    BERT model : Bidirectional Encoder Representations from Transformers.
    """

    def __init__(self, vocab_size, hidden=768, n_layers=12, attn_heads=12, dropout=0.1):
        super().__init__()
        self.hidden = hidden  # 最后一层的维度 hidden size
        self.n_layers = n_layers  # Transformer的block数目,即层数
        self.attn_heads = attn_heads  # 多头注意力机制的注意力头数目 = hidden size / 每个头的维度
        # 在bert中,hidden size = 768 时,头数为12,每个头维度是64

        # paper noted they used 4*hidden_size for ff_network_hidden_size
        self.feed_forward_hidden = hidden * 4
        # transformer的每个block为多头-残差块-feedforward-残差块 这里是feedforward的维度
        
        # embedding for BERT, sum of positional, segment, token embeddings
        self.embedding = BERTEmbedding(vocab_size=vocab_size, embed_size=hidden)

        # multi-layers transformer blocks, deep network
        self.transformer_blocks = nn.ModuleList(
            [TransformerBlock(hidden, attn_heads, hidden * 4, dropout) for _ in range(n_layers)])

    def forward(self, x, segment_info):
        # attention masking for padded token
        # torch.ByteTensor([batch_size, 1, seq_len, seq_len)
        mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)

        # embedding the indexed sequence to sequence of vectors
        x = self.embedding(x, segment_info)

        # running over multiple transformer blocks
        for transformer in self.transformer_blocks:
            x = transformer.forward(x, mask)

        return x

The first is the initialization of the BERT model. We know that BERT is the encoder part of the Transformer, so the initialization part is mainly to build the Transformer structure (including the Transformer block and the number of heads, the size of the hidden size, the number of layers, the above notes mentioned).

Then from the forward function, it can be seen that first construct the mask, find the place in the data that is not padding, assign a value of 1 and expand it into the corresponding dimension, then use BERTEmbedding to encode the word, and finally output it through Transformer blocks one by one. Below is the code for the BERTEmbeding class and the TransformerBlock class:

import torch.nn as nn
from .token import TokenEmbedding
from .position import PositionalEmbedding
from .segment import SegmentEmbedding


class BERTEmbedding(nn.Module):
    """
    BERT Embedding which is consisted with under features
        1. TokenEmbedding : normal embedding matrix 正则嵌入矩阵
        2. PositionalEmbedding : adding positional information using sin, cos
        2. SegmentEmbedding : adding sentence segment info, (sent_A:1, sent_B:2)

        sum of all these features are output of BERTEmbedding
    """

    def __init__(self, vocab_size, embed_size, dropout=0.1):
        """
        :param vocab_size: total vocab size  总词汇量的大小
        :param embed_size: embedding size of token embedding  标记嵌入的嵌入大小
        :param dropout: dropout rate
        """
        super().__init__()
        self.token = TokenEmbedding(vocab_size=vocab_size, embed_size=embed_size)
        self.position = PositionalEmbedding(d_model=self.token.embedding_dim)
        self.segment = SegmentEmbedding(embed_size=self.token.embedding_dim)
        self.dropout = nn.Dropout(p=dropout)
        self.embed_size = embed_size

    def forward(self, sequence, segment_label):
        x = self.token(sequence) + self.position(sequence) + self.segment(segment_label)
        return self.dropout(x)
import torch.nn as nn

from .attention import MultiHeadedAttention
from .utils import SublayerConnection, PositionwiseFeedForward


class TransformerBlock(nn.Module):
    """
    Bidirectional Encoder = Transformer (self-attention)
    Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
    """

    def __init__(self, hidden, attn_heads, feed_forward_hidden, dropout):
        """
        :param hidden: hidden size of transformer
        :param attn_heads: head sizes of multi-head attention
        :param feed_forward_hidden: feed_forward_hidden, usually 4*hidden_size
        :param dropout: dropout rate
        """

        super().__init__()
        self.attention = MultiHeadedAttention(h=attn_heads, d_model=hidden)
        # attention会返回QKV计算后的结果,QKV都是可学习的Linear层
        self.feed_forward = PositionwiseFeedForward(d_model=hidden, d_ff=feed_forward_hidden, dropout=dropout)
        # Position这个好像就是两个Linear中间加一个激活层
        self.input_sublayer = SublayerConnection(size=hidden, dropout=dropout)
        self.output_sublayer = SublayerConnection(size=hidden, dropout=dropout)
        # Multi-head+残差块,配上FeedForward+残差块,共同组成了一个Transformer
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, mask):
        x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask))
        x = self.output_sublayer(x, self.feed_forward)
        return self.dropout(x)

Finally, it is the normal process of PyTorch training network, input data and corresponding Label, calculate Loss and back propagation:
training process

GPT model

In June 2018, OpenAI published a paper introducing its own language model GPT. GPT is the abbreviation of "Generative Pre-Training". It is based on the Transformer architecture. There are many ways to fine-tune for specific tasks on supervised datasets . First train a general model, and then adjust it on each task. This kind of model design technique that does not rely on individual tasks can achieve good performance in multiple tasks at once.

Unlike the BERT pre-training model that uses the Transformer's Encoder part, the GPT series uses the Transformer's Decoder part. Since the GPT model is trained using a traditional language model, that is, using the words above to predict words, GPT is better at processing natural language generation tasks (NLG), while BERT is better at processing natural language understanding tasks (NLU).

The overall architecture of the GPT model

The GPT framework uses a semi-supervised learning method to complete language understanding tasks. The training process is divided into two stages: unsupervised Pre-training and supervised Fine-tuning . In the pre-training stage, one-way Transformer is used to learn a language model, and unsupervised Embedding is performed on sentences. In the fine-tuning stage, the parameters of Transformer are fine-tuned according to specific tasks. The purpose is to learn a general Representation method for Different kinds of tasks can be adapted with minor modifications . The overall architecture is shown in the figure below:
gpt 1.0
where Trm represents the Decoder module, Trm on the same horizontal line represents the same unit, E i E_iEiRepresents word embedding, and those complex connections represent the dependency between words and words. Obviously, the words to be predicted by GPT only rely on the above.

GPT-2 has roughly 4 versions according to its size, as shown in the figure:
4 versions of GPT-2

Model structure of GPT

GPT uses Transformer's Decoder structure and makes some changes to Transformer Decoder. The original Decoder contains two Multi-Head Attention structures, and GPT only retains Mask Multi-Head Attention . As shown below:
Model structure of GPT

The difference between GPT-2's Multi-Head and BERT's Multi-Head

Importantly, the distinction between self-attention (used by BERT) and masked self-attention (used by GPT-2) is clear. A normal Self-Attention module allows a position to focus on the part to its right, Masked self-attention prevents this from happening. That is to say, BERT can pay attention to the left and right sides of a word at the same time, while GPT-2 can only pay attention to the right side of the word.
The difference between BERT and GPT-2's Multi-Head

Decoder only module

After the original Transformer paper was published, Generating Wikipedia by Summarizing Long Sequences proposed another layout of Transformer modules capable of language modeling. This model discards the Transformer's Encoder . Therefore, we can refer to this model as Transformer-Decoder. This early Transformer-based language model consists of 6 Decoder modules.
Transformer-Decoder
These Decoder modules are the same. The above figure has expanded the first Decoder, so you can see that its Self-Attention layer is masked. Note that this model can now handle up to 4000 tokens – a big upgrade from the 512 tokens in the original paper.

These modules are very similar to the original Decoder modules, except that they remove the second Self Attention layer. A similar structure is used in Character -Level Language Modeling with Deeper Self-Attention to create language models one letter/character at a time.

OpenAI's GPT-2 uses these Decoder modules .

Understanding the GPT-2 Model

Let's first look at how a trained GPT-2 works.
GPT-2 can handle 1024 tokens
GPT-2 can handle 1024 tokens. Each token passes through all Decoder modules along its own path

The easiest way to run a trained GPT-2 model is to let it generate text itself (this is technically called generating unconditional samples ). Alternatively, it can be given a prompt to talk about a topic (i.e. generate interactive conditional samples). In the rambling case, you can simply feed it the initial token and let it start generating words (a trained model uses <|endoftext|>as initial token. We call it <s>).
generating interactive conditional samples
The model has only one input token and therefore only one active path. Tokens are processed sequentially through all layers, and a vector is generated along that path . This vector can be used to calculate a score based on the model's vocabulary (the model knows all the words, which is 5000 words in GPT-2). In this example, we chose the with the highest probability the. But we can mix things up - if you keep selecting suggested words in the keyboard app, it sometimes gets stuck in a repeating loop where the only way out is to tap the second or third suggested word. The same thing will happen here, GPT-2 has a top-k parameter, we can use this parameter to make the model consider other words than the first word (top-k=1) .

Next, add the output of the first step to the input sequence and let the model make the next prediction.
GPT-2's second prediction
Note that the second path is the only active path in this calculation. Each layer of GPT-2 keeps its own interpretation of the first token and uses it when processing the second token. GPT-2 will not recalculate the first token based on the second token .

Deep understanding of the details of GPT-2

Input to GPT-2

Like the other NLP models we've discussed before, GPT-2 looks up the embedding of the input word in its embedding matrix -- one of the components we get from a trained model.
Token Embeddings
Each row is a word embedding: this is a list of numbers that can represent a word and capture some meaning. The size of this list is different in different GPT-2 models. The embedding size used by the smallest model is 768

So, at the beginning, we look up the embedding of the first token in the embedding matrix <s>. Before passing this embedding to the first module of the model, it needs to incorporate positional encoding, which can indicate the order of words in the sequence . Part of the trained model is a matrix, which includes a position encoding vector for each of the 1024 positions.
Positional EncodingsHere, we discuss how input words are processed before being passed to the first Transformer module. We also know that the trained GPT-2 includes two weight matrices: one is the embedding matrix (Token Embedding) that records all words or identifiers, and the shape of the matrix is ​​model_vocabulary_size × Embedding_size; the other is the position code representing the word in the context Matrix (Positional Encoding), the shape of the matrix is ​​context_size × Embedding_size, where Embedding_size is determined by the size of the GPT-2 model, the Small model is 768, the Medium model is 1024, and so on.

Before inputting the GPT-2 model, you need to add the corresponding position code to the logo embedding, as shown in the figure below:
Input data for GPT-2
inputting a word into the first module of Transformer means looking for the embedding of the word, and adding the first position Position encoding vector. The position code of each mark is invariant in the Decoder of each layer, and the position code is not a learning vector.

flow up the layer

The first module can now process tokens, first through the Self-Attention layer and then through the neural network layer. Once the first module of Transformer has processed the token, it will get a result vector, which will be sent to the next module in the stack for processing. The processing of each module is the same, but each module has its own Self-Attention and neural network layer.
A journey up the Stack


Review Self-Attention

Language is heavily dependent on context. For example, look at the second law below:

Second Law of Robotics
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

Highlights 3 parts in the sentence where words are used to refer to other words. These words cannot be understood or processed without the context in which they refer. When a model processes this sentence, it must be able to know:

  • it refers to a robot
  • The command refers to the preceding part of this law, the command given to it by man
  • The first law refers to the first law of robotics

What Self-Attention does is that it combines the model's understanding of the word's related words and associated words (and inputs it into a neural network) before processing a word. It works by scoring the relevance of each word in a sentence segment and summing the weighted representation vectors of these words . For example, the Self-Attention layer in the top module in the figure below pays attention to a robot when processing the word it. The vector it passes to the neural network is the sum of the multiplication and addition of the 3 words and their respective scores.
an example of self-attention layer


Self-Attention process

Self-Attention is processed along the path of each token in the sentence, and the main components include 3 vectors.

  • Query : The Query vector is a representation of the current word and is used to score all other words (using the key vectors of those words). We only focus on the query vector of the token currently being processed.
  • Key : The Key vector is like a label for all the words in the sentence. They are what we are looking for when searching for words.
  • Value : Value vectors are the actual word representations, once we have scored the relevance of each word, we need to perform a weighted sum of these vectors to represent the current word.

searching through a filing cabinet
A rough analogy is to think of it as searching inside a filing cabinet, the Query vector is a sticky note that says the topic you are researching, and the Key vector is like the label of the folder in the cabinet. When you match notes to tags, we fetch the contents of those folders that match, which is the Value vector. But you 're not just looking for one Value vector, you're looking for a series of Value vectors in a series of folders .

Multiplying the Value vector with each folder's Key vector yields a score for each folder (technically: just a dot product followed by a softmax).
Multiplying the query vector by each key vector produces a score for each folder
We multiply each Value vector by the corresponding score and then sum to get the output of Self Attention.
Self-Attention output
These weighted Value vectors result in a vector that puts 50% of the attention on the word robot, 30% on the word a, and 19% on the word it. In the following, we will go deeper into Self Attention, but for now, first let us continue to go up in the model until the output of the model.

Output of GPT-2

When the module at the top of the model produces an output vector (this vector is obtained through the Self-Attention layer and the neural network layer), the model will multiply this vector by the embedding matrix .
the model multiplies that vector by the embedding matrix
Recall that each row in the embedding matrix corresponds to a word in the model's vocabulary. The result of this multiplication is interpreted as a score for each word in the model's vocabulary .
output token probabilities
The token with the highest score can be selected (top_k=1). But if the model can also consider other words, it can get better results. So a better strategy is to use the score as the probability of the word, and choose a word from the whole list (so that the word with the higher score has a higher chance of being selected). A compromise is to set top_k to 40 and let the model consider the top 40 words.
the score as the probability of selecting that word
Note that Decoder #12 and Position #1 in the figure represent the position of the 12th layer Decoder and the first identifier respectively.

In this way, the model completes an iteration and outputs a word. The model continues to iterate until all context has been generated (1024 tokens), or until a token representing the end of the sentence is output.

Similarities and differences between GPT and GPT-2

GPT and GPT-2 have no major changes in architecture, but are slightly different in scale and data volume. The similarities and differences between them are embodied in the following aspects: 1) The structure is basically the same, both
adopting the LM model and using Transformer's Decoder .
2) The differences are as follows:
- GPT-2 has a larger scale and more layers
- GPT-2 has a larger amount of data and more data types, which is conducive to enhancing the versatility of the model and improving the quality of the data Filtering and Control
- GPT uses supervised learning for different downstream tasks, modifies the input format, and adds a fully connected layer. GPT-2 uses an unsupervised learning method for downstream tasks, without changing the parameters and models of different downstream tasks (the so-called Zero-Shot Setting). As shown in the figure below:
Differences in GPT for downstream tasks
Figure (left) Transformer architecture and training objectives, Figure (right) input transformation for fine-tuning different tasks.

So, how does GPT transform downstream tasks? When fine-tuning, for different downstream tasks, mainly change the input format of GPT, first combine different tasks through data, substitute into the Transformer model, and then add a fully connected layer (Linear) after the output data of the model to adapt to the format of the labeled data , the details are as follows:
1) For classification problems, if there are few changes, you only need to add a start and end symbol;
2) For sentence relationship judgment problems, such as Entailment, just add a separator between two sentences;
3) For the question of text similarity judgment, just reverse the order of the two sentences and make two inputs, this is to tell the model that the order of the sentences is not important
; Options can be concatenated as an input.

Similarities and differences between GPT and ELMo

  1. The model architecture is different: ELMo is a shallow bidirectional RNN; GPT is a multi-layer Transformer encoder.
  2. The downstream tasks are handled differently: ELMo adds word embeddings to specific tasks as an additional feature; GPT fine-tunes the same base model for all tasks.

Similarities and differences between GPT and BERT

  1. Pre-training: The GPT pre-training method is the same as the traditional language model, predicting the next word through the above text; the GPT pre-training method is to use Mask LM, which can predict words through both the context and context. For example, given a sentence u 1 , u 2 , . . . , un u_1,u_2,...,u_nu1,u2,...,un, GPT is predicting the word ui u_iuiU 1 , u 2 , . . . , ui − 1 u_1,u_2,...,u_{i-1}u1,u2,...,ui1information. Therefore, the BERT group uses u 1 , u 2 , . . . , ui − 1 , ui + 1 , . ...,u_nu1,u2,...,ui1,ui+1,...,unInformation.
  2. Model effect: GPT is more suitable for natural language generation tasks (NLG) because it uses a traditional language model, because these tasks usually generate information at the next moment based on current information. And BERT is more suitable for natural language understanding tasks (NLU).
  3. Model structure: Model structure: GPT uses Transformer's Decoder, while BERT uses Transformer's Encoder. GPT uses the Mask Multi-Head Attention structure in Decoder, using u 1 , u 2 , . . . , ui − 1 u_1,u_2,...,u_{i-1}u1,u2,...,ui1predict word ui u_iuiWhen, will ui u_iuiAfter the word Mask off.

References

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  2. Hugging face BERT
  3. What is BERT?
  4. https://ekbanaml.github.io/nlp/BERT/
  5. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
  6. Understanding BERT architecture
  7. Graphical BERT Model: Building BERT from Scratch
  8. Detailed version of BERT code line by line
  9. The Illustrated GPT-2

Guess you like

Origin blog.csdn.net/ARPOSPF/article/details/132653539