The difference between Bert and GPT of self-study big language model

The difference between Bert and GPT

origin

insert image description here

In 2018, Google first launched BERT (Bidirectional Encoder Representations from Transformers). The model is trained on a large text corpus using a combination of unsupervised and supervised learning. The goal of BERT is to create a language model that can understand the context and meaning of a word in a sentence, taking into account words that appear before and after it.

In 2018, OpenAI first launched GPT (Generative Pre-trained Transformer). Like BERT, GPT is also a large-scale pre-trained language model. However, GPT is a generative model that is capable of generating text on its own. The goal of GPT is to create a language model that can generate coherent and contextually appropriate text.

insert image description here

the difference

BERT and GPT are two different pre-trained language models, and they have some notable differences in principle and application.

Target task:

  • BERT: BERT is a Transformer-based pre-training model whose goal is to learn context-dependent word representations through bidirectional language model pre-training. During the pre-training process, BERT is trained through Masked Language Model (Masked Language Model, MLM) and Next Sentence Prediction (NSP) tasks.

    • Masked Language Model (Masked Language Model, MLM): In the input sequence, BERT randomly masks some words, and then asks the model to predict these masked words. Through this task, BERT can learn the ability to predict missing words given the context. This enables BERT to understand the semantic and contextual information of words. Specifically, for each input sequence, BERT randomly selects some words to mask. Usually, the selected words account for about 15% of the total number of words. For the selected words, there are three ways to deal with them:

      • 80% of the time, the selected words are replaced with a special mask token [MASK]. For example, replace "apples" in the sentence "I love apples" with "[MASK] love [MASK]". "this movie is great" becomes "this movie is [MASK]";

      • 10% of the time, the selected word was randomly replaced with another word. Such a model not only needs to understand the context, but also needs to have the ability of word replacement and word meaning inference. For example, "this movie is great" becomes "this movie is drink"

      • 10% of the time, keep the chosen word unchanged. This is done so that the model learns how to handle unmasked words. "this movie is great" becomes "this movie is great"

      • Next, BERT feeds the processed input sequence into the model, which is then encoded using the Transformer's encoder structure. During encoding, the model takes into account both masked words and other contextual information. Ultimately, the model produces a set of predictions for the masked words.

    • Next Sentence Prediction (NSP): In some natural language processing tasks, it is important to understand the relationship between sentences. In order for the model to learn sentence-level relationships, BERT uses the NSP task. This task requires the model to judge whether two sentences are continuous, that is, whether one sentence is the next sentence of another sentence. Through this task, BERT can learn the semantic relationship and reasoning ability at the sentence level. Explicitly model logical relationships between text pairs. Specifically, it is handled in the following ways:

      • For each training sample, BERT randomly selects two sentences A and B. Among them, in 50% of cases, sentence B is the next sentence of sentence A, and in the other 50% of cases, sentence B is other sentences randomly selected from the corpus. .

      • In order to perform NSP tasks, BERT introduces a special input encoding method. For each input sequence, BERT inserts a special separator token [SEP] between sentence A and sentence B, and adds a special sentence token [CLS] at the beginning of the input.

        [CLS] Sentence A [SEP] Sentence B [SEP]

      • Next, BERT feeds this encoded sequence into the model and encodes it using the Transformer's encoder structure. The encoder will learn the representation of sentence A and sentence B according to the context information.

      • During encoding, the model takes the entire sequence as input and makes predictions on special [CLS] tokens. This prediction task can be a classification task for judging whether sentence A and sentence B are consecutive. Usually, the model will use a fully connected layer to map the hidden state of [CLS] to a binary classification problem, such as using a sigmoid activation function to predict the continuity of two sentences.

  • GPT: GPT is a Transformer-based generative pre-training model whose goal is to learn the ability to generate coherent text through autoregressive language model pre-training. GPT uses the pre-training method of autoregressive language model. In the pre-training process, GPT uses large-scale text data and gradually generates the next word through auto-regression. The model predicts the next word based on the generated above, and optimizes the model parameters through maximum likelihood estimation. This enables GPT to learn the ability to generate coherent and logical text. The GPT implementation process is roughly as follows:

    • The process of GPT dividing text data into words or subwords is usually achieved through tokenization. In the word segmentation process, there are two commonly used methods:

      • Word-based Tokenization: This method divides the text into independent word units. For example, for the sentence "I love natural language processing", word-based tokenization divides it into ["I", "love", "natural", "language", "processing"].
      • Subword-based Tokenization: This method divides the text into smaller subword units. It can deal with the internal structure and complexity of words, and is more suitable for dealing with out-of-vocabulary and rare words. For example, for the sentence "I love natural language processing", subword-based word segmentation can divide it into ["I", "love", "nat", "ural", "language", "pro", "cess" , "ing"].
    • Whether it is word-based or subword-based word segmentation, the ultimate goal is to segment the text into discrete token units, each token unit corresponding to a word or subword.

    • Embedding word embedding: encode the token into a vector. That is, each word or subword will be converted into a corresponding embedding vector, indicating that the embedding vector is a continuous real vector used to represent the position of the word or subword in the semantic space. A common approach is to use a pretrained word embedding model such as Word2Vec, GloVe, or FastText to map words or subwords to fixed-dimensional real vectors.

    • Transformer architecture: GPT uses Transformer as its infrastructure. Transformer is a powerful deep learning model whose core mechanism is self-attention mechanism. It is able to capture global dependencies when processing sequence data, and at the same time has the ability of parallel computing.

    • Autoregressive language model: During pre-training, GPT uses an autoregressive language model for training. Specifically, the model generates coherent text by progressively generating the next word. When generating the i-th word, the model predicts the next word using the previous i-1 words already generated as a context.

    • Learning pre-training parameters: In autoregressive language models, the goal of GPT is to maximize the probability of generating realistic training samples. Through maximum likelihood estimation, the parameters of the model are optimized to maximize the probability of generating real training samples. Through large-scale pre-training data and iterative optimization process, GPT can learn the statistical laws and structure of language, so as to generate coherent and logical text.

    • Generating text: After pre-training is complete, GPT can generate text. Given an initial text or seed sentence, the model progressively generates the next word, adds it to the generated text, and then uses the generated text as a context to predict the next word. By repeating this process, the model can generate coherent, logical text.

Training method:

BERT: BERT uses a bidirectional language model training strategy. In the input sequence, BERT randomly masks some words and asks the model to predict these masked words. This approach enables BERT to learn the semantic and contextual information of words from the context.
GPT: GPT uses the training method of autoregressive language model. It learns the ability to generate text by having the model predict the word at the current position. During pre-training, GPT gradually generates the next word and optimizes parameters to maximize the probability of the next word.

Context understanding ability:
Two pre-trained models based on the Transformer architecture, which differ in context understanding ability and application domain.

  • BERT: Since BERT uses a two-way model, it predicts the relationship between the covered words and judges the sentence. It can obtain richer information from context and has strong context understanding ability. This makes BERT excellent at word-level tasks, such as named entity recognition, question answering, etc.
  • GPT: GPT is a one-way model, it can only rely on the generated above to predict the next word. During pre-training, GPT is trained using an autoregressive language model that learns to generate coherent text by gradually generating the next word. Due to the limitation of the one-way model, GPT performs better in generative tasks, such as dialogue generation, text generation, etc. GPT is able to generate text that is contextually coherent and logical because it takes into account the previous generated context when generating each word.

Downstream task suitability:

  • BERT: Due to BERT's strong context understanding ability and the characteristics of a bidirectional model, it performs well in various downstream tasks, such as text classification, named entity recognition, and semantic relationship judgment.
  • GPT: GPT is mainly used for generative tasks, such as dialogue generation, text generation, and machine translation. It is able to generate natural and smooth text, but is weak in some tasks that require input-output alignment.
    • In the first stage, the language modeling objective is used on unlabeled data to learn the initial parameters of the neural network model
    • In the second stage, the pre-trained GPT model is generative, and in specific applications, GPT can be used for specific downstream tasks through fine-tuning. In the fine-tuning stage, task-specific layers or structures can be added and the labeled task data can be used to further tune the model to the requirements of the specific task.

Overall, there are differences between BERT and GPT in target tasks, training methods, context understanding ability and applicability. BERT is suitable for various downstream tasks, while GPT is mainly used for generative tasks. Which model to choose depends on the specific task requirements and application scenarios.

the code

Bert implements the code. Use the code walkthrough provided by Hands-on Deep Learning here

import torch
from torch import nn
from d2l import torch as d2l

#@save
def get_tokens_and_segments(tokens_a, tokens_b=None):
    """获取输入序列的词元及其片段索引"""
    tokens = ['<cls>'] + tokens_a + ['<sep>']
    # 0和1分别标记片段A和B
    segments = [0] * (len(tokens_a) + 2)
    if tokens_b is not None:
        tokens += tokens_b + ['<sep>']
        segments += [1] * (len(tokens_b) + 1)
    return tokens, segments

#@save
class BERTEncoder(nn.Module):
    """BERT编码器"""
    def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
                 ffn_num_hiddens, num_heads, num_layers, dropout,
                 max_len=1000, key_size=768, query_size=768, value_size=768,
                 **kwargs):
        super(BERTEncoder, self).__init__(**kwargs)
        self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
        self.segment_embedding = nn.Embedding(2, num_hiddens)
        self.blks = nn.Sequential()
        for i in range(num_layers):
            self.blks.add_module(f"{
      
      i}", d2l.EncoderBlock(
                key_size, query_size, value_size, num_hiddens, norm_shape,
                ffn_num_input, ffn_num_hiddens, num_heads, dropout, True))
        # 在BERT中,位置嵌入是可学习的,因此我们创建一个足够长的位置嵌入参数
        self.pos_embedding = nn.Parameter(torch.randn(1, max_len,
                                                      num_hiddens))

    def forward(self, tokens, segments, valid_lens):
        # 在以下代码段中,X的形状保持不变:(批量大小,最大序列长度,num_hiddens)
        X = self.token_embedding(tokens) + self.segment_embedding(segments)
        X = X + self.pos_embedding.data[:, :X.shape[1], :]
        for blk in self.blks:
            X = blk(X, valid_lens)
        return X


#@save
class MaskLM(nn.Module):
    """BERT的掩蔽语言模型任务"""
    def __init__(self, vocab_size, num_hiddens, num_inputs=768, **kwargs):
        super(MaskLM, self).__init__(**kwargs)
        self.mlp = nn.Sequential(nn.Linear(num_inputs, num_hiddens),
                                 nn.ReLU(),
                                 nn.LayerNorm(num_hiddens),
                                 nn.Linear(num_hiddens, vocab_size))

    def forward(self, X, pred_positions):
        num_pred_positions = pred_positions.shape[1]
        pred_positions = pred_positions.reshape(-1)
        batch_size = X.shape[0]
        batch_idx = torch.arange(0, batch_size)
        # 假设batch_size=2,num_pred_positions=3
        # 那么batch_idx是np.array([0,0,0,1,1,1])
        batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions)
        masked_X = X[batch_idx, pred_positions]
        masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
        mlm_Y_hat = self.mlp(masked_X)
        return mlm_Y_hat

#@save
class NextSentencePred(nn.Module):
    """BERT的下一句预测任务"""
    def __init__(self, num_inputs, **kwargs):
        super(NextSentencePred, self).__init__(**kwargs)
        self.output = nn.Linear(num_inputs, 2)

    def forward(self, X):
        # X的形状:(batchsize,num_hiddens)
        return self.output(X)


#@save
class BERTModel(nn.Module):
    """BERT模型"""
    def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
                 ffn_num_hiddens, num_heads, num_layers, dropout,
                 max_len=1000, key_size=768, query_size=768, value_size=768,
                 hid_in_features=768, mlm_in_features=768,
                 nsp_in_features=768):
        super(BERTModel, self).__init__()
        self.encoder = BERTEncoder(vocab_size, num_hiddens, norm_shape,
                    ffn_num_input, ffn_num_hiddens, num_heads, num_layers,
                    dropout, max_len=max_len, key_size=key_size,
                    query_size=query_size, value_size=value_size)
        self.hidden = nn.Sequential(nn.Linear(hid_in_features, num_hiddens),
                                    nn.Tanh())
        self.mlm = MaskLM(vocab_size, num_hiddens, mlm_in_features)
        self.nsp = NextSentencePred(nsp_in_features)

    def forward(self, tokens, segments, valid_lens=None,
                pred_positions=None):
        encoded_X = self.encoder(tokens, segments, valid_lens)
        if pred_positions is not None:
            mlm_Y_hat = self.mlm(encoded_X, pred_positions)
        else:
            mlm_Y_hat = None
        # 用于下一句预测的多层感知机分类器的隐藏层,0是“<cls>”标记的索引
        nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
        return encoded_X, mlm_Y_hat, nsp_Y_hat

GPT code

GPT = Multi-Head Attention + Feed forward + LN layer + residual

Since the Decoder has the ability to generate text, GPT, which focuses on generative tasks, chose the Transformer Decoder part as the core architecture.
However, only the multi-head attention layer Multi-Head Attention layer and the feedforward neural network Feedforward layer are retained, and finally the summed and normalized pre-LN layer + residual is added.

Here, GPT2 is used as the code to explain
insert image description here

import torch
import torch.nn as nn
from transformers import GPT2Model, GPT2Tokenizer

class GPT2(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, d_ff, vocab_size, max_sequence_length):
        super(GPT2, self).__init__()
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.transformer = GPT2Model(
            num_layers=num_layers,
            d_model=d_model,
            nhead=num_heads,
            dim_feedforward=d_ff,
            max_position_embeddings=max_sequence_length
        )
        self.fc = nn.Linear(d_model, vocab_size)
    
    def forward(self, inputs):
        embedded = self.embedding(inputs)
        outputs = self.transformer(embedded)
        logits = self.fc(outputs.last_hidden_state)
        return logits

# GPT-2 hyperparameters
num_layers = 12
d_model = 768
num_heads = 12
d_ff = 3072
vocab_size = 50257
max_sequence_length = 1024

# Create GPT-2 model instance
model = GPT2(num_layers, d_model, num_heads, d_ff, vocab_size, max_sequence_length)

# Example input
input_ids = torch.tensor([[31, 51, 99, 18, 42, 62]])  # Input token IDs

# Forward pass
logits = model(input_ids)

The above code is a simplified implementation example using PyTorch and Hugging Face's transformers library. The GPT-2 model uses the pre-training weights of GPT-2, where GPT2Model is the main Transformer model of GPT-2, and GPT2Tokenizer is used to convert text into the input required by the model.

insert image description here

To be continued!

Guess you like

Origin blog.csdn.net/qq_38915354/article/details/131054219