BERT: Pretraining of Deep Bidirectional Transformers fo

Author: Zen and the Art of Computer Programming

1 Introduction

In recent years, with the widespread application of large-scale pre-training models in the field of natural language processing, BERT (Bidirectional Encoder Representations from Transformers) is widely considered to be one of the best methods currently. It was proposed by Google in 2018. Its core idea is to use the multi-layer encoder layer trained by the Transformer model to represent the input text sequence, and through pre-training, the model can capture rich contextual information. At the same time, it also uses two distillation strategies to improve the performance of the model. In this paper, the author analyzes, describes in detail, and provides example code implementation and other extended functions of the BERT model. I hope readers can gain knowledge from this article and apply it to practical work.

2. Core concepts

Transformer

A machine learning model can be divided into two parts: encoder and decoder. Among them, the encoder is responsible for transforming the input sequence into a fixed-dimensional vector form, while the decoder is responsible for decoding the sequence through generation or inference, that is, restoring the encoded vector. In order to achieve more powerful representation capabilities, the Google team proposed a new attention mechanism-Transformer. Its core idea is to consider not only the relevance of a single element, but also the relevance of the entire sentence or entire document when calculating.

As shown in the figure above, each Transformer block includes two sub-layers: the first layer is a position-based feed-forward network (self-attention), and the second layer is a simple feed-forward network (feed-forward network). This structure is similar to the Encoder-Decoder structure in the standard Seq2seq model, but unlike RNN or CNN, it allows the model to directly focus on the entire input sequence. Therefore, Transformer has significant advantages.

BERT

The BERT model is a pre-trained Transformers-based language understanding model that can be used for a variety of natural language processing tasks, including text classification, sentiment analysis, named entity recognition, etc. Its main features are:

  1. Based on Transformer: BERT adopts multi-layer encoders based on Transformer, called BERT-base, BERT-large and BERT-xlarge, that is, the Transformer architecture achieves the best results among different number of layers and sizes.

  2. Pre-training using Masked LM: BERT is pre-trained on the masked language model (MLM) task, that is, randomly masking a small part of the input text for language modeling. Doing so can improve the robustness and robustness of the model.

  3. Two distillation strategies are adopted: task-specific distillation, which distills the output results of the original model into the distillation model; unsupervised domain adaptation, which uses supervised methods on unlabeled data sets. The model performs transfer learning.

3. Core algorithm principles and specific operation steps

1. Pre-training

data preparation

BERT uses the English Wikipedia (wikipedia corpus) as a training corpus. The corpus includes approximately 2.5 billion tokens, totaling approximately 3.34GB.

Masked Language Modeling (MLM)

The MLM task aims to predict the correct label of the masked words by masking some words of the input text. As shown in the figure below, assuming that you want to predict "the cat in the hat", then the model needs to predict the unmasked words "cat" and "hat". The purpose of this prediction task is to help the model master more contextual information, thereby improving the model's performance.

The strategy adopted by BERT's MLM is to randomly mask some words. First, the model randomly selects a small portion (usually 15%) of words from the input text and replaces them with special [MASK] mark symbols. Next, the model attempts to generate correct tokens corresponding to these masked words. Finally, the model adjusts the model parameters based on the prediction results to enhance the model's language understanding ability.

Next Sentence Prediction (NSP)

The Next Sentence Prediction task aims to predict whether two adjacent text fragments belong to the same sentence. For example, in a question answering system, ambiguity occurs if the model cannot determine whether two consecutive text fragments belong to the same sentence. As shown in the figure below, assuming that you want to determine whether two text fragments "The quick brown fox jumps over the lazy dog." and "A fast brown dog runs away." belong to the same sentence, then the model needs to determine whether there is a clear distinction between them. connect.

BERT uses a binary classification task for Next Sentence Prediction pre-training. First, the model randomly extracts a pair of consecutive text segments from the input text and combines them into a sentence group. Then, the model determines whether the two pieces of text belong to the same sentence. Finally, the model adjusts the model parameters through backpropagation to improve the model's language understanding ability.

Pre-training Procedure

  1. Tokenization: Use the wordpiece algorithm to segment the input text.

  2. Masking: Replace some words in the input text with [MASK] tags.

  3. Segment Embedding: Divide each sentence into token and segment embedding. Among them, token embedding represents the semantic information of the word, while segment embedding represents the information of the entire sentence.

  4. Positional Encoding: Introduce positional encoding and add positional information to all embeddings to improve the positional correlation of the model.

  5. Training: Fine-tuning task-specific models and then distilling them to improve performance.

2. Fine-tuning

BERT's distillation strategy includes two types: task-specific distillation and unsupervised domain adaptation.

Task-Specific Distillation

In task layer distillation, the original BERT model is first used to pre-train to obtain general language understanding capabilities, and then the model is fine-tuned on specific tasks to improve the performance of the model. Specifically, in NLP tasks, distillation can be performed based on the MNLI data set, which will not be described here.

Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) aims to leverage cross-domain data to adapt to the data distribution of the source domain. Specifically, the BERT model in the source domain is first used to pre-train the input text, and then the unsupervised data in the target domain is used to train the model. Test on target domain data to verify the generalization ability of the model.

4. Specific code implementation

Installation Environment

!pip install transformers==3.0.2 torch==1.6.0 torchvision==0.7.0 tensorboardX

Load pre-trained model

from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Segment the input text and perform padding

text = "Hello, my dog is cute."
marked_text = "[CLS] " + text + " [SEP]"
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1]*len(tokenized_text)

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

Use models to make predictions

outputs = model(tokens_tensor, token_type_ids=segments_tensors)
last_hidden_states = outputs[0]

Sample complete code

import torch
from transformers import BertTokenizer, BertModel

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=False, output_hidden_states=True)

# Example input sentence
text = "Hello, my dog is cute."

# Convert text to tokens and pad it with zeros up to max length
marked_text = "[CLS] " + text + " [SEP]"
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1]*len(tokenized_text)
input_ids = indexed_tokens + ([0]*(tokenizer.max_length-len(indexed_tokens)))
attn_masks = ([1]*len(indexed_tokens)) + ([0]*(tokenizer.max_length-len(indexed_tokens)))
tokens_tensor = torch.tensor([[input_ids]])
segments_tensors = torch.tensor([[segments_ids]])
attn_mask_tensors = torch.tensor([[attn_masks]])

# Run inference on model
outputs = model(tokens_tensor, token_type_ids=segments_tensors, attention_mask=attn_mask_tensors)[0].squeeze()
predicted_index = int(torch.argmax(outputs[0]))
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print("Predicted token:", predicted_token)

# Check that correct token has high probability score
assert all(outputs[0][int(tokenizer.vocab['cute'])] > outputs[0][int(tokenizer.vocab['not'])])

5. Future development and challenges

Existing research on BERT has made major breakthroughs. Currently, BERT has achieved state-of-the-art results on many natural language processing tasks. But there are still many problems to be solved. for example:

  1. The problem of excessive memory usage caused by excessive model capacity. BERT's model size varies with different layers and sizes. Although there are some methods to compress the size of the BERT model, they still cannot completely solve this problem.

  2. The problem of low model training efficiency. Currently, BERT uses a Transformer-based encoder-decoder structure, which is excellent in terms of computational complexity and number of parameters. However, the training process itself is also very complex and often takes days or even weeks. Therefore, how to reduce the training time of BERT is a future research direction.

  3. The problem of richer model architecture. In addition to BERT-base and BERT-large, there are other BERT-based model architectures. Among them, BERT's encoder can be replaced with other types of encoders, such as Transformer-XL in GPT-2, so that the model architecture can be designed more flexibly. In addition, the BERT-based model can be further modified, such as adding an attention mechanism to enable it to learn global features.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133566136