Large Language Models, Part 1: BERT

1. Introduction

        2017 was a historic year in machine learning when Transformers models first hit the scene. It performs well on many benchmarks and is applicable to many problems in data science. Due to its efficient architecture, many other transformer-based models were later developed that were more focused on specific tasks.

        One of the models is BERT. It is mainly known for its ability to build embeddings that can represent textual information very accurately and store the semantic meaning of long text sequences. Therefore, BERT embeddings are widely used in machine learning. Understanding how BERT builds text representations is crucial as it opens the door to handling a large number of tasks in NLP.

        In this article, we will refer to the original BERT paper and look at the BERT architecture and understand the core mechanisms behind it. In the first section, we will provide a high-level overview of BERT. After that, we'll step into a deeper understanding of its internal workflow and how information is passed throughout the model. Finally, we will learn how to fine-tune BERT to solve specific problems in NLP.

2. High-level overview

Transformer 's architecture consists of two main parts: encoder and decoder. The goal of stacked encoders is to construct meaningful embeddings for the input that preserve its primary context. The output of the last encoder is passed to the input of all decoders that try to generate new information.

BERT is the successor to Transformer and inherits its stacked bidirectional encoder. Most of the architectural principles in BERT are the same as those in the original Transformer.

Transformer architecture

3. Bert version

BERT has two main versions: basic version and large version. Their architecture is exactly the same, except that they use a different number of parameters. Overall, BERT large requires 3.09 times more parameters to be tuned compared to BERT base.

Comparison of BERT base and BERT big base

4. Two-way representation

Starting with the letter "B" in BERT's name, it is important to remember that BERT is a bidirectional model, meaning that since information is passed in both directions (left to right and right to left), it can better capture Word connections. Obviously, this results in more training resources compared to unidirectional models, but at the same time leads to better prediction accuracy.

For a better understanding, we can compare the BERT architecture with other popular NLP models.

Compare  BERT, OpenAI GPT and ElMo architectures from the ogirinal paper . Accepted by the author.

5. Input Tokenization

Novel In the official paper, the authors use the term " sentence " to refer to the text passed to the input. To designate the same terminology, we will use the term " sequence " in this series of articles . This is done to avoid confusion, since "sentence" usually means a single phrase separated by a point, and since in many other NLP research papers the term "sequence" is used in similar situations.

Before delving into how to train BERT, it is necessary to understand the format in which it accepts data. For input, BERT takes a single sequence or a pair of sequences. Each sequence is split into tokens. Additionally, two special tokens will be passed to the input:

The official novel paper uses the term " sentence ", which refers to the input sequence passed to BERT, which can actually consist of several sentences . For simplicity, we will follow the notation and use the same terminology throughout this article.

  • [CLS]  — Passed before the first sequence indicating its start. Meanwhile, [CLS]  is also used for classification objectives during training (discussed in the following sections).
  • [SEP]  — passed between sequences to indicate the end of the first sequence and the start of the second.

Passing two sequences allows BERT to handle a variety of tasks where the input contains a pair of sequences (such as question and answer, hypothesis and premise, etc.).

6. Input embedding

After tokenization, an embedding is built for each token. To make the input embeddings more representative, BERT constructs three types of embeddings for each token:

  • Token embedding captures the semantic meaning of a token.
  • A segment embedding has one of two possible values ​​and indicates which sequence a token belongs to.
  • Positional embeddings contain information about the relative positions of tokens in a sequence.

Input processing

These embeddings are summarized and the results are passed to the first encoder of the BERT model.

7. Output

Each encoder takes  n  embeddings as input and outputs the same number of processed embeddings of the same dimensionality. Ultimately, the entire BERT output also contains n embeddings, each corresponding to its initial token.

8. Training

        BERT training consists of two stages:

  1. Pre-training . BERT is trained on unlabeled sequence pairs in two prediction tasks: masked language modeling (MLM) and natural language inference (NLI). For each pair of sequences, the model makes predictions for both tasks and performs backpropagation based on the loss value to update the weights.
  2. Fine tune . BERT is initialized with pre-trained weights and then optimized for a specific problem on labeled data.

9. Pre-training

        Compared to fine-tuning, pre-training usually takes a significant proportion of the time because the model is trained on a large corpus of data. This is why many online repositories of pre-trained models exist, which can then be fine-tuned to solve specific tasks relatively quickly.

We will look in detail at two problems BERT solves during pre-training.

9.1 Mask language modeling

        The authors recommend training BERT by masking a certain number of tokens in the initial text and predicting them. This enables BERT to build elastic embeddings that can use the surrounding context to guess a certain word, which also results in building appropriate embeddings for missing words . The process works as follows:

  1. After tokenization, 15% of tokens are randomly selected for shielding. The selected tokens will then be predicted at the end of the iteration.
  2. The selected tokens are replaced in one of three ways:
    -  80% of the tokens are replaced by  the [MASK]  token.
    Example : I bought a book → bought a [MASK]
    - 10% of the tokens were replaced by random tokens.
    Example: He is eating a fruit → He is smoking a fruit
    - 10% of the tokens remain the same.
    Example: House near me → House near me
  3. All tokens are passed to the BERT model, which outputs embeddings for each token it receives as input.

4. The output embeddings corresponding to the tokens processed in step 2 are independently used to predict the masked tokens. The result of each prediction is a probability distribution over all tokens in the vocabulary.

5. Cross-entropy loss is calculated by comparing the probability distribution with the true masked token.

6. Use backpropagation to update model weights.

9.2 Natural language reasoning

For this classification task, BERT tries to predict whether the second sequence follows the first sequence. The entire prediction is made by using only   the embedding of the final hidden state of the [CLS] token, which should contain aggregated information from both sequences.

Similar to MLM, the constructed probability distribution (binary in this case) is used to calculate the model's loss and update the model's weights through backpropagation.

For NLI, the authors recommend choosing 50% of sequence pairs where the sequences follow each other in the corpus ( positive pairs) and 50% of sequence pairs where the sequences are randomly obtained from the corpus ( negative pairs ).

Bert Pre-Training

9.3 Training details

According to the paper, BERT was pretrained on BooksCorpus (80 billion words) and English Wikipedia (2,500M words). To extract longer continuous text, the authors read only paragraphs from Wikipedia ignoring tables, headings and lists.

BERT is trained on batches of 2.56 million, with a size equal to 40 sequences, equivalent to 3 epochs for 300 million words. Each sequence contains at most 128 (90% of the time) or 512 (10% of the time) tokens.

According to the original paper, the training parameters are as follows:

  • Optimizer: Adam (learning rate  l  = 1e-4, weight decay L₂ = 0.01, β₁ = 0.9, β₂ = 0.999, ε = 1e-6).
  • A learning rate warmup is performed for the first 10 steps and then linearly decreased.
  • All layers use the pressure difference (α = 0.1) layer.
  • Active function: Gru.
  • The training loss is the sum of the average MLM and the average next sentence prediction likelihood.

9.4 Fine-tuning

Once pre-trained, BERT can literally understand the semantic meaning of words and build embeddings that almost fully represent their meaning. The goal of fine-tuning is to gradually modify BERT weights to solve specific downstream tasks.

10. Data format

Due to the robustness of the self-attention mechanism, BERT can be easily fine-tuned for specific downstream tasks. Another advantage of BERT is the ability to build bidirectional text representations. This provides a higher chance of finding the correct relationship between two sequences when processing pairs. Previous methods consist of independently encoding the two sequences and then applying bi-directional intersecting attention to them. BERT unifies these two stages.

According to a question, BERT accepts multiple input formats. The framework for solving all downstream tasks with BERT is the same: by taking a text sequence as input, BERT outputs a set of token embeddings, which are then fed to the model. In most cases, not all output embeddings are used.

Let's take a look at common problems and ways to solve them by fine-tuning BERT.

sentence pair classification

The goal of sentence pair classification is to understand the relationship between a given sequence pair. The most common task types are:

  • Natural Language Inference : Determine if a second sequence follows the first.
  • Similarity Analysis : Finds the degree of similarity between sequences.

Sentence pair classification

For fine-tuning, both sequences are passed to BERT. As a rule of thumb, the output embeddings of [CLS]  tokens are then used for classification tasks. According to the researchers, [CLS] tokens should contain the main information about sentence relations.

Of course, other output embeddings can be used as well, but they are usually omitted in practice.

Question and Answer Task

The purpose of question and answer is to find the answer in a passage of text that corresponds to a specific question. Most of the time, the answer is given in the form of two numbers: the channel's start and end token positions.

Question and Answer Task

For input, BERT takes questions and passages and outputs a set of embeddings for them. Since the answers are contained in paragraphs, we are only interested in the output embeddings corresponding to paragraph tokens.

To find the position of the starting answer token in the paragraph, the scalar product between each output embedding and a special trainable vector Tstₐrt is calculated. For most cases, when the model and vector Tstₐrt are trained accordingly, the scalar product should be proportional to the likelihood that the corresponding token is actually the starting answer token. To normalize the scalar products, they are then passed to the softmax function and can be viewed as probabilities. The token embedding corresponding to the highest probability is predicted as the starting answer token. Based on the true probability distribution, the loss value is calculated and backpropagation is performed. A similar process is performed using the vector Tₑnd to predict the end marker.

Single sentence classification

Compared to previous downstream tasks, the difference is that here only one sentence BERT is passed. Typical problems solved by this configuration are as follows:

  • Sentiment Analysis : Understanding whether a sentence represents a positive or negative attitude.
  • Topic Classification : Classify sentences into one of several categories based on their content.

Single sentence classification

The prediction workflow is the same as for sentence pair classification:  the output embeddings of [CLS] tokens are used as input to the classification model.

single sentence tag

Named entity recognition (NER) is a machine learning problem that aims to map each token of a sequence to one of the corresponding entities.

single sentence tag

To do this, the embeddings of the input sentence tokens are computed as usual. Each embedding (except [CLS]  and  [SEP] ) is then  passed independently to a model that maps each of them to a given NER class (or not, if it cannot).

11. Feature extraction

Taking the last BERT layer and using it as an embedding is not the only way to extract features from input text. In fact, the researchers performed several experiments aggregating embeddings in different ways to solve the NER task on the CoNLL-2003 dataset. For their experiments, they used the extracted embeddings as input to a randomly initialized two-layer 768-dimensional BiLSTM before applying the classification layer.

The figure below demonstrates how the embedding (extracted from the BERT base) works. As shown in the figure, the highest performing method is concatenating the last four BERT hidden layers.

Based on the conducted experiments, it is important to remember that aggregation of hidden layers is a potential way to improve embedding representations in order to obtain better results on various NLP tasks.

The image on the left shows the extended BERT structure with hidden layers. The table on the right illustrates how embeddings are constructed and the corresponding scores obtained by applying the corresponding strategies.

12. Combine BERT with other functions

For example, sometimes we deal not only with text but also with numerical features. It is naturally desirable to construct embeddings that can incorporate information from textual and other non-textual features. The following are recommended application strategies:

  • Concatenation of textual and non-textual functions . For example, if we process a profile description about a person in text form and there are other separate features such as their name or age, we can get a new text description in the form: "My name is <name> . < profile description >. I am < years old >". Finally, such textual descriptions can be fed into a BERT model.
  • Concatenation of embeddings and features . As mentioned above, BERT embeddings can be built and then connected with other features. The only thing changed in the configuration is that the classification model for the downstream task must now accept higher dimensional input vectors.

    Vyacheslav Yefimov

    13. Conclusion

        In this article, we dive into the process of training and fine-tuning BERT. In fact, this knowledge is enough to solve most tasks in NLP, and thankfully BERT allows almost complete incorporation of text data into embeddings.

        Recently, other BERT-like models have emerged (SBERT, RoBERTa, etc.). There is even a special research area called " BERTology " that deeply analyzes BERT features to derive new high-performance models. These facts reinforce the fact that BERT specifies a revolution in machine learning and enables significant advances in NLP

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132726327