Intensive reading of Li Mu's paper: BERT "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"


https://github.com/google-research/bert

Paper address: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Official code address: https://github.com/google-research/bert

Course recommendation: Li Hongyi Machine Learning --self-supervised: BERT

Reference: Intensive reading of BERT papers paragraph by paragraph [Intensive reading of papers] - Programmer Sought

Table of contents

1 Introduction

2. Summary

3 Introduction

4 Conclusion

5 related work

5.1 Feature-Based Unsupervised Approaches

5.2 Unsupervised fine-tuning methods

5.3 Transfer Learning from Supervised Data  

6 BERT model structure and input and output

6.1BERT framework two steps: pre-training and fine-tuning

6.2 Model structure

6.3 Size of learnable parameters

6.4 Model input and output

input sequence

WordPiece

special mark

3 embeddings

7 BERT pre-training (MLM+NSP)

7.1 Task 1:Masked LM

7.2 Task 2:Next Sentence Prediction(NSP)

7.3 Pre-training data and parameters

8 BERT fine-tuning

fine-tuning parameters

9 experiments

9.1 GLUE Dataset (Classification)

9.2 SQuAD v1.1 (Question and Answer)

9.3 SQuAD v2.0 (Question and Answer)

9.4 SWAG (Sentence pair task)

10 ablation test

10.1 Effects of Model Parts

10.2 Effect of Model Configuration

10.3 Using BERT as a Feature Extractor   

11 Summary


1 Introduction

     In computer vision, a CNN model can be trained on a large data set (such as ImageNet) very early, and this model can be used to process a large number of machine vision tasks to improve their performance; but in natural language In processing, before BERT came out, there was no such deep neural network model in the NLP field, and it was still necessary to construct its own neural network for each task and train it by itself. The emergence of BERT enables us to train a relatively deep neural network on a large data set, and then apply it to many NLP tasks, which not only simplifies the training of NLP tasks, but also improves its performance, so BERT and A series of work after it has made natural language processing a qualitative leap in the past three years.

        The main contribution of BERT is to extend the pre-training model to a deep two-way architecture, and the specific implementation is done through the MLM task. Through the self-supervised training mechanism of cloze, there is no need to use supervised corpus, but to predict what the masked word in the sentence is, so as to extract text features.

        As a result, BERT has greatly expanded the application of Transformers, enabling it to be trained on larger unlabeled datasets, and its effect is better than that of models trained on labeled and smaller datasets.

2. Summary

  We introduce a new language representation model called BERT, which is the abbreviation of Bidirectional Encoder Representations from Transformers (bidirectional Transformer encoder).

The difference between BERT and ELMo, GPT:

  • GPT uses a new Transformers structure, using the information on the left to predict future information. The main disadvantage of the one-way model is that it cannot obtain a good enough word representation;
  • ELMo obtains the representation of words by concatenating the output of two models from left to right (LTR) and from right to left (RTL). The two-way information fusion is very shallow, and due to the RNN-based architecture, it needs to be applied to downstream tasks. Make some adjustments to the structure;
  • BERT is based on Transformer, which uses left and right side information and unlabeled data. When using some downstream tasks, it only needs to fine-tune the output layer like GPT.

From the point of view of innovation, Bert does not actually have too many structural innovations. This article combines ELMo's two-way idea with GPT's transformer architecture to become BERT. The main feature of BERT is that it has a unified model structure for different tasks and is a pre-trained model with strong generalization ability.

3 Introduction

        Language model pre-training can improve many NLP tasks, including:

  • Sentence-level tasks are mainly used to model the relationship between sentences, such as emotion recognition for sentences or the relationship between two sentences
  • Token-level tasks: including the recognition of entity naming (recognition of whether each word is an entity name, such as a person's name, street name), these tasks need to output some fine-grained output at the token level  

        When using pre-trained models for feature representation, there are generally two types of strategies

  • A strategy is based on feature-based. The representative work is ELMo. For each downstream task, a neural network related to this task is constructed, and then these pre-trained representations (such as word embedding) are used as an additional feature. Hope These features already have a relatively good representation, put them into the model together with the original input. Because the additional features already have a good representation, it is relatively easy to train the model, which is the most common way to use pre-trained models in NLP
  • Another strategy is based on fine-tuning. Here is an example of GPT. It does not need to change too much when the pre-trained model is placed in the downstream task. It only needs to simply modify some output layers, and then use our Perform an incremental training on your own data, and fine-tune the pre-trained parameters on the downstream data        

  

        These two methods use the same objective function in the pre-training process, that is, a one-way language model (given some words to predict what the next word is, say a sentence and then predict the word below this sentence is something), which limits the capabilities of pre-trained representations. For example, a left-to-right architecture is used in GPT (you can only see the right from the left when looking at the sentence). The disadvantage of this is that if you want to analyze the sentence level, for example, to judge a sentence level If the mood is right, it is legal to see right from left and see left from right. In addition, even some tasks at the lexical level, such as QA, read the entire sentence before choosing the answer, while Not going down one by one. If information from both directions is put in, it should be able to improve the performance of these tasks.

   In response to the questions raised above, the author proposed BERT, which uses a masked language model (MLM, masked language model) pre-training target, which alleviates the previously mentioned one-way language model limitation.

  • The masked language model randomly screens some tokens in the input each time, the purpose is to predict the masked word (the id corresponding to the original vocabulary) according to its context, which is equivalent to filling in the blanks after digging out some holes in a sentence .
  • The difference from the standard language model from left to right is that the masked language model allows to see the left and right information, so it allows the training of a deep two-way transformer model.
  • In addition, a task is also trained, given two sentences, and then judging whether the two sentences are adjacent in the original text or randomly sampled, so that the model can learn information at the sentence level.   

        The contributions of this paper are as follows:

  • We demonstrate the importance of bidirectional pre-training for language representation.
    • GPT uses a unidirectional language model for pre-training, and BERT uses a masked language model to achieve a pre-trained deep bidirectional representation.
    • ELMo simply merges a left-to-right language model and a right-to-left language model, similar to a bidirectional RNN model. BERT is better in the application of two-way information
  • Assuming that there is a better pre-trained model, there is no need to carefully design the model structure for specific tasks. BERT is the first fine-tuning based representation model that achieves state-of-the-art performance on a range of sentence-level and token-level tasks, outperforming many task-specific architectures.
  • BERT drives the development of 11 NLP tasks. Code and pre-trained models can be found on BERT. Code and models are all placed at: https://github.com/google-research/bert

4 Conclusion

  Recent experiments show that rich, unsupervised pre-training makes deep neural networks accessible to even low-resource (few-sample) tasks. Our main contribution is to extend previous work to deep bidirectional architectures, enabling the same pre-trained model to successfully solve various NLP tasks.

5 related work

5.1 Feature-Based Unsupervised Approaches

Mainly speaking word embedding, ELMo and some work after that, skip

5.2 Unsupervised fine-tuning methods

The representative work is GPT

5.3 Transfer Learning from Supervised Data  

      Computer vision research has demonstrated the importance of transfer learning from large pretrained models by fine-tuning models pretrained with ImageNet. In NLP, there are already labeled and relatively large data (including relatively large data sets in both natural language reasoning and machine translation), and models are trained on these labeled data sets, and then other It works very well when used on tasks. However, in the field of NLP, the application of transfer learning is not particularly ideal. On the one hand, it may be because these two tasks are quite different from other tasks, and on the other hand, it may be because the amount of data is still far from enough. BERT and his subsequent series of work proved that the model trained on NLP using a large number of unlabeled data sets is better than the model trained on a relatively small data set with labels. The same idea is now It is also slowly being adopted by computer vision, that is, models trained on a large number of unlabeled images may perform better than models trained on ImageNet.

6 BERT model structure and input and output

6.1BERT framework two steps: pre-training and fine-tuning

  • Pre-training is trained on an unlabeled dataset
  • Fine-tuning also uses the BERT model, whose weights are initialized to pre-trained weights. All weights will participate in training when fine -tuning, using labeled data
  • Each downstream task creates a new BERT model to train its own task based on its own labeled data.

6.2 Model structure

  The model architecture of BERT is the Encoder of the multi-layer Transformer . Only use Transformer's EncoderLayer, and then stack multiple layers, which is BERT. Transformer is based on the original paper and original code without any changes. Good guides such as The Annotated Transformer can be consulted.

        three parameters

  • L: the number of transformer blocks
  • H: the size of the hidden layer
  • A: The number of heads of the multi-head self-attention mechanism

        two models

  • BERT base: Similar to the parameters of the GPT model, to make a fair comparison. BERTBASE​(L=12, H=768, A=12, Total Parameters=110M)
  • BERT large: Used to brush the list. ERTLARGE​(L=24, H=1024, A=16, Total Parameters=340M)

The model complexity and number of layers in BERT is a linear relationship, and the width is a square relationship

6.3 Size of learnable parameters

BERT model can learn parameters from word embedding layer and Transformer block

  • Embedding layer: It is a matrix, the input is the size of the dictionary (assumed to be 30k), and the output is equal to the number of hidden units (assumed to be H)
  • The transformer block has two parts:
    • Self-attention mechanism. It itself has no learnable parameters, but the multi-head attention will project all incoming K, V, and Q respectively, and the dimension of each projection is equal to 64. Because there are multiple heads, the dimension does not change before and after projection, and the number A of heads is multiplied by 64 to get H. The parameter matrix (WQ, WK, WV) is actually a H*H matrix when combined between each head. The output multi-head attention will also be mapped back to the H dimension through a matrix WO, which is also a H*H matrix, so for a transformer block, its self-attention can learn the parameter is 4H2
    • MLP. Two fully connected layers are required in MLP. The input of the first layer is H, but its output is 4*H, and the input of the other fully connected layer is 4*H, and the output is H, so the size of each matrix is H*4H, two matrices are 8H2

        The sum of these two parts is a parameter in a transformer block, and it must be multiplied by L (the number of transformer blocks), so the total number of parameters is 30K⋅H+L⋅12H2, and the parameters brought into the base are about 110 million .

6.4 Model input and output

input sequence

The input of BERT pre-training is collectively called input sequence. Some downstream tasks process one sentence, and some tasks process two sentences, so in order to enable the BERT model to process all tasks, the input sequence can be either a sentence or a sentence pair (the "sentence" here refers to a continuous text, not necessarily a true semantic sentence). This is different from the transformer in the previous article: when the transformer is training, its input is a sequence pair, because its encoder and decoder will input a sequence respectively, but BERT has only one encoder, and the input is a sequence, so in order for it to handle two sentences, it needs to turn the two sentences into a sequence

WordPiece

The word segmentation method used is WordPiece, the core idea is:

  • If BERT uses spaces to cut words, a word is a token. The amount of training data for BERT is very large, so the size of the vocabulary will be particularly large, such as millions of levels. Then according to the above algorithm, the learnable parameters of the model are basically on the word embedding layer.
  • The principle of WordPiece is that the probability of a word appearing in the data set is not high, so it is cut into subsequences. If the subsequence (probably a root) has a high probability of occurrence, then only keep this subsequence. In this way, relatively long words can be cut into some frequently occurring fragments. The size of this vocabulary is relatively small (30,000).

special mark

        The first word of the sequence is always a special token [CLS] (indicating classification), and the final output corresponding to CLS can represent the information of the entire sequence. If it is a single sentence classification, it indicates the category of the input sentence; if it is a sentence pair classification, Indicates that two sentences are related/unrelated, similar meaning/opposite meaning. Because every word in the self-attention layer will look at all the words in the input and output, even if the word is placed in the first position, it can still see all the words after it, so it is no problem to put CLS at the beginning of the sentence , not necessarily at the end of the sentence.

        Sentence pairs are input together, but in order to do sentence-level tasks, it is necessary to distinguish these two sentences and use two methods:

  • Add the special word [SEP] to the end of the sentence to separate (means separate)
  • A learnable vector is used in the word embedding layer to indicate whether each token belongs to sentence A or sentence B.

        The pink box in the figure below represents the input sequence. Each token enters BERT to get the embedding representation of the token. Finally, the output of the transformer block represents the BERT representation of the token. Finally, an additional output layer is added to obtain the desired result. .

3 embeddings

The abbreviations have the following meanings:

  • [CLS] : The abbreviation of class, which is a special symbol, which will be added to the front of each input sample.
  • Tok X: the abbreviation of Token, Tok 1 means the first word in the current sentence, and so on
  • [SEP] : The abbreviation of separator, indicating the separator. Used to split two sentences.
  • E : Abbreviation of embedding, indicating the vector after the token is embedded
  • C: Indicates [CLS] this special token after the BERT vector, H dimension
  • T : Indicates the vector corresponding to Token after BERT, H dimension
  • NSP: Next Sentence Prediction task
  • Mask LM: masked laguage model, MLM task
  • Gray arrow ⇧: the arrow between Tok and E, indicating input, that is, Tok is input
  • Red arrow : Indicates downstream tasks
  • MNLI, NER, SQuAD : three specific tasks

        For a given token, its input representation is composed of the sum of token, segment, and position embeddings. The resulting sequence of vectors goes into the transformer block. The visualization of the embedding layer structure is shown in the figure.

  • token embedding: convert each token into a fixed-dimensional vector
  • Segment embedding: set for the sentence pair task, there are only two values ​​of 0 and 1, used to distinguish between two sentences, sentence A is coded as 0, sentence B is coded as 1, if it is a single sentence task, all codes are 0
  • position embedding: The size of the input is the maximum length of this sequence, and its input is the position information (starting from zero) of each token in this sequence, thus obtaining the corresponding position vector.

        BERT adopts a two-way parallel input method, that is, the entire sentence is input into the model instead of inputting words one by one, which can make full use of the performance of the GPU and greatly improve the operating efficiency of the model. At the same time, since the parallel input will cause the loss of the position information of the word in the text, the BERT model needs to add an additional position encoding input to ensure that the position information is not lost. The position information in Transformer is obtained through position encoding (cos function), and the position information position embedding and sentence information segment embedding here are all learned through embedding. The sizes of the three embeddings are all (batch_size, seq_length, hidden_size), and finally add the three embeddings by element value, which means the input of the BERT coding layer

7 BERT pre-training (MLM+NSP)

BERT uses two unsupervised tasks for parameter pre-training.

7.1 Task 1:Masked LM

  Randomly mask 15% of each token sequence, replace it with [MASK] (CLS and SEP are not replaced), and let the model predict these empty original words. Pass the corresponding output vector of the mask part through a Linear transform (matrix multiplication), and do Softmax to obtain a distribution that minimizes the cross-entropy loss between it and the MASK character one-hot vector. Essentially solving a classification problem, what BERT has to do is to predict what is covered.

        One problem this brings is that there is no [MASK] in the data during fine-tuning, so the data seen during pre-training and fine-tuning are a bit different. To mitigate this, we don't always replace "masked" words with actual [mask] tokens.

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning.

        Solution: If a token is selected as masked, there is an 80% probability that it will be replaced with [MASK], and a 10% probability that it will be replaced with a random word (noise), and finally there are 10 The % probability remains the same as the original word (still needs to be predicted). This probability came out of the experiment, and the effect is good. There are examples in the appendix.

  • 80% of the time is to use [mask], my dog ​​is hairy → my dog ​​is [MASK]
  • 10% of the time is to randomly choose a word to replace the mask word, my dog ​​is hairy -> my dog ​​is apple
  • 10% of the time stays the same, my dog ​​is hairy -> my dog ​​is hairy

        So why use random words with a certain probability? This is because the transformer needs to maintain a distributed representation of each input token, otherwise the Transformer is likely to remember that [MASK] is "hairy". As for the negative impact of using random words, the article said that all other tokens (that is, tokens other than "hairy") share a probability of 15%*10% = 1.5%, and its impact is negligible.

7.2 Task 2:Next Sentence Prediction(NSP)

  The selected sentence pair A and B, B has a 50% probability of being the next sentence of A (marked as is next), and a 50% probability of randomly selecting sentences in the corpus (marked as not next), which means that there are 50 % of the samples are positive and 50% of the samples are negative. The purpose is to allow the model to understand the relationship between two sentences, so that it can be applied to downstream tasks such as question answering QA and natural language reasoning NLI.

       Just looking at the output of CLS, the output of CLS undergoes the same operation as masking input, the purpose is to predict whether the second sentence is a follow-up sentence of the first sentence. This is a binary classification problem with two possible outputs: yes or no.

  Although NSP is very simple, we can see later that adding this objective function can greatly improve the effect of QA and natural language reasoning (there are examples in the appendix)

# example

Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]

penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

        In the original text, flightless is a word, but because the probability of this word is not high, it is cut into two words flight and less in WordPiece, they are relatively common words, ## means in the back of the original text word follows the previous word

        Li Hongyi believes that this method is not very effective for pre-training, and has not learned anything very useful. One of the reasons may be that Next Sentence Prediction is too simple. Usually, when we randomly select a sentence, it sees It looks very different from the previous sentence, so it is not too difficult for BERT to predict whether two sentences are connected. There is another trick called Sentence order prediction, SOP, which predicts who is in the front and who is in the back of the two sentences. Maybe because the task is harder, it seems to be more efficient

7.3 Pre-training data and parameters

   The pre-training procedure largely follows the existing literature on language model pre-training, we use BooksCorpus (800M words) and English Wikipedia (2.5 billion words). A data set at the text level should be used, that is, the data set contains articles one by one instead of some randomly shuffled sentences, because the transformer can indeed handle relatively long sequences, so it will be better for the entire text sequence as a data set Some

All data and debug parameters:

  1. Dataset: BooksCorpus (800M words), English Wikipedia (2,500M words)
  2. Main parameters: batch_size=256, epochs=40, max_tokens=512, dropout=0.1
  3. Optimization parameters: optimizer Adam, lr=1e-4, β1=0.9, β2=0.999, L2 weight decay=0.01, lr_warmup=10,000
  4. Activation function: gelu
  5. Training loss: mean MLM likelihood + mean NSP likelihood
  6. Machine configuration: BERT (base) uses 4 cloud TPUs, BERT (large) uses 16 cloud TPUs
  7. Training duration: 4 days
  8. Acceleration method: 90% of the steps are trained according to the text length of 128, and the remaining 10% of the steps are trained according to the text length of 512

8 BERT fine-tuning

  Since Transformer uses a self-attention mechanism, the BERT model can be applied to a variety of downstream tasks. For sentence pair tasks, it is common practice to encode sentence pairs individually before computing cross-attention. BERT combines these two steps into one, that is, uses the self-attention mechanism to encode sentences.

       In the process of fine-tuning, a small amount of labeled data is still required for the training of downstream tasks . For each downstream task, simply concatenate task-specific inputs and outputs into BERT, and then fine-tune all parameters end-to-end.

        For tasks where the input is two sentences (A and B), A and B can be two expressions with the same meaning, it can be a hypothesis-premise pair, it can be a question-answer pair in a question answering system, it can also be a text classification or A text pair for the sequence label. On the output side, the output vector of the token is used for word-level classification tasks, such as sequence labeling or question answering system; while the output vector of [CLS] is used for classification tasks, such as entailment analysis or sentiment analysis. In any case, add an output layer at the end, and then use a softnax to get the desired label.

sentence classification

part-of-speech tagging

        Compared with pre-training, it is relatively cheaper than fine-tuning. All the results can be run for an hour with a TPU. If you use a GPU, you can run for a few more hours.

fine-tuning parameters

In the fine-tune stage, most of the model parameters are the same as in the pre-training stage, only batch_size, lr, and epochs need to be adjusted. The recommended parameters are as follows:

  1. batch size = 16, 32
  2. lr = 5e-5, 3e-5, 2e-5
  3. epochs = 2, 3, 4

9 experiments

  In this section, we present the results of BERT on the 11 NLP tasks mentioned earlier.

9.1 GLUE Dataset (Classification)

  The GLUE benchmark is a collection of various NLU tasks, a sentence-level task. BERT takes out the output vector C of the special word unit [CLS], puts it into the learned output layer w, and then uses softmax to get the label, which becomes a very normal multi-classification problem. That is, use the CLS final output vector to classify

        The average represents the average value on all data sets, which represents the accuracy, the higher the better. It can be found that even if the base and GPT learnable parameters are similar, BERT can still have a relatively large improvement.

9.2 SQuAD v1.1 (Question and Answer)

        The Stanford Question Answering Dataset (SQuAD v1.1) is a Stanford dataset containing 10w crowd-sourced sentence pairs. Given a paragraph, then ask a question, the answer is in the given paragraph, predict the position of the answer in the paragraph (the beginning and end of this paragraph), similar to reading comprehension. It is to judge each lexical unit to see if it is the beginning of the answer or the end of the answer

  Specifically, it is to learn the start vector S and the end vector E, which correspond to the probability that the word is the beginning of the answer and the probability that the answer is the end, respectively. For each lemma in the second sentence, calculate the dot product of Ti and Si, and then perform softmax on the dot product of all words in the paragraph to get the probability that each lemma in this paragraph is the beginning of the answer. The calculation of the ending probability is the same.

     The epoch value used for fine-tuning is 3, the learning rate is 5e-5, and the batch size is 32. This sentence is very misleading, because BERT is very unstable when fine-tuning, and the variance of the same parameter training is very large. So you need to train a few more epochs. In addition, the optimizer used by the author is an incomplete version of Adam. There is no problem when the training is long, but there is a problem if the training is short, so it needs to be changed back to the normal version.

9.3 SQuAD v2.0 (Question and Answer)

no more details

9.4 SWAG (Sentence pair task)

  The SWAG dataset includes 113K sentence pairs. Given a sentence, the task is to choose the most reasonable next sentence from the four alternatives. When fine-tuning on the SWAG dataset, we construct four inputs, each sequence consisting of a given sentence (sentenceA) and a possible next sentence (sentenceB). The only task-specific parameter to introduce is a vector that is dot-producted with the output C of [CLS] to represent the score for each choice, and then softmaxed to get the probability (similar to the above). The fine-tuning parameters are epoch=3, learning rate=2e-5, batch size=16. The test results will not be posted.

        For these different data sets, BERT basically only needs to express these data sets in the form of the required sentence pairs, and finally get a corresponding output and then add an output layer, so the contribution of BERT to the entire NLP field is still Very large, a large number of tasks can be completed with a relatively simple architecture without changing too many things

10 ablation test

  In this section, we conduct ablation experiments on many aspects of BERT to better understand the final contribution of each piece of BERT to the results.

10.1 Effects of Model Parts

  • No NSP: Assume that the prediction of the next sentence is removed
  • LTR & No NSP: MLM model is not used, the model is only from left to right, and there is no NSP
  • + BiLSTM: add a bidirectional LSTM on top of LTR & No NSP

Judging from the results, if any part is removed, the result will be discounted

  • Only using MLM without NSP will greatly affect the performance of QNLI, MNLI, SQuAD 1.1 tasks;
  • MLM outperforms LTR on all tasks;
  • The effect of splicing LTR and RTL models is still worse than that of deep two-way models, and the training cost is high

10.2 Effect of Model Configuration

  Explore the effect of model size on accuracy for fine-tuning tasks. Several BERT models were trained with different numbers of layers, hidden units, and attention heads, while other hyperparameters and training procedures were as described previously.

        There are 100 million learnable parameters in the BERT base. There are 300 million learnable parameters in BERT large. Compared with the previous transformer, the improvement in the number of learnable parameters is still relatively large. It can be seen that on all data sets, the larger the model, the higher the accuracy. It is known that increasing the model size will lead to continuous improvement in machine translation and language models, however, this is the first work to show that when the model is made very large, there is a large improvement in the language model (only fine-tune a small number of parameters).

        BERT thus triggered a model war to see whose model is bigger. GPT-3 has 100 billion parameters, and the current models are all on the way to trillions.

10.3 Using BERT as a Feature Extractor   

        The Feature-based method is to extract some specific features from the pre-trained model, which can improve the performance of model training to a certain extent. First of all, not all downstream tasks can be represented by the Transformer encoder structure, and additional structures often need to be added according to specific tasks. Secondly, from the perspective of calculation, it is cheaper to perform pre-computation first to obtain a complex and huge representation from the training data, and then run multiple small model experiments on this basis.

        In order to compare the effects of feature-based and fine-tune methods, features are extracted from one or more layers of the model, such as Embeddings, Last Hidden, Weighted Sum Last Four Hidden, etc., and input to a randomly initialized two-layer In the 768-dimensional BiLSTM model, it is finally output through a classification layer. The results show that if BERT(base) only extracts the output of the last 4 hidden layers and performs splicing, its performance is only 0.3 F1 lower than BERT(base) for the fine-tune method. It can be seen from this that both feature-based and fine-tune are effective for training the BERT model.

        Li Mu: Using BERT as a feature extractor instead of fine-tuning, the effect will be worse in comparison. So using BERT should be fine-tuned.

11 Summary

        This article believes that the biggest contribution of this article is bidirectionality. BERT uses Transformer's encoder instead of decoder. The advantage is that it can train a two-way language model. It performs better than GPT in language understanding tasks, but it also has disadvantages. It is inconvenient to do generation tasks, such as machine translation and summary generation. etc. It's just that in the NLP field, there are more language understanding tasks such as classification, so everyone prefers to use BERT. 

        What BERT provides is a complete problem-solving idea, which meets everyone's expectations for deep learning models: training a model with 300 million parameters, pre-trained on a data set of hundreds of GB, and fine-tuning can improve a large number of NLP downstream tasks, even with small datasets

        GPT and BERT are models proposed at the same time. Both are fine-tuned pre-trained models. Many ideas are the same. Even if the latter is better, it will be surpassed by the latecomers sooner or later. Why is BERT more out of the circle? Because the utilization rate of BERT is 10 times that of GPT, the influence is naturally more than 10 times greater.

Guess you like

Origin blog.csdn.net/iwill323/article/details/128374758