Intensive Reading of BERT

Table of contents

1. Abstract

2. Introduction:

3. Conclusion:

4.BERT model algorithm:

5. Summary


1. Abstract

What is the difference from other articles: BERT is designed to train a deep two-way representation, using unlabeled data, and then combining left and right context information. (where to improve)

How well does it work: Good results are achieved on 11 NLP tasks. It is necessary to clarify the absolute intensive reading and the relative intensive reading . (what is the result)


2. Introduction:

A brief introduction to the language model: 1. Modeling the relationship between these sentences 2. Recognition of entity naming

Expansion of the first paragraph of the abstract: when using the pre-trained model for feature representation, two strategies are used: feature-based and fine-tuning

Main idea:

How to solve the problems encountered: BERT is used to reduce the language model mentioned before, choose a masked language model (masked language model)


Contribution points: the importance of two-way information (sentences read from left to right, and from right to left), fine-tuning on BERT works well, and the code is open source


3. Conclusion:

Unsupervised pre-training is important (in computer vision, training on unlabeled datasets is better than training on labeled datasets); main contribution is to further generalize these findings to deep bidirectional architectures , enabling the same pretrained model to successfully handle a range of NLP tasks.

4.BERT model algorithm:

Two steps in BERT:

Pre-training: In pre-training, the BERT model is trained on an unlabeled data

Fine-tuning: A BERT model is also suitable for fine-tuning. Its weight is initialized to the weight we got in the middle of pre-training. All weights will be involved in training during fine-tuning (using labeled data).

 

Differences between pre-training and fine-tuning:

Two key things in pre-training: objective function and data for pre-training

Architecture of BERT:

It is a multi-layer transformer encoder

5. Summary


The biggest contribution in the conclusion of this paper is bidirectionality (when writing a paper, it is best to have a selling point, not here or there).
What are the disadvantages of choosing bidirectionality? There is something to be gained and some to be lost by making a choice.
The disadvantage is: Compared with GPT (Improving Language Understanding by Generative Pre-Training), BERT uses an encoder, and GPT uses a decoder. BERT is not easy to do machine translation and text summarization (tasks to generate classes).
But classification problems are more common in NLP.
The idea of ​​​​completely solving the problem: train a very wide and deep model on a large data set, which can be used on many small problems, and improve the performance of small data through fine-tuning (a lot of use in the field of computer vision Years), the larger the model, the better the effect (very simple and violent).

Guess you like

Origin blog.csdn.net/weixin_64443786/article/details/131986041