Table of contents
1. Abstract
What is the difference from other articles: BERT is designed to train a deep two-way representation, using unlabeled data, and then combining left and right context information. (where to improve)
How well does it work: Good results are achieved on 11 NLP tasks. It is necessary to clarify the absolute intensive reading and the relative intensive reading . (what is the result)
2. Introduction:
A brief introduction to the language model: 1. Modeling the relationship between these sentences 2. Recognition of entity naming
Expansion of the first paragraph of the abstract: when using the pre-trained model for feature representation, two strategies are used: feature-based and fine-tuning
Main idea:
How to solve the problems encountered: BERT is used to reduce the language model mentioned before, choose a masked language model (masked language model)
Contribution points: the importance of two-way information (sentences read from left to right, and from right to left), fine-tuning on BERT works well, and the code is open source
3. Conclusion:
Unsupervised pre-training is important (in computer vision, training on unlabeled datasets is better than training on labeled datasets); main contribution is to further generalize these findings to deep bidirectional architectures , enabling the same pretrained model to successfully handle a range of NLP tasks.
4.BERT model algorithm:
Two steps in BERT:
Pre-training: In pre-training, the BERT model is trained on an unlabeled data
Fine-tuning: A BERT model is also suitable for fine-tuning. Its weight is initialized to the weight we got in the middle of pre-training. All weights will be involved in training during fine-tuning (using labeled data).
Differences between pre-training and fine-tuning:
Two key things in pre-training: objective function and data for pre-training
Architecture of BERT:
It is a multi-layer transformer encoder
5. Summary
The biggest contribution in the conclusion of this paper is bidirectionality (when writing a paper, it is best to have a selling point, not here or there).
What are the disadvantages of choosing bidirectionality? There is something to be gained and some to be lost by making a choice.
The disadvantage is: Compared with GPT (Improving Language Understanding by Generative Pre-Training), BERT uses an encoder, and GPT uses a decoder. BERT is not easy to do machine translation and text summarization (tasks to generate classes).
But classification problems are more common in NLP.
The idea of completely solving the problem: train a very wide and deep model on a large data set, which can be used on many small problems, and improve the performance of small data through fine-tuning (a lot of use in the field of computer vision Years), the larger the model, the better the effect (very simple and violent).