In-depth understanding of deep learning - BERT (Bidirectional Encoder Representations from Transformers): basic knowledge

Category: General Catalog of "In-depth Understanding of Deep Learning"
Related Articles:
BERT (Bidirectional Encoder Representations from Transformers): Basic Knowledge
BERT (Bidirectional Encoder Representations from Transformers): BERT Structure
BERT (Bidirectional Encoder Representations from Transformers): MLM (Masked Language Model)
BERT (Bidirectional Encoder Representations from Transformers): NSP (Next Sentence Prediction) task
BERT (Bidirectional Encoder Representations from Transformers): input representation
BERT (Bidirectional Encoder Representations from Transformers): fine-tuning training-[sentence pairs Classification]
BERT (Bidirectional Encoder Representations from Transformers): fine-tuning training-[single sentence classification]
·
BERT (Bidirectional Encoder Representations from Transformers): Fine-tuning training- [ Text Q&A] BERT (Bidirectional Encoder Representations from Transformers): Fine-tuning training-[Single sentence annotation]
BERT (Bidirectional Encoder Representations from Transformers): Model summary and precautions


The full name of BERT is Bidirectional Encoder Representations from Transformers, which is the bidirectional encoder representation from Transformers. It is an unsupervised pre-training language model for natural language processing tasks proposed in the paper Pre-training of Deep Bidirectional Transformers for Language Understanding published by Google. It is a milestone model recognized in the field of natural language processing in recent years. The significance is that the deep model trained from a large number of unlabeled data sets can significantly improve the accuracy of various natural language processing tasks. In fact, BERT refreshed its performance on 11 natural language processing tasks, including natural language inference, question answering, and named entity recognition tasks, and even surpassed human performance in the SQuAD question answering test.

BERT is considered to be a master of excellent pre-trained language models in recent years. It refers to the two-way encoding idea of ​​ELMo model, borrows the idea of ​​GPT using Transformer as a feature extractor, and adopts the CBOW training method used by word2vec. After the advent of BERT, more excellent pre-trained language models have sprung up, showing better performance in different fields and scenarios, but their model structure and underlying ideas are still not completely separated from BERT, which shows the influence of BERT. profound.

From the name point of view, BERT emphasizes Bidirectional Encoder, that is, a two-way encoder, which makes it different from the pre-trained language model-GPT, which used unidirectional encoding to attract widespread attention during the same period. GPT is a standard language model that uses Transformer Decoder (including Masked Multi-Head Attention) as a feature extractor and has good text generation capabilities. Of course, its defect is also relatively obvious, that is, the semantics of the current word is only determined by its preamble, which is slightly insufficient in semantic understanding. The innovation of BERT is to use Transformer Encoder (including Multi-Head Attention) as a feature extractor, and use the matching mask training method. Although the use of two-way encoding makes BERT no longer have the ability to generate text, the research results show that BERT uses all the contextual information of each word in the process of encoding the input text, and can only use the pre-order information to extract semantics. Compared with the encoder, BERT has a stronger ability to extract semantic information.

The following example illustrates the difference in semantic understanding between one-way encoding and two-way encoding:

The weather is so bad today that we have to cancel outdoor sports.

If a word or word is removed from the sentence, the sentence becomes: The weather is very ( ) today, we have to cancel outdoor sports. Consider what words should be filled in "()" from the perspective of one-way encoding (such as GPT) and two-way encoding (such as BERT). One-way encoding will only use the information of the five characters "today's weather is very good" to infer the words or words in "()". Based on human experience and wisdom, the words with the highest probability of use should be: "good" and "good" "Poor" and "terrible," and these words can be divided into two distinct categories. Bidirectional encoding can use the following information "We have to cancel outdoor sports" to help the model judge. Based on human experience and wisdom, the words with the highest probability should be: "poor" and "bad". Through this example, we can intuitively feel that, regardless of the complexity of the model and the amount of training data, bidirectional encoding can use more contextual information to assist the semantic judgment of the current word than unidirectional encoding. In terms of semantic understanding, the two-way encoding method is the most scientific, and the success of BERT is largely determined by this.

References:
[1] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015 [
2] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola. Dive Into Deep Learning[J] . arXiv preprint arXiv:2106.11342, 2021.
[3] Che Wanxiang, Cui Yiming, Guo Jiang. Natural Language Processing: A Method Based on Pre-Training Model [M]. Electronic Industry Press, 2021. [4]
Shao Hao, Liu Yifeng. Pre-training language model [M]. Electronic Industry Press, 2021.
[5] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019
[6] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Practice[ M]. People's Posts and Telecommunications Press, 2023
[7] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131340620