In-depth understanding of deep learning - BERT (Bidirectional Encoder Representations from Transformers): BERT's structure

Category: General Catalog of "In-depth Understanding of Deep Learning"
Related Articles:
BERT (Bidirectional Encoder Representations from Transformers): Basic Knowledge
BERT (Bidirectional Encoder Representations from Transformers): BERT Structure
BERT (Bidirectional Encoder Representations from Transformers): MLM (Masked Language Model)
BERT (Bidirectional Encoder Representations from Transformers): NSP (Next Sentence Prediction) task
BERT (Bidirectional Encoder Representations from Transformers): input representation
BERT (Bidirectional Encoder Representations from Transformers): fine-tuning training-[sentence pairs Classification]
BERT (Bidirectional Encoder Representations from Transformers): fine-tuning training-[single sentence classification]
·
BERT (Bidirectional Encoder Representations from Transformers): Fine-tuning training- [ Text Q&A] BERT (Bidirectional Encoder Representations from Transformers): Fine-tuning training-[Single sentence annotation]
BERT (Bidirectional Encoder Representations from Transformers): Model summary and precautions


BERT consists of a core network composed of stacked Transformer Encoder layers, supplemented by word encoding and position encoding. The network morphology of BERT is very similar to GPT. The simplified version of the network structure of ELMo, GPT and BERT is shown in the figure below. "Trm" in the figure means Transformer Block, which is a Transformer-based feature extractor.
Network structure comparison of simplified versions of ELMo, GPT and BERT

  • ELMo uses two LSTM networks with left-to-right encoding and right-to-left encoding, respectively with P ( wi ∣ w 1 , w 2 , ⋯ , wi − 1 ) P(w_i|w_1, w_2, \cdots, w_{ i-1})P(wiw1,w2,,wi1) P ( w i ∣ w i + 1 , w i + 2 , ⋯   , w n ) P(w_i|w_{i+1}, w_{i+2}, \cdots, w_n) P(wiwi+1,wi+2,,wn) is trained independently for the objective function, and the feature vectors obtained by training are concatenated to achieve two-way encoding.
  • Transformer Decoder using GPT Transformer Block,Below P ( wi ∣ w 1 , w 2 , ⋯ , wi − 1 ) P(w_i|w_1, w_2, \cdots, w_{i-1})P(wiw1,w2,,wi1) is trained for the objective function, and the Transformer Block is used to replace the LSTM as the feature extractor, which realizes one-way encoding and is a standard pre-trained language model.
  • The difference between BERT and ELMo is that it uses Transformer Block as a feature extractor, which enhances the ability of semantic feature extraction; the difference with GPT is that it uses Transformer Encoder as Transformer Block, and changes the one-way encoding of GPT to two-way encoding. BERT gave up the text generation ability in exchange for a stronger semantic understanding ability.

Replace the Masked Multi-Head Attention layer in the GPT structure (refer to the article " Deep Understanding of Deep Learning - Attention Mechanism: Masked Multi-head Attention" ) with Multi-Head Attention layer, the model structure of BERT can be obtained, as shown in the figure below.
insert image description here
In terms of model parameter selection, there are two sets of models with inconsistent sizes. LLL represents the number of layers of Transformer Block,HHH represents the dimension of the feature vector (here the default dimension of the middle hidden layer in the Feed Forward layer is4 H 4H4H AAA indicates the number of heads of Self-Attention. For the specific meaning of the parameters, please refer tothe series of articlesIn-depth Understanding of Deep Learning-Transformer

  • BERT BASE : L = 12 , H = 768 , A = 12 \text{BERT}_{\text{BASE}}: L=12, H=768, A=12 BERTBASE:L=12,H=768,A=12 : The total number of parameters is 110 million
  • BERT LARGE : L = 24 , H = 1024 , A = 16 \text{BERT}_{\text{LARGE}}: L=24, H=1024, A=16 BERTLARGE:L=24,H=1024,A=16 : The total number of parameters is 340 million

BERT BASE \text{BERT}_{\text{BASE}} BERTBASEIt is specially designed for comparison with the first generation of GPT, and its parameter volume is equivalent to that of GPT. The purpose of this is to compare BERT BASE \text{BERT}_{\text{BASE}}BERTBASECompared with the performance of the first-generation GPT on various tasks, it proves that bidirectional encoding has more advantages than unidirectional encoding in terms of semantic understanding, that is, quantifying the impact of the core differences between BERT and GPT. The figure below shows the test results of BERT in the GLUE test task, and compares the optimal results of ELMo and GPT horizontally.
Test results of BERT in the GLUE test task
It can be seen that compared with ELMo, the effect of GPT on all tasks has been significantly improved, which is the result of using Transformer Block instead of LSTM as the feature extractor. It is worth noting that, compared to GPT, BERT BASE \text{BERT}_{\text{BASE}}BERTBASEThe effect on all tasks has been significantly improved (the accuracy rate has increased by an average of 4.5%~7.0%), which proves that compared with unidirectional encoding, bidirectional encoding has great advantages in semantic understanding. Not only that, but with BERT BASE \text{BERT}_{\text{BASE}}BERTBASECompared to BERT LARGE \text{BERT}_{\text{LARGE}}BERTLARGEThe effect on all tasks is also significantly improved, especially on tasks with limited training set resources. Regarding the comparison between model size and model capability, the author of BERT tested the performance of BERT with different parameter settings on three tasks. The figure below shows the performance of BERT with different scales on different tasks. With BERT As the number of parameters increases, the performance of the model on all tasks has improved significantly.
insert image description here
In recent years, limited by the scale of labeled data that can be used for supervised training, some scholars believe that using larger models cannot obtain higher benefits, and the emergence of BERT proves that pre-trained language models use unsupervised training and specific data sets. The mode of fine-tuning training can break through this limitation, that is, larger-scale pre-trained language models can always achieve better performance through random initialization of model parameters and fine-tuning of domain data. This is also in line with the trend of explosive growth in the parameter scale of pre-trained language models in recent years. The article " In-depth understanding of deep learning - GPT (Generative Pre-Trained Transformer): GPT-3 and Few-shot Learning " mentioned that there are 175 billion The parameter GPT-3 has reached the extreme at that time on this road, and it has indeed achieved unexpected results. It is also unknown whether BERT will be able to subvert its own record with a larger model in the future.

References:
[1] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015 [
2] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola. Dive Into Deep Learning[J] . arXiv preprint arXiv:2106.11342, 2021.
[3] Che Wanxiang, Cui Yiming, Guo Jiang. Natural Language Processing: A Method Based on Pre-Training Model [M]. Electronic Industry Press, 2021. [4]
Shao Hao, Liu Yifeng. Pre-training language model [M]. Electronic Industry Press, 2021.
[5] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019
[6] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Practice[ M]. People's Posts and Telecommunications Press, 2023
[7] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131348616