In-depth understanding of deep learning - BERT derived model: SpanBERT (Improving Pre-training by Representing and Predicting Spans)

Category: General Catalog of "In-depth Understanding of Deep Learning"


The MLM training method is the core training method for BERT to have the ability to understand natural language. However, the probability of BERT selecting mask words during pre-training is calculated independently, that is, the granularity of BERT mask words is the smallest, which may be a single word or a partial subword of a word. The core idea of ​​SpanBERT is that expanding the mask range will lead to better performance of the model. Specifically, the main improvements of SpanBERT are as follows:

  • A Span Mask scheme is proposed to expand the granularity of masked words, that is, no longer perform a mask operation on a single word (subword), but perform a mask operation on multiple local continuous words.
  • Using Span Boundary Objective (SBO) as a training method, using the word information near the mask word to predict the content of a mask word, and strengthening the use of local context information, so as to enhance the performance of BERT in semantic understanding.
  • Abandon the NSP training method to obtain better semantic understanding of long text.

SpanBERT did not modify the structure of BERT, nor did it use more corpus. It only designed more reasonable pre-training tasks and goals to make the model have better performance. The idea is very unique.

algorithm details

Choice of masked words

Compared with BERT, which randomly selects 15% of the words for masking operations, SpanBERT is designed more carefully when implementing masking operations for multiple words. There are two keys to implementing a multi-word mask operation:

  • The number of consecutive mask words.
  • The starting point for consecutive mask words.

SpanBERT uses geometric distribution to determine the number of consecutive mask words, the calculation formula is as follows:
I = Geo ( p ) I=\text{Geo}(p)I=Geo ( p )

where Geo \text{Geo}Geo is the geometric distribution sampling function,ppp is the geometric distribution hyperparameter, the value is 0.2,lll is the number of consecutive mask words, limited to[ 1 , 10 ] [1, 10][1,10 ] . The probability distribution for geometric distribution sampling is shown in the figure below. llThe value of l is kkThe formula for calculating the probability of k
is as follows: P ( I = k ) = p ( 1 − p ) k − 1 ( 1 − p ) 10 P(I=k)=\frac{p(1-p)^{k-1 }}{(1-p)^{10}}P(I=k)=(1p)10p(1p)k1
Probability Distribution for Geometric Distribution Sampling
Geometric distribution sampling determines the number of consecutive mask words. It is worth noting that the number of masked words counts the number of complete words, not the number of subwords (some uncommon words will be divided into several subwords through the word segmentation tool), and the starting point of the masked word is chosen randomly, the only requirement is that the starting point of the masked word must be an independent word, or the first subword that is divided into several subwords (i.e., the mask must cover the complete word to ensure semantic coherence ).

SBO training objectives

In addition to using the cross-entropy of BERT's original prediction mask words as the training target, introducing mask boundary words as auxiliary training targets for continuous mask words can make the model have better performance. Specifically, during training, two words at the front and rear boundaries of the mask word are taken, and these two words are not within the mask range, the feature vector extracted through these two words, and the position encoding vector of the mask word Jointly predict the final word, which is the SBO training method.

As shown in the figure below, x 5 , x 6 , x 7 , x 8 x_5, x_6, x_7, x_8x5,x6,x7,x8Is the feature vector corresponding to the mask word to predict the vector x 7 x_7x7Taking the corresponding word as an example, the probability of the word being correctly predicted as "football" is:
L ( football ) = L MLM ( football ) + L SBO ( football ) = − log ⁡ P ( football ∣ x 7 ) − log ⁡ P ( football ∣ x 5 , x 9 , p 3 ) \begin{aligned} L(\text{football}) &= L_\text{MLM}(\text{football})+L_\text{SBO}(\text {football}) \\ &= -\log P(\text{football}|x_7) - \log P(\text{football}|x_5, x_9, p_3) \end{aligned}L(football)=LMLM(football)+LSBO(football)=logP(footballx7)logP(footballx5,x9,p3)

where, L MLM L_\text{MLM}LMLMThe calculation method is consistent with BERT's MLM calculation method, L SBO L_\text{SBO}LSBOIt is calculated through a two-layer fully connected network, p 3 p_3p3Indicates the position code of the word to be predicted in the mask word.
SBO training method
In summary, the main improvement of SpanBERT is to optimize the mask strategy and introduce the SBO training method. SpanBERT outperforms BERT in most downstream tasks. It performs particularly well on extractive question answering tasks, demonstrating that improved masking strategies and training patterns are strongly relevant to such tasks.

References:
[1] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015 [
2] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola. Dive Into Deep Learning[J] . arXiv preprint arXiv:2106.11342, 2021.
[3] Che Wanxiang, Cui Yiming, Guo Jiang. Natural Language Processing: A Method Based on Pre-Training Model [M]. Electronic Industry Press, 2021. [4]
Shao Hao, Liu Yifeng. Pre-training language model [M]. Electronic Industry Press, 2021.
[5] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019
[6] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Practice[ M]. People's Posts and Telecommunications Press, 2023
[7] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131354077