Natural language processing from entry to application - dynamic word vector pre-training: bidirectional language model

Category: General Catalog of "Natural Language Processing from Entry to Application"


For a given piece of input text w 1 w 2 ⋯ wn w_1w_2\cdots w_nw1w2wn, the bidirectional language model builds the language model simultaneously from the forward (from left to right) and backward (from right to left) directions. The advantage of this is that for any word wt w_t in the textwt, the representations based on the left context information and the right context information respectively can be obtained at the same time. Specifically, the model first encodes each word individually. This process is context-independent and mainly uses the character sequence information inside the word. Based on the encoded word representation sequence, the model uses two multi-layer long short-term memory networks (LSTM) in different directions to calculate the forward and backward hidden layer representations of the words at each moment, that is, the context-dependent word vector representation. Using this representation, the model predicts the target word at each moment. For the forward language model, ttThe target word at time t is wt + 1 w_{t+1}wt+1, for the backward language model, the target word is wt − 1 w_{t-1}wt1

input presentation layer

The ELMo model uses a character combination-based neural network to represent each word in the input text, with the purpose of reducing the impact of Out-Of-Vocabulary (OOV) on the model. The figure below shows the basic structure of the input representation layer. First, the character vector layer converts each character in the input layer (with additional start and stop characters) into a vector representation. Suppose wt w_twtFrom the character sequence c 1 c 2 ⋯ cl c_1c_2\cdots c_lc1c2clconstitute, for each character ci c_i in itci, can be expressed as: vci = E char eci v_{c_i}=E^{\text{char}}e_{c_i}vci=Echar eci。其中, E char ∈ R d char × ∣ V char ∣ E^{\text{char}}\in R^{d^{\text{char}}\times |V^{\text{char}}|} EcharRdchar×Vcharrepresents a character vector matrix;V char V^{\text{char}}Vchar represents all character sets;d char d^{\text{char}}dchar represents the character vector dimension;eci e_{c_i}eciIndicates the character ci c_icione-hot encoding of . remember wt w_twtThe matrix composed of all character vectors in is C t ∈ R d char × l C_t\in R^{d^{\text{char}}\times l}CtRdchar×l,即 C t = [ v c 1 , v c 2 , ⋯   , v c l ] C_t=[v_{c_1}, v_{c_2}, \cdots, v_{c_l}] Ct=[vc1,vc2,,vcl] . Next, use the convolutional neural network to perform semantic composition (Semantic Composition) on the character-level vector representation sequence. A one-dimensional convolutional neural network is used here, and the dimension of the character vectord char d^{\text{char}}dchar is used as the number of input channels, recorded asN in N^{\text{in}}Nin , the dimension of the output vector is used as the number of output channels, recorded asN out N^{\text{out}}Nout . In addition, by using multiple convolution kernels of different sizes (widths), character-level context information of different granularities can be utilized, and corresponding hidden layer vector representations can be obtained. The dimensions of these hidden layer vectors are corresponding to each convolution kernel The number of output channels is determined. By splicing these vectors, the convolution output of each position is obtained. Then, the output vectors of all positions of the hidden layer are pooled to obtain the wordwt w_twtThe fixed-length vector representation of , denoted as ft f_tft. Assuming that 7 one-dimensional convolution kernels with widths {1, 2, 3, 4, 5, 6, 7} are used, the corresponding output channels are {32, 32, 64, 128, 256, 512, 1024 }, then the output vector ft f_tfthas a dimension of 2048. Schematic diagram of input representation layer based on character convolutional neural network and Highway neural networkThen, the model uses a two-layer Highway neural network to further transform the output of the convolutional neural network to obtain the final word vector representation xt x_txt. The Highway neural network directly establishes a "channel" between the input and the output, so that the output layer can directly transmit the gradient back to the input layer, thereby avoiding the problem of gradient explosion or dispersion caused by too many network layers. The specific calculation method of the single-layer Highway neural network is as follows:
xt = g ⊙ ft + ( 1 − g ) ⊙ ReLU ( W ft + b ) x_t=g\odot f_t+(1 - g)\odot\text{ReLU}(Wf_t +b)xt=gft+(1g)ReLU ( W ft+b)

where, ggg is the gating vector, which isft f_tftAs the input, it is calculated by the Sigmoid function after linear transformation:
g = σ ( W gft + bg ) g=\sigma(W^gf_t+b^g)g=s ( Wgft+bg)

In the formula, W g W^gWg andbgb^gbg is the linear transformation matrix and bias vector in the gating network. It can be seen that the output of the Highway neural network is actually the result of linear interpolation between the input layer and the hidden layer. Of course, the structure of the model is usually adjusted and determined according to the experiment, and we can also try other model structures by ourselves. For example, a sequence of intra-word strings can be encoded using a character-level bidirectional LSTM network. Next, on the basis of the context-free word vectors obtained by the above process, a bidirectional language model is used to encode the forward and backward context information respectively, so as to obtain the dynamic word vector representation at each moment.

forward language model

In the forward language model, the prediction of the target word at any moment only depends on the context information or history on the left side of the moment. Here we use a long short-term memory network language model based on multi-layer stacking. Record the parameters of the multi-layer stacked LSTM in the model as θ → LSTM \overrightarrow{\theta}^\text{LSTM}i LSTM , Softmax output layer parameters are recorded asθ out \theta^\text{out}iout . Then the model can be expressed as:
p ( w 1 w 2 ⋯ wn ) = ∏ t = 1 n P ( wt ∣ x 1 : t − 1 ; θ → LSTM ; θ out ) p(w_1w_2\cdots w_n)=\prod_{ t=1}^nP(w_t|x_{1:t-1}; \overrightarrow{\theta}^\text{LSTM}; \theta^\text{out})p(w1w2wn)=t=1nP(wtx1:t1;i LSTM;iout)

backward language model

Contrary to the forward language model, the backward language model only considers the context information on the right at a certain moment. Can be expressed as:
p ( w 1 w 2 ⋯ wn ) = ∏ t = 1 n P ( wt ∣ xt + 1 : n ; θ ← LSTM ; θ out ) p(w_1w_2\cdots w_n)=\prod_{t=1 }^nP(w_t|x_{t+1:n}; \overleftarrow{\theta}^\text{LSTM}; \theta^\text{out})p(w1w2wn)=t=1nP(wtxt+1:n;i LSTM;iout)

References:
[1] Che Wanxiang, Cui Yiming, Guo Jiang. Natural language processing: a method based on pre-training model [M]. Electronic Industry Press, 2021. [2] Shao Hao, Liu Yifeng. Pre-
training language model [M] ]. Electronic Industry Press, 2021.
[3] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019 [
4] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Combat [M]. People's Posts and Telecommunications Publishing Society, 2023
[5] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131261194