[Paper Interpretation Series] NER Direction: LatticeLSTM (ACL2018)

Introduction

LatticeLSTM comes from Chinese NER Using Lattice LSTM in ACL2018.

Paper address:
https://arxiv.org/abs/1805.02023

There are multiple versions of the code :
Official version: https://github.com/jiesutd/LatticeLSTM
Reproduced by others: https://github.com/LeeSureman/Batch_Parallel_LatticeLSTM

The LSTM-CRF model has a significant effect in the English named entity recognition task. In the Chinese NER task, the character-based NER model is also significantly better than the word-based NER model (to avoid the impact of word segmentation errors on the NER task). Introducing lexical information into the character-based NER model to determine entity boundaries can significantly improve the Chinese NER task.

The Lattice LSTM model is the beginning of the Chinese NER based on the vocabulary enhancement method . In this model, character information and all word sequence information are used . Specifically, when we match a sentence through lexical information (dictionary), a Lattice-like structure can be obtained. This method can avoid entity recognition errors caused by word segmentation errors, and has a significant effect on Chinese NER tasks.

insert image description here

model structure

LSTM structure

LSTM is a variant of RNN, which can effectively solve the problem of gradient disappearance and gradient explosion. Three gates are mainly introduced, namely the input gate it i_tit, the forget gate ft f_tft, the output gate ot o_totAnd with a new Cell State ct c_tctLinear transmission of information, while non-linear output information to the Hidden State ht h_t of the hidden layerht

Form:
[ itotftc ~ t ] = [ σ σ σ tanh ⁡ ] ( W [ X tht − 1 ] + b ) ct = ft ⊙ ct − 1 + it ⊙ c ~ tht = ot ⊙ tanh ⁡ ( ct ) \begin {aligned} &{\left[\begin{array}{c} i_{t}\\o_{t}\\f_{t}\\\assign{c}_{t}\end{array}\right ]=\left[\begin{array}{c} \sigma\\\sigma\\\sigma\\\tanh\end{array}\right]\left(W\left[\begin{array}{c} X_{t}\h_{t-1}\end{array}\right]+b\right)} \\&\mathbf{c}_{t}=\mathbf{f}_{t}\odot \mathbf{c}_{t-1}+\mathbf{i}_{t} \odot \give{\mathbf{c}}_{t} \\ &\mathbf{h}_{t}=\ mathbf{o}_{t}\odot\tanh\left(\mathbf{c}_{t}\right)\end{aligned} itotftc~t = pppfishy (W[Xtht1]+b)ct=ftct1+itc~tht=otfishy(ct)

It can be seen from the above formula:

  • Input gate it \mathbf{i}_{t}itUsed to control the current candidate state c ~ t \tilde{\mathbf{c}}_{t}c~tHow much information needs to be saved.

  • Forget gate ft \mathbf{f}_{t}ftUsed to control the previous state ct − 1 \mathbf{c}_{t-1}ct1How much information needs to be forgotten.

  • output gate ot \mathbf{o}_{t}otThe user controls the state of the current moment ct \mathbf{c}_{t}ctHow much information needs to be output to ht \mathbf{h}_{t}ht

The article introduces three types of model schemes, including Character-Based Model, Word-Based Model and Lattice Model, but its main network structure is LSTM-CRF.

insert image description here

Character-Based Model

For the Character-Based model, the input is the character sequence c 1 , c 2 , … , cm c_{1}, c_{2}, \ldots, c_{m}c1,c2,,cm, which are directly input to LSTM-CRF. where each character cj c_{j}cj表征的xjc = ec ( cj ),ec \mathbf{x}_{j}^{c}=\mathbf{e}^{c}\left(c_{j}\right),\mathbf{e}^ {c}xjc=ec(cj)ec is the embedding matrix of the character, that is, the lookup table of the character representation. Usually bidirectional LSTM is used to characterize the input sequencex 1 , x 2 , … , xm \mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{ m}x1,x2,,xmProcess to get the Hidden State sequence from left to right and right to left h → 1 c , h → 2 c , … , h → mc \overrightarrow{\mathbf{h}}_{1}^{c}, \overrightarrow{\mathbf{h}}_{2}^{c}, \ldots, \overrightarrow{\mathbf{h}}_{m}^{c}h 1c,h 2c,,h mc h ← 1 c , h ← 2 c , … , h ← m c \overleftarrow{\mathbf{h}}_{1}^{c}, \overleftarrow{\mathbf{h}}_{2}^{c}, \ldots, \overleftarrow{\mathbf{h}}_{m}^{c} h 1c,h 2c,,h mc. Then splicing the character representations in the two directions, that is, the Hidden State of each character is expressed as: hjc = [ h → jc ; h ← jc ] \mathbf{h}_{j}^{c}=\left[\overrightarrow {\mathbf{h}}_{j}^{c} ; \overleftarrow{\mathbf{h}}_{j}^{c}\right]hjc=[h jc;h jc]
最后在 h 1 c , h 2 c , … , h m c \mathbf{h}_{1}^{c}, \mathbf{h}_{2}^{c}, \ldots, \mathbf{h}_{m}^{c} h1c,h2c,,hmcThen a standard CRF layer is followed for sequence annotation. superscript cc herec means character level.

In addition, the Character-Based model can also integrate n-gram information (char+bichar) and word segmentation information (char+softword).

  • Char + bichar: directly splicing the embedded representation of bigram and the embedded representation of char as the embedded representation of the character.
  • Char + softword: Word segmentation information can be added to the Character-Based model as a soft feature. Concatenate word segmentation label embedding representation and character embedding representation xjc = [ ec ( cj ) ; es ( seg ⁡ ( cj ) ) ] \mathbf{x}_{j}^{c}=\left[\mathbf{e} ^
    {c}\left(c_{j}\right) ; \mathbf{e}^{s}\left(\operatorname{seg}\left(c_{j}\right)\right)\right]xjc=[ec(cj);es( say(cj) ) ]
    es\mathbf{e}^{s}es is the embedding matrix represented by the word segmentation tag embedding,seg ( cj ) seg(c_j)see g ( cj) means that forcj c_jcjThe word segmentation tag results of characters mainly include 4 kinds of tag results of BMES, indicating the position of the word segmentation result of the character.

Word-Based Model

The Word-Based model is similar to the Character-Based model. The input is a word sequence w 1 , w 2 , . . . , wn w_1,w_2,...,w_nw1,w2,...,wn,Every sentence wi w_iwiExpression: xiw = ew ( wi ) \mathbf{x}_{i}^{w}=\mathbf{e}^{w}(w_i)xiw=ew(wi) , among whichew \mathbf{e}^{w}ew is the word embedding matrix. In the same way, it also inputs a bidirectional LSTM, and then concatenates the two results as the output Hidden State. Sequence annotation is achieved by concatenating softmax and CRF.

Word-Based can further fuse character information. word wi w_iwiThe character representation in xic \mathbf{x}_{i}^cxic, take the splicing result of the two as the new representation of the word:
xiw = [ ew ( wi ) ; xic ] \mathbf{x}_{i}^{w}=[\mathbf{e}^{w}\left( w_{i}\right);\mathbf{x}_{i}^c]xiw=[ew(wi);xic]
As for how to get the wordwi w_iwiThe characters in xic \mathbf{x}_{i}^cxic, there are three ways:

  • Word + char LSTM: Directly input each character into the bidirectional LSTM, and splicing the hidden states in both directions as the word wi w_iwi的字符表征:
    x i c = [ h → t ( i , len ⁡ ( i ) ) c ; h ← t ( i , 1 ) c ] \mathbf{x}_{i}^{c}=\left[\overrightarrow{\mathbf{h}}_{t(i, \operatorname{len}(i))}^{c} ; \overleftarrow{\mathbf{h}}_{t(i, 1)}^{c}\right] xic=[h t(i,len(i))c;h t(i,1)c]
    of whicht ( i , k ) t(i,k)t(i,k ) means theiiThe index of i words iskkThe character of k , sot ( i , len ( i ) ) t(i,len(i))t(i,l e n ( i )) display lyricswi w_iwithe last character of the . It can be seen that this method actually only takes the splicing results of the first and last characters.

  • Word + char LSTM': Different from the above method, this method uses an LSTM to obtain each character cj c_jcjThe two-way representation of the character is spliced ​​together as the hidden states representation of the character, and finally integrated into the word representation. This method is similar to the structure of Liu et al. (2018) but does not use the highway layer.

  • Word + char CNN: Use CNN to extract the character sequence representation of each word, character representation xic \mathbf{x}^c_ixic, character cj c_jcjThe word vector uses ec ( cj ) \mathbf{e}^c(c_j)ec(cj) means:

insert image description here

where WCNN \mathbf{W_{CNN}}WCNNb CNN \mathbf{b_{CNN}}bCNNare parameters to be learned, ke kek e means kernal size, the value in the text is 3,max maxmax means max pooling.

Lattice Model

Lattice LSTM can be seen as an extension of the Character-Based model, which adds words as input (for example, the input in the figure below includes "Nanjing City", use the red part of the network structure to extract features) and additional gates (green in the figure below Gating units connected by wires) to control the flow of information.

insert image description here

The input of Lattice LSTM is: character sequence c 1 , c 2 , . . . , cm c_1,c_2,...,c_mc1,c2,...,cm, with the dictionary DDAll character subsequences matched by words in D are recorded as wb , ed w_{b,e}^dwb,ed b b b is the subscript of the starting character,eee is the subscript of the ending character. For example, for the word segmentation result of the sentence "Nanjing Yangtze River Bridge" in Figure 1,w 1 , 2 dw^d_{1,2}w1,2dMeans "Nanjing". The input of the model has 4 kinds of vectors: input vector, hidden layer output vector, cell vector, and gating vector. The input vector is a character vector , xjc = ec ( cj ) \mathbf{x}^c_j=\mathbf{e}^c(c_j)xjc=ec(cj) . The recursive structure of the LSTM model is reflected in each charactercj c_jcjUse a character cell vector cjc \mathbf{c}^c_jcjcAnd a hidden layer vector hjc \mathbf{h}^c_jhjc, where cjc \mathbf{c}^c_jcjcUsed to record from the beginning of the sentence to the character cj c_jcjThe information flow will eventually hjc \mathbf{h}_j^chjcInput to the CRF layer for sequence annotation.

Unlike the character-based model, calculate cjc \mathbf{c}^c_jcjcWhen considering the substring wb of the corresponding dictionary in the sentence, edw^d_{b,e}wb,ed, each substring wb , edw^d_{b,e}wb,ed的表征的:xb , ew = ew ( wb , ed ) \mathbf{x}^w_{b,e}=\mathbf{e}^w(w^d_{b,e})xb,ew=ew(wb,ed) . In addition,the word cell State cb , ew \mathbf{c}^w_{b,e}cb,ewRepresents the recursive state xb starting from the sentence , ew \mathbf{x}^w_{b,e}xb,ew, calculated as follows:

[ ib , ewfb , ewc ~ b , ew ] = [ σ σ tanh ⁡ ] ( W w ⊤ [ xb , ewhbc ] + bw ) cb , ew = fb , ew ⊙ cbc + ib , ew ⊙ c ~ b , ew \ begin{aligned} {\left[\begin{array}{c}\mathbf{i}_{b, e}^w \\ \mathbf{f}_{b, e}^w \\\wideassignment{c }_{b, e}^w \end{array}\right] } &=\left[\begin{array}{c}\sigma\\\sigma\\\tanh\end{array}\right]\ left(\mathbf{W}^{w \top}\left[\begin{array}{c}\mathbf{x}_{b, e}^w \\\mathbf{h}_b^c \end{ array}\right]+\mathbf{b}^w\right)\\\mathbf{c}_{b, e}^w &=\mathbf{f}_{b, e}^w \odot\mathbf {c}_b^c+\mathbf{i}_{b, e}^w \odot \extended{\ball symbol{c}}_{b, e}^w \end{aligned} ib,ewfb,ewc b,ew cb,ew= ppfishy (Ww[xb,ewhbc]+bw)=fb,ewcbc+ib,ewc b,ew
Among them, ib , ew \mathbf{i}_{b, e}^wib,ewfb , ew \mathbf{f}_{b, e}^wfb,ewThey are the input gate and the forget gate. hbc \mathbf{h}_b^chbccbc \mathbf{c}_b^ccbcThey are the Hidden State and Cell State output from the first word LSTM unit of the word, such as calculating 长江大桥the cell state of the word, hbc \mathbf{h}_b^chbccbc \mathbf{c}_b^ccbcFrom the word, and the calculated 大桥cell State, hbc \mathbf{h}_b^chbccbc \mathbf{c}_b^ccbcfrom the word. In addition, it should be noted that there is no output gate for word-level cells, because the labeling results of sequence labeling are all at the character level.

Cell status cb by word , ew \mathbf{c}^w_{b,e}cb,ewThere will be more recursive path information flowing into each cjc \mathbf{c}^c_jcjc. For example, in Figure 2, the input source c 7 c \mathbf{c}^c_7c7cincluding x 7 c \mathbf{x}^c_7x7c(桥)、c 6 , 7 w \mathbf{c}^w_{6,7}c6,7w(Bridge), c 4 , 7 w \mathbf{c}^w_{4,7}c4,7w(Yangtze River Bridge). Put all cb , ew \mathbf{c}^w_{b,e}cb,ewlink to cell cec \mathbf{c}^c_ecec, that is, weighted addition of these three parts of information. For the cell status cb of each substring , ew \mathbf{c}^w_{b,e}cb,ewUsing an extra gate ib , ec \mathbf{i}^c_{b,e}ib,ecto control the flow into cec \mathbf{c}^c_ececThe information flow (the original text is cb , ec \mathbf{c}^c_{b,e}cb,ec, should be a typo):

insert image description here

Compute character cjc c_j^ccjccell state cjc \mathbf{c}_j^ccjc, that is, for all cjc c_j^ccjcThe cell state and cjc c_j^c of the ending wordcjcCandidate state c ~ jc \widetilde{\boldsymbol{c}}_j^cc jcDo a weighted addition. Therefore the characters cjc c_j^ccjc的cell state cjc \mathbf{c}_j^ccjc 的计算如下:
c j c = ∑ b ∈ { b ′ ∣ w b ′ , j d ∈ D } α b , j c ⊙ c b , j w + α j c ⊙ c ~ j c \mathbf{c}_j^c=\sum_{b \in\left\{b^{\prime} \mid w_{b^{\prime}, j}^d \in \mathbb{D}\right\}} \boldsymbol{\alpha}_{b, j}^c \odot \boldsymbol{c}_{b, j}^w+\boldsymbol{\alpha}_j^c \odot \widetilde{\boldsymbol{c}}_j^c cjc=b{ bwb,jdD}ab,jccb,jw+ajcc jc

α \alpha in the above formulaα is the result of gating normalization, and the gating is based on the cell statecb , ew \mathbf{c}_{b, e}^wcb,ewand character embedding xec \mathbf{x}_{e}^cxecSolution:
ib , ec = σ ( W l ⊤ [ xeccb , ew ] + bl ) \mathbf{i}_{b, e}^c=\sigma\left(\mathbf{W}^{l \top} \left[\begin{array}{c}\mathbf{x}_e^c\\\mathbf{c}_{b, e}^w \end{array}\right]+\mathbf{b}^l \right)ib,ec=p(Wl[xeccb,ew]+bl)

i b , j c \mathbf{i}_{b, j}^c ib,jc i j c \mathbf{i}_j^c ijcRespectively normalized to α b , jc \boldsymbol{\alpha}_{b, j}^cab,jcand α jc \boldsymbol{\alpha}_j^cajc, so that it sums to 1 \mathbf{1}1

α b , j c = exp ⁡ ( i b , j c ) exp ⁡ ( i j c ) + ∑ b ′ ∈ { b ′ ′ ∣ w b ′ ′ , j d ∈ D } exp ⁡ ( i b ′ , j c ) α j c = exp ⁡ ( i j c ) exp ⁡ ( i j c ) + ∑ b ′ ∈ { b ′ ′ ∣ w b ′ ′ , j d ∈ D } exp ⁡ ( i b ′ , j c ) \begin{aligned} \boldsymbol{\alpha}_{b, j}^c &=\frac{\exp \left(\mathbf{i}_{b, j}^c\right)}{\exp \left(\mathbf{i}_j^c\right)+\sum_{b^{\prime} \in\left\{b^{\prime \prime} \mid w_{b^{\prime \prime}, j}^d \in \mathbb{D}\right\}} \exp \left(\mathbf{i}_{b^{\prime}, j}^c\right)} \\ \boldsymbol{\alpha}_j^c &=\frac{\exp \left(\mathbf{i}_j^c\right)}{\exp \left(\mathbf{i}_j^c\right)+\sum_{b^{\prime} \in\left\{b^{\prime \prime} \mid w_{b^{\prime \prime}, j}^d \in \mathbb{D}\right\}} \exp \left(\mathbf{i}_{b^{\prime}, j}^c\right)} \end{aligned} ab,jcajc=exp(ijc)+b{ b′′wb′′,jdD}exp(ib,jc)exp(ib,jc)=exp(ijc)+b{ b′′wb′′,jdD}exp(ib,jc)exp(ijc)

Taking 南京市长江大桥as an example, the calculated cell state is calculated by weighting and 长江大桥summing the 大桥cell state of , and the candidate cell state of . The weighting coefficient is obtained by performing Softmax on the gating of characters and words.
It should be pointed out that when the current character has vocabulary integrated, the above formula is used for calculation; if the current character has no vocabulary, the native LSTM is used for calculation. In other words, when there is vocabulary information , Lattice LSTM does not use the memory vector cj − 1 c c_{j-1}^c of the previous momentcj1c, that is, no continuous memory of lexical information is retained.

Decoding 和 Training

The Decoding part is in h 1 , h 2 , . . . , h τ \mathbf{h}_{1},\mathbf{h}_{2},...,\mathbf{h}_{\tau }h1,h2,...,htThen follow a CRF layer to calculate the output probability, and then select the largest output probability argmaxP(y|s). For Character-Based Model, τ \tauτ value isnnn , for Word-Based Model,τ \tauτ value ismmm . The following is to calculate a label sequenceyyprobability of y .

P ( y ∣ s ) = exp ⁡ ( ∑ i ( W C R F l i h i + b C R F ( l i − 1 , l i ) ) ) ∑ y ′ exp ⁡ ( ∑ i ( W C R F l i ′ h i + b C R F ( l i − 1 ′ , l i ′ ) ) ) P(y \mid s)=\frac{\exp \left(\sum_{i}\left(\mathbf{W}_{\mathrm{CRF}}^{l_{i}} \mathbf{h}_{i}+b_{\mathrm{CRF}}^{\left(l_{i-1}, l_{i}\right)}\right)\right)}{\sum_{y^{\prime}} \exp \left(\sum_{i}\left(\mathbf{W}_{\mathrm{CRF}}^{l_{i}^{\prime}} \mathbf{h}_{i}+b_{\mathrm{CRF}}^{\left(l_{i-1}^{\prime}, l_{i}^{\prime}\right)}\right)\right)} P ( ands)=yexp(i(WCRFlihi+bCRF(li1,li)))exp(i(WCRFlihi+bCRF(li1,li)))

Use the first-order Viterbi algorithm to find the highest-scoring label sequence, and the loss function is log-likelihood loss with L2 regularization.

L = ∑ i = 1 N log ⁡ ( P ( y i ∣ s i ) ) + λ 2 ∥ Θ ∥ 2 L=\sum_{i=1}^{N} \log \left(P\left(y_{i} \mid s_{i}\right)\right)+\frac{\lambda}{2}\|\Theta\|^{2} L=i=1Nlog(P(yisi))+2l∥Θ2

shortcoming:

  • Computational performance is low and batch parallelization cannot be performed. The main reason is that the number of increased word cells (as nodes) between each character is inconsistent;
  • Information loss: 1) Each character can only obtain the lexical information ending with it, and there is no continuous memory for the lexical information before it. For example, for "medicine", the 'inside' information of "Renhe Pharmacy" cannot be obtained. 2) Due to the characteristics of RNN, the forward and backward vocabulary information cannot be shared when BiLSTM is adopted.
  • Poor transferability: only suitable for LSTM, does not have the characteristics of migration to other networks

Guess you like

Origin blog.csdn.net/ljp1919/article/details/126657649