Paper notes: Chinese NER Using Lattice LSTM

 

Overview:

Currently, English NER: The best model is LSTM-CRF (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Liu et al., 2018).

Chinese NER: This model can also be used, but Chinese NER is related to word segmentation. An intuitive way to implement Chinese NER is to perform word segmentation first, and then apply word sequence labeling. Character-level-based LSTM-CRF cannot use word information in sentences.

Disadvantages of charNER: clear word and word sequence information may have potential value, but it is not used. Studies have shown that in Chinese NER, character-based methods perform better than word-based methods (He and Wang, 2008; Liu et al., 2010; Li et al., 2014).

However, the segmentation → NER process may encounter potential problems with error propagation, because NE is an important source of OOV in segmentation, and segmentation of the wrong entity boundary will cause NER errors. However, if it is based on the tokenizer, once the NER has a word segmentation error, it will directly affect the prediction of the entity boundary and lead to recognition errors. This is a serious problem in the open field.

motivation:

Lattice LSTM is used to characterize the lexicon word (Nanjing, Nanjing, Mayor...) in the sentence, so as to integrate the potential word information into the character-based LSTM-CRF. Because there are exponential number of word-character paths in the grid, the researchers use lattice LSTM structure to automatically control the flow of information from the beginning to the end of the sentence. As shown in Figure 2, the gating unit is used to dynamically transmit information from different paths to each character, and will not be affected by word segmentation deviation.

                                                                            Figure 1: Lattice LSTM structure

 

Overall model:

The overall model is divided into 3 parts: (a) character-based model; (b) word-based model; (c) Lattice model

(A): Character-based model

Among them, the Embed layer can have the following methods:

① Char Embedding

② Char + bichar Embedding

Combine the single-character Embedding with the Bigram Embedding composed of the current character and the next character to form the overall Embedding

③ Char+softword Embedding

Combine the Embedding of a single character and the Embedding of the segment where the current character is located to form the overall Embedding

(b): Word-based model

Embedding of wi:

Similarly, in the Embedding layer, the Embedding of the word should also be combined with the Embedding of the char contained in the word, namely

There are also several ways to get char Embedding in the current word:

① word+char LSTM

Use the BiLSTM structure to get the Embedding of all chars in the word 

② word+char LSTM’

But LSTM is slightly different from ①

③ word+char CNN

Use a standard CNN to convolve all char Embedding in the current word

 

(c): Lattice model

On the character-based model, cells based on vocabulary pairs and additional gates to control information flow are added.

The input of the model is all the words contained in vocabulary D consisting of all characters and character sequences. The model includes 4 types of vectors, namely: ① input vector; ② output hidden vector; ③ cell vector; ④ gated vector

The basic LSTM unit corresponding to the character is:

The cell expression formula of the LSTM corresponding to the vocabulary is: 

This word cell does not have an output gate, because the final labeling task is for char instead of word.

For the ending character of a word, there may be multiple paths, such as the word "Bridge", and the word information of multiple paths such as "Bridge" and "Nanjing Bridge" flows into the representation of the character "Bridge", so separate Use an extra to control the weight of each word:

 The overall cell value is composed of the cell value of the character and the cell value of the vocabulary:

The weight calculation formula for char and word cell value is:

 (4) CRF decoding

Standard CRF layer

Input: latent vector h after Lattice model

Output: probability of sequence labeling

 

Loss function: sentence-level log-likelihood+L2 regular

 

Guess you like

Origin blog.csdn.net/qq_22472047/article/details/109113252