A brief history of the development of named entity recognition (NER)

In recent years, deep learning methods based on neural networks have achieved great success in the fields of computer vision and speech recognition, and have also made a lot of progress in the field of natural language processing. In the study of NLP's key basic task-Named Entity Recognition (NER), deep learning has also achieved good results. Recently, the author has read a series of related papers on NER research based on deep learning, and applied them to the NER basic module of the concept. Here I will summarize and share the learning with you.

1. Introduction to NER

NER, also known as proper name recognition, is a basic task in natural language processing and has a wide range of applications. Named entities generally refer to entities with specific meanings or strong referrals in the text, usually including names of people, places, names of organizations, dates, and proper nouns. The NER system is to extract the above entities from the unstructured input text, and can identify more types of entities according to business needs, such as product name, model, price, etc. Therefore, the concept of entity can be very broad, as long as the special text fragments required by the business can be called entities.

Academically, the named entities involved in NER generally include 3 categories (entity category, time category, number category) and 7 subcategories (person name, place name, organization name, time, date, currency, percentage).

In practical applications, the NER model usually only needs to recognize the names of people, places, organizations, and dates, and some systems will also give proper noun results (such as abbreviations, conference names, product names, etc.). Digital entities such as currency and percentages can be obtained through regularization. In addition, in some application scenarios, entities in a specific field will be given, such as book titles, song titles, periodical titles, etc.

NER is a basic key task in NLP. From the perspective of natural language processing, NER can be regarded as a kind of unregistered word recognition in lexical analysis, which is the most unrecognized word, the most difficult to recognize, and the biggest influence on the effect of word segmentation. At the same time, NER is also the basis of many NLP tasks such as relation extraction, event extraction, knowledge graph, machine translation, question answering system and so on.

NER is not currently a hot research topic, because some academics think that this is a solved problem. Of course, some scholars believe that this problem has not been solved well. The main reasons are: named entity recognition is only obtained in limited text types (mainly in news corpus) and entity categories (mainly names of people, places, organizations). Compared with other information retrieval fields, the entity naming evaluation is expected to be smaller and it is easy to produce overfitting; named entity recognition focuses more on high recall rates, but in the field of information retrieval, high accuracy is more important; general recognition Many types of named entities have poor system performance.

2. Application of deep learning methods in NER

NER has always been a research hotspot in the field of NLP. From early dictionary-based and rule-based methods, to traditional machine learning methods, to deep learning-based methods in recent years, the general trend of NER research progress is roughly shown in the following figure.

Figure 1: NER development trend

In the method based on machine learning, NER is regarded as a sequence labeling problem. A large-scale corpus is used to learn the annotation model, so as to annotate each position of the sentence. Commonly used models in NER tasks include generative model HMM and discriminant model CRF. Conditional random field (Conditional Random Field, CRF) is the current mainstream model of NER. Its objective function not only considers the input state feature function, but also includes the label transfer feature function. SGD can be used to learn model parameters during training. When the model is known, finding the input sequence to predict the output sequence, that is, to find the optimal sequence that maximizes the objective function, is a dynamic programming problem. You can use the Viterbi algorithm to decode to obtain the optimal label sequence. The advantage of CRF is that it can utilize rich internal and contextual feature information in the process of labeling a location.

Figure 2: A linear chain conditional random field

In recent years, with the development of hardware computing capabilities and the introduction of word embedding, neural networks can effectively handle many NLP tasks. This kind of method is similar to the processing of sequence labeling tasks (such as CWS, POS, NER): map tokens from discrete one-hot representations to low-dimensional space to become dense embedding, and then input the embedding sequence of sentences to RNN In the method, neural network is used to automatically extract features, and Softmax predicts the label of each token.

This method makes model training an end-to-end process, rather than traditional pipeline, does not rely on feature engineering, is a data-driven method, but there are many types of networks, a large dependence on parameter settings, and poor model interpretability. In addition, a disadvantage of this method is that the process of labeling each token is carried out independently, and the labels that have been predicted above cannot be used directly (only the above information can be transmitted by implicit state), which leads to the predicted labels The sequence may be invalid. For example, the label I-PER cannot be followed by B-PER, but Softmax will not use this information.

The academic community has proposed a DL-CRF model for sequence annotation. The CRF layer (emphasis on the use of label transition probability) is connected to the output layer of the neural network to make sentence-level label prediction, so that the labeling process is no longer to classify each token independently.

2.1 BiLSTM-CRF

LongShort Term Memory network is generally called LSTM, which is a special type of RNN that can learn long-distance dependent information. LSTM was proposed by Hochreiter & Schmidhuber (1997), and was recently improved and promoted by Alex Graves. On many issues, LSTM has achieved considerable success and is widely used. LSTM solves the problem of long-distance dependence through clever design.

All RNNs have a chain form of repeating neural network units. In a standard RNN, this repeating unit has only a very simple structure, such as a tanh layer.

Figure 3: Traditional RNN structure

LSTM has the same structure, but the repeating unit has a different structure. Unlike ordinary RNN units, here are four, interacting in a very special way.

Figure 4: LSTM structure

LSTM uses three gate structures (input gate, forget gate, and output gate) to selectively forget some historical information, add some current input information, and finally integrate it into the current state and generate an output state.

Figure 5: Each gating structure of LSTM

The biLSTM-CRF model used in NER is mainly composed of an Embedding layer (mainly word vectors, word vectors and some additional features), a bidirectional LSTM layer, and the final CRF layer. The experimental results show that biLSTM-CRF has reached or exceeded the CRF model based on rich features, and has become the most mainstream model in the NER method based on deep learning. In terms of features, the model inherits the advantages of deep learning methods, without feature engineering, using word vectors and character vectors can achieve good results. If there are high-quality dictionary features, it can be further improved.

Figure 6: Schematic diagram of biLSTM-CRF

2.2 IDCNN-CRF

For sequence labeling, a common CNN has a disadvantage that after convolution, the last layer of neurons may only get a small piece of information in the original input data. In terms of NER, every word in the entire input sentence may have an effect on the annotation of the current position, the so-called long-distance dependence problem. In order to cover all the input information, more convolutional layers need to be added, resulting in deeper layers and more and more parameters. In order to prevent overfitting, more regularization such as Dropout must be added to bring more hyperparameters, and the entire model becomes huge and difficult to train. Because of the disadvantages of CNN, for most sequence labeling problems, people still choose a network structure such as biLSTM, and use the memory of the network to remember the entire sentence information to label the current word as much as possible.

But this brings another problem. BiLSTM is essentially a sequence model, which is not as powerful as CNN in the use of GPU parallel computing. How can you provide the GPU with a full-fire battlefield like CNN, and remember as much input information as possible with a simple structure like LSTM?

Fisher Yu and Vladlen Koltun 2015 proposed the dilated CNN model, which means "expanded" CNN. The idea is not complicated: normal CNN filters are applied to a continuous area of ​​the input matrix, and continuous sliding is used for convolution. The dilated CNN adds a dilation width to this filter. When it acts on the input matrix, it skips the input data in the middle of all dilation widths; and the size of the filter itself remains unchanged, so that the filter obtains the data on the broader input matrix. It looks like it has "expanded".

When specifically used, the dilated width will increase exponentially as the number of layers increases. In this way, as the number of layers increases, the number of parameters increases linearly, while the receptive field increases exponentially, which can quickly cover all the input data.

Figure 7: schematic diagram of idcnn

It can be seen in Figure 7 that the receptive field expands at an exponential rate. The original receptive field is a 1x1 area at the center:

(A) In the figure, the original receptive domain is diffused in steps of 1 to obtain 8 1x1 regions to constitute a new receptive domain, with a size of 3x3;

(B) After the diffusion in the figure with a step size of 2, the receptive field of the previous step 3x3 is expanded to 7x7;

(C) In the figure, after the diffusion with a step size of 4, the original 7x7 receptive field is expanded to a 15x15 receptive field. The number of parameters in each layer is independent of each other. The receptive field expands exponentially, but the number of parameters increases linearly.

Corresponding to the text, the input is a one-dimensional vector, and each element is a character embedding:

Figure 8: An idcnn block with a maximum expansion step size of 4

IDCNN generates a logits for each word of the input sentence, here is exactly the same as the output logits of the biLSTM model, join the CRF layer, and use the Viterbi algorithm to decode the annotation results.

Connecting a CRF layer to the end of a network model such as biLSTM or IDCNN is a very common method of sequence labeling. biLSTM or IDCNN calculates the probability of each label of each word, and the CRF layer introduces the transition probability of the sequence, and finally calculates the loss and feeds it back to the network.

3. Practical application

3.1 Corpus preparation

Embedding: We choose Chinese Wikipedia corpus to train word vectors and word vectors.

Basic corpus: Select the People's Daily annotated corpus in 1998 as the basic training corpus.

Additional corpus: 98 corpus is the official corpus, and its authority and accuracy of labeling are guaranteed. However, because it is completely taken from the People's Daily and has a long history, the coverage of entity types is relatively low. For example, new company name, foreigner name, foreign place name. In order to improve the ability to recognize new types of entities, we collected a batch of annotated news corpora. It mainly includes finance, entertainment, and sports, and these are exactly what is missing in the 98 corpus. Due to labeling quality problems, the extra corpus cannot be added too much, about 1/4 of the 98 corpus.

3.2 Data enhancement

For deep learning methods, a large amount of annotated corpus is generally required, otherwise overfitting is extremely likely to occur and the expected generalization ability cannot be achieved. We found in experiments that data enhancement can significantly improve model performance. Specifically, we split the original corpus, then randomly bigram and trigram each sentence, and finally used the original sentence as the training corpus.

In addition, we use the collected named entity dictionary to randomly replace the entities of the same type in the corpus to obtain an enhanced corpus.

The following figure shows the training curve of the BiLSTM-CRF model. It can be seen that the convergence is very slow. In contrast, the convergence of the IDCNN-CRF model is much faster.

Figure 9: BiLSTM-CRF training curve

Figure 10: IDCNN-CRF training curve

3.3 Examples

The following is an example prediction result using the BiLSTM-CRF model.

Figure 11: BiLSTM-CRF prediction example

4. Summary

Finally, to summarize, CNN / RNN-CRF, which combines neural network and CRF model, has become the mainstream model of NER. For CNN and RNN, no one has an absolute advantage, each has its own advantages. Because RNN has a natural sequence structure, RNN-CRF is more widely used. The NER method based on the neural network structure inherits the advantages of the deep learning method without requiring a large number of artificial features. Only word vectors and word vectors can reach the mainstream level, and adding high-quality dictionary features can further enhance the effect. For a small number of labeled training set problems, transfer learning and semi-supervised learning should be the focus of future research.

发布了150 篇原创文章 · 获赞 149 · 访问量 81万+

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/103919382