NLP sequence annotation summary (there is no good summary, mine is the first)

Sequence labeling (sequence labeling, also named as tagging)

The four basic tasks of NLPhttps://blog.csdn.net/savinger/article/details/89302956

 

Sequence labeling (sequence labeling, also named as tagging)

Part of Speech (POS)

Information extraction (IE)

(1) Named entity recognition (NER)

(2) Relation extraction

(3) Event extraction

(4) Information integration

Hidden Markov Model (HHM)

1. Markov model

2. Hidden Markov Model

(1) HMM evaluation issues

a. Forward algorithm

b. Backward algorithm

(2) HMM decoding problem

(3) HMM parameter learning

Application of Hidden Markov Model in Sequence Labeling

Conditional Random Field Model (CRF)

1. Maximum entropy model (in the maximum entropy model, the outputs are independent of each other)

2. Conditional Random Field (CRF)



Sequence tagging problems mainly include POS tagging, semantic role tagging, and information extraction.

 

  • Part of Speech (POS)

··· Given a sentence with well-cut words, the purpose of part-of-speech tagging is to assign a category to each word. This category is called part-of-speech tag, such as noun, verb ), adjective, etc. Part of speech tagging is a very typical sequence tagging problem.

preview

 

 

Information extraction (IE)

 

The information extraction system processes a variety of unstructured/semi-structured text input (such as news pages, product pages, microblogs, forum pages, etc.), and uses multiple technologies (such as rule methods, statistical methods, knowledge mining methods) to extract Various designated structured information (such as entities, relationships, product records, lists, attributes, etc.), and integrate these information at different levels (knowledge deduplication, knowledge linking, knowledge system construction, etc.)

 According to the types of information extracted, the current core research content of information extraction can be divided into Named Entity Recognition (NER), relation extraction (Relation Extraction), event extraction and information integration (Information Integration).

 

(1) Named entity recognition (NER)

The purpose of named entity recognition is to identify entities of specified categories in the text, including tasks such as person names, place names, organization names, proper nouns, etc.

A named entity recognition system usually consists of two parts: entity boundary recognition and entity classification. The entity boundary recognition judges whether a character string is an entity, and the entity classification divides the identified entities into different predetermined categories.

The main difficulty of named entity recognition lies in the open-domain named entity categories (such as movie and song names) that express irregularly and lack training corpus.

As shown below: name recognition

As shown below: Organization name recognition

preview

 

 

BiLSTM-CRF model word-based Chinese named entity recognition https://blog.csdn.net/weixin_34004576/article/details/93472426?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522159955270019724839255548%2522%252C%2522scm% 2522%253A%252220140713.130102334..%2522%257D&request_id=159955270019724839255548&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-4-93472426.pc_ecpm_v3_pc_rank_term=nF%E5%E5% 88%97%E6%A0%87%E6%B3%A8&spm=1018.2118.3001.4187

Named Entity Recognition

difficulty. . .

     (1) Chinese text does not have spaces as the boundary mark of words like in English, and "word" is originally a very vague concept in Chinese, and Chinese does not have morphological instructions such as letter case in English.

      (2) Chinese characters are flexible and changeable. Some words cannot be judged whether they are named entities without the context, and even if they are named entities, they may be different entities under different contexts. Types of

      (3) There is a nesting phenomenon in named entities. For example, the name "Peking University Third Hospital" also contains "Peking University" which can also be used as the name of the organization, and this phenomenon is especially true in the name of the organization. serious

      (4) There is a widespread phenomenon of simplified expression in Chinese, such as "The Third Hospital of Beijing Medical University", "National University of Science and Technology", and even named entities composed of simplified expressions, such as "National Science Bridge".

“-------------------------------”

Named Entity Recognition creates NER recognizer

Process

å¨è¿éæå ¥ å¾çæè¿ °

 

What does the training data look like

å¨è¿éæå ¥ å¾çæè¿ °

Column C: indicates part of speech
. Column D: entity category; O indicates not an entity, B indicates Begin, and I indicates other parts of the entity. (B, I, O notation, there are other methods such as B, M, E, O)

Evaluate NER recognizer

å¨è¿éæå ¥ å¾çæè¿ °

NRE method

å¨è¿éæå ¥ å¾çæè¿ °

 

-Reference: Detailed explanation of named entity recognition https://blog.csdn.net/weixin_46425692/article/details/108269912?biz_id=102&utm_term=%E5%91%BD%E5%90%8D%E5%AE%9E%E4 %BD%93%E8%AF%86%E5%88%AB&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-1-108269912&spm=1018.2118.3001.4187

“--------------------”

BiLSTM-CRF model

The schematic diagram is as follows:

  • First, each word in the sentence x is expressed as a vector, which contains the aforementioned word embedding and character embedding, in which the character embedding is initialized randomly, and the word embedding is usually initialized by a pre-training model. All embeddings will be fine-tuned during training.
  • Second, the input of the BiLSTM-CRF model is the aforementioned embeddings, and the output is the predicted label of each word in the sentence xxx.

As can be seen from the above figure, the output of the BiLSTM layer is the score of each label, such as the word w0, the output of BiLSTM is 1.5 (B-Person), 0.9 (I-Person), 0.1 (B-Organization), 0.08 (I -Organization) and 0.05 (O),

These scores are the input of the CRF layer.
Feed the score predicted by the BiLSTM layer into the CRF layer, and the tag sequence with the highest score will be the best result predicted by the model.

(The technology involved in CRF is mentioned below)

What if there is no CRF layer

Based on the above, it can be found that if there is no CRF layer, that is, we use the following figure to train the BiLSTM named entity recognition model:

Because the output of BiLSTM for each word is a label score, for each word, we can choose the label with the highest score as the prediction result.
For example, for w0, "B-Person" has the highest score (1.5), so we can choose "B-Person" as its predicted label; similarly, the label of w1 is "I-Person" and the label of w2w_2w2​ is "O" , The label of w3w_3w3​ is "B-Organization", and the label of w4w_4w4​ is "O".

According to the above method, although we get the correct label for x, in most cases the correct label cannot be obtained, such as the example in the following figure:

Obviously, the output tags "I-Organization I-Person" and "B-Organization I-Person" are incorrect

Why use CRF

CRF can learn constraints from training data

The CRF layer can add some constraints to the final constraint label to ensure the validity of the predicted label. These constraints are that the CRF layer automatically learns from the training data.
The constraints may be:

  • The label of the first word in a sentence should be "B-" or "O", not "I-";
  • In "B-label1 I-label2 I-label3 I-...", label1, label2, label3... should be the same named entity label. For example, "B-Person I-Person" is valid, but "B-Person I-Organization" is invalid;
  • "O I-label" is invalid. The first label of a named entity should start with "B-", but not with "I-". In other words, it should be in the mode of "O B-label";

With these constraints, invalid predicted label sequences will be drastically reduced.

CRF layer

In the loss function of the CRF layer, there are two types of scores, and these two types of scores are the key concepts of the CRF layer.

1 Launch score

The first score is the launch score, which can be obtained from the BiLSTM layer. As shown in Figure 2.1, w0​ is marked as B-Person with a score of 1.5.

For the convenience of the follow-up description, we will give each label an index, as shown in the following table:

Label Index
B-Person 0
I-Person 1
B-Organization 2
I-Organization 3
O 4

We use xi, yj to express the emission matrix, where i represents the i-th word and yj represents the label index. For example, according to Figure 2.1,

This expression means that the probability of marking w1w_1w1​ as B-Organization is 0.1.

2 Transfer score

Specific reference: CRF layer principle and code understanding of LSTM+CRF model   https://www.cnblogs.com/luckyplj/p/13433397.html

3. CRF loss function

(2) Relation extraction

Relation extraction refers to the task of detecting and recognizing the semantic relationship between entities in the text, and linking mentions that represent the same semantic relationship.

The output of relationship extraction is usually a triple (entity 1, relationship category, entity 2), which indicates that there is a specific type of semantic relationship between entity 1 and entity 2.

For example, the relationship expressed in the sentence "Beijing is China's capital, political center, and cultural center" can be expressed as (China, capital, Beijing), (China, political center, Beijing) and (China, cultural center, Beijing).

Relation extraction usually includes two core modules: relation detection and relation classification. The relation detection judges whether there is a semantic relationship between two entities, and the relation classification divides the entity pairs with semantic relations into pre-designated categories.

In some scenarios and tasks, the relationship extraction system may also include a relationship discovery module, whose main purpose is to discover the types of semantic relationships between entities and entities

For example, it is found that there are relationship categories such as employee, CEO, CTO, founder, chairman, etc. between the person and the company.

 

(3) Event extraction

Event extraction refers to the task of extracting event information from unstructured text and presenting it in a structured form

For example, extract events {Type: Birth, Person: Mao Zedong, Time: 1893, Place of Birth: Xiangtan, Hunan} from the sentence "Mao Zedong was born in Xiangtan, Hunan in 1893".

Event extraction task usually includes two subtasks: event type identification and event element filling.

Event type recognition determines whether a sentence expresses a specific type of event.

The event type determines the template that the event represents, and different types of events have different templates.

For example, the template of a birth event is {person, time, place of birth}, and the template of a terrorist attack event is {location, time, attacker, victim, number of injured,...}.

Event elements refer to the key elements that make up the event. Event element identification refers to the task of extracting corresponding elements based on the event template they belong to and labeling them with correct element tags.

 

(4) Information integration

Entities, relationships, and events represent different granularities of information in a single text.

In many applications, it is necessary to integrate information from different data sources and different texts to make decisions, which requires research on information integration technology.

Information integration technology in information extraction research mainly includes coreference resolution technology and entity link technology.

Co-reference resolution refers to the task of detecting different mentions of the same entity/relationship/event and linking them together, for example, identifying "Jobs is one of the founders of Apple, who has experienced the ups and downs of Apple for decades "Jobs" and "he" in the phrase "and rise and fall" refer to the same entity.

The purpose of entity linking is to determine the real-world entity pointed to by the entity name. For example, identifying the "Apple" and "Jobs" in the previous sentence refer to Apple and its CEO Steve Jobs in the real world, respectively.

As follows: Military terminology information extraction

 

At present, the main research methods for sequence labeling in natural language processing mainly include probabilistic graph models (hidden Markov model (HMM), conditional random field (CRF)) and neural networks (the mainstream solution is generally bi-LSTM) +CRF, SVM+AdaBoost was also used in the research of early natural language processing)

 

Hidden Markov Model (HHM)

Hidden Markov model as a typical representative of probabilistic graphical model (probabilistic graphical model)

(Probabilistic graphical models can generally be divided into Bayesian network (Bayesian network, the causal dependence relationship between variables, which is represented by directed acyclic graph) and Markovn network (Markovn network, between variables) There are correlations but causality is difficult to obtain, which is represented by undirected graphs.) It is a dynamic Bayesian network with the simplest structure.

 

1. Markov model

Markov model is mainly used to describe the transition process between system states, that is, the system transitions from one state to another over time or space.

In the Markov process, it is assumed that the state of the system at time t is only related to the previous time, that is, t -1, and has nothing to do with the previous state.

The model mainly includes three elements,

  • S: A limited set of states in the model;
  • Π: the probability distribution of the initial state space;
  • A: State transition probability matrix.

Figure. Markov model triples

preview

 

There are some interesting conclusions about the Markov model. For example, after a long period of time, that is, after many state transitions, the final state will converge to the same result, regardless of the initial state. (That is, the final state is only related to the state transition matrix, not the initial state)

 

2. Hidden Markov Model

The change of visible things reveals the inherent essential laws hidden behind, which is why the model is called Hidden Markov (state sequence unknowable).

 

(1) HMM evaluation issues

The above section is bullshit

a. Forward algorithm

 

The url of the forward algorithm dynamic diagram =  https://pic3.zhimg.com/v2-aab75a9c0df890ef11db2c27e672baf4_b.webp

b. Backward algorithm

Backward algorithm diagram

The backward algorithm is similar to the forward algorithm, and its time complexity is also O(N^2T)

 

###You can read the original text from here, I have summarized it once, so I won’t waste time https://zhuanlan.zhihu.com/p/50184092?from_voters_page=true

(2) HMM decoding problem

Viterbi algorithm (dynamic programming)

(3) HMM parameter learning

The main parameters of the hidden horse model are two matrices A, B. A: hidden state transition probability matrix and B: probability distribution of observations in a given state.

 

Application of Hidden Markov Model in Sequence Labeling

Participle

Part-of-speech tagging

Phrase recognition, speech recognition

 

Conditional Random Field Model (CRF)

Random field. The random field can be regarded as a set of random variables (this set of random variables all come from the sample space)

There may be a certain interdependence between these random variables. When we randomly assign a value in the corresponding space to the random variable at each location according to a certain distribution, the whole is called a random field.

Markov sex. Markov property means that when we expand a sequence of random variables in chronological order, the distribution characteristics of the variables at time N+1 are only related to the value of the variable at time N, and are related to the value of the variable at time N The value of the variable is irrelevant.

We call the random field that satisfies Markov property as Markov Random Field (MRF).

1. Maximum entropy model (in the maximum entropy model, the outputs are independent of each other)

2. Conditional Random Field (CRF)

In the application of CRF in named entity recognition, the model input is the word sequence, and the output is the word tag.

The neural network sequence annotation model architecture is as follows

A simple example to illustrate the concept of random field: a whole composed of several existing positions, when a given position is randomly assigned a value according to a certain distribution, the whole is called a random field.

 

 

Taking place name recognition as an example, suppose the following rules are defined:

Label meaning
B The current word is the beginning word of the place name named entity
M The current word is the middle word of the place name naming entity
E The current word is the ending word of a place name named entity
S The current word alone constitutes a place name named entity
O The current word is not a named entity or part of a place name

There is a sentence consisting of n characters, and the label of each character is selected in the known label set {"B", "M", "E", "S", "O"}, when we After all characters have selected the label, a random field is formed .

If some constraints are added to it, for example, the labels of all characters are only related to the adjacent character labels, then it will be transformed into a Markov random field problem.

Assuming that there are two variables in Markov random field, XXX and YYY, XXX is generally given, and YYY is the output under given XXX conditions. In this example, XXX is a character, YYY is a label, and P(Y∣X)P(Y|X)P(Y∣X) is a conditional random field .

This structure is generally called a linear chain conditional random field. It is defined as follows:

Let X=(X1,X2,X3,⋅⋅⋅,Xn)X=(X_1,X_2,X_3,···,X_n)X=(X1​,X2​,X3​,⋅⋅⋅,Xn​) And Y=(Y1,Y2,Y3,…,Yn)Y=(Y_1,Y_2,Y_3,…,Y_n)Y=(Y1​,Y2​,Y3​,…,Yn​) are all expressed by a linear chain Random variable sequence, if under the condition of given random variable sequence X, the conditional probability distribution P(Y|X) of random variable Y constitutes a conditional random field and satisfies the Markov property: P(Yi∣X,Y1, Y2,⋅⋅⋅,Yn)=P(Yi∣X,Yi−1,Yi+1))P(Y_i|X,Y_1,Y_2,···,Y_n)=P(Y_i|X,Y_i-_1 ,Y_i+_1))P(Yi​∣X,Y1​,Y2​,⋅⋅⋅,Yn​)=P(Yi​∣X,Yi​−1​,Yi​+1​)) is
called P (Y|X) is the conditional random field of the linear chain .
In other words, the linear model only considers the influence of nodes on both sides of it, because only the nodes on both sides are adjacent to it.

Above reference: Named entity recognition https://blog.csdn.net/qq_42851418/article/details/83269545?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522159955957819725264620784%2522%252C%2522scm%2522%253A%252220140713.130102334 ..%2522%257D&request_id=159955957819725264620784&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~top_click~default-4-83269545.pc_ecpm_v3_pc_rank_v3&utm_term=%E5%91%BD%E5%90%8 AE%9E%E4%BD%93%E8%AF%86%E5%88%AB&spm=1018.2118.3001.4187

 

 

 

For the introduction of network structures such as RNN and LSTM, please refer to: https://zhuanlan.zhihu.com/p/50915723

What are the advantages and disadvantages of CRF and LSTM models in sequence labeling? https://www.zhihu.com/question/46688107?sort=created

From RNN, LSTM to Encoder-Decoder framework, attention mechanism, Transformer https://zhuanlan.zhihu.com/p/50915723

 

 

 

Guess you like

Origin blog.csdn.net/weixin_45316122/article/details/108471496