Sequence labeling (sequence labeling, also named as tagging)
The four basic tasks of NLPhttps://blog.csdn.net/savinger/article/details/89302956
Sequence labeling (sequence labeling, also named as tagging)
(1) Named entity recognition (NER)
Application of Hidden Markov Model in Sequence Labeling
Conditional Random Field Model (CRF)
1. Maximum entropy model (in the maximum entropy model, the outputs are independent of each other)
2. Conditional Random Field (CRF)
Sequence tagging problems mainly include POS tagging, semantic role tagging, and information extraction.
-
Part of Speech (POS)
··· Given a sentence with well-cut words, the purpose of part-of-speech tagging is to assign a category to each word. This category is called part-of-speech tag, such as noun, verb ), adjective, etc. Part of speech tagging is a very typical sequence tagging problem.
Information extraction (IE)
The information extraction system processes a variety of unstructured/semi-structured text input (such as news pages, product pages, microblogs, forum pages, etc.), and uses multiple technologies (such as rule methods, statistical methods, knowledge mining methods) to extract Various designated structured information (such as entities, relationships, product records, lists, attributes, etc.), and integrate these information at different levels (knowledge deduplication, knowledge linking, knowledge system construction, etc.)
According to the types of information extracted, the current core research content of information extraction can be divided into Named Entity Recognition (NER), relation extraction (Relation Extraction), event extraction and information integration (Information Integration).
(1) Named entity recognition (NER)
The purpose of named entity recognition is to identify entities of specified categories in the text, including tasks such as person names, place names, organization names, proper nouns, etc.
A named entity recognition system usually consists of two parts: entity boundary recognition and entity classification. The entity boundary recognition judges whether a character string is an entity, and the entity classification divides the identified entities into different predetermined categories.
The main difficulty of named entity recognition lies in the open-domain named entity categories (such as movie and song names) that express irregularly and lack training corpus.
As shown below: name recognition
As shown below: Organization name recognition
BiLSTM-CRF model word-based Chinese named entity recognition https://blog.csdn.net/weixin_34004576/article/details/93472426?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522159955270019724839255548%2522%252C%2522scm% 2522%253A%252220140713.130102334..%2522%257D&request_id=159955270019724839255548&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-4-93472426.pc_ecpm_v3_pc_rank_term=nF%E5%E5% 88%97%E6%A0%87%E6%B3%A8&spm=1018.2118.3001.4187
Named Entity Recognition
difficulty. . .
(1) Chinese text does not have spaces as the boundary mark of words like in English, and "word" is originally a very vague concept in Chinese, and Chinese does not have morphological instructions such as letter case in English.
(2) Chinese characters are flexible and changeable. Some words cannot be judged whether they are named entities without the context, and even if they are named entities, they may be different entities under different contexts. Types of
(3) There is a nesting phenomenon in named entities. For example, the name "Peking University Third Hospital" also contains "Peking University" which can also be used as the name of the organization, and this phenomenon is especially true in the name of the organization. serious
(4) There is a widespread phenomenon of simplified expression in Chinese, such as "The Third Hospital of Beijing Medical University", "National University of Science and Technology", and even named entities composed of simplified expressions, such as "National Science Bridge".
“-------------------------------”
Named Entity Recognition creates NER recognizer
Process
What does the training data look like
Column C: indicates part of speech
. Column D: entity category; O indicates not an entity, B indicates Begin, and I indicates other parts of the entity. (B, I, O notation, there are other methods such as B, M, E, O)
Evaluate NER recognizer
NRE method
-Reference: Detailed explanation of named entity recognition https://blog.csdn.net/weixin_46425692/article/details/108269912?biz_id=102&utm_term=%E5%91%BD%E5%90%8D%E5%AE%9E%E4 %BD%93%E8%AF%86%E5%88%AB&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-1-108269912&spm=1018.2118.3001.4187
“--------------------”
BiLSTM-CRF model
The schematic diagram is as follows:
- First, each word in the sentence x is expressed as a vector, which contains the aforementioned word embedding and character embedding, in which the character embedding is initialized randomly, and the word embedding is usually initialized by a pre-training model. All embeddings will be fine-tuned during training.
- Second, the input of the BiLSTM-CRF model is the aforementioned embeddings, and the output is the predicted label of each word in the sentence xxx.
As can be seen from the above figure, the output of the BiLSTM layer is the score of each label, such as the word w0, the output of BiLSTM is 1.5 (B-Person), 0.9 (I-Person), 0.1 (B-Organization), 0.08 (I -Organization) and 0.05 (O),
These scores are the input of the CRF layer.
Feed the score predicted by the BiLSTM layer into the CRF layer, and the tag sequence with the highest score will be the best result predicted by the model.
(The technology involved in CRF is mentioned below)
What if there is no CRF layer
Based on the above, it can be found that if there is no CRF layer, that is, we use the following figure to train the BiLSTM named entity recognition model:
Because the output of BiLSTM for each word is a label score, for each word, we can choose the label with the highest score as the prediction result.
For example, for w0, "B-Person" has the highest score (1.5), so we can choose "B-Person" as its predicted label; similarly, the label of w1 is "I-Person" and the label of w2w_2w2 is "O" , The label of w3w_3w3 is "B-Organization", and the label of w4w_4w4 is "O".
According to the above method, although we get the correct label for x, in most cases the correct label cannot be obtained, such as the example in the following figure:
Obviously, the output tags "I-Organization I-Person" and "B-Organization I-Person" are incorrect
Why use CRF
CRF can learn constraints from training data
The CRF layer can add some constraints to the final constraint label to ensure the validity of the predicted label. These constraints are that the CRF layer automatically learns from the training data.
The constraints may be:
- The label of the first word in a sentence should be "B-" or "O", not "I-";
- In "B-label1 I-label2 I-label3 I-...", label1, label2, label3... should be the same named entity label. For example, "B-Person I-Person" is valid, but "B-Person I-Organization" is invalid;
- "O I-label" is invalid. The first label of a named entity should start with "B-", but not with "I-". In other words, it should be in the mode of "O B-label";
- …
With these constraints, invalid predicted label sequences will be drastically reduced.
CRF layer
In the loss function of the CRF layer, there are two types of scores, and these two types of scores are the key concepts of the CRF layer.
1 Launch score
The first score is the launch score, which can be obtained from the BiLSTM layer. As shown in Figure 2.1, w0 is marked as B-Person with a score of 1.5.
For the convenience of the follow-up description, we will give each label an index, as shown in the following table:
Label | Index |
---|---|
B-Person | 0 |
I-Person | 1 |
B-Organization | 2 |
I-Organization | 3 |
O | 4 |
We use xi, yj to express the emission matrix, where i represents the i-th word and yj represents the label index. For example, according to Figure 2.1,
This expression means that the probability of marking w1w_1w1 as B-Organization is 0.1.
2 Transfer score
Specific reference: CRF layer principle and code understanding of LSTM+CRF model https://www.cnblogs.com/luckyplj/p/13433397.html
3. CRF loss function
(2) Relation extraction
Relation extraction refers to the task of detecting and recognizing the semantic relationship between entities in the text, and linking mentions that represent the same semantic relationship.
The output of relationship extraction is usually a triple (entity 1, relationship category, entity 2), which indicates that there is a specific type of semantic relationship between entity 1 and entity 2.
For example, the relationship expressed in the sentence "Beijing is China's capital, political center, and cultural center" can be expressed as (China, capital, Beijing), (China, political center, Beijing) and (China, cultural center, Beijing).
Relation extraction usually includes two core modules: relation detection and relation classification. The relation detection judges whether there is a semantic relationship between two entities, and the relation classification divides the entity pairs with semantic relations into pre-designated categories.
In some scenarios and tasks, the relationship extraction system may also include a relationship discovery module, whose main purpose is to discover the types of semantic relationships between entities and entities
For example, it is found that there are relationship categories such as employee, CEO, CTO, founder, chairman, etc. between the person and the company.
(3) Event extraction
Event extraction refers to the task of extracting event information from unstructured text and presenting it in a structured form
For example, extract events {Type: Birth, Person: Mao Zedong, Time: 1893, Place of Birth: Xiangtan, Hunan} from the sentence "Mao Zedong was born in Xiangtan, Hunan in 1893".
Event extraction task usually includes two subtasks: event type identification and event element filling.
Event type recognition determines whether a sentence expresses a specific type of event.
The event type determines the template that the event represents, and different types of events have different templates.
For example, the template of a birth event is {person, time, place of birth}, and the template of a terrorist attack event is {location, time, attacker, victim, number of injured,...}.
Event elements refer to the key elements that make up the event. Event element identification refers to the task of extracting corresponding elements based on the event template they belong to and labeling them with correct element tags.
(4) Information integration
Entities, relationships, and events represent different granularities of information in a single text.
In many applications, it is necessary to integrate information from different data sources and different texts to make decisions, which requires research on information integration technology.
Information integration technology in information extraction research mainly includes coreference resolution technology and entity link technology.
Co-reference resolution refers to the task of detecting different mentions of the same entity/relationship/event and linking them together, for example, identifying "Jobs is one of the founders of Apple, who has experienced the ups and downs of Apple for decades "Jobs" and "he" in the phrase "and rise and fall" refer to the same entity.
The purpose of entity linking is to determine the real-world entity pointed to by the entity name. For example, identifying the "Apple" and "Jobs" in the previous sentence refer to Apple and its CEO Steve Jobs in the real world, respectively.
As follows: Military terminology information extraction
At present, the main research methods for sequence labeling in natural language processing mainly include probabilistic graph models (hidden Markov model (HMM), conditional random field (CRF)) and neural networks (the mainstream solution is generally bi-LSTM) +CRF, SVM+AdaBoost was also used in the research of early natural language processing)
Hidden Markov Model (HHM)
Hidden Markov model as a typical representative of probabilistic graphical model (probabilistic graphical model)
(Probabilistic graphical models can generally be divided into Bayesian network (Bayesian network, the causal dependence relationship between variables, which is represented by directed acyclic graph) and Markovn network (Markovn network, between variables) There are correlations but causality is difficult to obtain, which is represented by undirected graphs.) It is a dynamic Bayesian network with the simplest structure.
1. Markov model
Markov model is mainly used to describe the transition process between system states, that is, the system transitions from one state to another over time or space.
In the Markov process, it is assumed that the state of the system at time t is only related to the previous time, that is, t -1, and has nothing to do with the previous state.
The model mainly includes three elements,
- S: A limited set of states in the model;
- Π: the probability distribution of the initial state space;
- A: State transition probability matrix.
Figure. Markov model triples
There are some interesting conclusions about the Markov model. For example, after a long period of time, that is, after many state transitions, the final state will converge to the same result, regardless of the initial state. (That is, the final state is only related to the state transition matrix, not the initial state)
2. Hidden Markov Model
The change of visible things reveals the inherent essential laws hidden behind, which is why the model is called Hidden Markov (state sequence unknowable).
(1) HMM evaluation issues
The above section is bullshit
a. Forward algorithm
The url of the forward algorithm dynamic diagram = https://pic3.zhimg.com/v2-aab75a9c0df890ef11db2c27e672baf4_b.webp
b. Backward algorithm
Backward algorithm diagram
The backward algorithm is similar to the forward algorithm, and its time complexity is also O(N^2T)
###You can read the original text from here, I have summarized it once, so I won’t waste time https://zhuanlan.zhihu.com/p/50184092?from_voters_page=true
(2) HMM decoding problem
Viterbi algorithm (dynamic programming)
(3) HMM parameter learning
The main parameters of the hidden horse model are two matrices A, B. A: hidden state transition probability matrix and B: probability distribution of observations in a given state.
Application of Hidden Markov Model in Sequence Labeling
Participle
Part-of-speech tagging
Phrase recognition, speech recognition
Conditional Random Field Model (CRF)
Random field. The random field can be regarded as a set of random variables (this set of random variables all come from the sample space)
There may be a certain interdependence between these random variables. When we randomly assign a value in the corresponding space to the random variable at each location according to a certain distribution, the whole is called a random field.
Markov sex. Markov property means that when we expand a sequence of random variables in chronological order, the distribution characteristics of the variables at time N+1 are only related to the value of the variable at time N, and are related to the value of the variable at time N The value of the variable is irrelevant.
We call the random field that satisfies Markov property as Markov Random Field (MRF).
1. Maximum entropy model (in the maximum entropy model, the outputs are independent of each other)
2. Conditional Random Field (CRF)
In the application of CRF in named entity recognition, the model input is the word sequence, and the output is the word tag.
The neural network sequence annotation model architecture is as follows
A simple example to illustrate the concept of random field: a whole composed of several existing positions, when a given position is randomly assigned a value according to a certain distribution, the whole is called a random field.
Taking place name recognition as an example, suppose the following rules are defined:
Label | meaning |
---|---|
B | The current word is the beginning word of the place name named entity |
M | The current word is the middle word of the place name naming entity |
E | The current word is the ending word of a place name named entity |
S | The current word alone constitutes a place name named entity |
O | The current word is not a named entity or part of a place name |
There is a sentence consisting of n characters, and the label of each character is selected in the known label set {"B", "M", "E", "S", "O"}, when we After all characters have selected the label, a random field is formed .
If some constraints are added to it, for example, the labels of all characters are only related to the adjacent character labels, then it will be transformed into a Markov random field problem.
Assuming that there are two variables in Markov random field, XXX and YYY, XXX is generally given, and YYY is the output under given XXX conditions. In this example, XXX is a character, YYY is a label, and P(Y∣X)P(Y|X)P(Y∣X) is a conditional random field .
This structure is generally called a linear chain conditional random field. It is defined as follows:
Let X=(X1,X2,X3,⋅⋅⋅,Xn)X=(X_1,X_2,X_3,···,X_n)X=(X1,X2,X3,⋅⋅⋅,Xn) And Y=(Y1,Y2,Y3,…,Yn)Y=(Y_1,Y_2,Y_3,…,Y_n)Y=(Y1,Y2,Y3,…,Yn) are all expressed by a linear chain Random variable sequence, if under the condition of given random variable sequence X, the conditional probability distribution P(Y|X) of random variable Y constitutes a conditional random field and satisfies the Markov property: P(Yi∣X,Y1, Y2,⋅⋅⋅,Yn)=P(Yi∣X,Yi−1,Yi+1))P(Y_i|X,Y_1,Y_2,···,Y_n)=P(Y_i|X,Y_i-_1 ,Y_i+_1))P(Yi∣X,Y1,Y2,⋅⋅⋅,Yn)=P(Yi∣X,Yi−1,Yi+1)) is
called P (Y|X) is the conditional random field of the linear chain .
In other words, the linear model only considers the influence of nodes on both sides of it, because only the nodes on both sides are adjacent to it.
For the introduction of network structures such as RNN and LSTM, please refer to: https://zhuanlan.zhihu.com/p/50915723
What are the advantages and disadvantages of CRF and LSTM models in sequence labeling? https://www.zhihu.com/question/46688107?sort=created
From RNN, LSTM to Encoder-Decoder framework, attention mechanism, Transformer https://zhuanlan.zhihu.com/p/50915723