Sequence annotation (HMM / CRF)

Brief introduction

Annotation sequence (Sequence Tagging) NLP is a relatively simple task, but it may also be referred to is the most basic task. Sequence annotation coverage is very wide, it can be used to solve a series of problems for the classification of characters, such as segmentation, POS tagging, named entity recognition, relation extraction and so on.

For the word believe before seen the blog friends are not familiar, in fact, the Internet has a lot of open-source Chinese word segmentation tool, jieba, pkuseg, pyhanlp ... not list them here, we will not make too much discuss. Next is to identify the entity as an example to explain, and perform other tasks are basically the same, just different ways of labeling nothing.

For solid recognition task, we had a sequence to be marked \ (X-= \ {x_1, x_2, ..., x_n \} \) , we need a sequence of each of the \ (x_i \) predict a corresponding Tag in general, we define the following tag:

  • B - Begin, indicating the start
  • I - Intermediate, for the intermediate word
  • E - End, represents the end of
  • S - Single, represents a single character
  • O - Other, represent other, extraneous characters used to label

Common labeling schemes are usually three or five labels labeling law:

  • IOB - for the first character of the text block marked with B, the other characters of the text block denoted by I, the non-text block characters marked with O
  • IOBES - for the first character of the text block marked with B, the last character of the text block denoted by E, other characters in the text block denoted by I, the non-text block marked with O

Of course, this tag is not fixed, depending on the task can also have a flexible range of label changes or extensions. For the word task, we can use the same marked way to mark the beginning of each word, at the end, or word. The POS tagging, we can define the tag: n, v, adj ... and for the more specific categories named entity recognition task, we add some suffixes after the labels defined as: B-Person, B- Location ... this can be selected from the line according to your actual task.

Common model processing sequence labeling problems include hidden Markov model (HMM), Conditional Random Fields (CRF), BiLSTM + CRF, due to space limitations, this section describes the first two traditional machine learning models: Hidden Markov models and Conditional random Fields

Hidden Markov Models (HMM)

Machine learning algorithms HMM belonging to the classic, belong To a graph model, mainly for time series data modeling. With the development of the depth of learning, HMM in sequence marked by relatively small, but the idea is to do a basic sequence annotation. Principle is simple, the students have mastered can skip this paragraph, the following only a brief introduction, the students want to learn more about the book can look watermelon Zhou Zhihua teacher, talking about the very detailed.

HMM model variables can be divided into two groups:

  • Observed variables: \ (X-= \ {x_1, x_2, ..., x_n \} \) , representing a section \ (I \) observed value of the time
  • State variables: \ (the Y = \ {Y_1, Y_2, ..., y_n \} \) , representing a section \ (I \) hidden in time, the state is generally hidden, and therefore also referred to as hidden variables.
  • State space: \ (S = \ {S_1, S_2, ..., S_N \} \) , representing the state variables for the normal range.

Clearly, in sequence labeling tasks, to be marked on the corresponding sequence of observed variables, the results mark the corresponding state variables, and we define the label categories on the corresponding state space. Hidden Markov Models corresponding to the structure shown below:

FIG arrow represents the dependence between variables, that is, a Markov chain, the entire model based on hidden Markov following assumptions:

  • At any one time, the observed value of the variable depends only on the state variables, i.e. \ (x_t \) a \ (Y_t \) determined, regardless of the state variables and the observed variable at other times (This is similar to a unigram model), i.e.
    \ [P (x_t | X, Y) = P (x_t | y_t) \]
  • At any one time, the next state is determined only by the time the current state of the system, not dependent on any previous state, \ (Y_t \) only \ (y_ {t-1} \) , but not with others are irrelevant. This assumption means that our hidden variables that contain timing information, this is a simple classification models are not available. That
    \ [P (y_t | X, Y) = P (y_t | y_ {t-1}) \]

According to the above assumptions, we can joint probability distribution for all variables were modeled:
\ [P (x_1, Y_1, ..., x_n, y_n) = P (Y_1) P (x_1 | Y_1) \ ^ N_ Prod {i = 2} P (y_i | y_ {i-1}) P (x_i | y_i) \]

With the above expressions, we can learn if the initial probability of each state obtained \ (P (y_1) \) transition probabilities between each state \ (P (y_i |. 1-Y_ {I}) \) , and an observation probability \ (P (x_i | y_i) \) , we can be calculated for any sequence of our joint probability distribution, to select the highest probability state variable as our predictions. Thus, HMM mainly has the following three sets of parameters:

  • State transition probability (Transition probabilities): the probability model transitions between each state, the matrix is generally referred to as \ (A = [A_ {I, J}] _ {N \ N} Times \) , where
    \ [a_ {i , j} = P (y_ { t + 1} = s_j | y_t = s_i), \ sum_ {i = 1} ^ Na_ {i, j} = 1,1 \ le i, j \ le N \]
    • Which indicates at any time (T \) \ , when the state is \ (S_I \) , then the next state is the time \ (S_j \) probability
  • Output probability of observation (Emission probabilities): model obtained based on the current state of the probability of each observed value is usually referred to as matrix \ (B = [B_ {I, J}] _ {N \ Times M} \) , where
    \ [b_ { i, j} = P (x_ {t} = o_j | y_t = s_i), \ sum_ {i = 1} ^ Nb_ {i, j} = 1, 1 \ le i \ le N, 1 \ le j \ le M \]
    • Which represents an arbitrary time \ (T \) , when the state is \ (S_I \) , the observed value \ (O_j \) probability of being acquired
  • Initial state probability (Start probabilities): probability of each state of the model appears in the initial time, typically referred to as \ (\ PI = [\ PI_1, \ PI_2, ..., \ -PI_n] \) , where
    \ [\ pi_i = P (y_1 = s_i), \ sum_ {i = 1} ^ N \ pi_i = 1, 1 \ le i \ le N \]

Generally denoted as \ (\ lambda = [A, B, \ pi] \) is the joint probability distribution can be turned into the following expression:
\ [P (X-, the Y | \ the lambda) = \ {Y_1} B_ pi_ {y_1, x_1} \ prod ^ n_ {i = 2} a_ {y_ {i-1}, y_i} b_ {y_i, x_i} \]

Probabilistic graphical models exist in three basic problems, which is why we solve the basic steps of probabilistic graphical models:

  • Assessment (Evaluation Problem): Given \ (\ lambda \) and observed variables \ (the X-\) , how to calculate the probability of observing a variable \ (P (the X-| \ the lambda) \) , used to evaluate the model and the actual problem the matching degree (front to back to the algorithm)
  • Learning problems (Learning Problem): a plurality of view sequences given \ (X-\) , how learning parameter \ (\ the lambda = [A, B, \ PI] \) , so that the calculated \ (P (X | \ lambda ) \) to maximize (under supervised maximum likelihood estimation)
  • Decoding problem (Decoding Problem): Given \ (\ lambda \) and observed variables \ (the X-\) , how to find the most likely hidden sequence \ (P (the Y-| the X-, \ the lambda) \) (Viterbi algorithm)

Conditional Random Fields (CRF)

Belonging to the HMM model formula directly to the joint distribution (P (X, Y) \ ) \ modeled. CRF HMM in some respects somewhat similar, but is a kind of discriminant undirected graph model, its conditional distribution is modeled. Specifically, for observation sequence \ (X \) and a tag sequence \ (the Y-\) , CRF's goal is to build conditional probability model \ (P (the Y-| the X-) \) .

Markov random field

MRF also known probability undirected graph model, which represents a joint probability distribution. For an undirected graph model \ (= G [V, E] \) , \ (V \) indicates that no to all nodes in the graph, \ (E \) means that no edges all undirected graph. If there is no distribution to the joint probability between each node in the graph to meet the Markov property, called this joint probability distribution of Markov random or undirected graph probability model.

Markov property: any node which is equal to the distribution of its neighbor node conditional probability distribution conditional probability of all nodes
\ [P (y_v | Y_ { V / {v}}) = P (y_v | Y_ {n (v )}) \]

Wherein \ (Y_ {V / {v }} \) represents no addition to the figure \ (y_v \) Unexpected all nodes, \ (n-Y_ {(V)} \) represents \ (y_v \) adjacent all nodes. Is a brief overview of the Markov conditional independence between nodes, each node or determined by only neighbor nodes are not adjacent. As an undirected graph model, which is not as strict HMM model assumptions

Conditional Random Fields

CRFs is in a special MRF, is represented by a given set of inputs, to obtain the output meets MRF.

For our task sequence labeling, we call CRF CRF generally refers to a chain, as shown above, i.e. each as a linear non-chain model, every node adjacent to the node with only two model FIG, each a node distribution satisfies conditional probability of all nodes:
\ [P (Y_t | X-, Y_ {V / {V}}) = P (Y_t | X-, Y_ {T-. 1}, Y_ {T +. 1}) \ ]

Wherein when \ (T \) Take \ (1 \) or \ (n-\) considering only unilateral.

Airports condition with characteristic function

CRF for conditional probability \ (P (Y | X) \) modeling is more complex, but carefully read the following explanation you will soon be able to fully grasp. Conditional probability condition with parametrically defined in airports as follows:
\ [P (Y | X) = \ FRAC the Z {} {} exp. 1 (\ sum_j \ sum_ = {I}. 1. 1} ^ {n-- \ lambda_it_j (y_ {i + 1 }, y_i, x, i) + \ sum_k \ sum_ {i = 1} ^ {n} \ mu_ks_k (y_i, x, i)) \]

among them:

  • \ (t_j (y_ {i- 1}, y_i, x, i) \) as a function of the local feature, the feature is determined by the current node and a node, called state transition characteristics, and an observation to describe neighboring nodes influencing variables on the current state;
  • \ (s_k (y_i, x, it) \) is the characteristic function of the node, and only the characteristic function of the current node related feature called state;
  • \ (\ the lambda \) and \ (\ MU \) is the characteristic function parameter corresponds.

Sequence annotation task, usually, we denote as a matrix with the condition of the airport, which we understand and easier to calculate:

For random chain conditions, we first define two special nodes: \ (y_0 = <the START> \) , \ (. 1 + n-Y_ {} = <the STOP> \) .
Observation sequence \ (X-\) each location \ (I =. 1, 2, ..., n-+. 1 \) , defining a matrix of order N (N is the number of hidden), this matrix is equivalent to HMM model in the state transition matrix:
\ [M_i (X) = [M_i (I-Y_ {}. 1, y_i | X)] \]

\[M_i(y_{i-1}, y_i|x)=exp(W_i(y_{i-1}, y_i|x))\]

\[W_i(y_{i-1}, y_i|x)=\sum_{k=1}^{K} w_kf_k(y_{i-1}, y_i, x, i)\]

Wherein \ (f_k (y_ {i- 1}, y_i, x, i) \) is the state transition characteristics \ (t_j (y_ {i- 1}, y_i, x, i) \) and a state wherein \ (S_k (y_i, x, i) \ ) unification of notation, \ (W_k \) unified symbolic representation corresponding to the characteristic parameters for the (detailed characterization methods can refer to Li Hang teacher "statistical learning methods" of 197), Thus, we get a similar state of the HMM transition matrix, the transition probability matrix can be a suitable + product represented by the sequence of elements n, i.e., \ (\ prod_ {i = 1 } ^ {n + 1} M_i ( {I}. 1-Y_, y_i | X) \) . Note, however, that the state transition matrix \ (M \) non-normalized (all probabilities accumulated is not 1), we normalized the final conditional probability of a technology as:
\ [P_w (the Y | X-) = \ FRAC {1} {Z_w (x)} \ prod_ {i = 1} ^ {n + 1} M_i (y_ {i-1}, y_i | x) \]

Wherein: \ (Z_w (X) = (M_l (X) M_2, (X) of M_ {n-+ ...}. 1) _ {Start, STOP} \) represents the non-normalized probability of all paths from the start state to the ending state only and that a normalization factor, the probability of non-standardized standardized.

HMM transition probability is constrained, and the transition matrix of CRF can be any weight, just last globally normalization on the line, which makes CRF is more flexible than the HMM

Comparison of CRF and HMM

CRF is more powerful than the HMM, mainly by the following reasons:

  • Each HMM may be equivalent to a particular of CRF;
  • CRF hidden state can rely on more extensive information (before and after the neighboring hidden and all observed variables), and HMM can only rely on a hidden moment of observation and information of current time;
  • The value of the parameter matrix CRF no restrictions (non-normalized probability), and HMM parameter values ​​need to be limited.

Viterbi algorithm (Viterbi)

Viterbi algorithm for the shortest path problem using dynamic programming method can be used to decode HMM model and CRF models.

The Viterbi algorithm requires the following three elements:

  • The initial state probability \ (\ pi \)
  • State transition probabilities
  • Output observation probability

We already know from the previous description, the three probability in HMM model and CRF models are rectifiable.

Algorithm is based on an idea: sub-optimal path path must be the best.
For a chestnut can clearly understand, suppose we have a sentence: "I love Beijing's Tiananmen Square." If we use the tagging method of tagging BIO, each character has three possible hidden. We follow the following method to solve the conditional probability of each layer:

  • Seeking a node to any first word \ (P_ 1J} {\) (corresponding to a hidden state of each node) in the initial state probabilities stored at the node;
  • A second node for any word \ (p_ {2j} \) , the probability of using each node of the first layer and the corresponding state transition probabilities, to calculate the level of each node to the \ (p_ {2j } \) probability, and select the maximum probability and a corresponding node stored on the node, we recorded the maximum probability of reaching the second layer, and each node on the corresponding step position.
  • For any node of the third word \ (p_ {3j} \) , using the maximum probability and the transition probability stored in the second layer, it is calculated for each node of the upper layer \ (p_ {3j} \) probability, and select the maximum probability and a corresponding node stored in the node, then we recorded the third floor to reach the maximum probability of each node and the step position.
  • Similarly, the back of each layer with the maximum probability and the transition probability on the existing layer, calculate the probability of each state of the maximum current layer, and while preserving the step position until the last layer.
  • The last layer, we can get the highest probability of a three node states, do not forget, we store a maximum probability at the same time, also stores on a status word, so that we can get all the words corresponding to the reverse hidden friends.

Reference links
https://blog.csdn.net/shuibuzhaodeshiren/article/details/85093765
https://www.cnblogs.com/Determined22/p/6750327.html
https://zhuanlan.zhihu.com/p/35620631
HTTPS : //zhuanlan.zhihu.com/p/56317740
https://www.cnblogs.com/Determined22/p/6915730.html
https://zhuanlan.zhihu.com/p/63087935

Guess you like

Origin www.cnblogs.com/sandwichnlp/p/11618530.html
HMM