CRF Conditional Random Field Summary

 

 

According to the description in the book "Statistical Learning Methods", a conditional random field (CRF) is a conditional probability distribution model of a set of output random variables given a set of input random variables, which is characterized by the assumption that the output is random. The variables constitute a Markov random field.

A conditional random field is a discriminative model.

 

1. Understanding Conditional Random Fields

 

1.1 Brief introduction of HMM

HMM is Hidden Markov Model, which is a statistical model for dealing with sequence problems . sequence.

In this process, the unobservable sequence is called the state sequence, and the resulting sequence is called the observation sequence.

The process can be described by the following diagram:

In the above figure, $X_1,X_2,…X_T$ are implicit sequences, and $O_1, O_2,..O_T$ are observation sequences.

Hidden Markov Models are determined by three probabilities:

  1. The initial probability distribution , that is, the probability distribution of the initial hidden state, is recorded as $\pi$;
  2. State transition probability distribution , that is, the transition probability distribution between hidden states, denoted as $A$;
  3. The observation probability distribution , that is, the probability distribution of the observed state generated from the hidden state, is denoted as $B$.

The above three probability distributions can be said to be the parameters of the hidden Markov model, and according to these three probabilities, a hidden Markov model $\lambda = (A, B, \pi)$ can be determined.

The three basic problems of hidden Markov chains are:

  1. Probability calculation problem . That is, given the model $\lambda = (A, B, \pi)$ and the observation sequence $O$, calculate the maximum probability $P(O|\lambda)$ of the observation sequence under the model $\lambda$, mainly using Forward-backward algorithm solution;
  2. learning problems . That is, given the observation sequence $O$, estimate the parameter $\lambda$ of the model, so that the probability of occurrence of the observation sequence under this parameter is the largest, that is, $P(O|\lambda)$ is the largest, mainly using the Baum-Welch algorithm EM iteration Calculation (if the hidden state is not involved, the maximum likelihood estimation method is used to solve it);
  3. decoding problem . Given the model $\lambda = (A, B, \pi)$ and the observation sequence $O$, calculate the implicit sequence $X$ that is most likely to produce this observation sequence, even if the probability $P(X|O, \ The largest implicit sequence $X$ of lambda)$ is mainly solved by the Viterbi algorithm (dynamic programming idea).

HMM was originally used for infectious disease models and public opinion dissemination problems. The current state in these problems can be simplified to be only related to the previous state, that is, it has the Markov property. However, imagine a language labeling problem. The model not only needs to consider the labeling of the previous state, but also the labeling of the latter state (for example, I love China, noun + verb + noun, more contextual information). As a result, more assumptions are naturally made on the model, which leads to the graphical model (the current state is related to the connected state) + the conditional model (the current state is related to the hidden state) = conditional random field.

 

1.2 Probabilistic Undirected Graph (Markov Random Field)

A probabilistic undirected graph model, also known as a Markov random field, is a joint probability distribution that can be represented by an undirected graph. Directed graphs are time series sequential, also known as Bayesian networks, HMM is one of them. HMM cannot consider the next state information of the sequence, which cannot be avoided due to the "directivity" of directed graphs. On the other hand, undirected graphs can consider more connected states in the current state and consider more comprehensive contextual information.

A probabilistic graphical model is a probability distribution represented by a graph. Note that $G=(V, E)$ is a graph composed of a set of nodes $V$ and a set of edges $E$.

First of all, we need to clarify the pairwise Markov property, the local Markov property and the global Markov property. These three properties are theoretically proved to be equivalent.

Pairwise Markov property means that the two random variables corresponding to any two nodes in the graph $G$ that are not connected by edges are conditionally independent .

Given a joint probability distribution $P(Y)$, if the distribution satisfies the pairwise, local or global Markov property, the joint probability distribution is called a probabilistic undirected graph model or a Markov random field.

Local Markov properties (black and white points are never adjacent, i.e. pairwise Markov properties)

 1.3 Conditional random fields

 A conditional random field (CRF) is a Markov random field of a random variable $Y$ given a random variable $X$. In practice, the most widely used is the linear chain conditional random field in the labeling task. At this time, in the conditional probability model $P(Y|X)$, $Y$ is the output variable, representing the label sequence, and $X$ is the input variable, representing the observation sequence (state sequence) that needs to be labeled.

During learning, use the training data set to obtain the conditional probability model $\hat P(Y|X)$ through maximum likelihood estimation or regularized maximum likelihood estimation;

When predicting, for a given input sequence $x$, find the output sequence $\hat y$ with the largest conditional probability $\hat P(y|x)$.

 

 A general conditional random field is defined as follows:

Let $X$ and $Y$ be random variables, and $P(Y|X)$ is the conditional probability distribution of $Y$ given $X$. If the random variable $Y$ constitutes a Markov random field represented by an undirected graph $G=(V, E)$, namely:

$$   P\left( {{Y_v}|X,{Y_w},w \ne v} \right) = P\left( {{Y_v}|X,{Y_w},w \sim v} \right).     $$

For any node $v$ is true, then the conditional probability distribution $\hat P(Y|X)$ is called a conditional random field. In the formula, $w \sim v$ represents all nodes $w$ connected with node $v$ in the graph $G=(V, E)$, and $w \ne v$ represents the node $v$ All nodes other than $Y_v$ and $Y_w$ are random variables corresponding to nodes $v$ and $w$.

 

Similarly, a linear chain conditional random field is defined as:

Obviously, the linear chain conditional random field is a special case of the general conditional random field.

Let $ X=(X_1, X_2, ..., X_n) $, $Y=(Y_1, Y_2, ..., Y_n)$ be the random variable sequence represented by linear chain, if given random variable sequence $ Under the condition of X$, the conditional probability distribution $P(Y|X)$ of the random variable sequence $Y$ constitutes a conditional random field, which satisfies the Markov property:

$$     P\left( {{Y_i}|X,{Y_1}, \ldots ,{Y_{i - 1}},{Y_{i + 1}}, \ldots ,{Y_n}} \right) = P\left( {{Y_i}|X,{Y_{i - 1}},{Y_{i + 1}}} \right).  $$

$$ i = 1,2, \ldots ,n (only one side is considered when i=1 and n). $$

Then $P(Y|X)$ is called a linear chain conditional random field. In the labeling problem, $X$ represents the input observation sequence, and $Y$ represents the corresponding output label sequence or state sequence.

 

 

Linear Chain Conditional Random Field

 

 

 Linear chain conditional random fields with the same graph structure for $X$ and $Y$

 Second, the probability calculation problem of conditional random field

The problem of probability calculation of conditional random field is given conditional random field $P(Y|X)$, input sequence $x$ and output sequence $y$, calculate conditional probability $P(Y_i=y_i|x)$, $P (Y_{i-1}=y_{i-1}|x, Y_i=y_i|x)$ and the corresponding mathematical expectation problem.

There is no essential difference between the probability calculation of the conditional random field and the probability calculation of the HMM, and it can even be said to be exactly the same. The difference is only a slight change in the formula.

 

2.1 Forward-Backward Algorithm

In order to calculate the probability of each node, such as the probability of $P(Y = y_i | x)$ mentioned in the book, for this kind of probability calculation, either forward or backward algorithm can be used to solve it. The forward or backward algorithm scans the overall edge weights and calculates the $P(X)$ of the graph, but they scan in different directions, one from front to back and the other from back to front. So the formula in the book:
$$ P(x) = Z(x) = \sum_{y} P(y,x) = \alpha_n^T(x) \cdot 1 = 1^T\cdot \beta_1(x ).$$
The $\alpha$ in the formula is the forward vector, and $\beta$ is the backward vector.

According to the definition of forward-backward vector, it is easy to calculate the conditional probability that the sequence of tokens is token $y_i$ at position $i$ and the token $y_{i-1}$ at positions $i-1$ and $i$ and the conditional probability of $y_i$:
$$ P(Y_i = y_i | x) =\frac {\alpha_i^T(y_i | x) \beta_i(y_i | x)}{Z(x)}. $$
$$ P(Y_{i-1} = y_{i-1},Y_i = y_i | x) = \frac{\alpha_{i-1}^T(y_{i-1} | x)M_i(y_{i -1},y_i|x)\beta_i(y_i|x)}{Z(x)}. $$

 

2.2 Calculating Expectations

Using the forward-backward vector, the mathematical expectation of the characteristic function with respect to the joint distribution $P(X,Y)$ and the conditional distribution $P(Y|X)$ can be calculated.

 

The mathematical expectation of the characteristic function $f_k$ about the conditional distribution $P(Y|X)$ is:

\begin{align*}
E_{P(Y|X)}[f_k] &= \sum_y P(y | x) f_k(y,x)\\
&=\sum_{i=1}^{n+1}\sum_{y_{i-1}y_i}f_k(y_{i-1},y_i,x,i)\frac{\alpha_{i-1}^T(y_{i-1} | x)M_i(y_{i-1},y_i|x)\beta_i(y_i|x)}{Z(x)}
& k = 1,2,\ldots, K.
\end{align*}

 Among them, there is $Z(x) = \alpha_n^T(x) \cdot 1$.

 

Assuming that the empirical distribution is $\tilde P(X)$, the mathematical expectation of the characteristic function $f_k$ about the joint distribution is:

\begin{align*}
E_{p(X,Y)}[f_k] &= \sum_{x,y}P(X,Y)\sum_{i=1}^{n+1}f_k(y_{i-1},y_i,x,i) \\
&=\sum_{x} \hat P(x)\sum_{i=1}^{n+1}\sum_{y_{i-1}y_i}f_k(y_{i-1},y_i,x,i)\frac{\alpha_{i-1}^T(y_{i-1} | x)M_i(y_{i-1},y_i|x)\beta_i(y_i|x)}{Z(x)}
& k = 1,2,\ldots, K.
\end{align*}

  Among them, there is $Z(x) = \alpha_n^T(x) \cdot 1$.

 

2.3 Parameter learning algorithm

The conditional random field model is actually a log-linear model defined on time series data, and its learning methods include maximum likelihood estimation and regularized maximum likelihood estimation. The specific optimization algorithms include improved iterative scaling method IIS, gradient descent method and quasi-Newton method.

There is no difference between the theoretical derivation of the parametric model algorithm and the maximum entropy model algorithm, and it is still the process of finding the maximum value of the trained log-likelihood function.

The log-likelihood function of the training data is:

$$   L(w) = L_{\hat p}(P_w) = \log \prod_{x,y}P_w(y | x)^{\hat P(x,y)}.  $$

 

2.4 Prediction Algorithms

The Viterbi algorithm adopts the classic dynamic programming idea. The algorithm is completely consistent with the HMM, so there is no need to re-derive it. You can directly refer to the previous blog post [Viterbi Algorithm]. So why do you need to use the Viterbi algorithm, instead of directly substituting the input vector x like the maximum entropy model? Simply put, because in the whole graph, each node is interdependent, so simply substituting $P(Y | X)$ will not work, you have no way of knowing which label can be associated with which label In one piece, so the problem must be [tiled], that is, to calculate every possible combination, but once you tile, you will find that if exhaustive, then the running time is $O(k^T)$, $k$ is the number of labels, and $T$ is the number of corresponding sequence states. The overhead of the algorithm is quite large, and one of the advantages of using dynamic programming is that we use space to exchange time, and directly record the optimal value in some intermediate nodes, so that it can be used directly in the process of forward scanning, then the running time will naturally go down. .

 

 

3. CONDITIONAL RANDOM FIELDS AND OTHER MODELS

 

3.1 Classic comparison chart

Classic comparison diagram, from paper: Sutton, Charles, and Andrew McCallum. "An introduction to conditional random fields." Machine Learning 4.4 (2011): 267-373.

 

From the figure, we can find the position of the CRF. It can be used to classify the HMM model from the naive Bayesian method, and then obtain the CRF from the HMM model conditional . Or the naive Bayes method conditional becomes a logistic regression model, and then the sequence becomes a CRF, both paths are available.

Let's first look at the model of Naive Bayes:
$$ P(Y | X ) = \frac{P(Y) P(X| Y)}{P(X)}. $$
where the feature vector $X$ can be is $X= (x_1,x_2,...,x_n)$, since each feature of Naive Bayes is independent and identically distributed, so there are:
$$ P(X|Y) = P(x_1 | Y) P(x_2 | Y)\cdots P(x_n|Y). $$
arranges:
$$ P(Y,X) = P(Y) \prod_{i=1}^nP(x_i | Y). $$

Let's look at the logistic regression model in general form:
\begin{align*}
P( Y | X) &= \frac {1}{Z(X)} exp(\theta_y + \sum_{i=1}^n \theta_{yi} f_i(X,Y))\\
&= \frac {1}{Z(X)} exp(\theta_y + \sum_{i=1}^n \theta_{yi} x_i).
\ end{align*}
where $Z(X)$ is the normalization factor.

Continue the derivation by the model of Naive Bayes:
\begin{align*}
P(Y, X) &= P(Y = y_c) \cdot \prod_{i = 1}^n P(X= x_i | Y = y_c ) \\
&=exp[\log P(y_c)]exp[\sum_{i=1}^n \log P(x_i | y_c)]\\
&=exp\{\theta_y + \sum_{i=1 }^n\theta_{yi} [X = x_i and Y = y_c]\}.
\end{align*}

This is the conclusion we can get from looking at the Bayesian model from the logistic regression model. First, the final model of the logistic regression model is expressed as a conditional probability, not a joint probability, because it is a discriminant model; secondly, the feature functions after the parameter $\theta_{yi}$ in the two formulas are different. The Bayesian model considers the joint probability distribution, so it is a generative model; while the logistic regression model does not calculate the joint probability distribution, but substitutes the actual value of each feature into the formula to calculate the conditional discriminant probability. According to this idea, I believe you can better understand the above classic diagram.

 

3.2 HMM vs. MEMM vs. CRF

  • HMM -> MEMM: There are two assumptions in the HMM model: one is that the output observations are strictly independent, and the other is that the current state is only related to the previous state during the state transition process. But in fact the problem of sequence labeling is not only related to a single word, but also to the length of the observed sequence, the context of the word, and so on. MEMM solves the problem of HMM output independence assumption. Because HMM is only limited to the dependence between observations and states, MEMM introduces custom feature functions, which can not only express the dependence between observations, but also express the complex dependence between the current observation and multiple states before and after.
  • MEMM -> CRF:  CRF not only solves the problem of HMM's output independence assumption, but also solves the problem of MEMM's labeling bias. MEMM is prone to falling into local optimum because it only performs local normalization, while CRF counts the global probability. When doing normalization, the global distribution of the data is considered, rather than only local normalization, which solves the problem of label bias in MEMM. This makes the decoding of sequence annotations an optimal solution.
  • HMM and MEMM are directed graphs, so the influence of $x$ and $y$ is considered, but $x$ is not considered as a whole. CRFs are undirected graphs without such dependencies, overcoming this problem.

 

 

Reference content:

1. How to explain the conditional random field (CRF) model with simple and easy to understand examples? How is it different from HMM? https://www.zhihu.com/question/35866596

2. Conditional random field study notes: https://blog.csdn.net/u014688145/article/details/58055750

 3. "Statistical Learning Methods", Li Hang

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325327725&siteId=291194637