CRF series - a simple example

CRF can be applied to a string sequence of the automatic labeling. For example standard text speech sequence, i.e., to automatically determine the part of speech of each word in the sentence. In this problem, the result marked each word, depends not only on the word itself, will depend on the results of other words mark. CRF may take into account the dependencies.

This article will use a speech tagging for example, describes the problem solved and the establishment of CRF CRF models, learning and prediction process.

mission details

Speech tagging (part-of-speech tagging) , whose goal is to sequence a sentence that is a bunch of words were tagging (tagging), tagging each word speech (ADJECTIVE, NOUN, PREPOSITION, VERB , ADVERB, ARTICLE). We assume that the standard tasks in speech, each word of the speech depends not only on its own, but also rely on its previous word (simplified here to do, in fact, marked the result of each word rely more stuff).

Next, I will focus on this issue, create a simple linear chain Conditional Random Field (linear-chain CRF). And describes how the CRF is represented by the above dependencies, and how to use the CRF to solve this problem.

In fact, statistical and other machine learning models, we have to complete three tasks:

1) specify the model parameters (model)

2) estimate these parameters (learning)

3) These parameters are predicted (predicted)

The first task - to establish model

Characteristic Function

To evaluate the likelihood of each sequence tag belonging to each word, we need to define a series of words according to the position of the characteristic function (feature function), a linear chain CRF model characteristic function is expressed as follows:

Wherein represents $ s $ sequence, $ i $ represents the position of the word in the sequence, $ l_ {i} $ denotes the part of speech a word label.

Each feature is represented by a function when the position is $ I $, $ current position denoted L_i $ word and a word before it is labeled $ l_ {i-1} "possibility" $ time, but the "possibility "I am not a probability, it is usually 0 or 1.

For example, the characteristic function may be defined as follows:

$ F_1 (x, l_ {i-1}, l_ {i}, i) = cases {1, qquad l_i = ADVERB & i-th word to "-ly" end \ 0, qquad otherwise} $

$ F_2 (x, l_ {i-1}, l_ {i}, i) = cases {1, qquad i = 1 & l_i = VERB & sentence to the end of a question mark \ 0, qquad otherwise} $

$f_3(x,l_{i-1},l_{i},i)=cases{1,qquad l_{i-1}=ADJECTIVE & l_i=NOUN\ 0,qquad otherwise}$

$f_4(x,l_{i-1},l_{i},i)=cases{1,qquad l_{i-1}=PREPERSITION & l_i=PREPERSITION\ 0,qquad otherwise}$

We can see the value of each feature functions only on the current position and the previous position, therefore, it has the ability to be expressed before we call dependencies.

model

With characteristic function, we can begin to build a model.

The so-called modeling, in fact, to be shown when a given sentence, word sequence is $ x = {x_1x_2 ... x_N} $, the entire sentence is denoted probability $ l = (l_1l_2 ... l_N) $, i.e., the conditional probability:

That is the case when $ X $ given joint probability $ l_1, l_2, ..., l_N $ of

Weights

Each characteristic function of the overall impact should be different, and therefore, it is necessary for each function plus the weighting $ W = (w_1, w_2, ..., w_M) $, the larger the weight, the influence of the characteristic function of the annotation result greater. Then the weight of each feature function of how much weight should take it? In fact these weights $ W $ is what we call parameters, and these parameters on how the value is our second mission - parameter learning , this speaks in the next section.

Non-normalized probability

Each possibility represents the characteristic function of each position, the entire sentence is denoted as $ l = (... l_N 1_1l_2) $ possibility is of course the possibility of the product of each position.

However, in order to facilitate the calculation, we use exponential to refer to features: $ exp (w_kf_k (s, l_ {i-1}, l_i, i)) $. In this way, our multiplication becomes addition:

Parameter model

At this point, our model has been pretty clear,

Now "possibility" is not represented by a probability, we should do the normalization process, so that its value is between 0-1, so the introduction of normalized items $ Z (x) $:

In this way, the conditional probability that we need to get:

Wherein L $ $ $ X $ are the vectors, $ l = l_1l_2 ... l_N, x = x_1x_2 ... x_N $

summary

This is our linear chain CRF parameters of the model, its ability to represent such dependencies: tagging results for each position not only depends on the position of the word itself, but also on the results of the previous mark a location.

We also specify the parameters of the model as a function of the weight of each feature heavy $ W = (w_1, w_2, ..., w_M) $.

The second task - learning

aims

Learning goal actually is to find a set of parameters $ w = (w_1, w_2, ..., w_M) $, in the training data sequence $ x = (x_1, x_2, ..., x_N) $ and marked $ l = (l_1, l_2, ...,) $ L_N on, the conditional probability $ P (L = l | X = x) $ maximized, i.e. to find the $ w ^ * $, so that:

There are many ways of learning, such as gradient descent, maximum likelihood estimation, there is not much to say.

Such learning task seems complete, but wait! The problem seems not so simple.

Calculation

Let us look at $ Z (x) $:

See the leftmost sum: $ sum_ {L = l '} $, written here in fact simple, $ L $ is a vector, so its complete form should be:

I.e. has $ m $ possible annotation result at each location, the length of $ N $ sentence, there is $ m ^ N $ case, even if only two types of speech of each word, a length of 15 sequence of words, the amount of calculation has $ 2 ^ {15} $, so want direct calculation $ Z (x) $ is difficult .

Forward - the algorithm

We studied the object, usually a long sequence, according to the above mentioned, for this high-dimensional data, you want to directly calculate $ Z (x) $ is difficult, therefore, a linear chain CRF, in order to get $ Z (x) $ requires forward - to the algorithm .

For the forward - backward algorithm, we do not do much to explain here first, after the article will detail.

summary

In this way, we also get a second job.

Learning parameters is to find parameters can maximize the conditional probability of training data;

However, due to the high dimensional data, a large amount of computation, so using the pre - calculating algorithm to conditions with the linear chain normalization term airport the Z $ (x) $.

The third task - prediction

Now that we have the model, and the model has been optimized parameters on known data set, the next step is to speculate based on the data model and the parameters of the unknown. Specific to our POS tagging problem is a known sentence given $ x $, but it marked unknown, presumably through the model most likely part of speech of each word, that is to find the optimal set of $ l ^ * $, the :

Calculation

This time we are cautious, take a look at the calculation there is no problem.

Expand the formula:

We just have to find to make $ P (L = l | X = x) $ probability of a maximum of $ l $, and for every $ l $, $ Z (x) $ is the same, so the original question becomes :

Great, $ Z (x) $ do not calculate.

However, in order to find the best $ l ^ * $, the total can not really put every $ l $ are tried it again, this is still a large amount of calculation.

Then the Viterbi algorithm to solve the above problems, it utilizes the idea of dynamic programming, but the specific algorithm first while we do not say, will be described in detail later in the article.

summary

In this way, we know how to use well-established model of learning and good parameters of speech unknown sentences marked. Meanwhile, in order to get the best mark, it requires the use of the Viterbi algorithm

to sum up

So far, we have by way of example a speech tagging, and understand:

  1. CRF applicable to a case where: the plurality of positions for labels, labeling each position result, not only to itself, but also to rely on the labeling results in another location.
  2. CRF and other statistical machine learning models, can be divided into three parts, namely:
    1. Model building
    2. Parameters of learning
    3. Speculate
  3. For the model, we use the characteristic function, which ensures the ability dependencies we need models have representation.
  4. For learning and predictions of the model parameters and found a great deal of high-dimensional data calculation, and therefore need some algorithms to simplify the calculations.

However, there are some issues to explore:

  1. CRF is a probabilistic graphical models , then what is the probability graph model? How we look from the perspective of the conditions with the probability map of the airport? Other probabilistic graphical models can solve any kind of problem?
  2. Here we are talking about linear chain Conditional Random Fields (Linear Chain CRF), then other forms of CRF is how they can solve any problem?
  3. CRF linear chain, in order to solve computational problems in learning and prediction in the use of forward - backward algorithm and the Viterbi algorithm What is the difference?
  4. We now see that CRF in NLP applications, then in image processing , the CRF how to improve the prediction results?

Reference material

[1] introduction to conditional random fields

[2] How to intuitively understand the conditions of the airport, and is achieved by simply PyTorch

[3] Getting CRF popular non-rigorous - FCN (2)

[4] Lee Hang statistical learning method [J]. Tsinghua University Press, Beijing, 2012.

Original: Big Box  CRF series - a simple example


Guess you like

Origin www.cnblogs.com/dajunjun/p/11642230.html