Explain in detail the principle and application of conditional random fields

Disclaimer: The article is reprinted from https://www.jianshu.com/p/55755fc649b1
 
The best way to understand conditional random fields is to illustrate them with a real-world example. However, the current Chinese conditional random field articles rarely do this. Maybe the people who write the articles are big cows, so I don't care to give examples. So, I translated this article. Hope to help other partners.
The original text is here [ http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/]

Friends who want to read English directly can click directly. I didn't stick to the original text when translating, I added my own understanding in many places, which is free translation in academic terms. (Voiceover: What to wear, let's get started.) Okay, let's start translating!

Suppose you have many photos of Xiao Ming's classmates at different times of the day, from when Xiao Ming puts up his pants to get up to take off his pants to sleep (Xiao Ming is a photo control!). The task now is to classify these photos. For example, if some photos are of eating, then tag it with a meal; if some photos are taken while running, then tag it with running; if some photos are taken during a meeting, then tag it with a meeting. The question is, what are you going to do?

A simple and intuitive way is to find a way to train a multivariate classifier regardless of the chronological order between these photos. It is to use some labeled photos as training data, train a model, and classify directly according to the characteristics of the photos. For example, if the photo was taken at 6:00am and the picture is dark, tag it sleeping; if there is a car in the photo, tag it driving.

Is this possible?

At first glance it works! But in reality, our classifier will be flawed because we ignore the important information of the temporal order between these photos. For example, if there is a photo of Xiao Ming with his mouth closed, how to classify it? Obviously, it is difficult to judge directly. You need to refer to the photo before closing your mouth. If the previous photo shows Xiaoming eating, then this photo of closing your mouth is likely to be Xiaoming chewing food and preparing to swallow. You can label it as eating; The photo shows Xiaoming singing, so this photo of shutting up is likely to be a snapshot of Xiaoming's singing moment, which can be labeled as singing.

So, in order for our classifier to perform better, when classifying a photo, we must take into account the label information of the photos adjacent to it. That's -- that's where conditional random fields (CRFs) come into play!

Let's start with an example - part-of-speech tagging problems

What is the part-of-speech tagging problem?

Very simple, it is to indicate the part of speech for each word in a sentence. For example, in this sentence: "Bob drank coffee at Starbucks", after indicating the part of speech of each word is this: "Bob (noun) drank (verb) coffee (noun) at (preposition) Starbucks (noun)".

Next, we use conditional random fields to solve this problem.

Taking the above words as an example, there are 5 words, we will: (noun, verb, noun, preposition, noun) as a label sequence, called l, there are many optional label sequences, such as l can also be like this: (nouns, verbs, verbs, prepositions, nouns) , we have to select the most reliable one as our annotation for this sentence among so many optional annotation sequences.

How to judge whether a labeling sequence is reliable or not?

As far as the two labeling sequences we showed above, the second one is obviously not as reliable as the first one, because it labels both the second and third words as verbs, and the verbs are followed by the verbs, which is usually the case in a sentence. It doesn't make sense.

If we score each label sequence, the higher the score, the more reliable the label sequence is. We can at least say that any label sequence that is followed by a verb or a verb in the label should give it a negative score! !

After the verb mentioned above, the verb is a feature function. We can define a feature function set, use this feature function set to score a label sequence, and select the most reliable label sequence accordingly. That is to say, each feature function can be used to score a label sequence, and the sum of the scores of all feature functions in the set for the same label sequence is the final score value of the label sequence.

Defining Eigenfunctions in CRF

Now, let's formally define what is a characteristic function in CRF. The so-called characteristic function is such a function, which accepts four parameters:

  • Sentence s (that is, the sentence we want to mark the part of speech)
  • i, used to represent the i-th word in sentence s
  • l_i, indicating the part of speech of the i-th word labelled by the labeling sequence to be scored
  • l_i-1, indicating the part of speech marked by the label sequence to be scored for the i-1th word

Its output value is either 0 or 1, where 0 indicates that the label sequence to be scored does not conform to this feature, and 1 indicates that the label sequence to be scored conforms to this feature.

Note: Here, our feature function only judges the label sequence by the label of the current word and the label of the word before it. The CRF established in this way is also called linear chain CRF, which is a simple case in CRF. For simplicity, we only consider linear chain CRFs in this paper.

From eigenfunctions to probabilities

After defining a set of feature functions, we need to assign a weight λ_j to each feature function f_j. Now, as long as there is a sentence s and a label sequence l, we can use the set of feature functions defined earlier to score l. 

      

There are two summations in the above formula. The outer summation is used to calculate the sum of the score values ​​of each feature function f_j, and the inner summation is used to calculate the sum of the eigenvalues ​​of the words in each position in the sentence.

By exponentiating and normalizing this score , we can obtain the probability value p(l|s) for the label sequence l , as follows:
Examples of several eigenfunctions

We have given examples of characteristic functions before. Let's look at a few specific examples to help enhance everyone's perceptual understanding.

            

When l_i is an "adverb" and the i-th word ends with "ly", we let f1 = 1, otherwise f1 is 0. It is not difficult to imagine that the weight λ1 of the f1 feature function should be positive. And the larger λ1, the more likely we are to use those label sequences that label words ending in "ly" as "adverbs"

            

If i=1, l_i=verb, and the sentence s ends with "?", f2=1, otherwise f2=0. Likewise, λ2 should be positive, and the larger λ2, the more likely we are to use label sequences that label the first word of a question as "verb".

            

When l_i-1 is a preposition and l_i is a noun, f3=1, otherwise f3=0. λ3 should also be positive, and the larger λ3, the more we think that the preposition should be followed by a noun.

           

If both l_i and l_i-1 are prepositions, then f4 equals 1, otherwise f4=0. Here, we should be able to think that λ4 is negative, and the larger the absolute value of λ4, the less we recognize that the preposition is followed by the label sequence of the preposition.

Well, a conditional random field is established like this, let us summarize:
in order to build a conditional random field, we first need to define a set of feature functions, each feature function is based on the entire sentence s, current position i, position i and the label of i-1 as input. Then assign a weight to each feature function, and then for each label sequence l, weight and sum all the feature functions, and if necessary, convert the summed value into a probability value.

Comparison of CRF and Logistic Regression
Observe the formula:
   

Does it sound like a logistic regression?
In fact, conditional random fields are serialized versions of logistic regression. Logistic regression is a log-linear model for classification, and conditional random fields are log-linear models for serialized labeling.

Comparison of CRF and HMM

For the part-of-speech tagging problem, the HMM model can also solve it. The idea of ​​HMM is to use the generation method, that is, when the sentence s to be labeled is known, to determine the probability of generating the labeled sequence l, as shown below:

      

Here:
p(l_i|l_i-1) is the transition probability. For example, l_i-1 is a preposition, and l_i is a noun. At this time, p represents the probability that the word following the preposition is a noun.
p(w_i|l_i) represents the emission probability. For example, l_i is a noun, and w_i is the word "ball". At this time, p represents the probability of the word "ball" when it is a noun.

So, how do HMM and CRF compare?
The answer is: CRF is much more powerful than HMM, it can solve all the problems that HMM can solve, and can also solve many problems that HMM can't solve. In fact, we can take the logarithm of the HMM model above, and it becomes the following:   

          

We compare this formula with that of CRF:

          

It is not difficult to find that if we regard the probability of the log form in the first HMM formula as the weight of the feature function in the second CRF formula, we will find that the CRF and HMM have the same form.

In other words, we can construct a CRF that is identical to the logarithmic form of the HMM. How to construct it?

For each transition probability p(l_i=y|l_i-1=x) in the HMM, we can define a feature function like this: 

        
 

The characteristic function is equal to 1 only when l_i = y, l_i-1=x. The weights of this feature function are as follows:

 
      

Similarly, for each emission probability in the HMM, we can also define a corresponding feature function, and let the weight of the feature function be equal to the log form of the emission probability in the HMM.

The p(l|s) computed with these forms of eigenfunctions and corresponding weights is almost identical to the HMM model in logarithmic form!

In one sentence, the relationship between HMM and CRF is as follows:
each HMM model is equivalent to a certain CRF,
each HMM model is equivalent to a certain CRF,
and each HMM model is equivalent to a certain CRF

However, CRF is more powerful than HMM for two main reasons:

  • CRF can define more and more kinds of feature functions . The HMM model is inherently local, that is, in the HMM model, the current word only depends on the current label, and the current label only depends on the previous label. Such locality restricts the HMM to only define characteristic functions of the corresponding type, as we saw above. But CRF can focus on the entire sentence s to define a more global feature function, such as this feature function:
 

  If i=1, l_i=verb, and the sentence s ends with "?", f2=1, otherwise f2=0.

  • CRF can use any weight. When the logarithmic HMM model is regarded as CRF, the weight of the feature function is less than or equal to 0 because it is a log probability, and the probability must meet the corresponding restrictions, such as

    But in CRF, the weight of each feature function can be any value, without these restrictions.
   
 
 
 
   Disclaimer: The article is reprinted from https://www.jianshu.com/p/55755fc649b1







Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324802299&siteId=291194637