Part-of-Speech Tagging: Understanding Using the Viterbi Algorithm (2/2)

1. Description

        I have long wanted to write an article on Hidden Markov and the Viterbi algorithm; now I have found an article that basically describes the characteristics of Hidden Markov.

        Hidden Markov Model (HMM) has the following characteristics:

  1. HMM is a directed graph model, which consists of hidden state and observation state, where the hidden state is not directly observable, and can only be inferred indirectly through the observation state.

  2. HMM is sequential, that is, the transition probability of each state is related to the previous state, so it can be used for time series analysis and prediction.

  3. HMM is a generative model that can be used to solve problems such as sequence classification, sequence matching, and sequence generation.

  4. HMM uses the EM algorithm for parameter estimation, which has certain guarantees in model fitting and training.

  5. HMM is a model with strong flexibility, which can be transformed according to needs, such as adding states, introducing state constraints, and so on.

Two, the problem        

        Given a state diagram and a sequence of N observations over time, we need to tell the state of the baby at the current point in time. Mathematically, we have N observations over time. We want to find out whether Peter is awake or asleep, or rather, which state is more likely in time.t0, t1, t2 .... tNt,N+1

        If these sound like Greek to you, read the previous post for a review of Markov Chain Models, Hidden Markov Models, and Part-of-Speech Tagging.

The state map Peter's mom gave you before leaving.

        In the previous article , we have briefly modeled the problem of part-of-speech tagging using Hidden Markov Models.

        The question of whether Peter is asleep is just an example question posed to better understand some of the core concepts covered in these two articles. At the core of these articles is the use of Hidden Markov Models to solve the problem of part-of-speech tagging.

        So, before moving on to the Viterbi algorithm , let's look at a more detailed explanation of how to model the labeling problem using HMMs.

3. Generative model and noise channel model

        Many problems in natural language processing are solved using supervised learning methods.

        Supervised problems in machine learning are defined as follows. We assume training examples... where each example consists of an input x(i) paired with a label y(i). We use X to refer to the set of possible inputs and Y to refer to the set of possible labels. Our task is to learn a function f: X → Y that maps any input x to a label f(x).(x(1), y(1))(x(m) , y(m))

        In the labeling problem, each x(i) will be a sequence of words and each y(i) will be a sequence of labels (we use n(i) to refer to the length of the ith training example). X will refer to the set of all sequences x1. . .xn, and Y will be the set of all label sequences y1. . .oops. Our task is to learn a function f: X→Y that maps sentences to sequences of labels.X1 X2 X3 …. Xn(i)Y1 Y2 Y3 … Yn(i)

        An intuitive way to obtain estimates for this problem is to use conditional probabilities. This is the probability of output y given input x. The parameters of the model will be estimated using the training samples. Finally, given an unknown input, we wish to findp(y | x)x

f(x) = arg max(p(y | x)) ∀y ∊ Y

        Here is a conditional model for solving this general problem given training data. Another approach that is predominantly employed in machine learning and natural language processing is the use of generative models .

        In generative models, instead of estimating conditional distributions directly, we simulate the joint probability of all (x, y) pairs.p(y|x)p(x, y)

We can further decompose the joint probability into simpler values ​​using Bayes' rule:

  • p(y)is the prior probability of any input belonging to label y.
  • p(x | y)is the conditional probability of the input x given the label y.

We can use this decomposition and Bayes' rule to determine conditional probabilities.

Remember, we want to estimate the function

f(x) = arg max( p(y|x) ) ∀y ∊ Y
f(x) = arg max( p(y) * p(x | y) )

        The reason we skip the denominator here is that the probabilities remain the same no matter what output label is considered. Therefore, from a computational point of view, it is treated as a normalization constant and is usually ignored.p(x)

        Models that decompose joint probabilities into terms are often called noisy channel models . Intuitively, when we see a test example x, we assume it was generated in two steps:p(y)p(x|y)

  1. First, a label y is chosen with probability p(y)
  2. Second, the examples X are generated from the distribution p(x|y). A model p(x|y) can be interpreted as a "channel" that takes a label y as its input and destroys it to produce x as its output.

4. The generation part of the part-of-speech tagging model

        Let us assume a finite set of words V and a finite sequence K of tags. Then the set S will be the set of all sequences, labeled pairs such that n > 0 and .<x1, x2, x3 ... xn, y1, y2, y3, ..., yn>∀x ∊ V∀y ∊ K

        Then, generating the tagged model is where

2.

        Given a generative labeling model, the function we discussed earlier from input to output becomes

        Thus, for any given input sequence of words, the output is the highest probability token sequence in the model. After defining the generative model, we need to figure out three different things:

  1. How exactly do we define generative model probabilitiesp(<x1, x2, x3 ... xn, y1, y2, y3, ..., yn>)
  2. how we estimate the parameters of the model, and
  3. How can we efficiently calculate

        Let's see how to answer these three questions side-by-side, once for our example question and then for the actual problem at hand: part-of-speech tagging.

5. Define the generative model

        Let's first look at how to estimate probabilities using HMMs.p(x1 .. xn, y1 .. yn)

       We can have any N-gram HMM that considers events in the previous window of size N.

       The formula provided below corresponds to a triplet Hidden Markov Model.

5.1 Ternary Hidden Markov Model

        Triplet Hidden Markov Models can be used

  • A finite set of states.
  • A series of observations.
  • The q(s|u, v)
    transition probability is defined as the probability of state “s” occurring immediately after observing “u” and “v” in the observation sequence.
  • The e(x|s)
    emission probability is defined as the probability of observing x in state s.

Then, the generative model probability is estimated as

        As for the baby sleep problem we're considering, we only have two possible states: the baby is either awake or asleep. Watchers are limited to two observations over time. The room either had noise coming in or it was absolutely quiet. The order of observations and states can be represented as follows:

        Observations and status of infant sleep problems over time

        Entering the part-of-speech tagging problem, the state will be represented by the actual label assigned to the word. These words will be our observations. The reason we say the label is our state is because in a Hidden Markov Model the state is always hidden and all we have is the set of observations that are visible to us. Along similar lines, the state and observation order for the part-of-speech tagging problem would be

        Observations and status of POS tagging issues over time

6. Estimate the parameters of the model

        Let's assume we have access to some training data. The training data consists of a set of examples, where each example is a sequence of observations, and each observation is associated with a state. Given this data, how do we estimate the parameters of the model?

        Estimating the parameters of the model is done by reading various counts from the training corpus we have, and then computing the maximum likelihood estimate:

        Transition probability and emission probability of triplet HMM

        We already know that the first term represents the transition probability and the second term represents the emission probability. Let's see what the four different counts in the above terms mean.

  1. c(u , v, s) represents the count of triples for states u, v, and s. This means it represents the number of times the three states u, v and s occur together in that order in the training corpus.
  2. C(u, v)  follows a similar train of thought to triple counts, which are bigram counts of U and V states given the training corpus.
  3. c(s → x)  is the number of times a state s and an observation x are paired with each other in the training set. at last,
  4. c(s)  is the prior probability that an observation is labeled with state s.

        Let's first look at the sample training set for the toy problem and see the calculation of transition and launch probabilities using it.

        Blue markers represent transition probabilities and red marks emission probability calculations.

        Note that the calculations below for the example problem use a binary HMM instead of a triplet .

        Peter's mother kept records of observations and status. So she even gives you a training corpus to help you get transition and launch probabilities.

6.1 Examples of transition probabilities:

        training corpus

        lucid computing that occurs after sobriety

6.2 Examples of emission probabilities:

training corpus

Observing the calculation of "quiet" when the state is "awake"

        This is easy because the training set is very small. Let's look at an example training set for a practical problem of part-of-speech tagging. Here, we can consider a triplet HMM and we will show the computation results accordingly.

        We will use the following sentences as a corpus of training data (token word/TAG means words tagged with a specific part-of-speech tag).

        The training set we have is a corpus of labeled sentences. Each sentence consists of words tagged with corresponding part-of-speech tags. For example:- eat/VB means the word is "eat" and the part-of-speech tag in that sentence in this context is "VB", the verb phrase. Let's look at an example calculation of transition probabilities and launch probabilities, as we see with baby sleep problems.

6.3 Transition Probability

        Suppose we want to calculate the transition probability q(IN|VB,NN). For this, we can see how many times we saw triplets (VB, NN, IN) in the training corpus in that particular order. We then divide that by the total number of times we see bigrams (VB,NN) in the corpus.

6.4 Emission probability

        Suppose we want to find out the emission probability e(an|DT). To do this, we see how many times the word "an" has been tagged as "DT" in the corpus, and divide that by the total number of times we've seen the tag "DT" in the corpus.

        So if you look at the calculations, it shows that computing the parameters of the model is not computationally expensive. That is, we don't have to make multiple passes through the training data to compute these parameters. All we need is a bunch of distinct counts, and a single pass over the training corpus should give us that.

        Let's move on to the last step we need to look at a given generative model. This step is effectively calculated

        We will study the famous Viterbi algorithm for this calculation.

Seven, find the most likely sequence - Viterbi algorithm

        Finally, we'll tackle the problem of finding the most likely sequence of labels given a set of observations x1...xn. That is, we'll find out

        The probabilities here are represented by the transition and emission probabilities we learned how to calculate in the previous section of this article. Just to remind you, the probability formula for a sequence of labels given a sequence of observations of "n" time steps is

        Before looking at an optimization algorithm to solve this problem, let's look at a simple brute force approach to this problem. Basically, we need to find the most probable label sequence for a given set of observations from a finite set of possible label sequences. Let's look at the total number of possible sequences for the example question and a small example of the part-of-speech tagging question.

        Suppose we have the following set of observations for the example problem.

Noise     Quiet     Noise

We have two possible labels {asleep and awake}. Some possible label sequences for the above observations are:

Awake      Awake     Awake
Awake      Awake     Asleep
Awake      Asleep    Awake
Awake      Asleep    Asleep

        In total, we can have 2³ = 8 possible sequences. This may not seem like a lot, but if we increase the number of observations over time, the number of sequences will grow exponentially. This is the case when we only have two possible labels. What if we had more? The same is the case with part-of-speech tags.

For example, consider the sentence

the dog barks

        Assuming the set of possible labels is {D, N, V}, let's look at some possible sequences of labels:

D     D     D
D     D     N
D     D     V
D     N     D
D     N     N
D     N     V ... etc

        Here we will have 3³ = 27 possible label sequences. As you can see, the sentence is very short and the number of tags is not very large. In practice, we can have sentences much larger than three words. Then also the number of unique tags we can use is too high to follow this enumeration method and find the optimal tag sequence this way.

        So the exponential growth in the number of sequences means that for sentences of any reasonable length, the brute force approach won't work because it takes too much time to execute.

        Instead of this brute force approach, we will see that we can efficiently find the highest possible sequence of labels using a dynamic programming algorithm called the Viterbi algorithm.

        Let's first define some terms that are useful for defining the algorithm itself. We already know that the probability of a label sequence given a set of observations can be defined in terms of transition probabilities and firing probabilities. Mathematically, it is

        Let's look at a truncated version of this, which is

        Let's call this the cost of a sequence of length k.

        So the definition of "r" just considers the first k terms of the probability definition, where k ∊ {1..n} and any label sequence y1...yes.

        Next we have the set S(k, u, v), which is basically the set of all label sequences of length k ending in a double letter (u, v), i.e.

        Finally, we define the term π(k, u, v), which is basically the sequence with maximum cost.

        The main idea behind the Viterbi algorithm is that we can efficiently compute the value of the term π(k,u,v) in a recursive, memorized manner. To define an algorithm recursively, let's look at the base case of recursion.

π(0, *, *) = 1
π(0, u, v) = 0

        Since we are considering a triplet HMM, we will consider all triplets as part of the execution of the Viterbi algorithm.

        Now, we can start the first triple window with the first three words of the sentence, but then the model will miss triples where the first word or the first two words occur independently. For this reason, we consider the two special start symbols as such, so our sentence becomes*

*    *    x1   x2   x3   ......         xn

The first triple we consider is (*,*,x1) and the second is (*,x1,x2).

Now that we have all the terms, we can finally look at the recursive definition of the algorithm, which is basically the heart of the algorithm.

        This definition is obviously recursive because we are trying to compute a π term while we are using another k term with a lower value in the recurrence relation that uses it.

        Each sequence will end with a special STOP symbol. For triplet models, we also have two special start symbols "*" at the beginning.

        Take a look at the pseudocode for the entire algorithm.

        The algorithm first
        fills in the π(k, u, v) values ​​using a recursive definition. It then uses the identity described earlier to calculate the highest probability for any sequence.

        The running time of the algorithm is O(n|K|³), so it is linear in the length of the sequence and cubic in the number of labels.

        Note: We will only show calculations for the two-letter HMM- based infant sleep problem and part-of-speech tagging problem. The computation of triplets is left to the reader. But the code appended at the end of this article is based on a triplet HMM. It's just that the computation of the Viterbi algorithm is easier to explain and picture when considering binary HMMs rather than triplet HMMs.

        So, before showing the computation of the Viterbi algorithm, let's look at the recursive formulation based on bigram HMMs.

        This one is very similar to the triplet model we saw earlier, except now we only focus on the current label and the previous label instead of the previous two labels. The complexity of the algorithm now becomes O(n|K|²).

7.1 Calculation of sleep problems in infants

        Now that we have the recursive formulation ready for the Viterbi algorithm, let's look first at an example of the same computation for the example problem we had (namely the baby sleeping problem), and then at the part-of-speech tagged version.

        Note that when we are at this step, the computation of the Viterbi algorithm to find the most probable sequence of labels given a set of observations over a series of time steps, we assume that Transition and launch probabilities were calculated. Let's look at a sample of transitions and firing probabilities for the baby sleep problem, which we will use to calculate the algorithm.

        Infants start awake and remain in the room for three time points, t1. t3 (three iterations of the Markov chain). The observations were: quiet, quiet, noise. Take a look at the figure below, which shows the calculation for up to two time steps. A complete graph with all final value sets will then be displayed.

        For simplicity, we have not shown the calculations for the "sleep" state for k = 2 and the calculations for k = 3 in the figure above.

        Now that we have all these calculations done, we want to calculate the most likely sequence of states the baby could be in at different given time steps. So, for k = 2 and the awake state, we want to know the most likely state k = 2 when transitioning to awake at k = 1. (k = 2 means a sequence of states of length 0 starting at 3, and t = 2 means the state at time step 2. We get the state at t = 0, which is awake).

        Obviously, if the state at time step 2 is AWAKE, then the state at time step 1 will also be AWAKE, as indicated by the calculation. Thus, the Viterbi algorithm not only helps us find the π(k) value, i.e. the cost value for all sequences using the concept of dynamic programming, but also helps us find the most probable sequence of labels given a start state and sequence of observations. The algorithm and pseudocode for storing the back pointer are given below .

7.2 Computation of part-of-speech tagging problems

        Let's look at a slightly larger corpus of part-of-speech tagging and the corresponding Viterbi diagram, showing the computation and backpointing of the Viterbi algorithm.

        Here are the corpora we will consider:

        Now look at the transition probabilities computed from this corpus.

        Here, q0 → VB denotes the probability of a sentence starting with the label VB, i.e. the first word of the sentence is labeled as VB. Likewise, q0 → NN denotes the probability of a sentence beginning with the label NN. Note that out of 10 sentences in the corpus, 8 start with NN and 2 start with VB, hence the corresponding transition probabilities.

        As for firing probabilities, ideally we should be looking at all combinations of labels and words in the corpus. Since this is too much, we will only consider the firing probabilities of the sentences used in the Viterbi algorithm calculation.

Time flies like an arrow 

The firing probability of the above sentence is:

Finally, we are ready to look at the calculations for a given sentence, transition probabilities, firing probabilities, and a given corpus.

        So, is this what the Viterbi algorithm is all about? Take a look at the example below.

        The buckets below each word are filled with possible labels seen next to that word in the training corpus. A given sentence can have combinations of labels, depending on the path we take. But there is a problem! Can you guess what that is?

All combinations of sequence paths

can you figure it out

No? ?

Let me tell you what it is.

There may be some paths in the computational graph for which we don't have transition probabilities. Therefore, our algorithm can discard that path and take another path.

        In the above diagram, we discard the paths marked in red because we do not have q(VB|VB). The training corpus never has VB followed by  VB. Therefore, we end up with q(VB|VB) = 0 in the Viterbi calculation. If you've been following the algorithm closely, you'll see that a single 0 in the calculation will make the entire probability or maximum cost of the label/sequence of labels 0.

        However, this means that we ignore combinations not seen in the training corpus.

        Is this the correct way to handle real world examples?

        Consider a small adjustment in the sentence above.

Time flies like an arrow

        In this sentence, we don't have any alternative paths. Even if we have Viterbi probabilities, we can't move forward until we get to the word "like". Since both q(VB|VB) = 0 and q(VB|IN) = 0. What should we do now?

        The corpora we consider here are very small. Consider any reasonably sized corpus with a large number of words, and we have a major data sparsity problem. Take a look below.

Source:  http://www.cs.pomona.edu/~kim/CSC181S08/lectures/Lec6/Lec6.pdf

        This means we can have potentially 6.8 billion bigrams, but the number of words in the corpus is less than a billion. This is a large number of zero transition probabilities that need to be filled. The problem of data sparsity is even more complicated if we consider triples.

        To address this data sparsity problem, we employ a solution called smoothing.

Eight, smooth

        The idea behind smoothing is this:

  1. Discount  — Existing probability values ​​are somewhat summed
  2. Redistribution  — this probability is zero

        In this way, we reassign non-zero probability values ​​to compensate for unseen transition combinations. Let's consider a very simple smoothing technique called Laplace smoothing.

        Laplace smoothing is also known as single-count smoothing. Later you'll understand exactly why it bears that name. Let's modify how the parameters of the triplet HMM model are computed given the training corpus.

Values ​​that can go wrong here are

  1. c(u, v, s)is 0
  2. c(u, v)is 0
  3. We get an unknown word in the test sentence, and we don't have any training labels associated with it.

All of these can be resolved with smoothing. Therefore, the Laplacian smoothed count becomes

        Here V is the total number of labels in our corpus and λ is basically an actual value between 0 and 1. It's like a discount factor. A value of λ = 1 would give us too many redistributions of probability values. For example:

        For λ = 1, unseen triples are given too much weight, which is why a modified version of the above Laplacian smoothing is considered for all practical applications. The value of the discount factor varies by application.

        Note that λ = 1 only creates problems when the vocabulary size is too large. For smaller corpora, λ = 1 will give us a good starting performance.

        One thing to note about Laplace smoothing is that it is a uniform redistribution, that is, all previously unseen triples have equal probability. So, suppose we get some data and we observe

  • The frequency of the triplet <give, that, thing> is zero
  • The frequency of the triplet <give, think> is also zero
  • Uniform distribution over unseen events means:
    P(thing|gve, the) = P(think|gve, the)

        Does this reflect what we know about English usage?
        P(thing|gve, the) > P(think|gve, the) ideally, but a uniform distribution using Laplace smoothing doesn't take this into account.

This means that millions of unseen triples from a huge corpus have equal probability when considering them in our calculations. This is probably not the right thing to do. However, it is better to consider 0 probability, which leads to these triples and eventually completely ignores some paths in the Viterbi diagram. But it still needs work and improvement.

        However, there are many different types of smoothing techniques that improve upon the basic Laplacian smoothing technique and help overcome the problem of uniform distribution of probabilities. Some of these techniques are:

  • good turing estimate
  • Jelinek-Mercer smoothing (interpolation)
  • Katz smoothing (backoff)
  • Witten-Bell smoothing
  • absolute discount
  • Kneisser-Ney smoothing

        To read more about these different types of smoothing techniques in more detail, see this tutorial . Which smoothing technique to choose depends largely on the type of application at hand, the type of data being considered, and the size of the dataset.

        If you have been following this lengthy article, I must say

Source:  https://sebreg.deviantart.com/art/You-re-Kind-of-Awesome-289166787

        Let's go ahead and look at a slight optimization we can make to the Viterbi algorithm that reduces the number of calculations and also makes sense for many datasets out there.

        But, before doing that, look again at the algorithm's pseudocode.

If we look closely, we can see that for each triplet word we are considering all possible sets of labels. That is, if the number of labels is V, then we are considering |V|³ the number of combinations of each triplet of the test sentence.

        Ignore triples for now and just consider a single word. We will consider all unique labels for a given word in the above algorithm. Consider a corpus where we have the word "kick" which is associated with only two tags, say {NN, VB}, the total number of unique tags in the training corpus is about 500 (this is a huge corpus).

        Now the problem here is obvious. We may end up assigning a label that does not make sense for the word under consideration, simply because triples ending with a label have a very high transition probability, as shown above. Furthermore, it is computationally inefficient to consider all 500 labels for the word "kick" if the word "kick" only occurs in two unique labels across the entire corpus.

        So what we do optimize is that for each word we only consider the tags that occur in the corpus instead of all the unique tags in the corpus .

        This will work because for a fairly large corpus, a given word will ideally appear in all the various label sets it can appear in (at least most). It is reasonable then to simply consider those labels of the Viterbi algorithm.

        As far as the Viterbi decoding algorithm is concerned, the complexity remains the same since we always focus on the worst case complexity. In the worst case, each word occurs with each unique label in the corpus, so the complexity remains O(n|V|³) for triplet models and O(n|V|²) for binary model.

        For a recursive implementation of the code, see

        The recursive implementation is done with Laplace smoothing. For an iterative implementation, see

        This implementation is done using a single-count smoothing technique, which has higher precision than Laplace smoothing.

        Many snapshots of the formulas and calculations in both articles are derived from here .

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132206899