[Study Notes] [Machine Learning] 12. [Part 1] HMM hidden Markov algorithm (Markov chain, HMM three types of problems, forward and backward algorithm, Viterbi algorithm, Baum-Welch algorithm, API and examples )

learning target:

  • Understand what is a Markov chain
  • Know what is an HMM model
  • Knowing the forward-backward algorithm to evaluate the observed sequence probabilities
  • Know the Viterbi algorithm for decoding hidden state sequences
  • Understanding the Baum-Welch Algorithm
  • Know the usage of HMM model API

1. Markov chain

learning target:

  • Know what is a Markov chain

In machine learning algorithms, Markov Chain (Markov Chain) is a very important concept. Markov chain, also known as discrete time Markov chain (Discrete-Time Markov Chain), is named after Russian mathematician Andrei Markov.

1.1 Introduction to Markov chain

A Markov chain is a random process of transition from one state to another in the state space.

insert image description here

Let's look at the gray dot in the picture, it can go to itself (A → A / B → B), or it can go to the next point (A → B / B → A), it has nothing to do with how the previous steps are taken .

This process requires the property of "no memory": that is, the probability distribution of the next state can only be determined by the current state, and the previous events in the time series have nothing to do with it . This particular type of "memorylessness" is called the Markov property.

As a statistical model of the actual process, Markov chain has many applications, and has a wide range of applications in the fields of machine learning and artificial intelligence, such as reinforcement learning, natural language processing, financial fields, weather prediction and speech recognition.

At each step of the Markov chain, the system can change from one state to another according to the probability distribution, and can also maintain the current state.

  • A change of state is called a transition
  • The probabilities associated with different state changes are called transition probabilities

The mathematical representation of a Markov chain is:

P ( x t + 1 ∣ ⋅ ⋅ ⋅ , x t − 2 , x t − 1 , x t ) = P ( x t + 1 ∣ x t ) P(x_{t+1}|···, x_{t-2}, x_{t-1}, x_t) = P(x_{t+1}|x_t) P(xt+1⋅⋅⋅,xt2,xt1,xt)=P(xt+1xt)

The above formula shows that the transition probability of the Markov chain is related to ttThe probability before t has nothing to do with ttt related.

Since the probability of state transition at a certain moment only depends on the previous state, then only the transition probability between any two states in the system is required, and the model of this Markov chain is determined .

1.2 Classic example of Markov chain

The Markov chain in the figure below is used to represent the stock market model. There are three states: bull market (Bull market), bear market (Bear market) and sideways (Stagnant market). Each state is converted to the next state
with a certain probability. a state. For example, a bull market turns into a sideways state with a probability of 0.025.

The Markov chain in the figure below is used to represent the stock market model, and there are three states:

  • Bull market : When stocks continue to rise, when the increase exceeds 20%, it is called a bull market
  • Bear market : When stocks continue to fall, when the decline exceeds 20%, it is called a bear market
  • Sideways (Stagnant market) : The stock price remains basically unchanged for a period of time, with no obvious upward or downward trend

Each state transitions to the next state with a certain probability. For example, a bull market transitions to a sideways state with a probability of 0.025.

insert image description here

This state probability transition diagram can be expressed in the form of a matrix.

If we define the matrix PPP a certain positionP ( i , j ) P(i, j)P(i,j ) isP ( j ∣ i ) P(j | i)P ( j i ) , that is, from stateiii becomes statejjThe probability of j . In addition, the states of bull market, bear market, and sideways are defined as 0, 1, and 2 respectively, so that we get the state transition matrix of the Markov chain model:

P = ( 0.9 0.075 0.025 0.15 0.8 0.05 0.25 0.25 0.5 ) P = \begin{pmatrix} 0.9 & 0.075 & 0.025\\ 0.15 & 0.8 & 0.05\\ 0.25 & 0.25 & 0.5 \end{pmatrix} P= 0.90.150.250.0750.80.250.0250.050.5

The above matrix:

  • The first line indicates a bull market; the second line indicates a bear market; the third line indicates a sideways market
  • The first column indicates a bull market; the second column indicates a bear market; the third column indicates a sideways market
  • The diagonal of the matrix represents the transfer to itself

When this state transition matrix PPAfter P is determined, the entire stock market model has been determined.


Summary :

  • A Markov chain is simply a random process of transitions from one state to another in a state space.
    • This process requires the property of " no memory ": the probability distribution of the next state can only be determined by the current state, and the events before it in the time series have nothing to do with it .

2. Introduction to HMMs

learning target:

  • Understand what is an HMM model with an example

Hidden Markov Model (HMM) is a statistical model used to describe a Markov process with hidden unknown parameters . It is a probabilistic model about time series, describing the process of randomly generating an unobservable state random sequence from a hidden Markov chain, and then generating an observation from each state to generate an observed random sequence.

The difficulty is to determine the implicit parameters of the process from the observable parameters, and then use these parameters for further analysis .

Hidden Markov Model (HMM) is widely used in speech recognition, machine translation, Chinese word segmentation, named entity recognition, part-of-speech tagging, gene recognition and other fields. It has also been applied to gestures, font recognition, ground level estimation, intrusion detection in network security, prediction of DNA segment sequences in living things, and fault prediction and diagnosis.

2.1 Simple case of HMM

Let's use a simple example to illustrate HMM: Suppose we have three different dice in our hands.

  • The first dice is our usual dice (call this dice D6), it has 6 faces, and the probability of each face (1, 2, 3, 4, 5, 6) appearing is 1 6 \frac { 1}{6}61
  • The second dice is a tetrahedron (call this dice D4), and the probability of each face (1, 2, 3, 4) appearing is 1 4 \frac{1}{4}41
  • The third dice has eight sides (call this dice D8), and the probability of each side (1, 2, 3, 4, 5, 6, 7, 8) appearing is 1 8 \frac{1}{ 8 }81

insert image description here

We start rolling the dice, we first pick one of the three dice, the probability of picking each dice is 1 3 \frac{1}{3}31. Then we roll the dice and get a number (one of 1, 2, 3, 4, 5, 6, 7, 8). Repeat the above process non-stop, we will get a string of numbers, each number is one of 1, 2, 3, 4, 5, 6, 7, 8.

For example, we may get such a string of numbers (rolling the dice 10 times): 1 6 3 5 2 7 3 5 2 4. We call this string of numbers the visible state chain .


But in the hidden Markov model, we not only have such a series of visible state chains, but also a series of hidden state chains. In this example, the chain of implicit states is the sequence of dice you use .

For example, the implicit state chain may be: D6 D8 D8 D6 D4 D8 D6 D6 D4 D8.

Generally speaking, the Markov chain mentioned in HMM actually refers to the hidden state chain , because there is a transition probability (Transition Probability) between hidden states (dice).

In our example, the probability that the next state of D6 is D4 or D6 or D8 is 1 3 \frac{1}{3}31. The next state of D4 and D8 is D4 or D6 or D8. The transition probability is the same as 1 3 \frac{1}{3}31

This setting is to make it easier to explain clearly at the beginning, but we can actually set the conversion probability at will. For example, we can define that D4 cannot be followed by D6, the probability of D6 being followed by D6 is 0.9, and the probability of being D8 is 0.1. This is a new HMM.

Similarly, although there is no transition probability between the visible states, there is a probability between the hidden state and the visible state called the output probability (Emission Probability) .

For our example, the output probability of rolling a 1 on a six-sided die (D6) is 1 6 \frac{1}{6}61. The probability of casting 2, 3, 4, 5, 6 is also 1 6 \frac{1}{6}61. We can also make other definitions of output probabilities. For example, I have a six-sided dice that has been tampered with by the casino. The probability of throwing 1 is higher, which is 1 2 \frac{1}{2}21, the probability of throwing 2, 3, 4, 5, 6 is 1 10 \frac{1}{10}101

insert image description here

insert image description here

Figure out which states are implicit and which are visible


In fact, for HMM, if the transition probabilities between all hidden states and the output probabilities between all hidden states and all visible states are known in advance, it is quite easy to do simulations. But when applying the HMM model, some information is often missing.

  • Sometimes you know how many kinds of dice there are and what each type is, but you don't know the sequence of dice rolled
  • Sometimes you just see the result of rolling the dice many times and don't know the rest

If algorithms are used to estimate these missing information, it becomes a very important problem. We will discuss these issues in detail later.

2.2 Advanced case

2.2.1 Problem statement

Algorithms related to the HMM model are mainly divided into three categories (respectively solving three kinds of problems):

2.2.1.1 Algorithms of the first type

Knowing that there are several types of dice (the number of implicit states), what each type of dice is (transition probability), and according to the results of the dice (visible state chain), we want to know which kind of dice is thrown each time (implicit with state chain).

That is: according to the visible state chain + transition probability → implicit state chain

Which dice were used for these results

Note :

  • Number of hidden states: There are several types of dice (in this case, 3 types of dice)
  • Conversion probabilities: what each type of dice is (D4 D6 D8 in this example)
  • Visible state chain: the result of the dice roll (1 ~ 8 in this example)

For this problem, it is called the decoding problem in the field of speech recognition.

There are actually two solutions to this question, which give two different answers. Each answer is correct, but the meaning of these answers is not the same.

  • The first solution : find the maximum likelihood state path. To put it simply, we are looking for a sequence of dice (such as D6 D4 D8 D6 D6 D4 ...), which has the highest probability of producing observations (such as 0 1 0 2 2 3 4 ...).

  • The second solution : it is no longer to find a sequence of dice, but to find the probability that each dice thrown is a certain kind of dice. For example, after we see the result, I can find that the probability of D4 being rolled for the first time is 0.5, the probability of D6 is 0.3, and the probability of D8 is 0.2.

2.2.1.2 The second type of algorithm

I still know that there are several types of dice (the number of hidden states), what each type of dice is (transition probability), and according to the results of the dice (visible state chain), I want to know the probability of throwing this result (the next time I cast What is the probability of this outcome).

It seems that this question is of little significance, because the result you throw often corresponds to a relatively high probability. The purpose of asking this question is actually to test whether the observed results match the known model .

If many results correspond to relatively small probabilities, it means that our known model is likely to be wrong, and someone secretly changed our dice.

2.2.1.3 The third type of algorithm

Knowing that there are several types of dice (the number of hidden states), I don’t know what each type of dice is (transition probability), and I have observed the results of many dice rolls (visible state chain), I want to deduce what each type of dice is (transition probability ).

That is: according to the number of hidden states + visible state chain → transition probability

Here we mainly want to know what the conversion probability between dice is, and whether it is consistent with our expectations!

This question is important because this is the most common situation .

In many cases, we only have visible results and do not know the parameters in the HMM model. We need to estimate these parameters from visible results, which is a necessary step in modeling.

2.2.2 Problem Solving

2.2.2.1 A simple question [corresponding to question 2]

In fact, the practical value of this question is not high. Since it is helpful for the more difficult problems below, I will mention them here first.

Knowing that there are several types of dice (the number of hidden states), what each type of dice is (transition probability), and according to the results of the dice (visible state chain), we want to know the probability of throwing this result (the next time we throw What is the probability of this outcome).

insert image description here

The solution is nothing more than multiplying the probabilities:

P = P ( D 6 ) × P ( D 6 → 1 ) × P ( D 6 → D 8 ) × P ( D 8 → 6 ) × P ( D 8 → D 8 ) × P ( D 8 → 3 ) = 1 3 × 1 6 × 1 3 × 1 8 × 1 3 × 1 8 = 0.00005787 \begin{aligned} P & = P(D6) \times P(D6 \rightarrow 1) \times P(D6 \rightarrow D8) \times P(D8 \rightarrow 6) \times P(D8 \rightarrow D8) \times P(D8 \rightarrow 3)\\ & = \frac{1}{3} \times \frac{1}{6} \times \frac{1}{3} \times \frac{1}{8} \times \frac{1}{3} \times \frac{1}{8}\\ & = 0.00005787 \end{aligned} P=P(D6)×P(D61)×P(D6D8 ) _×P(D86)×P(D8D8 ) _×P(D83)=31×61×31×81×31×81=0.00005787

2.2.2.2 See the invisible, crack the dice sequence [corresponding to question 1]

Question 1 : Know that there are several types of dice (the number of hidden states), what each type of dice is (transition probability), and according to the results of the dice (visible state chain), we want to know which one is thrown each time Dice (implied state chain).

That is: according to the visible state chain + transition probability → implicit state chain

Here we use the first solution, the maximum likelihood path problem.

For example, I know that I have three dice, a six-sided D6, a four-sided D4, and an eight-sided D8. I also know the result of my ten tosses (1 6 3 5 2 7 3 5 2 4). But we don't know which dice were used each time, we want to know the most likely sequence of dice ( which dice were used for these results when they were rolled ).

In fact, the simplest and crudest way is to enumerate all possible dice sequences, and then calculate the probability corresponding to each sequence according to the solution of the previous problem. Then we just pick out the sequence corresponding to the highest probability from it.

This is certainly possible if the Markov chain is not long. But if the Markov chain is very long and the number of exhaustive enumeration is too large, it will be difficult to complete.

Another well-known algorithm is called the Viterbi algorithm . To understand this algorithm, let's look at a few simple examples.

First, if we only roll the dice once:

insert image description here

Seeing that the result is 1, the corresponding maximum probability dice sequence is D4. Because the probability of D4 producing 1 is 1 4 \frac{1}{4}41, higher than 1 6 \frac{1}{6}61and 1 8 \frac{1}{8}81

Extending this situation, we roll the dice twice:

insert image description here

The result is 1 out of 6. At this point the problem becomes complicated, we need to calculate three values, which are the maximum probability that the second dice is D6 or D4 or D8. Obviously, to get the maximum probability, the first dice must be D4. At this time, the maximum probability of the second dice getting D6 is:

P 2 ( D 6 ) = P ( D 4 ) × P ( D 4 → 1 ) ‾ first throw × P ( D 4 → D 6 ) × P ( D 6 → 6 ) ‾ second throw = 1 3 × 1 4 × 1 3 × 1 6 \begin{aligned} P2(D6) &= \underset{first throw}{\underline{P(D4) \times P(D4 \rightarrow 1)}} \times \ underset{second throw}{\underline{P(D4 \rightarrow D6) \times P(D6 \rightarrow 6)}}\\ & = \frac{1}{3} \times \frac{1}{4 } \times \frac{1}{3} \times \frac{1}{6} \end{aligned}P2(D6)=first throwP ( D 4 )×P ( D _1)×second throwP ( D _D6 ) _×P(D66)=31×41×31×61

Similarly, we can calculate the maximum probability that the second die is D4 or D8. We find that the second dice has the highest probability of getting D6. And when this probability is maximized, the first dice is D4. So the highest probability dice sequence is D4 D6. Continuing to expand, we roll the dice three times:

insert image description here

Similarly, we calculate the maximum probability that the third dice are D6 D4 D8 respectively. We find again that to get maximum probability, the second die must be D6. At this time, the maximum probability of getting D4 on the third dice is:

P 3 ( D 4 ) = P ( D 4 ) × P ( D 4 → 1 ) ‾ first throw × P ( D 4 → D 6 ) × P ( D 6 → 6 ) ‾ second throw × P ( D 6 → D 4 ) × P ( D 4 → 3 ) ‾ Third throw = 1 3 × 1 4 × 1 3 × 1 6 × 1 3 × 1 4 \begin{aligned} P3(D4) &= \underset {First throw}{\underline{P(D4) \times P(D4 \rightarrow 1)}} \times \underset{Second throw}{\underline{P(D4 \rightarrow D6) \times P( D6 \rightarrow 6)}} \times \underset{Third throw}{\underline{P(D6 \rightarrow D4) \times P(D4 \rightarrow 3)}}\\ & = \frac{1}{3 } \times \frac{1}{4} \times \frac{1}{3} \times \frac{1}{6} \times \frac{1}{3} \times \frac{1}{4 } \end{aligned}P3 ( D4 ) _ _=first throwP ( D 4 )×P ( D _1)×second throwP ( D _D6 ) _×P(D66)×third throwP(D6D 4 )×P ( D _3)=31×41×31×61×31×41

As above, we can calculate the maximum probability that the third die is D6 or D8. We find that the third dice has the highest probability of getting D4. And when this probability is maximized, the second dice is D6, and the first dice is D4. So the highest probability dice sequence is D4 D6 D4.


At this point, we should see some rules. Since one, two, and three rolls of the dice can be counted, any number of rolls can be counted by analogy.

We found that we need to do a few things when we ask for the maximum probability dice sequence:

  • First, no matter how long the sequence is, it is necessary to calculate the maximum probability of getting each dice when the sequence length is 1, no matter how long the sequence is.
  • Then, gradually increase the length, and every time the length is increased, recalculate the maximum probability of getting each dice at the last position under this length . Because the maximum probability of getting each dice under the previous length has been calculated, it is not difficult to recalculate (similar to recursion).
  • When we count the last digit, we know which dice has the highest probability of the last digit. Then, we need to push the sequence corresponding to this maximum probability from the back to the front (backtracking) .

The Viterbi algorithm draws on the idea of ​​dynamic programming.

2.2.2.3 Who moved my dice? 【Corresponding to Question 3】

Question 3 : I know that there are several types of dice (the number of hidden states), I don’t know what each type of dice is (transition probability), and I have observed the results of many dice rolls (visible state chain), I want to deduce what each type of dice is (conversion probability).

Here we mainly want to know what the conversion probability between dice is, and whether it is consistent with our expectations!

That is: according to the number of hidden states + visible state chain → transition probability

This question is important because this is the most common situation .

In many cases, we only have visible results and do not know the parameters in the HMM model. We need to estimate these parameters from visible results, which is a necessary step in modeling.


For example, if you suspect that your six-sided dice D6 has been tampered with by the casino, it may be replaced by another six-sided dice D ^ 6 \hat{D}6D^ 6, this kind of six-sided dice is more likely to be 1 (1 2 \frac{1}{2}21), the probability of throwing 2, 3, 4, 5, 6 is 1 10 \frac{1}{10}101. What should we do at this point?

In fact, the answer is very simple. Calculate the probability of throwing a sequence of normal three dice, and then calculate the probability of throwing this sequence of abnormal six-sided dice and the other two normal dice. If the former is smaller than the latter, we have to be careful.

For example, the result of rolling a dice is:

insert image description here

To calculate the probability of rolling this result with the normal three dice, it is actually the sum of the probabilities of all possible situations.

Similarly, the simple and crude method is to exhaustively enumerate all dice sequences, or calculate the probability corresponding to each dice sequence. But this time we don't pick the maximum value, but add all the calculated probabilities, and the total probability obtained is the result we require .

This method still cannot be applied to too long sequences of dice (Markov chains). We will apply a solution similar to the previous problem, except that the previous problem is concerned with the maximum probability, and this problem is concerned with the sum of probabilities . The algorithm to solve this problem is called the forward algorithm (Forward algorithm) .

First, if we only roll the dice once:

insert image description here

See that the result is 1. The overall probability of producing this outcome can be calculated as follows, with an overall probability of 0.18:

P 1 P1 P1 _ P 2 P2 P2 _ P 3 P3 P3 _
D6 1 3 × 1 6 \frac{1}{3} \times \frac{1}{6} 31×61
D4 1 3 × 1 4 \frac{1}{3} \times \frac{1}{4} 31×41
D8 1 3 × 1 8 \frac{1}{3} \times \frac{1}{8} 31×81
total 0.18 0.18 0.18

Extending this situation, we roll the dice twice:

insert image description here

See the result as 1 6. The overall probability of producing this outcome can be calculated as follows, with an overall probability of 0.05:

P 1 P1P1 _ P 2 P2P2 _ P 3 P3P3 _
D6 1 3 + 1 6 \frac{1}{3} + \frac{1}{6} 31+61 P 1 ( D 6 ) × 1 3 × 1 6 ‾ D 6 for the first time, D 6 + for the second time P 1 ( D 4 ) × 1 3 × 1 6 ‾ D 4 for the first time, D 6 + for the second time P 1 ( D 8 ) × 1 3 × 1 6 ‾ D 8 for the first time, D 6 for the second time \underset{D6 for the first time, D6 for the second time}{\underline{P1(D6) \times \frac{ 1}{3}\times \frac{1}{6}}} + \underset{D4 for the first time, D6 for the second time}{\underline{P1(D4) \times \frac{1}{3} \ times \frac{1}{6}}} + \underset{D8 for the first time, D6 for the second time}{\underline{P1(D8) \times \frac{1}{3} \times \frac{1} {6}}}D 6 for the first time , D 6 for the second timeP1(D6)×31×61+D 4 for the first time , D 6 for the second timeP1 ( D4 ) _ _×31×61+D 8 for the first time , D 6 for the second timeP1 ( D8 ) _ _×31×61
D4 1 3 + 1 4 \frac{1}{3} + \frac{1}{4} 31+41 P 1 ( D 6 ) × 1 3 × 0 ‾ D 6 for the first time, D 4 + P 1 for the second time ( D 4 ) × 1 3 × 0 ‾ D 4 for the first time, D 4 + P 1 for the second time ( D 8 ) × 1 3 × 0 ‾ D 8 for the first time, D 4 for the second time \underset{D6 for the first time, D4 ​​for the second time}{\underline{P1(D6) \times \frac{1}{ 3}\times 0}} + \underset{The first time D4, the second time D4}{\underline{P1(D4) \times \frac{1}{3} \times 0}} + \underset{First D8 for the second time, D4 ​​for the second time}{\underline{P1(D8) \times \frac{1}{3} \times 0}}D 6 for the first time , D 4 for the second timeP1(D6)×31×0+D 4 for the first time , D 4 for the second timeP1 ( D4 ) _ _×31×0+D 8 for the first time , D 4 for the second timeP1 ( D8 ) _ _×31×0
D8 1 3 + 1 8 \frac{1}{3} + \frac{1}{8} 31+81 P 1 ( D 6 ) × 1 3 × 1 8 ‾ D 6 for the first time, D 8 + for the second time P 1 ( D 4 ) × 1 3 × 1 8 ‾ D 4 for the first time, D 8 + for the second time P 1 ( D 8 ) × 1 3 × 1 8 ‾ D 8 for the first time, D 8 for the second time \underset{D6 for the first time, D8 for the second time}{\underline{P1(D6) \times \frac{ 1}{3}\times \frac{1}{8}}} + \underset{D4 for the first time, D8 for the second time}{\underline{P1(D4) \times \frac{1}{3} \ times \frac{1}{8}}} + \underset{first time D8, second time D8}{\underline{P1(D8) \times \frac{1}{3} \times \frac{1} {8}}}D 6 for the first time , D 8 for the second timeP1(D6)×31×81+D 4 for the first time , D 8 for the second timeP1 ( D4 ) _ _×31×81+D 8 for the first time , D 8 for the second timeP1 ( D8 ) _ _×31×81
total 0.18 0.18 0.18 (probability of 1 the first time) 0.05 0.05 0.05 (probability of 1 the first time and 6 the second time)

For D4, a 6 cannot be rolled, so the probability is 0.

Continuing to expand, we roll the dice three times:

insert image description here

See the result is 1 6 3. The overall probability of producing this outcome can be calculated as follows, with an overall probability of 0.03:

P 1 P1 P1 _ P 2 P2 P2 _ P 3 P3 P3 _
D6 1 3 + 1 6 \frac{1}{3} + \frac{1}{6} 31+61 P 1 ( D 6 ) × 1 3 × 1 6 ‾ D 6 for the first time, D 6 + for the second time P 1 ( D 4 ) × 1 3 × 1 6 ‾ D 4 for the first time, D 6 + for the second time P 1 ( D 8 ) × 1 3 × 1 6 ‾ D 8 for the first time, D 6 for the second time \underset{D6 for the first time, D6 for the second time}{\underline{P1(D6) \times \frac{ 1}{3}\times \frac{1}{6}}} + \underset{D4 for the first time, D6 for the second time}{\underline{P1(D4) \times \frac{1}{3} \ times \frac{1}{6}}} + \underset{D8 for the first time, D6 for the second time}{\underline{P1(D8) \times \frac{1}{3} \times \frac{1} {6}}}D 6 for the first time , D 6 for the second timeP1(D6)×31×61+D 4 for the first time , D 6 for the second timeP1 ( D4 ) _ _×31×61+D 8 for the first time , D 6 for the second timeP1 ( D8 ) _ _×31×61 P 2 ( D 6 ) × 1 3 × 1 6 ‾ D 6 for the second time, D 6 + for the third time P 2 ( D 4 ) × 1 3 × 1 6 ‾ D 4 for the second time, D 6 + for the third time P 2 ( D 8 ) × 1 3 × 1 6 ‾ D 8 for the second time, D 6 for the third time \underset{D6 for the second time, D6 for the third time}{\underline{P2(D6) \times \frac{ 1}{3} \times \frac{1}{6}}} + \underset{D4 for the second time, D6 for the third time}{\underline{P2(D4) \times \frac{1}{3} \ times \frac{1}{6}}} + \underset{second D8, third D6}{\underline{P2(D8) \times \frac{1}{3} \times \frac{1} {6}}}D 6 for the second time , D 6 for the third timeP2(D6)×31×61+D 4 for the second time , D 6 for the third timeP2(D4)×31×61+第二次D8,第三次D6P2(D8)×31×61
D4 1 3 + 1 4 \frac{1}{3} + \frac{1}{4} 31+41 P 1 ( D 6 ) × 1 3 × 0 ‾ 第一次 D 6 , 第二次 D 4 + P 1 ( D 4 ) × 1 3 × 0 ‾ 第一次 D 4 , 第二次 D 4 + P 1 ( D 8 ) × 1 3 × 0 ‾ 第一次 D 8 , 第二次 D 4 \underset{第一次D6,第二次D4}{\underline{P1(D6) \times \frac{1}{3}\times 0}} + \underset{第一次D4,第二次D4}{\underline{P1(D4) \times \frac{1}{3} \times 0}} + \underset{第一次D8, 第二次D4}{\underline{P1(D8) \times \frac{1}{3} \times 0}} 第一次D6,第二次D4P1(D6)×31×0+第一次D4,第二次D4P1(D4)×31×0+第一次D8,第二次D4P1(D8)×31×0 P 2 ( D 6 ) × 1 3 × 1 4 ‾ 第二次 D 6 , 第三次 D 4 + P 2 ( D 4 ) × 1 3 × 1 4 ‾ 第二次 D 4 , 第三次 D 4 + P 2 ( D 8 ) × 1 3 × 1 4 ‾ 第二次 D 8 , 第三次 D 4 \underset{第二次D6,第三次D4}{\underline{P2(D6) \times \frac{1}{3} \times \frac{1}{4}}} + \underset{第二次D4,第三次D4}{\underline{P2(D4) \times \frac{1}{3} \times \frac{1}{4}}} + \underset{第二次D8,第三次D4}{\underline{P2(D8) \times \frac{1}{3} \times \frac{1}{4}}} 第二次D6,第三次D4P2(D6)×31×41+第二次D4,第三次D4P2(D4)×31×41+第二次D8,第三次D4P2(D8)×31×41
D8 1 3 + 1 8 \frac{1}{3} + \frac{1}{8} 31+81 P 1 ( D 6 ) × 1 3 × 1 8 ‾ 第一次 D 6 , 第二次 D 8 + P 1 ( D 4 ) × 1 3 × 1 8 ‾ 第一次 D 4 , 第二次 D 8 + P 1 ( D 8 ) × 1 3 × 1 8 ‾ 第一次 D 8 , 第二次 D 8 \underset{第一次D6,第二次D8}{\underline{P1(D6) \times \frac{1}{3}\times \frac{1}{8}}} + \underset{第一次D4,第二次D8}{\underline{P1(D4) \times \frac{1}{3} \times \frac{1}{8}}} + \underset{第一次D8, 第二次D8}{\underline{P1(D8) \times \frac{1}{3} \times \frac{1}{8}}} 第一次D6,第二次D8P1(D6)×31×81+第一次D4,第二次D8P1(D4)×31×81+第一次D8,第二次D8P1(D8)×31×81 P 2 ( D 6 ) × 1 3 × 1 8 ‾ 第二次 D 6 , 第三次 D 8 + P 2 ( D 4 ) × 1 3 × 1 8 ‾ 第二次 D 4 , 第三次 D 8 + P 2 ( D 8 ) × 1 3 × 1 8 ‾ 第二次 D 8 , 第三次 D 8 \underset{第二次D6,第三次D8}{\underline{P2(D6) \times \frac{1}{3} \times \frac{1}{8}}} + \underset{第二次D4,第三次D8}{\underline{P2(D4) \times \frac{1}{3} \times \frac{1}{8}}} + \underset{第二次D8,第三次D8}{\underline{P2(D8) \times \frac{1}{3} \times \frac{1}{8}}} 第二次D6,第三次D8P2(D6)×31×81+第二次D4,第三次D8P2(D4)×31×81+第二次D8,第三次D8P2(D8)×31×81
总计 0.18 0.18 0.18(第一次为 1 的概率) 0.05 0.05 0.05(第一次为 1,第二次为 6 的概率) 0.03 0.03 0.03(第一次为 1,第二次为 6,第三次为 3 的概率)

同样的,我们一步一步的算,有多长算多长,再长的马尔可夫链总能算出来的。

用同样的方法,也可以算出不正常的六面骰和另外两个正常骰子掷出这段序列的概率,然后我们比较一下这两个概率大小,就能知道你的骰子是不是被人换了。


小结

  • 隐马尔可夫模型(Hidden Markov Model,HMM)是统计模型,它用来描述一个含有隐含未知参数的马尔可夫过程。
  • 常见术语:
    • 可见状态链
    • 隐含状态链
    • 转换概率
    • 输出概率

3. HMM 模型基础

学习目标:

  • 了解 HMM 模型解决的问题的主要特征
  • 知道 HMM 模型的两个重要假设
  • 指导 HMM 观测序列的生成过程
  • 知道 HMM 模型的三个基本问题

3.1 什么样的问题需要 HMM 模型

首先我们来看看什么样的问题可以用 HMM 模型解决。使用 HMM 模型时我们的问题一般有这两个特征:

  1. 我们的问题是基于序列的,比如时间序列,或者状态序列
  2. 我们的问题中有两类数据:
    1. 一类序列数据是可以观测到的,即观测序列
    2. 另一类数据是不能观察到的,即隐藏状态序列,简称状态序列

有了这两个特征,那么这个问题一般可以用 HMM 模型来尝试解决。


这样的问题在实际生活中是很多的。比如:现在写博客,在键盘上敲出来的一系列字符就是观测序列,而</font color=‘red’>我实际想写的一段话就是隐藏状态序列。输入法的任务就是从敲入的一系列字符尽可能的猜测要写的一段话,并把最可能的词语放在最前面让选择,这就可以看做一个 HMM 模型了。

再举一个例子,假如老师上课讲课,老师发出的一串连续的声音就是观测序列,而老师实际要表达的一段话就是隐藏状态序列。学生大脑的任务,就是从这一串连续的声音中判断出我最可能要表达的话的内容。

从这些例子中,我们可以发现,HMM 模型可以无处不在。但是上面的描述还不精确,下面我们用精确的数学符号来表述 HMM 模型。

3.2 HMM 模型的定义

对于 HMM 模型,首先我们假设 Q Q Q 是所有可能的隐藏状态的集合, V V V 是所有可能的观测状态的集合,即:

Q = q 1 , q 2 , . . . , q N V = v 1 , v 2 , . . . , v M \begin{aligned} & Q = q_1, q_2, ..., q_N\\ & V = v_1, v_2, ..., v_M \end{aligned} Q=q1,q2,...,qNV=v1,v2,...,vM

其中:

  • N N N 是可能的隐藏状态数
  • M M M 是所有的可能的观察状态数

对于一个长度为 T T T 的序列, i i i 是对应的状态序列, O O O 是对应的观察序列,即:

i = i 1 , i 2 , . . . , i T ( 几个面的骰子 ) O = o 1 , o 2 , . . . , o T ( 投掷骰子的结果 ) \begin{aligned} & i = i_1, i_2, ..., i_T \quad (几个面的骰子)\\ & O = o_1, o_2, ..., o_T \quad (投掷骰子的结果) \end{aligned} i=i1,i2,...,iT(几个面的骰子)O=o1,o2,...,oT(投掷骰子的结果)

其中:

  • 任意一个隐藏状态 i t ∈ Q i_t \in Q itQ
  • 任意一个观察状态 o t ∈ V o_t \in V otV

HMM 模型做了两个很重要的假设:

  1. 齐次马尔科夫链假设
  2. 观测独立性假设

3.2.1 【假设1】齐次马尔科夫链假设

任意时刻的隐藏状态只依赖于它前一个隐藏状态(时刻 t t t 只与 时刻 t − 1 t - 1 t1 有关)。

当然这样假设有点极端,因为很多时候我们的某一个隐藏状态不仅仅只依赖于前一个隐藏状态,可能还会依赖前两个或者是前三个。

但是这样假设的好处就是模型简单,便于求解。

如果在时刻 t t t 的隐藏状态是 i t = q i i_t = q_i it=qi,在时刻 t + 1 t + 1 t+1 的隐藏状态是 i t + 1 = q j i_t + 1 = q_j it+1=qj,则从时刻 t t t 到时刻 t + 1 t+1 t+1 的 HMM 状态转移概率 a i j a_{ij} aij 可以表示为:

a i j = P ( i t + 1 = q j ∣ i t = q i ) a_{ij} = P(i_{t+1} = q_j | i_t = q_i) aij=P(it+1=qjit=qi)

这样 a i j a_{ij} aij 可以组成马尔科夫链的状态转移矩阵 A A A

A = [ a i j ] N × N A = [a_{ij}]_{N \times N} A=[aij]N×N

3.2.2 【假设2】观测独立性假设

任意时刻的观察状态只仅仅依赖于当前时刻的隐藏状态(跟其他时刻无关),这也是一个为了简化模型的假设(时刻 t t t 只与 时刻 t t t 有关)。

如果在时刻 t t t 的隐藏状态是 i t = q j i_t = q_j it=qj,而对应的观察状态为 o t = v k o_t = v_k ot=vk,则该时刻观察状态 v k v_k vk 在隐藏状态 q j q_j qj 下生成的概率为 b j ( k ) b_j(k) bj(k) 满足:

b j ( k ) = P ( o t = v k ∣ i t = q j ) b_j(k) = P(o_t = v_k | i_t = q_j) bj(k)=P(ot=vkit=qj)

这样 b j ( k ) b_j(k) bj(k) 可以组成观测状态生成的概率矩阵 B B B

B = [ b j ( k ) ] N × M B = [b_j(k)]_{N \times M} B=[bj(k)]N×M

除此之外,我们需要一组在时刻 t = 1 t=1 t=1 的隐藏状态概率分布 Π \Pi Π

Π = [ Π i ] N \Pi = [\Pi_{i}^{}]_{N} Π=[Πi]N

其中: Π i = P ( i 1 = q i ) \Pi_{i} = P(i_1 = q_i) Πi=P(i1=qi)


因此我们可以知道,一个 HMM 模型,可以由隐藏状态初始概率分布 Π \Pi Π,状态转移概率矩阵 A A A 和观测状态概率矩阵 B B B 三部分决定

  • 初始状态概率分布 Π \Pi Π 和 状态序列 A A A 决定状态序列
  • 观测序列 B B B 决定观测序列

因此,HMM 模型可以由一个三元组 λ \lambda λ 表示:

λ = ( A , B , Π ) = ( 状态序列 , 观测序列 , 初始状态概率分布 ) \lambda = (A, B, \Pi) = (状态序列, 观测序列, 初始状态概率分布) λ=(A,B,Π)=(状态序列,观测序列,初始状态概率分布)

3.3 一个 HMM 模型实例

下面我们用一个简单的实例来描述上面抽象出的 HMM 模型。这是一个盒子与球的模型。

例子来源于李航的《统计学习方法

假设我们有 3 个盒子,每个盒子里都有 红色白色 两种球,这三个盒子里球的数量分别是:

盒子 1 2 3
红球数 5 4 7
白球数 5 6 3

按照下面的方法从盒子里抽球,开始的时候:

  • 从第 1 个盒子抽球的概率是 0.2
  • 从第 2 个盒子抽球的概率是 0.4
  • 从第 3 个盒子抽球的概率是 0.4

以这个概率抽一次球后,将球放回。

然后从当前盒子转移到下一个盒子进行抽球。规则是:

  • 如果当前抽球的盒子是第 1 个盒子.则以 0.5 的概率仍然留在第 1 个盒子继续抽球,以 0.2 的概率去第 2 个盒子抽球,以 0.3 的概率去第 3 个盒子抽球。
  • 如果当前抽球的盒子是第 2 个盒子,则以 0.5 的概率仍然留在第 2 个盒子继续抽球,以 0.3 的概率去第 1 个盒子抽球,以 0.2 的概率去第 3 个盒子抽球。
  • 如果当前抽球的盒子是第 3 个盒子.则以 0.5 的概率仍然留在第 3 个盒子继续抽球,以 0.2 的概率去第 1 个盒子抽球,以 0.3 的概率去第 2 个盒子抽球。

如此下去,直到重复 3 次,得到一个球的颜色的观测序列 O O O

O = { 红 , 白 , 红 } O = \{ 红, 白, 红 \} O={ ,,}

注意在这个过程中,观察者只能看到球的颜色序列,却不能看到球是从哪个盒子里取出的

那么按照我们前面 HMM 模型的定义,我们的观察状态集合 V V V 是:

V = { 红 , 白 } M = 2 \begin{aligned} & V = \{ 红, 白 \}\\ & M = 2 \end{aligned} V={ ,}M=2

我们的隐藏状态集合 Q Q Q 是:

Q = { 盒子 1 , 盒子 2 , 盒子 3 } N = 3 \begin{aligned} & Q = \{ 盒子1, 盒子2, 盒子3 \}\\ & N = 3 \end{aligned} Q={ 盒子1,盒子2,盒子3}N=3

而观察序列 O O O(三个盒子)和状态序列 i i i(红白红)的长度 T T T 为都为 3。

初始状态概率分布 Π \Pi Π 为:

Π = ( 0.2 , 0.4 , 0.4 ) T \Pi = (0.2, 0.4, 0.4)^T Π=(0.2,0.4,0.4)T

从第 1 个盒子抽球的概率是 0.2;从第 2 个盒子抽球的概率是 0.4;从第 3 个盒子抽球的概率是 0.4

状态转移概率分布矩阵 A A A(不可见的,隐含的)为:

A = [ 0.5 0.2 0.3 0.3 0.5 0.2 0.2 0.3 0.5 ] N × N = 3 × 3 A = \begin{bmatrix} 0.5 & 0.2 & 0.3\\ 0.3 & 0.5 & 0.2\\ 0.2 & 0.3 & 0.5 \end{bmatrix}_{N \times N = 3 \times 3} A= 0.50.30.20.20.50.30.30.20.5 N×N=3×3

行表示第几次抽球(从2开始);列表示使用第几个盒子的概率

观测状态概率矩阵 B B B(可见的)为:

B = [ 0.5 0.5 0.4 0.6 0.7 0.3 ] N × M = 3 × 2 B = \begin{bmatrix} 0.5 & 0.5\\ 0.4 & 0.6\\ 0.7 & 0.3 \end{bmatrix}_{N \times M = 3 \times 2} B= 0.50.40.70.50.60.3 N×M=3×2

行代表第几个盒子;列1代表红球的概率,列2代表白球的概率

其中:

  • M M M 是所有的可能的观察状态数
  • N N N 是可能的隐藏状态数
  • V V V 是所有可能的观测状态的集合
  • Q Q Q 是所有可能的隐藏状态的集合
  • i i i 是状态序列
  • O O O 是观测序列
  • T T T 是序列的长度
  • A A A 是状态转移概率分布矩阵
  • B B B 是观测转移概率分布矩阵

3.4 HMM 观测序列 O O O 的生成

从上面的例子,我们也可以抽象出 HMM 观测序列 O O O 生成的过程。

  • 输入
    • HMM 的模型 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π)
    • 观测序列 O O O 的长度为 T T T
  • 输出
    • 观测序列 O = o 1 , o 2 , . . . , o T O = o_1, o_2,..., o_T O=o1,o2,...,oT

生成的过程如下

  1. 根据初始状态概率分布 Π \Pi Π 生成隐藏状态 i 1 i_1 i1
  2. 对于 t t t 1 1 1 T T T
    • a. 按照隐藏状态 i t i_t it 的观测状态分布 b i t ( k ) b_{it}(k) bit(k) 生成观察状态 o t o_t ot
    • b. 按照隐藏状态 i t i_t it 的状态转移概率分布 a i t a_{it} ait i t + 1 i_{t+1} it+1 产生隐藏状态 i t + 1 i_{t+1} it+1

所有的 o t o_t ot 一起形成观测序列 O = o 1 , o 2 , . . . , o T O = o_1, o_2,..., o_T O=o1,o2,...,oT

在隐马尔科夫模型(HMM)中, λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π) 表示 HMM 模型,其中 A A A 是状态转移概率矩阵, B B B 是观测概率矩阵, Π \Pi Π 是初始状态概率分布。 T T T 表示观测序列 O O O 的长度。 O = o 1 , o 2 , . . . , o T O = o_1, o_2,..., o_T O=o1,o2,...,oT 表示观测序列,其中 o t o_t ot 表示在时间步长 t t t 时的观测状态。 i t i_t it 表示在时间步长 t t t 时的隐藏状态。 b i t ( k ) b_{it}(k) bit(k) 表示在隐藏状态 i t i_t it 时,观测状态为 k k k 的概率。 a i t a_{it} ait 表示在隐藏状态 i t i_t it 时,转移到下一个隐藏状态的概率分布。这些字母代表了隐马尔科夫模型中的各种参数和变量。

3.5 HMM 模型的三个基本问题

HMM 模型一共有三个经典的问题需要解决:

  1. 评估观察序列概率:前向后向的概率计算
  2. 预测问题(也称为解码问题):维特比(Viterbi)算法
  3. 模型参数学习问题:鲍姆-韦尔奇(Baum-Welch)算法(状态未知)

3.5.1 【问题1】评估观察序列概率:前向后向的概率计算

即给定模型 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π) 和观测序列 O = { o 1 , o 2 , . . . , o T } O = \{o_1, o_2,..., o_T\} O={ o1,o2,...,oT},计算在模型 λ \lambda λ 下某一个观测序列 O O O 出现的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)

这个问题的求解需要用到前向后向算法,是 HMM 模型三个问题中最简单的。

3.5.2 【问题2】预测问题(也称为解码问题):维特比(Viterbi)算法

即给定模型 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π) 和观测序列 O = { o 1 , o 2 , . . . , o T } O = \{o_1, o_2,..., o_T\} O={ o1,o2,...,oT},求给定观测序列条件 O O O 下,最可能出现的对应的状态序列 i i i

这个问题的求解需要用到基于动态规划的维特比算法,是 HMM 模型三个问题中复杂度居中的算法。

3.5.3 【问题3】模型参数学习问题:鲍姆-韦尔奇(Baum-Welch)算法(状态未知)

即给定观测序列 O = { o 1 , o 2 , . . . , o T } O = \{o_1, o_2,..., o_T\} O={ o1,o2,...,oT},估计模型 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π) 的参数,使该模型下观测序列 O O O 的条件概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ) 最大。

这个问题的求解需要用到基于 EM 算法的鲍姆-韦尔奇算法,是 HMM 模型三个问题中最复杂的。

接下来的三节,我们将基于这个三个问题展开讨论。


小结

  • 什么样的问题可以用 HMM 模型解决?
    • 基于序列的,比如时间序列
    • 问题中包含两类数据:
      • 一类是可以观测到的观测序列
      • 另一类是不能观察到的隐藏状态序列
  • HMM 模型的两个重要假设:
    • 齐次马尔科夫链假设
    • 观测独立性假设
  • HMM 模型的三个基本问题:
    • 【问题1】评估观察序列概率:前向后向的概率计算
    • 【问题2】预测问题(也称为解码问题):维特比(Viterbi)算法
    • 【问题3】模型参数学习问题:鲍姆-韦尔奇(Baum-Welch)算法(状态未知)

4. 前向后向算法评估观察序列概率

学习目标:

  • 知道用前向算法求 HMM 观测序列 O O O 的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)
  • 知道用后向算法求 HMM 观测序列 O O O 的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)

本节我们就关注 HMM 第一个基本问题的解决方法,即已知模型 λ \lambda λ 和观测序列 O O O,求观测序列出现的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)

4.1 回顾 HMM 问题一:求观测序列 O O O 的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)

首先我们回顾下 HMM 模型的问题二。这个问题是这样的:

我们已知 HMM 模型的参数 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π)。其中 A A A 是隐藏状态转移概率的矩阵, B B B 是观测状态生成概率的矩阵, Π \Pi Π 是隐藏状态的初始概率分布。同时我们也已经得到了观测序列 O = { o 1 , o 2 , . . . , o T } O = \{o_1, o_2,..., o_T\} O={ o1,o2,...,oT},现在我们要求观测序列 O O O 在模型入下出现的条件概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)

乍一看,这个问题很简单。因为我们知道所有的隐藏状态之间的转移概率 A A A 和所有从隐藏状态到观测状态生成概率 B B B,那么我们是可以暴力求解的。

我们可以列举出所有可能出现的长度为 T T T 的隐藏序列 i = { i 1 , i 2 , . . . , i T } i=\{i_1,i_2, ..., i_T\} i={ i1,i2,...,iT},分别求出这些隐藏序列 i i i 与观测序列 O = { o 1 , o 2 , . . . , o T } O =\{o_1,o_2,...,o_T\} O={ o1,o2,...,oT} 的联合概率分布 P ( O , i ∣ λ ) P(O,i|\lambda) P(O,iλ),这样我们就可以很容易的求出边缘分布 P ( O ∣ λ ) P(O|\lambda) P(Oλ) 了。


具体暴力求解的方法是这样的:

  • 首先,任意隐藏序列 i = { i 1 , i 2 , . . . , i T } i=\{i_1,i_2, ..., i_T\} i={ i1,i2,...,iT} 出现的概率是:出现的概率是: P ( i ∣ λ ) = Π i 1 a i 1 , i 2 a i 2 , i 3 . . . a i T − 1 , i T P(i|\lambda) = \Pi_{i_1} a_{i_1,i_2} a_{i_2,i_3} ... a_{i_{T-1}, i_T} P(iλ)=Πi1ai1,i2ai2,i3...aiT1,iT其中 Π \Pi Π是初始状态概率, a i t − 1 i t a_{i_{t-1}i_t} ait1it 是隐藏状态转移概率。

  • 对于固定的状态序列 i = { i 1 , i 2 , . . . , i T } i=\{i_1,i_2, ..., i_T\} i={ i1,i2,...,iT},我们要求的观察序列 O = { o 1 , o 2 , . . . , o T } O =\{o_1,o_2,...,o_T\} O={ o1,o2,...,oT} 出现的概率是: P ( O ∣ i , λ ) = b i 1 ( o 1 ) × b i 2 ( o 2 ) × . . . × b i T ( o T ) P(O|i, \lambda) = b_{i_1}(o_1)\times b_{i_2}(o_2) \times ... \times b_{i_T}(o_T) P(Oi,λ)=bi1(o1)×bi2(o2)×...×biT(oT)其中, b i t ( o t ) b_{i_t}(o_t) bit(ot) 是在隐藏状态 i t i_t it下观察到 o t o_t ot的概率

  • O O O i i i 联合出现的概率是: P ( O , i ∣ λ ) = P ( i ∣ λ ) P ( O ∣ i , λ ) = Π i 1 × b i 1 ( o 1 ) × a i 1 i 2 b i 2 ( o 2 ) . . . a i T − 1 i T b i T ( o T ) P(O, i|\lambda) = P(i|\lambda)P(O|i,\lambda) = \Pi_{i_1} \times b_{i_1}(o_1) \times a_{i_1i_2}b_{i_2}(o_2)...a_{i_{T-1}i_T}b_{i_T}(o_T) P(O,iλ)=P(iλ)P(Oi,λ)=Πi1×bi1(o1)×ai1i2bi2(o2)...aiT1iTbiT(oT)其中, Π \Pi Π是初始状态概率, a i t − 1 i t a_{i_{t-1}i_t} ait1it 是隐藏状态转移概率, b i t ( o t ) b_{i_t}(o_t) bit(ot) 是在隐藏状态 i t i_t it下观察到 o t o_t ot的概率

  • 然后求边缘概率分布,即可得到观测序列 O O O 在模型 λ \lambda λ 下出现的条件概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ) P ( O ∣ λ ) = ∑ i P ( O , i ∣ λ ) = ∑ i 1 , i 2 , . . . , i T Π i 1 × b i 1 ( o 1 ) × a i 1 i 2 × b i 2 ( o 2 ) × . . . × a i T − 1 i T × b i T ( O T ) P(O|\lambda) = \sum_i P(O, i|\lambda) = \sum_{i_1, i_2, ..., i_T}\Pi_{i_1}\times b_{i_1}(o_1)\times a_{i_1i_2}\times b_{i_2}(o_2) \times ... \times a_{i_{T-1}i_T} \times b_{i_T}(O_T) P(Oλ)=iP(O,iλ)=i1,i2,...,iTΠi1×bi1(o1)×ai1i2×bi2(o2)×...×aiT1iT×biT(OT)其中, Π \Pi Π是初始状态概率, a i t − 1 i t a_{i_{t-1}i_t} ait1it 是隐藏状态转移概率, b i t ( o t ) b_{i_t}(o_t) bit(ot) 是在隐藏状态 i t i_t it下观察到 o t o_t ot的概率

虽然上述方法有效,但是如果我们的隐藏状态数 N N N 非常多的那就麻烦了,此时我们预测状态有 N T N^T NT 种组合,算法的时间复杂度是 O ( T N T ) O(TN^T) O(TNT) 阶的。

因此:

  • 对于一些隐藏状态数 N N N 极少的模型,我们可以用暴力求解法来得到观测序列出现的概率
  • 但如果隐藏状态数 N N N 多,上述算法太耗时,我们需要寻找其他简洁的算法。

前向后向算法就是来帮助我们在较低的时间复杂度情况下求解这个问题的。

4.2 用前向算法求 HMM 观测序列 O O O 的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)

前向后向算法 是 前向算法 和 后向算法 的统称,这两个算法都可以用来求 HMM 观测序列 O O O 的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)。我们先来看看 前向算法 是如何求解这个问题的。

4.2.1 流程梳理

前向算法 本质上属于 动态规划的算法,也就是我们要通过找到 局部状态 递推的公式,这样一步步的从 子问题的最优解 拓展到 整个问题的最优解。

动态规划算法的思想:在算的过程中保证前面的步骤是最优的(即当前结果最优

在前向算法中,通过定义“前向概率”来定义动态规划的这个局部状态。

那么什么是 前向概率 呢?其实定义很简单:定义时刻 t t t 时隐藏状态为 q i q_i qi,那么观测状态的序列为 o 1 , o 2 , . . . , o t o_1, o_2, ..., o_t o1,o2,...,ot 的概率就是 前向概率。记为:

α t ( i ) = P ( o 1 , o 2 , . . . , o t , i t = q i ∣ λ ) \alpha_t(i) = P(o_1, o_2, ..., o_t, i_t = q_i | \lambda) αt(i)=P(o1,o2,...,ot,it=qiλ)

其中:

  • α t ( i ) \alpha_t(i) αt(i) 表示前向概率,即定义时刻 t t t 时隐藏状态为 q i q_i qi,那么观测状态的序列为 o 1 , o 2 , . . . , o t o_1, o_2, ..., o_t o1,o2,...,ot 的概率。
  • o 1 , o 2 , . . . , o t o_1, o_2, ..., o_t o1,o2,...,ot 表示观测状态的序列。
  • i t = q i i_t = q_i it=qi 表示时刻 t t t 时隐藏状态为 q i q_i qi
  • λ \lambda λ 表示隐马尔科夫模型的参数。

既然是动态规划,我们就要递推了,现在假设我们已经找到了在时刻 t t t 时各个隐藏状态的前向概率,现在我们需要递推出时刻 t + 1 t+1 t+1 时各个隐藏状态的前向概率。

我们可以基于时刻 t t t 时各个隐藏状态的前向概率 α t \alpha_t αt,再乘以对应的状态转移概率 a j i a_{ji} aji,即 α t ( j ) × a j i \alpha_t(j)\times a_{ji} αt(j)×aji 就是在时刻 t t t 观测到 o 1 , o 2 , . . . , o t o_1, o_2, ..., o_t o1,o2,...,ot,并且时刻 t t t 隐藏状态 q j q_j qj,时刻 t + 1 t+1 t+1 隐藏状态 q i q_i qi 的概率。

Q:为什么是 a j i a_{ji} aji 而不是 a t a_t at
A a j i a_{ji} aji 表示的是从隐藏状态 q j q_j qj 转移到隐藏状态 q i q_i qi 的概率,它是隐马尔科夫模型的参数之一。在隐马尔科夫模型中,状态转移概率矩阵 A = [ a i j ] A = [a_{ij}] A=[aij] 是固定不变的,不随时间变化。所以,在递推计算前向概率时,我们使用固定的状态转移概率矩阵 A A A 中的元素 a j i a_{ji} aji,而不是随时间变化的 a t a_t at

如果将下面所有的线对应的概率求和,即 ∑ j = 1 N α t ( j ) a j i \sum_{j=1}^N \alpha_t(j)a_{ji} j=1Nαt(j)aji 就是在时刻 t t t 观测到 o 1 , o 2 , . . . , o t o_1, o_2, ..., o_t o1,o2,...,ot,并且时刻 t + 1 t+1 t+1 隐藏状态 q i q_i qi 的概率。

继续一步,由于观测状态 o t + 1 o_{t+1} ot+1 只依赖于 t + 1 t+1 t+1 时刻隐藏状态 q i q_i qi,这样 [ ∑ i = 1 N α t ( j ) a j i ] b i ( o t + 1 ) \left[\sum_{i=1}^N \alpha_t(j)a_{ji}\right]b_i(o_{t+1}) [i=1Nαt(j)aji]bi(ot+1) 就是在时刻 t + 1 t+1 t+1 观测到 o 1 , o 2 , . . . , o t + 1 o_1, o_2, ..., o_{t+1} o1,o2,...,ot+1,并且时刻 t + 1 t+1 t+1 隐藏状态 q i q_i qi 的概率。

而这个概率,恰恰就是时刻 t + 1 t+1 t+1 对应的隐藏状态 i i i 的前向概率,这样我们得到了前向概率的递推关系式,如下所示:

α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) \alpha_{t+1}(i) = \left[ \sum_{j=1}^N \alpha_t(j)a_{ji} \right] b_i(o_{t+1}) αt+1(i)=[j=1Nαt(j)aji]bi(ot+1)

我们的动态规划从时刻 1 开始,到时刻 T T T 结束。由于 α T ( i ) \alpha T(i) αT(i) 表示在时刻 T T T 观测序列为 o 1 , o 2 , . . . , o T o_1, o_2, ..., o_T o1,o2,...,oT,并且时刻 T T T 隐藏状态 q i q_i qi 的概率,我们只要将所有隐藏状态对应的概率相加,即 i = ∑ i = 1 N α T ( i ) i=\sum_{i=1}^N \alpha_T(i) i=i=1NαT(i) 就得到了在时刻 T T T 观测序列为 o 1 , o 2 , . . . , o T o_1, o_2, ..., o_T o1,o2,...,oT 的概率。

4.2.2 算法总结

  • 输入:HMM 模型参数 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π),观测序列 O = { o 1 , o 2 , . . . , o T } O = \{o_1, o_2,..., o_T\} O={ o1,o2,...,oT}
  • 输出:观测序列 O O O 的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)
  1. 计算时刻 1 1 1 的各个隐藏状态前向概率: α 1 ( i ) = Π i b i ( o 1 ) i = 1 , 2 , . . . , N \alpha_1(i) = \Pi_i b_i(o_1)\quad i=1, 2, ..., N α1(i)=Πibi(o1)i=1,2,...,N
  2. 递推时刻 2 , 3 , . . . , T 2,3,..., T 2,3,...,T 的前向概率: α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) i = 1 , 2 , . . . , N \alpha_{t+1}(i) = \left[ \sum_{j=1}^N \alpha_t(j)a_{ji} \right]b_i(o_{t+1}) \quad i = 1, 2, ..., N αt+1(i)=[j=1Nαt(j)aji]bi(ot+1)i=1,2,...,N
  3. 计算最终结果: P ( O ∣ λ ) = ∑ i = 1 N α T ( i ) P(O|\lambda) = \sum_{i=1}^N \alpha_T(i) P(Oλ)=i=1NαT(i)

其中:

  • λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π) 表示隐马尔科夫模型的参数
    • A A A 是状态转移概率矩阵
    • B B B 是观测概率矩阵
    • Π \Pi Π 是初始状态概率向量
  • O = { o 1 , o 2 , . . . , o T } O = \{o_1, o_2,..., o_T\} O={ o1,o2,...,oT} 表示观测序列。
  • P ( O ∣ λ ) P(O|\lambda) P(Oλ) 表示在模型 λ \lambda λ 下观测序列 O O O 出现的概率。
  • α t ( i ) \alpha_t(i) αt(i) 表示前向概率,即定义时刻 t t t 时隐藏状态为 q i q_i qi,那么观测状态的序列为 o 1 , o 2 , . . . , o t o_1, o_2, ..., o_t o1,o2,...,ot 的概率。
  • Π i \Pi_i Πi 表示初始状态概率向量中第 i i i 个元素的值。
  • b i ( o t ) b_i(o_t) bi(ot) 表示在隐藏状态 q i q_i qi 下观测到 o t o_t ot 的概率。
  • a j i a_{ji} aji 表示从隐藏状态 q j q_j qj 转移到隐藏状态 q i q_i qi 的概率。

从递推公式可以看出,我们的算法时间复杂度是 O ( T N 2 ) O(TN^2) O(TN2),比暴力解法的时间复杂度 O ( T N T ) O(TN^T) O(TNT) 少了几个数量级。

4.3 HMM 前向算法求解实例

这里我们用前面盒子与球的例子来显示前向概率 α \alpha α 的计算。我们的观察集合是:

V = { 红 , 白 } M = 2 \begin{aligned} & V = \{ 红,白 \}\\ & M = 2 \end{aligned} V={ ,}M=2

我们的状态集合是:

Q = { 盒子 1 , 盒子 2 , 盒子 3 } N = 3 \begin{aligned} & Q = \{盒子1, 盒子2, 盒子3\}\\ & N = 3 \end{aligned} Q={ 盒子1,盒子2,盒子3}N=3

而观察序列 O O O 和状态序列 i i i 的长度为都为 3。

初始状态分布为:

Π = ( 0.2 , 0.4 , 0.4 ) T \Pi = (0.2, 0.4, 0.4)^T Π=(0.2,0.4,0.4)T

状态转移概率分布矩阵 A A A(不可见的,隐含的)为:

A = [ 0.5 0.2 0.3 0.3 0.5 0.2 0.2 0.3 0.5 ] N × N = 3 × 3 A = \begin{bmatrix} 0.5 & 0.2 & 0.3\\ 0.3 & 0.5 & 0.2\\ 0.2 & 0.3 & 0.5 \end{bmatrix}_{N \times N = 3 \times 3} A= 0.50.30.20.20.50.30.30.20.5 N×N=3×3

行表示第几次抽球(从2开始);列表示使用第几个盒子的概率

观测状态概率矩阵 B B B(可见的)为:

B = [ 0.5 0.5 0.4 0.6 0.7 0.3 ] N × M = 3 × 2 B = \begin{bmatrix} 0.5 & 0.5\\ 0.4 & 0.6\\ 0.7 & 0.3 \end{bmatrix}_{N \times M = 3 \times 2} B= 0.50.40.70.50.60.3 N×M=3×2

行代表第几个盒子;列1代表红球的概率,列2代表白球的概率

球的颜色的观测序列:

O = { 红 , 白 , 红 } O = \{红, 白, 红\} O={ ,,}


按照我们上一节的前向算法。首先计算时刻 1 三个状态的前向概率 α 1 ( i ) \alpha_1(i) α1(i)

时刻 1 是红色球

  • 隐藏状态是盒子 1 的概率为: α 1 ( 1 ) = Π 1 b 1 ( o 1 ) = 0.2 抽到盒子 1 的概率 × 0.5 抽到红球的概率 = 0.1 \alpha_1(1) = \Pi_1b_1(o_1) = \underset{抽到盒子1的概率}{0.2} \times \underset{抽到红球的概率}{0.5} = 0.1 α1(1)=Π1b1(o1)=抽到盒子1的概率0.2×抽到红球的概率0.5=0.1
  • 隐藏状态是盒子 2 的概率为: α 1 ( 2 ) = Π 2 b 2 ( o 1 ) = 0.4 抽到盒子 2 的概率 × 0.4 抽到红球的概率 = 0.16 \alpha_1(2) = \Pi_2b_2(o_1) = \underset{抽到盒子2的概率}{0.4} \times \underset{抽到红球的概率}{0.4} = 0.16 α1(2)=Π2b2(o1)=抽到盒子2的概率0.4×抽到红球的概率0.4=0.16
  • 隐藏状态是盒子 3 的概率为: α 1 ( 3 ) = Π 3 b 3 ( o 1 ) = 0.4 抽到盒子 3 的概率 × 0.7 抽到红球的概率 = 0.28 \alpha_1(3) = \Pi_3b_3(o_1) = \underset{抽到盒子3的概率}{0.4} \times \underset{抽到红球的概率}{0.7} = 0.28 α1(3)=Π3b3(o1)=抽到盒子3的概率0.4×抽到红球的概率0.7=0.28

现在我们可以开始递推了,首先递推时刻 2 三个状态的前向概率 α 2 ( i ) \alpha_2(i) α2(i)

时刻 2 是白色球

  • 隐藏状态是盒子 1 的概率为: α 2 ( 1 ) = [ ∑ i = 1 3 α 1 ( i ) a i 1 ] b 1 ( o 2 ) = [ 0.1 第一次是盒子 1 × 0.5 盒子 1 → 盒子 1 ‾ 第一种情况 + 0.16 第一次是盒子 2 × 0.3 盒子 2 → 盒子 1 ‾ 第二种情况 + 0.28 第一次是盒子 3 × 0.2 盒子 3 → 盒子 1 ‾ 第三种情况 ] × 0.5 抽到白球 = 0.077 \begin{aligned}\alpha_2(1) & = \left[ \sum_{i=1}^3 \alpha_1(i) a_{i1} \right]b_1(o_2)\\ & = \left[\underset{第一种情况}{\underline{\underset{第一次是盒子1}{0.1} \times \underset{盒子1\rightarrow盒子1}{0.5}}} + \underset{第二种情况}{\underline{\underset{第一次是盒子2}{0.16} \times \underset{盒子2\rightarrow盒子1}{0.3}}} + \underset{第三种情况}{\underline{\underset{第一次是盒子3}{0.28} \times \underset{盒子3\rightarrow盒子1}{0.2}}} \right] \times \underset{抽到白球}{0.5}\\& = 0.077\end{aligned} α2(1)=[i=13α1(i)ai1]b1(o2)= 第一种情况第一次是盒子10.1×盒子1盒子10.5+第二种情况第一次是盒子20.16×盒子2盒子10.3+第三种情况第一次是盒子30.28×盒子3盒子10.2 ×抽到白球0.5=0.077
  • 隐藏状态是盒子 2 的概率为: α 2 ( 2 ) = [ ∑ i = 1 3 α 1 ( i ) a i 2 ] b 2 ( o 2 ) = [ 0.1 第一次是盒子 1 × 0.2 盒子 1 → 盒子 2 ‾ 第一种情况 + 0.16 第一次是盒子 2 × 0.5 盒子 2 → 盒子 2 ‾ 第二种情况 + 0.28 第一次是盒子 3 × 0.3 盒子 3 → 盒子 2 ‾ 第三种情况 ] × 0.6 抽到白球 = 0.1104 \begin{aligned}\alpha_2(2) & = \left[ \sum_{i=1}^3 \alpha_1(i) a_{i2} \right]b_2(o_2)\\ & = \left[\underset{第一种情况}{\underline{\underset{第一次是盒子1}{0.1} \times \underset{盒子1\rightarrow盒子2}{0.2}}} + \underset{第二种情况}{\underline{\underset{第一次是盒子2}{0.16} \times \underset{盒子2\rightarrow盒子2}{0.5}}} + \underset{第三种情况}{\underline{\underset{第一次是盒子3}{0.28} \times \underset{盒子3\rightarrow盒子2}{0.3}}} \right] \times \underset{抽到白球}{0.6}\\& = 0.1104\end{aligned} α2(2)=[i=13α1(i)ai2]b2(o2)= 第一种情况第一次是盒子10.1×盒子1盒子20.2+第二种情况第一次是盒子20.16×盒子2盒子20.5+第三种情况第一次是盒子30.28×盒子3盒子20.3 ×抽到白球0.6=0.1104
  • 隐藏状态是盒子 3 的概率为: α 2 ( 3 ) = [ ∑ i = 1 3 α 1 ( i ) a i 3 ] b 3 ( o 2 ) = [ 0.1 第一次是盒子 1 × 0.3 盒子 1 → 盒子 3 ‾ 第一种情况 + 0.16 第一次是盒子 2 × 0.2 盒子 2 → 盒子 3 ‾ 第二种情况 + 0.28 第一次是盒子 3 × 0.5 盒子 3 → 盒子 3 ‾ 第三种情况 ] × 0.3 抽到白球 = 0.0606 \begin{aligned}\alpha_2(3) & = \left[ \sum_{i=1}^3 \alpha_1(i) a_{i3} \right]b_3(o_2)\\ & = \left[\underset{第一种情况}{\underline{\underset{第一次是盒子1}{0.1} \times \underset{盒子1\rightarrow盒子3}{0.3}}} + \underset{第二种情况}{\underline{\underset{第一次是盒子2}{0.16} \times \underset{盒子2\rightarrow盒子3}{0.2}}} + \underset{第三种情况}{\underline{\underset{第一次是盒子3}{0.28} \times \underset{盒子3\rightarrow盒子3}{0.5}}} \right] \times \underset{抽到白球}{0.3}\\& = 0.0606\end{aligned} α2(3)=[i=13α1(i)ai3]b3(o2)= 第一种情况第一次是盒子10.1×盒子1盒子30.3+第二种情况第一次是盒子20.16×盒子2盒子30.2+第三种情况第一次是盒子30.28×盒子3盒子30.5 ×抽到白球0.3=0.0606

在计算时刻 2 时只考虑时刻 1


继续递推,现在我们递推时刻 3 三个状态的前向概率 α 2 ( i ) \alpha_2(i) α2(i)

时刻 3 是红色球

  • 隐藏状态是盒子 1 的概率为: α 3 ( 1 ) = [ ∑ i = 1 3 α 2 ( i ) a i 1 ] b 1 ( o 3 ) = [ 0.077 第一次是盒子 1 + 1 × 0.5 盒子 1 → 盒子 1 ‾ 第一种情况 + 0.1104 第一次是盒子 1 + 2 × 0.3 盒子 2 → 盒子 1 ‾ 第二种情况 + 0.0606 第一次是盒子 1 + 3 × 0.2 盒子 3 → 盒子 1 ‾ 第三种情况 ] × 0.3 抽到红球 = 0.04187 \begin{aligned} \alpha_3(1) & = \left[ \sum_{i=1}^3 \alpha_2(i)a_{i1} \right] b_1(o_3) \\ & = \left[\underset{第一种情况}{\underline{\underset{第一次是盒子1+1}{0.077} \times \underset{盒子1\rightarrow盒子1}{0.5}}} + \underset{第二种情况}{\underline{\underset{第一次是盒子1+2}{0.1104} \times \underset{盒子2\rightarrow盒子1}{0.3}}} + \underset{第三种情况}{\underline{\underset{第一次是盒子1+3}{0.0606} \times \underset{盒子3\rightarrow盒子1}{0.2}}} \right] \times \underset{抽到红球}{0.3}\\& = 0.04187 \end{aligned} α3(1)=[i=13α2(i)ai1]b1(o3)= 第一种情况第一次是盒子1+10.077×盒子1盒子10.5+第二种情况第一次是盒子1+20.1104×盒子2盒子10.3+第三种情况第一次是盒子1+30.0606×盒子3盒子10.2 ×抽到红球0.3=0.04187
  • 隐藏状态是盒子 2 的概率为: α 3 ( 2 ) = [ ∑ i = 1 3 α 2 ( i ) a i 2 ] b 2 ( o 3 ) = [ 0.077 第一次是盒子 1 + 1 × 0.2 盒子 1 → 盒子 2 ‾ 第一种情况 + 0.1104 第一次是盒子 1 + 2 × 0.5 盒子 2 → 盒子 2 ‾ 第二种情况 + 0.0606 第一次是盒子 1 + 3 × 0.3 盒子 3 → 盒子 2 ‾ 第三种情况 ] × 0.4 抽到红球 = 0.03551 \begin{aligned} \alpha_3(2) & = \left[ \sum_{i=1}^3 \alpha_2(i)a_{i2} \right] b_2(o_3) \\ & = \left[\underset{第一种情况}{\underline{\underset{第一次是盒子1+1}{0.077} \times \underset{盒子1\rightarrow盒子2}{0.2}}} + \underset{第二种情况}{\underline{\underset{第一次是盒子1+2}{0.1104} \times \underset{盒子2\rightarrow盒子2}{0.5}}} + \underset{第三种情况}{\underline{\underset{第一次是盒子1+3}{0.0606} \times \underset{盒子3\rightarrow盒子2}{0.3}}} \right] \times \underset{抽到红球}{0.4}\\& = 0.03551 \end{aligned} α3(2)=[i=13α2(i)ai2]b2(o3)= 第一种情况第一次是盒子1+10.077×盒子1盒子20.2+第二种情况第一次是盒子1+20.1104×盒子2盒子20.5+第三种情况第一次是盒子1+30.0606×盒子3盒子20.3 ×抽到红球0.4=0.03551
  • 隐藏状态是盒子 3 的概率为: α 3 ( 3 ) = [ ∑ i = 1 3 α 3 ( i ) a i 3 ] b 3 ( o 3 ) = [ 0.077 第一次是盒子 1 + 1 × 0.3 盒子 1 → 盒子 3 ‾ 第一种情况 + 0.1104 第一次是盒子 1 + 2 × 0.2 盒子 2 → 盒子 3 ‾ 第二种情况 + 0.0606 第一次是盒子 1 + 3 × 0.5 盒子 3 → 盒子 3 ‾ 第三种情况 ] × 0.3 抽到红球 = 0.05284 \begin{aligned} \alpha_3(3) & = \left[ \sum_{i=1}^3 \alpha_3(i)a_{i3} \right] b_3(o_3) \\ & = \left[\underset{第一种情况}{\underline{\underset{第一次是盒子1+1}{0.077} \times \underset{盒子1\rightarrow盒子3}{0.3}}} + \underset{第二种情况}{\underline{\underset{第一次是盒子1+2}{0.1104} \times \underset{盒子2\rightarrow盒子3}{0.2}}} + \underset{第三种情况}{\underline{\underset{第一次是盒子1+3}{0.0606} \times \underset{盒子3\rightarrow盒子3}{0.5}}} \right] \times \underset{抽到红球}{0.3}\\& = 0.05284 \end{aligned} α3(3)=[i=13α3(i)ai3]b3(o3)= 第一种情况第一次是盒子1+10.077×盒子1盒子30.3+第二种情况第一次是盒子1+20.1104×盒子2盒子30.2+第三种情况第一次是盒子1+30.0606×盒子3盒子30.5 ×抽到红球0.3=0.05284

在计算时刻 3 时只考虑时刻 2


最终我们求出观测序列: O = { 红 , 白 , 红 } O = \{红, 白, 红\} O={ ,,} 的概率为:

P ( O ∣ λ ) = ∑ i = 1 3 α 3 ( i ) = 0.04187 + 0.03551 + 0.05284 = 0.13022 \begin{aligned} P(O|\lambda) & = \sum_{i=1}^3 \alpha_3(i) \\ & = 0.04187 + 0.03551 + 0.05284 \\ & = 0.13022 \end{aligned} P(Oλ)=i=13α3(i)=0.04187+0.03551+0.05284=0.13022

在计算的时候,我们只考虑时刻的前一步,这就是前向算法。

需要注意的是,我们一定要保证当前时刻一定是局部最优解(动态规划的思想)

4.4 用后向算法求 HMM 观测序列的概率

4.4.1 流程梳理

熟悉了用前向算法求 HMM 观测序列的概率,现在我们再来看看怎么用后向算法求 HMM 观测序列的概率。

后向算法和前向算法非常类似,都是用的动态规划,唯一的区别是选择的局部状态不同,后向算法用的是“后向概率”。

简单来说,前向是从 1 1 1 T T T 的算法,而后向算法是 从 T T T 1 1 1 的算法。

4.4.2 后向算法流程

以下是后向算法的流程,注意和前向算法的相同点及不同点:

  • 输入:HMM 模型 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π),观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1, o_2,..., o_T) O=(o1,o2,...,oT)
  • 输出:观测序列概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)

初始化时刻 T T T 的各个隐藏状态后向概率:

β T ( i ) = 1 i = 1 , 2 , . . . , N \beta_T(i) = 1 \quad i = 1, 2, ..., N βT(i)=1i=1,2,...,N

各个隐藏状态前向概率用 α \alpha α 表示;各个隐藏状态后向概率用 β \beta β 表示。

递推时刻 T − 1 , T − 2 , . . . , 1 T-1,T-2,..., 1 T1,T2,...,1 时刻的后向概率:

β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( j ) i = 1 , 2 , . . . , N \beta_t(i) = \sum_{j = 1}^N a_{ij}b_j(o_{t+1})\beta_{t+1}(j) \quad i = 1, 2, ..., N βt(i)=j=1Naijbj(ot+1)βt+1(j)i=1,2,...,N

计算最终结果:

P ( O ∣ λ ) = ∑ i = 1 N Π i b i ( o 1 ) β 1 ( i ) P(O | \lambda) = \sum_{i=1}^N \Pi_ib_i(o_1)\beta_1(i) P(Oλ)=i=1NΠibi(o1)β1(i)

其中:

  • A A A:状态转移矩阵,其中 a i j a_{ij} aij 表示从隐藏状态 i i i 转移到隐藏状态 j j j 的概率。
  • B B B:观测概率矩阵,其中 b j ( k ) b_j(k) bj(k) 表示在隐藏状态 j j j 下观测到符号 k k k 的概率。
  • Π \Pi Π:初始状态概率向量,其中 π i \pi_i πi 表示初始时刻隐藏状态为 i i i 的概率。
  • O O O:观测序列,其中 o t o_t ot 表示时刻 t t t 的观测值。
  • λ \lambda λ:HMM 模型参数,包括状态转移矩阵 A A A、观测概率矩阵 B B B 和初始状态概率向量 Π \Pi Π
  • β t ( i ) \beta_t(i) βt(i):时刻 t t t 处于隐藏状态 i i i 且从时刻 t + 1 t+1 t+1 到时刻 T T T 的观测序列为 o t + 1 , o t + 2 , . . . , o T o_{t+1}, o_{t+2}, ..., o_T ot+1,ot+2,...,oT 的后向概率。

此时我们的算法时间复杂度仍然是 O ( T N 2 ) O(TN^2) O(TN2)

4.5 总结

4.5.1 前向算法求 HMM 观测序列 O O O

  • 输入:HMM 模型参数 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π),观测序列 O = { o 1 , o 2 , . . . , o T } O = \{o_1, o_2,..., o_T\} O={ o1,o2,...,oT}
  • 输出:观测序列 O O O 的概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)
  1. 计算时刻 1 1 1 的各个隐藏状态前向概率: α 1 ( i ) = Π i b i ( o 1 ) i = 1 , 2 , . . . , N \alpha_1(i) = \Pi_i b_i(o_1)\quad i=1, 2, ..., N α1(i)=Πibi(o1)i=1,2,...,N
  2. 递推时刻 2 , 3 , . . . , T 2,3,..., T 2,3,...,T 的前向概率: α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) i = 1 , 2 , . . . , N \alpha_{t+1}(i) = \left[ \sum_{j=1}^N \alpha_t(j)a_{ji} \right]b_i(o_{t+1}) \quad i = 1, 2, ..., N αt+1(i)=[j=1Nαt(j)aji]bi(ot+1)i=1,2,...,N
  3. 计算最终结果: P ( O ∣ λ ) = ∑ i = 1 N α T ( i ) P(O|\lambda) = \sum_{i=1}^N \alpha_T(i) P(Oλ)=i=1NαT(i)

其中:

  • λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π) 表示隐马尔科夫模型的参数
    • A A A 是状态转移概率矩阵
    • B B B 是观测概率矩阵
    • Π \Pi Π 是初始状态概率向量
  • O = { o 1 , o 2 , . . . , o T } O = \{o_1, o_2,..., o_T\} O={ o1,o2,...,oT} 表示观测序列。
  • P ( O ∣ λ ) P(O|\lambda) P(Oλ) 表示在模型 λ \lambda λ 下观测序列 O O O 出现的概率。
  • α t ( i ) \alpha_t(i) αt(i) 表示前向概率,即定义时刻 t t t 时隐藏状态为 q i q_i qi,那么观测状态的序列为 o 1 , o 2 , . . . , o t o_1, o_2, ..., o_t o1,o2,...,ot 的概率。
  • Π i \Pi_i Πi 表示初始状态概率向量中第 i i i 个元素的值。
  • b i ( o t ) b_i(o_t) bi(ot) 表示在隐藏状态 q i q_i qi 下观测到 o t o_t ot 的概率。
  • a j i a_{ji} aji 表示从隐藏状态 q j q_j qj 转移到隐藏状态 q i q_i qi 的概率。

4.5.2 后向算法求 HMM 观测序列 O O O

  • 输入:HMM 模型 λ = ( A , B , Π ) \lambda = (A, B, \Pi) λ=(A,B,Π),观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1, o_2,..., o_T) O=(o1,o2,...,oT)
  • 输出:观测序列概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)

初始化时刻 T T T 的各个隐藏状态后向概率:

β T ( i ) = 1 i = 1 , 2 , . . . , N \beta_T(i) = 1 \quad i = 1, 2, ..., N βT(i)=1i=1,2,...,N

各个隐藏状态前向概率用 α \alpha α 表示;各个隐藏状态后向概率用 β \beta β 表示。

递推时刻 T − 1 , T − 2 , . . . , 1 T-1,T-2,..., 1 T1,T2,...,1 时刻的后向概率:

β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( j ) i = 1 , 2 , . . . , N \beta_t(i) = \sum_{j = 1}^N a_{ij}b_j(o_{t+1})\beta_{t+1}(j) \quad i = 1, 2, ..., N βt(i)=j=1Naijbj(ot+1)βt+1(j)i=1,2,...,N

计算最终结果:

P ( O ∣ λ ) = ∑ i = 1 N Π i b i ( o 1 ) β 1 ( i ) P(O | \lambda) = \sum_{i=1}^N \Pi_ib_i(o_1)\beta_1(i) P(Oλ)=i=1NΠibi(o1)β1(i)

其中:

  • A A A:状态转移矩阵,其中 a i j a_{ij} aij 表示从隐藏状态 i i i 转移到隐藏状态 j j j 的概率。
  • B B B:观测概率矩阵,其中 b j ( k ) b_j(k) bj(k) 表示在隐藏状态 j j j 下观测到符号 k k k 的概率。
  • Π \Pi Π:初始状态概率向量,其中 π i \pi_i πi 表示初始时刻隐藏状态为 i i i 的概率。
  • O O O:观测序列,其中 o t o_t ot 表示时刻 t t t 的观测值。
  • λ \lambda λ:HMM 模型参数,包括状态转移矩阵 A A A、观测概率矩阵 B B B 和初始状态概率向量 Π \Pi Π
  • β t ( i ) \beta_t(i) βt(i):时刻 t t t 处于隐藏状态 i i i 且从时刻 t + 1 t+1 t+1 到时刻 T T T 的观测序列为 o t + 1 , o t + 2 , . . . , o T o_{t+1}, o_{t+2}, ..., o_T ot+1,ot+2,...,oT 的后向概率。

4.5.3 对比

  • 前向和后向算法的输入和输出是一样的。
  • 前向算法先求时刻 1;而后向算法是先求时刻 T T T
  • 各个隐藏状态前向概率 α 1 ( i ) = Π i b i ( o i ) \alpha_1(i) = \Pi_ib_i(o_i) α1(i)=Πibi(oi)w;各个隐藏状态后向概率 β T ( i ) = 1 i = 1 , 2 , . . . , N \beta_T(i) = 1 \quad i = 1, 2, ..., N βT(i)=1i=1,2,...,N(直接认为是1)
  • 前向是从 1 到 T T T;而后向是从 T − 1 T-1 T1 到 1
  • 前向是一步步前进累加求解;而后向是一步步向后退求解
  • 最终结果求解一样
  • 时间复杂度二者一样,都是 O ( T N 2 ) O(TN^2) O(TN2)
  • 二者都使用了动态规划的思想(确保当前步是最优的(局部最优))

4.5.4 如何选择 前向算法 和 后向算法?

前向算法和后向算法都可以用来计算观测序列概率 P ( O ∣ λ ) P(O|\lambda) P(Oλ)。它们的时间复杂度都是 O ( T N 2 ) O(TN^2) O(TN2),因此在计算观测序列概率时,二者的效率相当。

不过,在某些情况下,前向算法和后向算法可以结合使用来解决其他问题。例如,在计算给定模型 λ \lambda λ 和观测序列 O O O 的条件下,时刻 t t t 处于隐藏状态 i i i 的概率时,可以使用前向概率 α t ( i ) \alpha_t(i) αt( i ) and backward probabilityβ t ( i ) \beta_t(i)bt( i ) to calculate:

P ( i t = q i ∣ O , λ ) = α t ( i ) β t ( i ) ∑ j = 1 N α t ( j ) β t ( j ) P(i_t = q_i | O, \lambda) = \frac{\alpha_t(i)\beta_t(i)}{\sum_{j=1}^N\alpha_t(j)\beta_t(j)} P(it=qiO,l )=j=1Nat( j ) bt(j)at( i ) bt(i)

Therefore, when choosing a forward algorithm or a backward algorithm, it should be decided according to the specific problem :

  • If only the observation sequence probabilities need to be calculated, then both are fine
  • If you need to solve other problems, you may need to use a combination of forward and backward algorithms.

Guess you like

Origin blog.csdn.net/weixin_44878336/article/details/131237760