Machine Learning Basic HMM Model (Hidden Markov)

Recommended reference: https://juejin.cn/post/6844903891834781703

1. Markov chain

In machine learning algorithms, Markov chain (Markov chain) is a very important concept. Markov chain, also known as discrete-time Markov chain, is named after Russian mathematician Andrei Markov (Russian: Андрей Андреевич Марков).

insert image description here

1 Introduction

A Markov chain is a random process of transition from one state to another in the state space .

insert image description here

  • The process requires a " memoryless " property:

    • The probability distribution of the next state can only be determined by the current state, and the events before it in the time series have nothing to do with it . This particular type of "memorylessness" is called the Markov property.
  • Markov chains have many applications as statistical models of real processes.

  • At each step of the Markov chain, the system can change from one state to another according to the probability distribution, and can also maintain the current state .

  • A change of state is called a transition, and the probabilities associated with different state changes are called transition probabilities .

  • The mathematical representation of a Markov chain is:
    insert image description here

  • Since the probability of state transition at a certain moment only depends on the previous state, then only the transition probability between any two states in the system is required, and the model of this Markov chain is determined .

2. Classic example

The Markov chain in the figure below is used to represent the stock market model. There are three states: Bull market, Bear market and Stagnant market.

Each state transitions to the next state with a certain probability. For example, a bull market turns into a sideways state with a probability of 0.025.

insert image description here

  • This state probability transition diagram can be expressed in the form of a matrix.
  • If we define the value of a position P(i, j) in the matrix P as P(j|i), that is, the probability of changing from state i to state j.
  • In addition, the states of bull market, bear market, and sideways market are defined as 0, 1, and 2 respectively, so that we get the state transition matrix of the Markov chain model as:

insert image description here

When the state transition matrix P is determined, the entire stock market model has been determined!

3. Summary

  • Markov chain is
    • A stochastic process of transitions from one state to another in state space .
    • The process requires a " memoryless " property:
      • The probability distribution of the next state can only be determined by the current state, and the events before it in the time series have nothing to do with it .

2. Introduction to HMM

Hidden Markov Model (HMM) is a statistical model used to describe a Markov process with hidden unknown parameters .

The difficulty is to determine the implicit parameters of the process from the observable parameters. These parameters are then used for further analysis, such as pattern recognition .

1. Simple case

Let's illustrate with a simple example:

  • Suppose I have three different dice in my hand.
    • The first dice is our usual dice (call this dice D6), with 6 faces, and the probability of each face (1, 2, 3, 4, 5, 6) appearing is 1/6.
    • The second dice is a tetrahedron (call this dice D4), and the probability of each face (1, 2, 3, 4) appearing is 1/4.
    • The third dice has eight sides (call this dice D8), and the probability of each side (1, 2, 3, 4, 5, 6, 7, 8) appearing is 1/8.

insert image description here

  • We start rolling the dice, we first pick one of the three dice, and the probability of picking each dice is 1/3.
  • Then we roll the dice and get a number, one of 1, 2, 3, 4, 5, 6, 7, 8. Repeat the above process non-stop, we will get a string of numbers, each number is one of 1, 2, 3, 4, 5, 6, 7, 8.
  • For example, we may get such a string of numbers (rolling the dice 10 times): 1 6 3 5 2 7 3 5 2 4
  • This string of numbers is called the visible state chain .

But in the hidden Markov model, we not only have such a series of visible state chains, but also a series of hidden state chains .

  • In this example, the chain of implicit states is the sequence of dice you use.
    • For example, the implicit state chain may be: D6 D8 D8 D6 D4 D8 D6 D6 D4 D8

Generally speaking, the Markov chain mentioned in HMM actually refers to the hidden state chain , because there is a transition probability (transition probability) between hidden states (dice).

  • In our example, the next state of D6 is D4, and the probability of D6 and D8 is 1/3. The next state of D4, D8 is D4, D6, and the transition probability of D8 is also 1/3.
  • This setting is to make it easier to explain clearly at the beginning, but we can actually set the conversion probability at will.
    • For example, we can define that D4 cannot be followed by D6, the probability of D6 following D6 is 0.9, and the probability of D8 is 0.1.
    • This is a new HMM.

Likewise, although there is no transition probability between visible states, there is a probability between hidden and visible states called the emission probability.

  • For our example, the six-sided die (D6) has a 1 output probability of 1/6. The probability of producing 2, 3, 4, 5, and 6 is also 1/6.
  • We can also make other definitions of output probabilities. For example, I have a six-sided dice that has been tampered with by the casino. The probability of rolling 1 is 1/2, and the probability of rolling 2, 3, 4, 5, and 6 is 1/10.

insert image description here

insert image description here

In fact, for HMM, if the transition probability between all hidden states and the output probability between all hidden states and all visible states are known in advance, it is quite easy to do simulation. But when applying the HMM model, part of the information is often missing.

  • Sometimes you know that there are several types of dice and what each type of dice is, but you don't know the sequence of dice thrown;
  • Sometimes you just see the result of rolling the dice many times and don't know the rest.

If algorithms are used to estimate these missing information, it becomes a very important problem. I will describe these algorithms in detail later.

2. Three basic questions

  1. Given a model, how to efficiently compute the probability of producing a sequence of observations? In other words, how do you assess how well your model fits your observation series?
  2. Given a model and a sequence of observations, how to find the sequence of states that best matches this sequence of observations? In other words, how can the hidden model state be inferred from a sequence of observations?
  3. Given a sequence of observations, how to adjust the model parameters to maximize the probability of this sequence? In other words, how do you train a model to best describe the observed data?

The first two problems are pattern recognition problems: 1) Obtain the probability (evaluation) of an observable state sequence according to the hidden Markov model; 2) Find a hidden state sequence that maximizes the probability of this sequence producing an observable state sequence (decoding). The third problem is to generate a Hidden Markov Model (learning) from a set of observable state sequences.
Corresponding solutions to the three major problems:

  1. Forward Algorithm, Backward Algorithm
  2. Viterbi Algorithm
  3. Baum-Welch Algorithm (approximately equal to EM algorithm)

3. HMM model basis

1. What kind of problem requires HMM model

First, let's take a look at what kind of problem solving can use the HMM model. Our problems when using the HMM model generally have these two characteristics:

  • 1) Our problem is based on sequences, such as time series, or state sequences.
  • 2) There are two types of data in our problem,
    • One type of sequence data is observable, that is, the observation sequence;
    • The other type of data cannot be observed, that is, the hidden state sequence, referred to as the state sequence.

With these two features, then this problem can generally be tried to solve with the HMM model. There are many such problems in real life.

  • For example: I am writing courseware for you now. The series of characters I type on the keyboard are the observation sequences, and the paragraph I actually want to write is the hidden state sequence. Guess the paragraph I want to write, and put the most likely word at the front for me to choose. This can be regarded as an HMM model.

  • To give another example, if I give lectures in class, the series of continuous sounds I make are the observation sequences, and the paragraph I actually want to express is the hidden state sequence. The task of your brain is to judge from this series of continuous sounds. I am most likely to express the content of the words.

From these examples, we can find that HMM models can be ubiquitous. However, the above description is not precise yet. Below we use precise mathematical symbols to express our HMM model.

2. Definition of HMM model

For the HMM model, first we assume that Q is the set of all possible hidden states, and V is the set of all possible observed states, namely:

  • Q = q 1 , q 2 , . . . , q N Q={q_1,q_2,...,q_N} Q=q1,q2,...,qN
    ​​
  • V = v 1 , v 2 , . . . v M V={v_1,v_2,...v_M} V=v1,v2,...vM
    ​​

where N is the number of possible hidden states and M is the number of all possible observed states.

For a sequence of length T, i is the corresponding state sequence, O is the corresponding observation sequence, namely:

  • i = i 1 , i 2 , . . . , i T i={i_1,i_2,...,i_T} i=i1,i2,...,iT​​
  • O = o 1 , o 2 , . . . o TO={o_1,o_2,...o_T}O=o1,o2,...oT
    ​​

Among them, any hidden state it ∈ Q i_t \in QitQ , any observation stateot ∈ V o_t\in VotV

The HMM model makes two very important assumptions as follows:

1) Homogeneous Markov chain assumption .

  • That is, the hidden state at any time only depends on its previous hidden state .

  • Of course, this assumption is a bit extreme, because many times one of our hidden states does not only depend on the previous hidden state, it may be the first two or the first three.

  • But the advantage of this assumption is that the model is simple and easy to solve.

  • If the hidden state at time t is it = qi​ ​ i_t=q_iit=qi​​​​, at timet + 1 t+1t+The hidden state of 1 is it + 1 = qj i_{t+1}=q_jit+1=qj, then the HMM state transition probability aij a_{ij} from time t to time t+1aijIt can be expressed as:

    • a i j = P ( i t + 1 = q j ∣ i t = q i ) a_{ij}=P(i_{t+1}= q_j | i_t=q_i) aij=P(it+1=qjit=qi)
  • such that aij a_{ij}aij​​The state transition matrix A that can form a Markov chain:

    • A = [ a i j ] N × N A=[a_{ij}]_{N \times N} A=[aij]N×N
      ​​

2) Observation independence assumption .

  • That is, the observed state at any moment only depends on the hidden state at the current moment, which is also an assumption to simplify the model.

    • If the hidden state at time t is it = qj i_t=q_jit=qj​​​​ , and the corresponding observation state is ot = vk o_t=v_kot=vk​​​​ , then observe the state vk v_k at this momentvk​​In the hidden state qj q_jqj​​The generated probability is bj ( k ) b_j(k)bj( k ) , satisfying:

      • b j ( k ) = P ( o t = v k ∣ i t = q j ) b_j(k)=P(o_t=v_k|i_t=q_j) bj(k)=P(ot=vkit=qj)
    • such that bj ( k ) b_j(k)bj( k ) can form the probability matrix B generated by the observation state:

      • B = [ b j ( k ) ] N × M B=[b_j(k)]_{N \times M} B=[bj(k)]N×M
        ​​
    • In addition, we need a set of hidden state probability distributions Π \Pi at time t=1P :

      • Π = [ Π i ] N \Pi =[\Pi_i]_NPi=[ Pi]N
      • Among them Π i = P ( i 1 = qi ) \Pi _i=P(i_1=q_i)Pii=P(i1=qi)

An HMM model can be defined by the hidden state initial probability distribution Π \PiΠ , the state transition probability matrix A and the observed state probability matrix B are determined.

Π \PiΠ , A determines the state sequence, B determines the observation sequence.

Therefore, the HMM model can be composed of a triplet λ \lambdaλ is expressed as follows:

  • λ = ( A , B , Π ) = \lambda =(A,B, \Pi )=l=(A,B,P )= (state sequence, observation sequence, initial state probability distribution)

3. An example of an HMM model

Below we use a simple example to describe the HMM model abstracted above. This is a box and ball model.

The example comes from Li Hang's "Statistical Learning Methods".

Suppose we have 3 boxes, and each box contains two kinds of balls, red and white. The numbers of balls in these three boxes are respectively:
insert image description here
draw balls from the box according to the following method. At the beginning,

  • The probability of drawing a ball from the first box is 0.2,
  • The probability of drawing a ball from the second box is 0.4,
  • The probability of drawing a ball from the third box is 0.4.

After drawing the ball once with this probability, put the ball back.

Then move from the current box to the next box to draw the ball. The rules are:

  • If the current ball draw box is the first box, stay in the first box to continue drawing balls with a probability of 0.5, go to the second box to draw balls with a probability of 0.2, and go to the third box with a probability of 0.3 ball.
  • If the current ball box is the second box, stay in the second box to continue drawing with a probability of 0.5, go to the first box with a probability of 0.3, and go to the third box with a probability of 0.2 ball.
  • If the current ball box is the third box, stay in the third box with a probability of 0.5 and continue to draw balls, go to the first box with a probability of 0.2, and go to the second box with a probability of 0.3 ball.

This goes on until repeated three times to obtain an observation sequence of the color of a ball:

  • O={red, white, red}

Note that in this process, the observer can only see the color sequence of the ball, but cannot see which box the ball was taken from .

Then according to the definition of our previous HMM model, our observation state set is:

  • V={red, white}, M=2

Our collection of hidden states is:

  • Q={Box 1, Box 2, Box 3}, N=3

And the length of observation sequence and state sequence is 3.

Initial state distribution Π \PiWhy :

  • Π = ( 0.2 , 0.4 , 0.4 ) T \Pi=(0.2,0.4,0.4)^TPi=(0.2,0.4,0.4)T

​​The state
transition probability distribution A matrix is:

insert image description here

The observed state probability B matrix is:

insert image description here

4. Generation of HMM observation sequence

From the above example, we can also abstract the process of HMM observation sequence generation.

  • The input is the HMM model λ = ( A , B , Π ) \lambda =(A,B,\Pi )l=(A,B,Π ) , the length of the observation sequenceTTT

  • The output is the sequence of observations O = o 1 , o 2 , . . . o TO={o_1,o_2,...o_T}O=o1,o2,...oT

The generation process is as follows:

  • 1) Generate the hidden state i 1 i_1 according to the initial state probability distribution \PiΠi1

  • 2)for t from 1 to T

    • a. According to the hidden state it i_titThe observed state distribution bit ( k ) b_{it}(k)bit( k ) Generate observation stateot o_tot
    • b. According to the hidden state it i_titThe state transition probability distribution ait , it + 1 ai_t, i_{t+1}ait,it+1​Generate hidden state it + 1 i_{t+1}it+1
      ​​

all ot o_tot​​Together form the observation sequence O = o 1 , o 2 , . . . o TO={o_1,o_2,...o_T}O=o1,o2,...oT

5. Three basic problems of the HMM model

There are three classic problems in the HMM model that need to be solved:

1) Evaluate the observation sequence probability - forward and backward probability calculation

  • That is, the given model λ = ( A , B , Π ) \lambda =(A,B,\Pi )l=(A,B,Π ) Summation orderO = { o 1 , o 2 , . . . o T } O=\{o_1,o_2,...o_T\}O={ o1,o2,...oT} , computed in the modelλ \lambdaThe probability P(O| λ \lambdal ).
  • The solution to this problem requires the use of a forward-backward algorithm, which is the simplest of the three problems of the HMM model.

2) Prediction problem, also known as decoding problem - Viterbi algorithm

  • That is, the given model λ = ( A , B , Π ) \lambda =(A,B,\Pi )l=(A,B,Π ) Summation orderO = { o 1 , o 2 , . . . o T } O=\{o_1,o_2,...o_T\}O={ o1,o2,...oT} , find the most likely corresponding state sequence under the condition of given observation sequence.
  • The solution to this problem requires the use of the Viterbi algorithm based on dynamic programming, which is the algorithm with the middle complexity among the three problems of the HMM model.

3) Model parameter learning problem - Baum-Welch (Baum-Welch) algorithm (unknown state), this is a learning problem

  • That is, a given observation sequence O = { o 1 , o 2 , . . . o T } O=\{o_1,o_2,...o_T\}O={ o1,o2,...oT} ,unlockλ = ( A , B , Π ) \lambda =(A,B,\Pi )l=(A,B,Π ) , so that the conditional probability of the observation sequence under the modelP ( O ∣ λ ) P(O|\lambda )P ( O λ ) max.
  • The solution to this problem requires the use of the Baum-Welch algorithm based on the EM algorithm, which is the most complex of the three problems of the HMM model.

In the next three sections, we will discuss these three questions.

4. Forward and backward algorithm to evaluate the probability of observation sequence

In this section, we will focus on the solution to the first basic problem of HMM, that is, to find the probability of occurrence of the observation sequence given the model and observation sequence.

1. Review HMM problem 1: Find the probability of the observation sequence

First, let's review Question 1 of the HMM model. The problem is this.

我们下载HMM model的parameterλ = ( A , B , Π ) \lambda =(A,B,\Pi)l=(A,B,P ) .

where A is the matrix of hidden state transition probabilities,

B is the matrix of observed state generation probabilities,

Π \PiΠ is the initial probability distribution of the hidden state.

At the same time, we have also obtained the observation sequence O = { o 1 , o 2 , . . . o T } O=\{o_1,o_2,...o_T\}O={ o1,o2,...oT},

Now we require the observation sequence O to be in the model λ \lambdaThe conditional probability of occurrence under λ P ( O ∣ λ ) P(O|\lambda )P(Oλ)

At first glance, the problem is simple. Because we know the transition probabilities between all hidden states and all the generation probabilities from hidden states to observed states, then we can solve it violently .

We can list all possible hidden sequences of length T i = { i 1 , i 2 , . . . , i T } i=\{i_1,i_2,...,i_T\}i={ i1,i2,...,iT} , find these hidden sequences and observation sequences respectivelyO = { o 1 , o 2 , . . . o T } O=\{o_1,o_2,...o_T\}O={ o1,o2,...oT} joint probability distributionP ( O , i ∣ λ ) P(O,i|\lambda )P(O,i λ ) , so we can easily find the marginal distributionP ( O ∣ λ ) P(O|\lambda )P ( O λ ) .


The specific brute force solution method is as follows:

  • First, arbitrary order i = i 1 , i 2 , . . . , i T i={i_1,i_2,...,i_T}i=i1,i2,...,iTThe probability of occurrence is:

    • P ( i ∣ λ ) = Π i 1 a i 1 , i 2 a i 2 , i 3 . . . a i T − 1 , i T P(i|\lambda )=\Pi _{i1}a_{i1,i2}a_{i2,i3}...a_{iT-1,iT} P(iλ)=Pii 1ai 1 , i 2ai 2 , i 3...aiT1,iT
      ​​
  • For a fixed state sequence i = i 1 , i 2 , . . . , i T i={i_1,i_2,...,i_T}i=i1,i2,...,iT​​​​ , the sequence of observations we require O = o 1 , o 2 , . . . o TO={o_1,o_2,...o_T}O=o1,o2,...oTThe probability of occurrence is:

    • P ( O ∣ i , λ ) = b i 1 ( o 1 ) b i 2 ( o 2 ) . . . b i T ( o T ) P(O|i,\lambda )=b_{i1}(o_1)b_{i2}(o_2)...b_{iT}(o_T) P(Oi,l )=bi 1(o1)bi2(o2)...biT(oT)
  • Then the probability of O and i appearing jointly is:
    insert image description here

  • Then calculate the marginal probability distribution, you can get the observation sequence O in the model λ \lambdaThe conditional probability P(O| λ \lambdal ):
    insert image description here

Although the above method is effective, it will be troublesome if we have a large number of hidden states N. At this time, we predict that the state has NTN^TNT kinds of combinations, the time complexity of the algorithm isO ( TNT ) O(TN^T)O ( T FEMALET )order.

Therefore, for some models with very few hidden states, we can use the brute force solution to obtain the probability of the observation sequence, but if there are many hidden states, the above algorithm is too time-consuming, and we need to find other simple algorithms.

The forward-backward algorithm is here to help us solve this problem with a lower time complexity.

2. Use the forward algorithm to find the probability of the HMM observation sequence

The forward and backward algorithm is a general term for the forward algorithm and the backward algorithm. Both algorithms can be used to find the probability of the HMM observation sequence. Let's first look at how the forward algorithm solves this problem.

2.1 Process combing

The forward algorithm is essentially a dynamic programming algorithm, that is, we need to find the formula for local state recursion, so that we can expand from the optimal solution of the sub-problem to the optimal solution of the whole problem step by step.

  • In the forward algorithm, this local state of dynamic programming is defined by defining a "forward probability".

  • What is the forward probability? In fact, the definition is very simple: define the hidden state at time t as qi q_iqi, the sequence of observed states is o 1 , o 2 , . . . ot o_1,o_2,...o_to1,o2,...otThe probability of ​​is the forward probability. Recorded as:

insert image description here

  • Since it is dynamic programming, we have to recurse. Now suppose we have found the forward probability of each hidden state at time t. Now we need to recurse the forward probability of each hidden state at time t+1.

  • We can multiply the corresponding state transition probability based on the forward probability of each hidden state at time t, that is, α t ( j ) aji \alpha _t(j)a_{ji}at( j ) hasji​​It is to observe o 1 , o 2 , . . . ot o_1,o_2,...o_t at time to1,o2,...ot, and the hidden state qj q_j at time tqj​​​​ , time t+1 hidden state qi q_iqiThe probability.

  • If the probabilities corresponding to all the lines below are summed, that is, ∑ j = 1 N α ( j ) α ji \sum_{j=1}^{N}\alpha(j)\alpha_{ji}j=1Na ( j ) ajiThat is , o 1 , o 2 , . . . ot o_1,o_2,...o_t are observed at time to1,o2,...ot​​​​ , and the hidden state qi q_i at time t+1qiThe probability of .

  • Continue one step, due to observation state ot + 1 o_{t+1}ot+1​​Only depends on the hidden state qi q_i at time t+1qi, so that o 1 , o 2 , . . . ot , ot + 1 o_1,o_2,...o_t,o_{t+1}insert image description here are observed at time t+1o1,o2,...ot,ot+1​​​​ , and the hidden state qi q_i at time t+1qiThe probability of .

  • And this probability is exactly the forward probability of hidden state i corresponding to time t+1, so we get the recurrence relation of forward probability as follows:
    insert image description here

Our dynamic programming starts at time 1 and ends at time T, because α T ( i ) \alpha _T(i)aT( i ) indicates that the observation sequence at time T iso 1 , o 2 , . . . o T o_1,o_2,...o_To1,o2,...oT​​​​ , and the hidden state qi q_i at time TqiThe probability of , we only need to add the corresponding probabilities of all hidden states, that is, ∑ i = 1 N α T ( i ) \sum_{i=1}^{N}\alpha_T(i)i=1NaT( i ) The observation sequence at time T is obtained aso 1 , o 2 , . . . ot o_1,o_2,...o_to1,o2,...otThe probability.

2.2 Algorithm Summary

  • Form:HMM functionλ = ( A , B , Π ) \lambda =(A,B,\Pi )l=(A,B,Π ) , viewing orderO = ( o 1 , o 2 , . . . o T ) O=(o_1,o_2,...o_T)O=(o1,o2,...oT)

  • Output: Observation sequence probability P ( O ∣ λ ) P(O|\lambda )P(Oλ)

    • 1) Calculate the forward probability of each hidden state at time 1:insert image description here

    • 2) Recursive forward probability at time 2, 3, ... T:insert image description here

    • 3) Calculate the final result:insert image description here

It can be seen from the recursion formula that the time complexity of our algorithm is O ( TN 2 ) O(TN^2)O ( T FEMALE2 ), the time complexity of the violent solution isO ( TNT ) O(TN^T)O ( T FEMALET )is several orders of magnitude less.

3. HMM forward algorithm solution example

Here we use the previous box and ball example to show the computation of forward probabilities. Our observation set is:

insert image description here

Our state collection is:

insert image description here

And the length of observation sequence and state sequence is 3.

The initial state distribution is:

insert image description here

The state transition probability distribution matrix is:

insert image description here

The observed state probability matrix is:

insert image description here

Sequence of observations for the color of the ball:

insert image description here


Follow our forward algorithm from the previous section. First calculate the forward probabilities of the three states at time 1:

Time 1 is the red ball,

  • The probability that the hidden state is box 1 is:
    insert image description here

  • The probability that the hidden state is box 2 is:
    insert image description here

  • The probability that the hidden state is box 3 is:
    insert image description here


Now we can start to recurse, first recurse the forward probabilities of the three states at time 2:

Moment 2 is the white ball,

  • The probability that the hidden state is box 1 is:
    insert image description here

  • The probability that the hidden state is box 2 is:
    insert image description here

  • The probability that the hidden state is box 3 is:
    insert image description here


Continuing to recurse, now we recurse the forward probabilities of the three states at time 3:

Time 3 is the red ball,

  • The probability that the hidden state is box 1 is:
    insert image description here

  • The probability that the hidden state is box 2 is:
    insert image description here

  • The probability that the hidden state is box 3 is:
    insert image description here

Finally, we find the observation sequence: O=red, white, and the probability of red is:

insert image description here

The principle of the backward algorithm is roughly the same, you can check it yourself

5. Viterbi Algorithm Decoding Hidden State Sequence

Learning objectives
Know the hidden state sequence decoded by the Viterbi algorithm
In this article, we will discuss the hidden state sequence decoded by the Viterbi algorithm, that is, given the model and observation sequence, find the most likely corresponding hidden state sequence under the given observation sequence conditions .

The most commonly used algorithm for the decoding problem of the HMM model is the Viterbi algorithm. Of course, there are other algorithms that can solve this problem.

At the same time, the Viterbi algorithm is a general dynamic programming algorithm for finding the shortest path of a sequence, and it can also be used for many other problems.

1. Overview of solving the most probable hidden state sequence of HMM

The decoding problem of the HMM model is:

  • Fixed model λ = ( A , B , Π ) \lambda=(A,B,\Pi)l=(A,B,Π ) Summation orderO = o 1 , o 2 , . . . o TO={o_1,o_2,...o_T}O=o1,o2,...oT, find the most likely corresponding state sequence I ∗ = i 1 ∗ , i 2 ∗ , . . . i T ∗ I^\ast ={i^\ast _1,i^\ ast _2,...i^\ast _T}I=i1,i2,...iT ,即 P ( I ∗ ∣ O ) P(I^\ast |O) P(IO)maximization.

A possible approximate solution is to find the most likely hidden state it ∗ i^\ast _t of the observation sequence O at each time titThen get an approximate hidden state sequence I ∗ = i 1 ∗ , i 2 ∗ , . . . i T ∗ I^\ast ={i^\ast _1,i^\ast _2,...i^\ast _T}I=i1,i2,...iTI. It is not difficult to solve this approximation, and use the forward-backward algorithm to evaluate the definition of the probability of observing the sequence:

  • In a given model λ \lambdaλ and observation sequence O, in stateqi q_iqi​​The probability of is γ t ( i ) \gamma _t(i)ct( i ) , this probability can be calculated by the forward algorithm and backward algorithm of HMM. This way we have:
    insert image description here

The approximation algorithm is very simple, but it cannot guarantee that the predicted state sequence as a whole is the most likely state sequence, because some adjacent hidden states in the predicted state sequence may have a transition probability of 0.

The Viterbi algorithm can consider the state sequence of the HMM as a whole to avoid the problem of the approximation algorithm. Let's take a look at the method of the Viterbi algorithm for HMM decoding.

2. Overview of Viterbi Algorithm

The Viterbi algorithm is a general decoding algorithm, which is a method for finding the shortest path of a sequence based on dynamic programming.

Since it is a dynamic programming algorithm, it is necessary to find a suitable local state and a recursive formula for the local state. In HMM, the Viterbi algorithm defines two local states for recursion.

1) The first partial state is hidden state ii at time ti all possible state transition pathsi 1 , i 2 , . . . it i_1,i_2,...i_ti1,i2,...it​​The probability maximum in .

  • 记为δ t ( i ) \delta _t(i)dt(i):
    insert image description here

δ t ( i ) \delta _t(i) dtThe definition of ( i ) can getδ \deltaThe recursive expression of δ :

insert image description here

2) The second partial state is recursively obtained from the first partial state .

  • We define all individual state transition paths ( i 1 , i 2 , . . . , it − 1 , i ) for hidden state i at time t (i_1,i_2,...,i_{t-1},i)(i1,i2,...,it1,The hidden state of the t-1th node in the transition path with the highest probability in i ) is ψ t ( i ) \psi _t(i)pt(i),
  • Its recursive expression can be expressed as:
    insert image description here

With these two local states, we can recurse from time 0 to time T, and then use ψ t ( i ) \psi _t(i)pt( i ) The recorded previous most likely state nodes backtrack until an optimal sequence of hidden states is found.

3. Summary of Viterbi algorithm process

Now let's summarize the process of the Viterbi algorithm:

  • Form:HMM functionλ = ( A , B , Π ) \lambda=(A,B,\Pi)l=(A,B,Π ) , viewing orderO = ( o 1 , o 2 , . . . o T ) O=(o_1,o_2,...o_T)O=(o1,o2,...oT)

  • Output: The most likely hidden state sequence I ∗ = i 1 ∗ , i 2 ∗ , . . . i T ∗ I^\ast ={i^\ast _1,i^\ast _2,...i^\ ast_T}I=i1,i2,...iT
    ​​

The process is as follows:

  • 1) Initialize local state:
    insert image description here

  • 2) Carry out dynamic programming recursion time t = 2 , 3 , . . . T t t=2,3,...Tt=2,3,. . . Local state at time T :
    insert image description here

  • 3) Calculate the maximum δ T ( i ) \delta _T(i) at time TdT( i ) is the probability of the most likely hidden state sequence appearing. Calculate the maximumψ t ( i ) \psi _t(i)pt( i ) , which is the most likely hidden state at time T.
    insert image description here

  • 4) Use the local state ψ t ( i ) \psi _t(i)pt( i ) Start backtracking. Fort = T − 1 , T − 2 , . . . , 1 t=T-1,T-2,...,1t=T1,T2,...,1:
    insert image description here

Finally, the most likely hidden state sequence I ∗ = i 1 ∗ , i 2 ∗ , . . . i T ∗ I^\ast ={i^\ast _1,i^\ast _2,...i^\ ast_T}I=i1,i2,...iT

4. HMM Viterbi Algorithm Solution Example

Below we still use the example of boxes and balls to look at the solution of the HMM Viterbi algorithm. Our observation set is:

insert image description here

Our state collection is:

insert image description here

And the length of observation sequence and state sequence is 3.

The initial state distribution is:

insert image description here

The state transition probability distribution matrix is:

insert image description here

The observed state probability matrix is:

insert image description here

Sequence of observations for the color of the ball:

insert image description here

According to our previous Viterbi algorithm, we first need to obtain the two local states corresponding to the three hidden states at time 1, and the observed state is 1 at this time:

insert image description here

Now start to recurse the two local states corresponding to the three hidden states at time 2. At this time, the observation state is 2:
insert image description here
Continue to recurse the two local states corresponding to the three hidden states at time 3. At this time, the observation state It is 1:
insert image description here
It is the last moment at this time, and we start to prepare for backtracking. At this point the maximum probability is δ 3 ( 3 ) \delta _3(3)d3( 3 ) , so thati 3 ∗ = 3 i^\ast _3=3i3=3

Since ψ 3 ​​( 3 ) = 3 \psi _3(3)=3p3(3)=3 , becausei 2 ∗ = 3 i^\ast _2=3i2=3 , and sinceψ 2 ( 3 ) = 3 \psi _2(3)=3p2(3)=3 , becausei 1 ∗ = 3 i^\ast _1=3i1=3 . Thus the final most likely hidden state sequence is: (3,3,3).

6. Introduction to Baum-Welch Algorithm

1 Introduction

Model parameter learning problem - Baum-Welch (Baum-Welch) algorithm (state unknown),

  • That is, a given observation sequence O = { o 1 , o 2 , . . . o T } O=\{o_1,o_2,...o_T\}O={ o1,o2,...oT} ,unlockλ = ( A , B , Π ) \lambda =(A,B,\Pi )l=(A,B,Π ) , so that the conditional probability of the observation sequence under the modelP ( O ∣ λ ) P(O|\lambda )P ( O λ ) max.
  • The most commonly used solution is the Baum-Welch algorithm, which is actually based on the EM algorithm, but in the era when the Baum-Welch algorithm appeared, the EM algorithm has not been abstracted, so it is called Baum - Welch algorithm.

insert image description here

2. Baum-Welch algorithm principle

Since the principle of the Baum-Welch algorithm is based on the principle of the EM algorithm,

  • Then we need to find the joint distribution P ( O , I ∣ λ ) P(O,I|\lambda) in step EP(O,I λ ) based on the conditional probabilityP ( I ∣ O , λ ‾ ) P(I|O,\overline{\lambda})P(IO,l) , whereλ ‾ \overline{\lambda}lis the current model parameter,
  • Then maximize this expectation in the M step to get an updated model parameter λ \lambdal .

Then the EM iterations are performed continuously until the values ​​of the model parameters converge.


Let's take a look at step E first, the current model parameter is λ ‾ \overline{\lambda}l​​​​, joint distribution P ( O , I ∣ λ ) P(O,I|\lambda)P(O,I λ ) based on the conditional probabilityP ( I ∣ O , λ ‾ ) P(I|O,\overline{\lambda})P(IO,l) expected expression is:

  • L ( λ , λ ‾ ) = ∑ IP ( I ∣ O , λ ‾ ) log P ( O , I ∣ λ ) L(\lambda, \overline{\lambda}) = \sum\limits_{I}P(I |O,\overline{\lambda})logP(O,I|\lambda)L ( λ ,l)=IP(IO,l) l o g P ( O ,Iλ)

In step M, we maximize the above formula, and then get the updated model parameters as follows:

  • λ ‾ = arg max ⁡ λ ∑ IP ( I ∣ O , λ ‾ ) log P ( O , I ∣ λ ) \overline{\lambda} = arg\;\max_{\lambda}\sum\limits_{I}P (I|O,\overline{\lambda})logP(O,I|\lambda)l=argmaxlIP(IO,l) l o g P ( O ,Iλ)

Through continuous E-step and M-step iterations, until λ ‾ \overline{\lambda}lconvergence.

Seven, HMM model API introduction

1. API installation:

Official website link: https://hmmlearn.readthedocs.io/en/latest/

pip3 install hmmlearn

2. Introduction to hmmlearn

hmmlearn implements three HMM model classes, which can be divided into two categories according to whether the observation state is continuous or discrete.

GaussianHMM and GMMHMM are HMM models of continuous observation state, while MultinomialHMM is a model of discrete observation state, which is also the model we use in the HMM principle series.

Here we mainly introduce the MultinomialHMM model about the discrete state that we have been talking about before.

For the MultinomialHMM model, it is relatively simple to use, and there are several commonly used parameters:

  • The "startprob_" parameter corresponds to our hidden state initial distribution \PiΠ,
  • "transmat_" corresponds to our state transition matrix A,
  • "emissionprob_" corresponds to our observed state probability matrix B.

3. Multinomial HMM instance

Let's run through the MultinomialHMM with the example we talked about earlier about the ball.

import numpy as np
from hmmlearn import hmm
# 设定隐藏状态的集合
states = ["box 1", "box 2", "box3"]
n_states = len(states)

# 设定观察状态的集合
observations = ["red", "white"]
n_observations = len(observations)

# 设定初始状态分布
start_probability = np.array([0.2, 0.4, 0.4])

# 设定状态转移概率分布矩阵
transition_probability = np.array([
  [0.5, 0.2, 0.3],
  [0.3, 0.5, 0.2],
  [0.2, 0.3, 0.5]
])
# 设定观测状态概率矩阵
emission_probability = np.array([
  [0.5, 0.5],
  [0.4, 0.6],
  [0.7, 0.3]
])
# 设定模型参数
model = hmm.MultinomialHMM(n_components=n_states)
model.startprob_=start_probability  # 初始状态分布
model.transmat_=transition_probability  # 状态转移概率分布矩阵
model.emissionprob_=emission_probability  # 观测状态概率矩阵

Now let's run the decoding process of the 3D Bitby algorithm for the HMM problem, using the same observation sequence as before to decode, the code is as follows:

seen = np.array([[0,1,0]]).T  # 设定观测序列
box = model.predict(seen)

print("球的观测顺序为:\n", ", ".join(map(lambda x: observations[x], seen.flatten())))
# 注意:需要使用flatten方法,把seen从二维变成一维
print("最可能的隐藏状态序列为:\n"", ".join(map(lambda x: states[x], box)))

Let's take a look at the problem of finding the probability of the observation sequence of HMM problem 1. The code is as follows:

print(model.score(seen))
# 输出结果是:-2.03854530992

It should be noted that the score function returns the logarithmic probability value based on the natural logarithm. The result of our manual calculation in HMM problem 1 is that the original probability without logarithm is 0.13022. Compare:

import math

math.exp(-2.038545309915233)
# ln0.13022≈−2.0385
# 输出结果是:0.13021800000000003

Guess you like

Origin blog.csdn.net/mengxianglong123/article/details/125309622