【Basic Theory】Hidden Markov Model and Its Algorithm

1. Description

        According to LR Rabiner et al . [1] , a Hidden Markov Model is a doubly embedded stochastic process whose underlying stochastic process is unobservable (it is hidden), but the observed sequence can only be generated by another set random process to observe.

        Basically, a Hidden Markov Model (HMM) is a model that observes a sequence of emissions, but does not know the sequence of states through which the model-generated emissions go through. We analyze Hidden Markov Models to recover state sequences from observed data. Sounds confusing...

2. Markov process

2.1 Markov model and Markov chain

        To understand HMM, we first need to understand what a Markov model is. A Markov model is a stochastic model used to model pseudorandomly varying states. This means that the current state does not depend on previous states. The simplest Markov model is a Markov chain. In this case, the current state depends only on the previous state.

Markov chain

2.2  What are HMMs and Markov chains...

        Rob participates in a game in which prizes are awarded when a dart hits a bull's-eye. Prizes are kept in 3 different boxes (red, green and blue) behind the screen. Now, whoever is handing out the prizes will do so based on whether he's stuck in traffic in the morning. These are called states. Now, since rob knows the general trend of traffic in the area, he can model it as a Markov chain. But he didn't have any definite information about the traffic, since he couldn't observe them directly. They are hidden from him. He also knows that the prize will be picked from a green, red or blue box. These are called observations. Since he cannot observe the correlation between states and observations, the system is that of an HMM.

Hidden Markov Model

2.2.1 Transition probability

        Transition probability is the probability of moving from one state to another. In this case, if we have traffic today, there is a 55% chance of traffic tomorrow and a 45% chance of no traffic tomorrow. Likewise, if we have no traffic today, there is a 35% chance of no traffic tomorrow, and a 65% chance of traffic tomorrow.

2.2.2 Emission probability

        They are output probabilities that can be observed by an observer. Here, the probability of connecting the state and the observed value is the emission probability.

        Representation of HMM in python

3. The basic structure of HMM 

        As we discussed earlier, a Hidden Markov Model has the following parameters:

  • hidden state set (M)
  • Transaction Probability Matrix (A)
  • series of observations (t)
  • Emission probability matrix (also known as observation likelihood) (B)
  • the initial probability distribution for the parameters

3.1 The core problem of HMM

  •         Evaluate

        The first problem is to find the probability of the observed sequence.

  •         decoding

        The second problem is to find the most probable sequence of hidden states - Viterbi Algorithm, Forward-Backward Algorithm

  •         study

        The third problem is to find the parameters that maximize the likelihood of the observed data - the Baum-Welch algorithm

3.2 Assumptions about HMM

        Markov assumption

        Since HMMs are an augmented form of Markov models, the following assumption holds: the future state will only depend on the current state.

It is represented as follows:

3.2.1 Markov assumption

        output independence

        The probability of outputting an observation (oi) depends only on the state that produced the observation, not on any other state or any other observation. It is represented as follows:

3.2.2 Output independence

        forward algorithm

        Let's have a simple HMM, we have two hidden states which represent the weather in a town and we also know a child who can carry either of two things depending on the weather, a hat and an umbrella. The relationship between the child's items and the weather is shown by orange and blue arrows. Black arrows indicate state transitions.

Image source: author

Suppose we know the following sequence of items

a sequence

We use a forward algorithm to find the probability of observing a sequence, given that the parameters of the HMM are known, namely the transition matrix, the emission matrix and the stationary distribution (π).

Sequences with latent hidden states

        We find all possible hidden states that could lead us to this sequence. We find the cumulative probability for each sequence. We found a total of 8 such sequences. Now calculating the probability of each sequence would be a tedious task. Computing the joint probability directly requires marginalizing all possible state sequences, the number of which grows exponentially with sequence length. In contrast, the forward algorithm utilizes the conditional independence rule of the Hidden Markov Model to perform computations recursively.

sequence probability

The probability of a sequence is the sum of all possible sequences of the hidden state and can be expressed as above.

Recursively expressed as follows, where t is the length of the sequence and s represents the hidden state.

Recursive Expressions for Probability

For example, if t=1,

Now, to get the final answer, we find the α t for each hidden state and add them up. It is represented as follows:

The final expression of the forward algorithm

In the forward algorithm, we use the calculated probability at the current time step to derive the probability at the next time step. Therefore, it is computationally more efficient.

The look-ahead algorithm is primarily used in applications that require us to determine the probability of being in a particular state given the sequence of observations we know. We first compute the probability of the state computed for the previous observation and use it for the current observation, then extend it to the next step using a transition probability table. This method caches all intermediate state probabilities, so they are only computed once. This helps us to compute stationary state paths. This process is also known as a posteriori decoding.

3.3 Reverse Algorithm

        The backward algorithm is a time-reversed version of the forward algorithm. In the reverse algorithm, we need to find the probability that the machine was in the hidden state at time step t and generated the rest of the sequence. Mathematically, it is expressed as:

3.4 Forward-Backward Algorithm

        The forward-backward algorithm is an inference algorithm that computes all hidden state variables (distributions). This inference task is often called smoothing. The algorithm uses dynamic programming principles to efficiently compute twice the values ​​needed to obtain the posterior marginal distribution. The first pass goes forward in time, while the second pass goes backward in time; hence the name forward-backward algorithm. It is a combination of the forward and backward algorithms explained above. This algorithm computes the probability of an observed sequence given an HMM. This probability can be used to classify a sequence of observations in recognition applications.

4. Viterbi Algorithm

        For HMMs containing hidden variables, the task of determining which sequence of variables is the most likely underlying source of some sequence of observations is called the decoding task. The task of the decoder is to find the best hidden sequence given the trained model and some observed data. More formally, given as input a as A and B and a sequence of observations O = o₁, o₂, ..., ot, find the most probable sequence of states S = s₁, s₂, . . . .st.

        We could of course use a forward algorithm to find all possible sequences and choose the best sequence on maximum likelihood, but this cannot be done because there are exponentially many state sequences.

        Like the forward algorithm, it estimates vt(j) upon seeing the first t observation and going through the most probable sequence of states s₁, s₂, . . . st. The value of each cell vt(j) is computed recursively, taking the most likely path that might lead us to that cell. Formally, each cell represents the following probability

        We can understand this better with an example,

5. HMM model

        Suppose we have an HMM model as shown in the figure. We always start with the sun (x), cloud (y) and end with snow (z). You have now observed sequences in a forward algorithm problem. What is the most likely sequence to generate this path?

        Options include:

Manual calculation

In this case, the path in bold is the Viterbi path. You can see this when you come back from the previous state: 0.3*0.4*0.136 > 0.7*0.4*0.024

Similarly, 0.4*0.5*0.4 > 0.7*0.4*0.2

Therefore, the most likely sequence is xxyz.

Note that there can be multiple possible optimal sequences.

Since each state only depends on the previous state, you can get the most probable path step by step. At each step, you calculate the probability of ending up in state x, state y, state z. After that, it doesn't matter how you got there.

6. Baum-Welch Algorithm

        The algorithm is "Under what parameterization is the observed sequence most likely?

        The Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm. It utilizes a forward-backward algorithm to compute step-specific statistics. Its purpose is to tune the parameters of the HMM, namely the state transition matrix, emission matrix and initial state distribution.

  • Start with random initial estimates of the transition and emission matrices.
  • Calculate the expected frequency of use for each transition/emission. We will estimate latent variables [ξ, γ]
  • The probabilities of the transition and emission matrices are recalculated from these estimates.
  • repeat until convergence

Expression of the latent variable γ

γ is the probability of being in state i given the observed sequence Y and the parameter theta

The expression for the latent variable ξ

ξ is the probability of being in state i,j given the observed sequence Y and the parameter theta

Parameters (A, B) are updated as follows

update A or transformation matrix

Update B or launch matrix

7. Advantages of HMM

  • HMM taggers used in NLP are very simple to train (only need to compile counts from the training corpus).
  • Performance is relatively good (over 90% performance for named entities).
  • It eliminates the label bias problem
  • Since each HMM uses only positive data, they scale well; since new words can be added without affecting the learned HMM.
  • We can initialize the model so close to what is considered correct.

8. Disadvantages of HMM

  • In order to define the joint probability of observation and label sequences, HMM needs to enumerate all possible observation sequences.
  • Representing multiple overlapping features and long-term dependencies is impractical.
  • The number of parameters to evaluate is huge. Therefore, it requires a large dataset for training.
  • HMM, training involves maximizing the observed probability of an example belonging to a class. But it doesn't minimize the probability of observing instances of other classes because it only uses positive data.
  • It employs the Markov assumption, which does not map well to many real-world domains

9. Where is Hidden Markov Model used?

  • Analysis of biological sequences, especially DNA [2]
  • time series analysis
  • handwriting recognition
  • speech recognition [1]

If you want to read more on the topic, here are some links:

  1. https://youtube.com/playlist?list=PLM8wYQRetTxBkdvBtz-gw8b9lcVkdXQKV
  2. https://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/HMM.pdf

References and Citations

[1] LR Rabiner, "Tutorial on hidden Markov models and selected applications in speech recognition", in IEEE Proceedings, Vol. 77, No. 2, pp. 257–286, October 1989, doi: 1109.5/18626.2.
[22] What is a Hidden Markov Model? Nat Biotechnol  1315, 1316–2004 (10). https://doi.org/1038.1004/nbt1315-<>

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131905520