NLP from entry to actual combat (6) Hidden Markov Model

NLP from entry to actual combat (6) Hidden Markov Model

Basic theory

Hidden Markov model is a double random process-a hidden Markov chain with a certain number of states and a set of explicit random functions. Since the 1980s, HMM has been applied to speech recognition with great success. In the 1990s, HMM was also introduced into computer character recognition and mobile communication core technology "multi-user detection". HMM has also begun to be applied in the fields of bioinformatics and fault diagnosis.

Basic overview

Hidden Markov Model (HMM) is a statistical model that is widely used in various natural language processing applications such as speech recognition, automatic part-of-speech tagging, phonetic character conversion, probabilistic grammar, etc. After long-term development, especially its successful application in speech recognition, it has become a general statistical tool.

Markov process

Let's look at an example first. Suppose that a few months old baby does three things every day: playing (excited state), eating (hungry state), and sleeping (sleepy state). These three things shift in the direction shown in the following figure:

img

This is a simple Markov process. It should be noted that this is different from a deterministic system. Each transition is probabilistic. The baby's state changes frequently, and it will switch between the two states at will:

img

The arrow in the figure above indicates the probability of switching from one state to another. The probability of sleeping after eating is 0.7.

As can be seen from the above figure, the transition of a state only depends on the previous n states. When n takes 1, it is the Markov hypothesis. This leads to the definition of Markov chain:

Markov chain is a random variable Sl, ..., S T a series (set state), the range of these variables, i.e., the set of all their possible values, is called "state space", while S T value is State at time t . If the conditional probability distribution of St +1 for the past state is only a function of St , then:

img

Here small x is a certain state in the process. The above equation is called Markov hypothesis.

The above function can be understood as follows: Under the condition of the known “now”, the “future” does not depend on the “past”; or the “future” only depends on the known “now”. That is, St+1 is only related to St, and has nothing to do with St-n, 1<n<t.

A Markov chain with N states has N2 state transitions. The probability of each transition is called the state transition probability, which is the probability of transitioning from one state to another. All the N2 probabilities can be represented by a state transition matrix:

img

This matrix indicates that if the baby's state is eating at time t, the probability of playing, eating, and sleeping at time t+1 is (0.2, 0.1, 0.7).

img

The cumulative sum of the data in each row of the matrix is ​​1.

Hidden Markov Model

In many cases, the Markov process is not enough to describe the problems we find. For example, we cannot directly know whether the baby is hungry or sleepy, but it can be inferred from other behaviors of the baby. If the baby is crying, it may be hungry; if it is listless, it may be sleepy. From this we will generate two state sets, one is an observable state set O and a hidden state set S. One of our purposes is to predict the hidden state based on the observable state. In order to simplify the description, we will "play" this state. Remove, let the baby not only eat and sleep every day, this is also the common desire of most parents, the model is as follows:

img

Thus, O={Ocry,Otired,Ofind}, S={Seat,Szzz}. The probabilities of the three observable behaviors of crying, lack of energy, and finding a mother in the state of "eating (hungry)" are (0.7, 0.1, 0.2).

In the above example, the observable state sequence and the hidden state sequence are probabilistically related. So we can model this type of process as a hidden Markov process and a set of observable states related to the hidden Markov process. This is the hidden Markov model.

Hidden Markov Model (Hidden Markov Model, HMM) is a statistical model used to describe a Markov process with hidden unknown parameters.

Through the transition matrix, we know how to represent P(St+1=m|St=n) and how to represent P(Ot|S) (the observed state is equivalent to an estimate of the hidden true state)? In HMM we use another matrix:

img

This matrix is ​​called a confusion matrix. The rows of the matrix represent hidden states, the columns represent observable states, and the sum of the probability values ​​of each row of the matrix is ​​1. Among them, in the first row and the first column, P(Ot=cry|Pt=eat)=0.7. When the baby is hungry, the probability of crying is 0.7.

The confusion matrix can be regarded as another hypothesis of the Markov model, the independence hypothesis : it is assumed that the observation at any time depends only on the state of the Markov chain at that time and has nothing to do with other observation states.

img

Formal definition of HMM model

An HMM can be represented by a 5-tuple {N, M, π, A *, * B }, where:

  • N represents the number of hidden states, we either know the exact value or guess the value;
  • M represents the number of observable states, which can be obtained through the training set;
  • π={πi} is the initial state probability; it represents the occurrence probability of each hidden state at the beginning;
  • A={aij} is the transition matrix of the hidden state; the N*N-dimensional matrix represents the probability of occurrence of the first state to the second state;
  • B={bij} is the confusion matrix, and the N*M matrix represents the probability of a certain observation occurring under the condition of a certain hidden state.

Each probability in the state transition matrix and the confusion matrix is ​​time-independent, that is, when the system evolves, these matrices do not change with time. For an HMM with fixed N and M, use λ={π, A, B} to represent the HMM parameters.

Problem solving

Suppose there is a known HMM model:

img

In this model, the initialization probability π={Seat=0.3,Szzz=0.7}; hidden state N=2; observable state M=3; the transition matrix and confusion matrix are:

img

Now we have to solve 3 problems:

**1.** Model evaluation problem (probability calculation problem)

Given the entire model, the baby's behavior is crying -> unconscious -> looking for a mother, and calculating the probability of these behaviors.

which is:

Knowing the model parameters, calculate the probability of a given sequence of observable states. That is, given an observation sequence and model λ=(A,B,π}, the probability of observation sequence O is P(O|λ}.

Corresponding algorithm: forward algorithm, backward algorithm

2 **. Decoding problem (prediction problem)**

Given the entire model, the baby’s behavior is crying -> unconscious -> looking for a mother, and calculate what the baby's state is most likely to be under these three behaviors.

which is:

Given the model parameters and observable state sequence, how to choose a state sequence S={S1,S2,...,ST} can best explain the observation sequence O.

Corresponding algorithm: Viterbi algorithm

**3.** Parameter evaluation problem (belonging to unsupervised learning algorithm)

Through the baby's behavior, crying, lack of energy, and finding a mother, determine the probability of the baby's state transition.

The data set has only observation sequences. How to adjust the model parameters λ=(π, A, B) to make P(O|λ) the largest

Corresponding algorithm: Baum-Welch algorithm

This article mainly solves problems 1 and 2, from which it can be seen that the Markov hypothesis (formula 1 and 2 mentioned above) simplifies the probability calculation (subsequent supplement to question 3).

Traversal

Solve problem 1.

The traversal method is also a typical exhaustive method, which is relatively simple to implement, just add them after listing the possible situations. There are 3 observable states, and each observable state corresponds to 2 hidden states. There are 23 = 8 possible situations. one of them:

P(Seat1, Seat2, Seat3,Ocry1,Otired2,Ofind3)

= P(Seat1)·P(Ocry1)·P(Seat2)·P(Otired2)·P(Seat3)·P(Ofind3)

= (0.3×0.7)×(0.1×0.1)×(0.1×0.2)

= 0.000042

The number of the subscript in the above formula represents time. The traversal method is the most effective (because of simplicity) when there are fewer observation points and hidden points. Once the number of nodes increases, the amount of calculation will increase sharply.

Forward Algorithm

Solve problem 1.

The forward algorithm is to calculate step by step when time t=1.

The Markov probability formula behind it:

P(W1,W2) = P(W1)P(W2|W1)

P(W1,W2,W3) = P(W1,W2)P(W3|W1,W2)

P(W1,W2,…,Wn) = P(W1,W2,…,Wn-1)P(Wn|W1,W2,…,Wn-1)

1. Calculate the probability that Cry will happen when t=1:

P(Ocry,Seat) = P(Seat)P(Ocry|Seat) =0.3×0.7=0.21

P(Ocry,Szzz) = P(Szzz)P(Ocry|Szzz) =0.7×0.3=0.21

2. Calculate the probability that Tired will happen when t=2:

According to Markov hypothesis, P(Ot=2) is only related to St=1, the behavior probability of the next day is calculated from the state of the previous day, if St=2=Seat2:

P(Ocry1,Otired2,Seat2)

= P(Ocry1,Seat1)P(Seat2|Seat1)P(Otired2|Seat2)+ P(Ocry1,Szzz1)P(Seat2|Szzz1)P(Otired2|Seat2)

=[P(Ocry1,Seat1)P(Seat2|Seat1)+P(Ocry1,Szzz1)P(Seat2|Szzz1)]·P(Otired2|Seat2)

= [0.21×0.1+0.21×0.8]×0.1

= 0.0189

If St=2=Szzz2:

P(Ocry1,Otired2,Szzz2)

= P(Ocry1,Seat1)P(Szzz2|Seat1)P(Otired2|Szzz2)+P(Ocry1,Szzz1)P(Szzz2|Szzz1)P(Otired2|Szzz2)

= [P(Ocry1,Seat1)P(Szzz2|Seat1)+ P(Ocry1,Seat1)P(Szzz2|Seat1)]·P(Otired2|Szzz2)

= [0.21×0.9+0.21×0.2]×0.5

= 0.1155

3. Calculate the probability of the behavior of Find when t=3:

If St=3=Seat3,

P(Ocry1,Otired2,Ofind3,Seat3)

= P(Ocry1,Otired2,Seat2)P(Seat3| Seat2)P(Ofind3|Seat3)+

​ P(Ocry1,Otired2,Szzz2)P(Seat3| Szzz2)P(Ofind3|Seat3)

= [P(Ocry1,Otired2,Seat2)P(Seat3| Seat2)+

P(Ocry1,Otired2,Szzz2)P(Seat3| Szzz2)]·P(Ofind3|Seat3)

= [0.0189×0.1+0.1155×0.8]×0.2

= 0.018858

If St=3=Szzz3,

P(Ocry1,Otired2,Ofind3,Seat3)

= P(Ocry1,Otired2,Seat2)P(Szzz3| Seat2)P(Ofind3|Szzz3)+

​ P(Ocry1,Otired2,Szzz2)P(Szzz3| Szzz2)P(Ofind3|Szzz3)

= [P(Ocry1,Otired2,Seat2)P(Szzz3| Seat2)+

P(Ocry1,Otired2,Szzz2)P(Szzz3| Szzz2)]·P(Ofind3|Szzz3)

= [0.0189×0.9+0.1155×0.2]×0.2

= 0.008022

In summary,

P(Ocry1,Otired2,Ofind3)

= P(Ocry1,Otired2,Ofind3,Seat3)+ P(Ocry1,Otired2,Ofind3,Szzz3)

= 0.018858+0.049602

= 0.06848

Viterbi Algorithm

Refer to Baidu Encyclopedia:

The basis of the Viterbi algorithm can be summarized into the following three points:

  1. If the path p with the highest probability (or the shortest path) passes through a certain point, such as X22 on the way, then the sub-path Q from the starting point S to X22 on this path must be the shortest path between S and X22. Otherwise, replacing Q with the shortest path R from S to X22 constitutes a path shorter than P, which is obviously contradictory. It is proved that the principle of optimality is satisfied.
  2. The path from S to E must pass through a certain state at the i-th moment. Assuming that there are k states at the i-th moment, then if the shortest path of all k nodes from S to the i-th state is recorded, the final shortest The path must pass through one of them, so that at any moment, only a very limited shortest path can be considered.
  3. Combining the above two points, assuming that when we enter state i+1 from state i, the shortest path from S to each node in state i has been found and recorded on these nodes, then in the calculation from the starting point S to the i+1 When the shortest path of a certain node Xi+1 of the state, just consider the shortest path from S to all k nodes in the previous state i, and the distance from this node to Xi+1, j.

In this example, the Viterbi algorithm actually starts at t=1 and continues to calculate backwards to find the path with the greatest probability.

1. Calculate the probability of occurrence of Ocry at t=1:

δ11 = P(Ocry,Seat) = P(Seat)P(Ocry|Seat)=0.3×0.7=0.21

δ12 = P(Ocry,Szzz) = P(Szzz)P(Ocry|Szzz)=0.7×0.3=0.21

2. Calculate the probability of Otired occurrence at t=2:

δ21 =max(P(Ocry1,Seat1)P(Seat2|Seat1)P(Otired2|Seat2),P(Ocry1,Szzz1)P(Seat2|Szzz1)P(Otired2|Seat2))

= max(P(Ocry1,Seat1)P(Seat2|Seat1), P(Ocry1,Szzz1)P(Seat2|Szzz1))·P(Otired2|Seat2)

= max(δ11 P(Seat2|Seat1), δ12 P(Seat2|Szzz1)) ·P(Otired2|Seat2)

= max(0.21×0.1,0.21×0.8)×0.1

= 0.0168

S21 = eat

δ22 = max(P(Ocry1,Seat1)P(Seat2|Seat1)P(Otired2|Szzz2),P(Ocry1,Szzz1)P(Seat2|Szzz1)P(Otired2|Szzz2))

= max(δ11 P(Szzz2|Seat1), δ12 P(Szzz2|Szzz1)) ·P(Otired2|Szzz2)

= max(0.21×0.9,0.21×0.2)×0.5

= 0.0945

S22 = zzz

3. Calculate the probability of OFIND occurring at t=3:

δ31 = max(δ21P(Seat3|Seat2), δ22P(Seat3|Szzz2)) ·P(Ofind3|Seat3)

=max(0.0168×0.1, 0.0189×0.8)×0.2

=0.003024

S31 = eat

δ32 = max(δ21P(Szzz3|Seat2), δ22P(Szzz3|Szzz2)) ·P(Ofind3|Szzz3)

=max(0.0168×0.9, 0.0189×0.2)×0.2

=0.003024

S32 = zzz

4. Backtracking, the maximum probability of each step:

max(δ11,δ12), max(δ21,δ22), max(δ31,δ32)

Corresponding status: eat or zzz, zzz, eat or zzz

Speech Recognition

The following content is compiled from Wu Jun's "The Beauty of Mathematics"

When we observe the speech signals o1, o2, o3, we have to guess the sent sentences s1, s2, s3 based on this set of signals. Obviously, we should find the most likely one among all possible sentences. To describe it in mathematical language is to find the sentence s1 that makes the conditional probability P (s1, s2, s3,...|o1, o2, o3....) reach the maximum value given o1, o2, o3,... ,s2,s3,...

img

among them

img

Independence hypothesis

img

Markov hypothesis

img

It can be seen that the speech recognition fits the HMM model exactly.

Algorithm Exchange Official Account

Insert picture description here

reference:
Baidu Encyclopedia: https://baike.baidu.com/item/%E9%9A%90%E9%A9%AC%E5%B0%94%E5%8F%AF%E5%A4%AB%E6% A8%A1%E5%9E%8B/7932524?fr=aladdin

https://www.cnblogs.com/bigmonkey/p/7230668.html

Guess you like

Origin blog.csdn.net/qq_46098574/article/details/108644097