Hidden Markov Model of Machine Learning

  This article is a study notes, on the one hand is to strengthen understanding, in the sense of taking notes easier to understand the process, on the other hand in order to strengthen memory, brain build neural networks for 'hidden Markov model' of

1. Model Scene

Before introducing the Hidden Markov Models to look at an example:
Suppose there are four boxes, each box which is equipped with the red and white color requirements, the number of red ball inside the box are as follows:

Pumping the ball in the following manner to produce a colored ball observation sequence:

  • (1) start from the four boxes with equal probability of a random box, this box is randomly selected from a sphere, the color recording, and then returned
  • (2) Then, a stochastic transition from the current box to the next box, the rule is: If the current box is a box 1, then the next box must be case 2, if the current box is a box 2 or 3, respectively, with probability 0.4 and 0.6 shift to the left or right side of the box, if the current box is 4, then the probability of 0.5 each to stay in a box or transferred to the box 3 4
  • (3) After determining the transfer box, then the box is randomly selected from a sphere, the color record, back
  • (4) Under such circumstances, repeated five times to obtain a color observation sequence ball: \ [O = (red, red, white, white, red) \]

In this process, the observer can observe a sequence of colored balls, the ball is not observed which is removed from the box, i.e. the cassette can not observed sequence

2. hidden Markov model of the three elements

The above is a typical example of the hidden Markov model . There are two random sequences, one box sequence (sequence of states), a sequence of an observation color balls, the former are hidden, only the latter is observable.

Hidden Markov model has three elements, denoted as \ [\ lambda = (A,
B, \ pi) \] Note: A is the state transition matrix, B is the observation probability distribution matrix, \ (\ PI initial state probability vector \ )

By way of example above, to calculate the A, B and \ (\ PI \) values

State transition probability distribution matrix:
\ [A = \ left [\ the begin Matrix {0} & 0. 1 & & \\ 0 0 0.4 & 0.4 & 0.4 & 0 \\ 0 & & \\ 0 & 0.6 & 0.5 0 0 & 0.5 & \ Matrix End {} \ right] \]
\ (a [ij of] \) represents the transition from state i to state j probability

Observation probability distribution matrix:

\ [B = \ left [\} the begin {Matrix 0.5 & 0.5 0.3 & 0.7 \\ \\ \\ 0.6 & 0.4 0.8 & 0.2 \ Matrix End {} \ right] \]
\ (B [I0] \) represents the box i probability removed red ball, \ (B [I1] \) represents the probability of a box of white balls removed i

Initial probability distribution:
\ [\ = PI (0.25,0.25,0.25,0.25) \]

Three basic questions 3. Hidden Markov Models

(1) probability calculation

Given model \ (\ lambda = (A, B, \ pi) \) and the observation sequence \ (O = (O_1, O_2, ..., 0_T) \) , calculates the model \ (\ the lambda \) under observation probabilistic model appeared \ (P (O | \ lambda ) \)

(2) learning problems

Known observation sequence \ (O = (O_1, O_2, ..., 0_T) \) , estimation model \ (\ lambda = (A, B, \ pi) \) parameters such that the probability of the observation sequence in this model \ (P (O | \ lambda) \) the largest, estimated by the method of maximum likelihood parameter estimates

(3) prediction problem, also known as decoding problem

Known model \ (\ lambda = (A, B, \ pi) \) and the observation sequence \ (O = (O_1, O_2, ..., 0_T) \) , find the probability of a given observation sequence conditions \ (P (O | \ lambda) \) the largest state sequence \ (the I = (i_1, i_2 will be used, ..., i_T) \) . I.e., given the observation sequence, for the most likely state sequence corresponding to

The following describes the algorithms for solving different problems

4. probability calculation algorithm

4.1 Description of the problem

Given model \ (\ lambda = (A, B, \ pi) \) and the observation sequence \ (O = (O_1, O_2, ..., 0_T) \) , calculates the model \ (\ the lambda \) under observation probabilistic model appeared \ (P (O | \ lambda ) \)

4.2 before the algorithm

The case under observation T1 (1) is calculated for the red ball state Note: a matrix index sequence and are from the beginning

The case of the red ball first selected from a box. 1:
\ [A_1 (. 1) = \ PI_1 B_1 (O_1) = 0.25 * 0.5 = 0.125 \]
for the first time from the selection box red ball case 2:
\ [A_1 (2 ) = \ pi_2 B_2 (o_1)
= 0.25 * 0.3 = 0.075 \] the first case of selecting from red ball box. 3:
\ [A_1 (. 3) = \ Pi_3 B_3 (O_1) = 0.6 * 0.25 = 0.15 \]
of from a red ball boxes selected. 4:
\ [A_1 (. 4) = \ Pi_4 and TADCOBRB_4 (O_1) = 0.8 * 0.25 = 0.20 \]

Observing the case (2) state t2 is calculated red ball, and the secondary case is selected to red ball

Second selection from the red ball box 1 where:
\ [A_2 (1) = A_1 (1). 11} {A_ B_1 (O_2) + A_1 (2) 21 is A_ {} B_1 (O_2) + + A_1 (. 3) A_ {31} B_1 (o_2)
+ + a_1 (4) A_ {41} B_1 (o_2) \] the second red ball from the selected cassette case 2:
\ [A_2 (2) A_1 = (. 1) 12 is A_ { } B_2 (o_2) + a_1 (
2) A_ {22} B_2 (o_2) + + a_1 (3) A_ {32} B_2 (o_2) + + a_1 (4) A_ {42} B_2 (o_2) \] second times from the case of selecting the red ball box. 3:
\ [A_2 (. 3) = A_1 (. 1) A_ {13 is} B_3 (O_2) + A_1 (2) A_ {23 is} B_3 (O_2) + + A_1 (. 3) A_ { 33} B_3 (o_2) + +
a_1 (4) A_ {43} B_3 (o_2) \] second selection box from red ball 4 where:
\ [A_2 (4) A_1 = (. 1) {14} and TADCOBRB_4 A_ (o_2) + a_1 (2) A_ {24} B_4 (o_2) + + a_1 (3) A_ {34} B_4 (o_2) + + a_1 (4) A_ {44} B_4 (o_2) \]

...

Through the above rule we get the formula:

(1) Hatsu值
\ [a_1 (i) = \ pi (i) B_i (o_1) \]

(2)递推
\ [a_ {t + 1} (i) = [\ sum_ {j = 1} ^ N a_t (j) A_ {ji}] B_i (o_ {t + 1}) \]

(3) termination
\ [P (O | \ lambda ) = \ sum_ {i = 1} ^ N a_T (i) \]

After the algorithm 4.3

As the name suggests, the probability of the observed sequence of time t-1 is calculated after the observation probability algorithm according to the sequence of time t

So that at time t the state is \ (Q_I \) under the conditions of the probability of the observed sequence of 1 to T t + is \ (\ beta_t (I) \) , then \ [\ beta_t (i) = P (o_ { t + 1}, o_ {t + 2}, ..., o_T | i_t = q_i, \ lambda) \]

Pay special attention to \ (\ beta_t (i) \ ) defined, well understood back in order

(1) all the time the final state \ (Q_I \) predetermined \ [\ beta_T (i) = 1 \]

(2) \[\beta_t(i) = \sum_{j=1}^N a_{ij}b_j(0_{t+1})\beta_{t+1}(j)\]

(3) \ [P (O | \ lambda) = \ sum_ {i = 1} ^ N \ pi_ib_i (o_1) \ beta_1 (i) \]

5. Learning Algorithm

5.1 Description of the problem

Known observation sequence \ (O = (O_1, O_2, ..., 0_T) \) , estimation model \ (\ lambda = (A, B, \ pi) \) parameters such that the probability of the observation sequence in this model \ (P (O | \ lambda) \) the largest, estimated by the method of maximum likelihood parameter estimates

Hidden Markov Model learning, training data set comprising observations sequence and the corresponding sequence of states or only the observation sequence that can be implemented by supervised learning and unsupervised learning

For supervised learning, because the data set contains the observation sequence and the corresponding sequence of states, this can be directly estimated according to the model parameters using the data set

For unsupervised learning, EM can be used to calculate the implicit parameter learning. Refer to Appendix EM algorithm

6. prediction algorithm

6.1 Description of the problem

Known model \ (\ lambda = (A, B, \ pi) \) and the observation sequence \ (O = (O_1, O_2, ..., 0_T) \) , find the probability of a given observation sequence conditions \ (P (O | \ lambda) \) the largest state sequence \ (the I = (i_1, i_2 will be used, ..., i_T) \) . I.e., given the observation sequence, for the most likely state sequence corresponding to

6.2 Viterbi algorithm

The Viterbi algorithm is actually a hidden Markov model predictions solution using dynamic programming problem, ie seeking maximum probability path with dynamic programming

Define two variables:

\ (\ delta_T (i) \) represents the state at time t is the maximum value of all the individual probabilities in the paths i
\ [\ delta_t (i) = max P (i_t = i, i_ {t-1}, ... , i_1, o_t, ..., o_1 | \ lambda), i = 1,2, ..., N \]

\ (\ psi_t (i) \ ) represents the time t the state for all single path i is the first t-1 nodes maximum path probability
\ [\ psi_t (i) = arg max_ {1 <= j <= N } [\ delta_ {t-1 } (j) a_ {ji}], i = 1,2, ..., N \]

(1) HatsuHajimeka \ [\ Delta_1 (I) = \ Pi_ib_i (O_1) \]
\ [\ Psi_1 (I) = 0 \]

(2) recursion, a = 2,3 T, ..., T
\ [\ delta_T (I) = {max_. 1 <= J <= N} [\ T-delta_time_unit_address {}. 1 (J) {A_ JI }] B_i (O_t) \]
\ [\ psi_t (I) = {Arg max_. 1 <= J <= N} [\ T-delta_time_unit_address {}. 1 (J) JI A_ {}] \]

(3) 终止
\[P^* = max_{1<=i<=N}\delta_T(i)\]
\[i_T^* = arg max_{1<=i<=N} [\delta_T(i)]\]

7. Annex: EM Algorithm

7.1 EM algorithm defines

Input: data observed variables X, hidden variable data Z, the joint distribution \ (P (X, the Z | \ Theta) \) , also known as full data, so that a better understanding of the point

Output: model parameter \ (\ Theta \)

(1) selecting an initial model parameters \ (\ ^ {Theta (0)} \) , iterates

(2) E Step: Write \ (\ theta ^ {i} \) is the i-th iteration parameter \ (\ Theta \) estimate, calculating a desired i-th iteration \ [Q (\ theta, \ theta ^ {(i)}) = E (logP (x, z | \ theta) | x, \ theta ^ {(i)})) = \ int_zlogp (x, z | \ theta) p (z | \ theta ^ {(I)}) \]
(. 3) step M: seeking to make \ (\ theta ^ {(i + 1)} = Q (\ theta, \ theta ^ {(i)}) is the maximum value of \)

(4) Repeat step (2) and (3) step

7.2 EM algorithm Some explanations

(1) initial parameters can be arbitrarily selected, but it should be noted that the EM algorithm is sensitive to the initial value

(2) E-step request \ (Q (\ Theta, \ Theta ^ {(I)}) \) , Q function of Z is hidden variables, X is the observed data, \ (Q (\ Theta, \ Theta ^ {(i)}) \) the first argument represents the parameter to be maximized, the second argument represents the current estimated value of the parameter, in each iteration find the actual maximum value of Q

(3) given iteration stop condition is generally small positive number \ (\ xi_i, \ xi_2 \) , if yes \ (|| \ theta ^ {( i + 1)} - \ theta ^ {( i)} <\ xi_i || or || Q (\ theta ^ {( i + 1)}, \ theta ^ {(i)}) - Q (\ theta ^ {(i)}, \ theta ^ {( i)}) || <\ xi_2 \)

7.3 EM algorithm is derived

\ [L (\ Theta) = argmaxlogP (X | \ Theta) = argmaxlog \ int_zp (X, Z | \ Theta) DZ \]
\ [L (\ Theta) = argmaxlog \ int_z \ FRAC {P (X, Z | \ theta)} {p (z
| \ theta ^ {(i)})} p (z | \ theta ^ {(i)}) dz \] Since the log function is a concave function, the \ [L (\ theta) \ GEQ \ int_zlog \ FRAC {P (X, Z | \ Theta)} {P (Z | \ Theta ^ {(I)})} P (Z | \ Theta ^ {(I)}) DZ \]
\ [ L (\ theta) \ geq \ int_zlogp (x, z | \ theta) p (z | \ theta ^ {(i)}) dz - \ int_zlog (p (z | \ theta ^ {(i)})) p (z | \ theta ^ {(
i)}) dz \] For the latter type Save model parameters (Theta \ \) \ independent, \ (P (Z | \ ^ {Theta (I)}) are known \ ) , so the only concern subtractive foregoing equation, so \ [Q (\ theta, \ theta ^ {(i)}) = \ int_zlogp (x, z | \ theta) p (z | \ theta ^ {( I)}) \]
(2) the same as the original L optimization problem into the original problem lower bound for the sake of definition and algorithm steps \ (Q (\ theta, \ theta ^ {(i)}) \) maximum
Thus, anything that can make \ (Q (\ theta, \ theta ^ {(i)}) \) increased \ (\ Theta \)Can make \ (L (\ theta) \ ) is increased, in order to make \ (L (\ theta) \ ) have as growth, so select \ (Q (\ theta, \ theta ^ {(i)}) \) reaches a maximum, i.e. \ [\ theta ^ {(i + 1)} = argmaxQ (\ theta, \ theta ^ {(i)}) \]

7.4 EM algorithm convergence

Theorem 1 : \ (provided P (x | \ theta) is the likelihood function observation data, \ theta ^ {(i) } estimation sequence, P (x parameter EM algorithm obtained | \ theta ^ {(i) } ) for the corresponding likelihood function sequence, P (x | \ theta ^ { (i)}) monotonically increasing \)
Theorem 2 : \ (set L (\ theta) = logP ( x | \ theta) as observed data likelihood function, \ theta ^ {(i) } is obtained EM algorithm parameter estimation sequence, L (\ theta ^ {( i)}) is the likelihood function corresponding sequence \)

(1) \ (If P (x | \ theta) bounded from above, the L (\ theta ^ {(i )}) converges to a certain value * ^ L \)
(2) \ (the function Q (\ theta , \ theta ^ {(i)}) and L (\ theta) satisfies certain conditions, obtained by the EM algorithm parameter estimation sequence \ theta ^ {(i)} convergence value \ theta ^ * is L (\ theta) the stable value \)

The above is the EM algorithm 'official' explanation, if not understand can refer to the blog https://www.jianshu.com/p/1121509ac1dc

Finally Hidden Markov Models throw thrown two questions:

  (1) how to model and trained Hidden Markov model word problems on the Chinese?

  (2) Maximum Entropy Markov Models Why would question mark bias? How to solve?

Reference:
Li Hang teacher "statistical learning methods."

Guess you like

Origin www.cnblogs.com/xiaobingqianrui/p/11238764.html