Chinese word segmentation Hidden Markov Model (HMM)

HMM (Hidden Markov Model): Hidden Markov Model.

Kai-Fu Lee's 1988 doctoral dissertation published the first speech recognition system Sphinx based on Hidden Markov Model (HMM), which was named the most important technological invention in the United States in 1988 by "Business Week". This incident is also mentioned in Wu Jun's "The Beauty of Mathematics".

The HMM model can be applied in many fields, so the description of its model parameters is generally abstract. The following paragraphs describe the actual meaning of HMM's model parameters directly using it in Chinese word segmentation:

A typical introduction to HMM is that the model is a quintuple:

  • StatusSet: a collection of status values
  • InitStatus: Initial status distribution
  • ObservedSet: A collection of observed values
  • TransProbMatrix: Transition probability matrix
  • EmitProbMatrix: Emission probability matrix

The HMM model can be used to solve three kinds of problems:

  • Solve the observation sequence when the parameters (StatusSet, TransProbMatrix, EmitRobMatrix, InitStatus) are known. (Forward-backward algorithm)
  • When the parameters (ObservedSet, TransProbMatrix, EmitRobMatrix, InitStatus) are known, solve the sequence of state values. (viterbi algorithm)
  • When the parameters (ObservedSet) are known, solve (TransProbMatrix, EmitRobMatrix, InitStatus). (Baum-Welch algorithm)

Among them, the third question is the most mysterious and the least commonly used, the second question is the most commonly used, [Chinese word segmentation], [speech recognition], [new word discovery], [part-of-speech tagging] all have its place. Therefore, this paper mainly introduces the second problem, that is, the method of [viterbi algorithm to solve the state value sequence].

The specific meaning of five-tuple parameters in Chinese word segmentation

Next, we will talk about the real, not the virtual, and directly give specific meanings to the five-tuple parameters for the application of Chinese word segmentation:

StatusSet & ObservedSet
status value set is (B, M, E, S): {B:begin, M:middle, E:end, S:single}. Respectively, each state represents the position of the word in the word, B means the word is the start word in the word, M means the middle word in the word, E means the end word in the word, S means yes Words into words.

The set of observations is the set of all Chinese characters (southeast, northwest, you, me, he...), and even punctuation marks.

The state value is also the value we require. In the Chinese word segmentation of the HMM model, our input is a sentence (that is, a sequence of observation values), and the output is the state value of each word in the sentence. for example:

Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences

The output state sequence is
BEBEBMEBEBMEBES

According to this state sequence, we can perform word segmentation:
BE/BE/BME/BE/BME/BE/S

So the word segmentation results are as follows:
Xiao Ming/Master/Graduate/China/Academy of Sciences/Computation/Institute

At the same time, we can notice that:
B can only be followed by (M or E), but not (B or S). And M can only be followed by (M or E), not (B, S).

That's right, it's that simple. Now the input and output are clear. Let's talk about the specific process between input and output. What secrets are going on inside, please see the following:

The above only introduces the two-element [StatusSet, ObservedSet] in the quintuple, and the remaining three-element [InitStatus, TransProbMatrix, EmitProbMatrix] are introduced below.

The five-element relationship is concatenated by an algorithm called Viterbi. The sequence value of ObservedSet is the input of Viterbi, and the sequence value of StatusSet is the output of Viterbi. The Viterbi algorithm between the input and output also needs to use three model parameters, which are InitStatus, TransProbMatrix, EmitProbMatrix, and explain them one by one:

The InitStatus
initial state probability distribution is best understood and can be exemplified as follows:

#B
-0.26268660809250016
#E
-3.14e + 100
#M
-3.14e + 100
#S
-1.4652633398537678

The example value is the result of taking the logarithm of the probability value (which can make the calculation of probability multiplication become logarithmic addition), where -3.14e+100 is negative infinity, that is, the corresponding probability value is 0. The same below.

That is, the probability that the first word of the sentence belongs to the four states of {B, E, M, S}. As can be seen above, the probability of E and M are both 0, which is consistent with the actual situation. The first one at the beginning A character can only be the first character (B) of a word, or a single-character word (S).

TransProbMatrix
transition probability is a very important knowledge point of Markov chain. Anyone who has studied probability theory in university knows that the biggest feature of Markov chain is the current state Status(i) at time T=i, which is only related to T= The n states before time i are related. That is:

{Status(i-1), Status(i-2), Status(i-3), … Status(i - n)}

To go further, the HMM model has three basic assumptions (see the remarks at the end of the article for details) as the premise of the model, among which there is a [limited historical assumption], that is, n=1 of the Markov chain. That is, Status(i) is only related to Status(i-1). This assumption can greatly simplify the problem.

Looking back at the TransProbMatrix, it is actually a two-dimensional matrix of 4x4 (4 is the size of the state value set). The example is as follows:

The order of the abscissa and ordinate of the matrix is ​​BEMS x BEMS. (The value is the logarithm of the probability, don't forget.)

-3.14e + 100 -0.510825623765990 -0.916290731874155 -3.14e + 100
-0.5897149736854513 -3.14e + 100 -3.14e + 100 -0.8085250474669937
-3.14e + 100 -0.33344856811948514 -1.24236233234 + 426 +
239 3.14e + 100 -0.6658631448798212

For example, the meaning represented by TransProbMatrix[0][0] is the probability of transitioning from state B to state B. From TransProbMatrix[0][0] = -3.14e+100, this transition probability is 0, which is in line with common sense. From the respective meanings of the states, it can be seen that the next state of state B can only be ME, not BS, so the probability corresponding to the impossible transition is 0, that is, the logarithm value is negative infinity, which is recorded as -3.14e+ here. 100.

It can be seen from the above TransProbMatrix matrix that for the next state that each state may transition to, and the transition probability corresponds to the following:

#B
#E:-0.510825623765990,M:-0.916290731874155
#E
#B:-0.5897149736854513,S:-0.8085250474669937
#M
#E:-0.33344856811948514,M:-1.2603623820268226
#S
#B:-0.7211965654669841,S:-0.6658631448798212


The emission probability (EmitProb) here in EmitProbMatrix is ​​actually just a conditional probability. According to the [observation independence assumption] in the three basic assumptions of the HMM model (see the remarks at the end of the article), the observed value only depends on the current state value , That is:

P(Observed[i], Status[j]) = P(Status[j]) * P(Observed[i] | Status[j])
where P(Observed[i] | Status[j]) is the value from Obtained from EmitProbMatrix.

An example of EmitProbMatrix is ​​as follows:

#B
Yao: -10.460283, She: -8.766406, Talk: -8.039065, Yi: -7.682602, Cave: -8.668696,…
#E
Yao: -9.266706, She: -9.096474, Talk: -8.435707, Yi: -10.223786, Hole: -8.366213,…
#M
Yao:-8.47651, Shed: -10.560093, Talk: -8.345223, Yi: -8.021847, Hole: -9.547990,….
#S
Qi: -10.005820, She: -10.523076, Hui:- 15.269250, 禑: -17.215160, hole: -8.369527…

At this point, the five-element parameters of the HMM model have been introduced. It is assumed that the specific probability values ​​of these parameters are already at hand and loaded, (that is, there is a dictionary of the model, see hmm_model.utf8 in HMMDict for details) , then we only have the algorithm function Viterbi, and the model can be used. So let's talk about the Viterbi algorithm.

Viterbi algorithm of HMM Chinese word segmentation

Input example:
Xiao Ming graduated from the Institute of Computing Technology, Chinese Academy of Sciences with a master's degree

The calculation process of the Viterbi algorithm is as follows:

define variable

Two-dimensional array weight[4][15], 4 is the number of states (0:B, 1:E, 2:M, 3:S), and 15 is the number of words in the input sentence. For example, weight[0][2] represents the possibility of the word 'Shuo' appearing under the condition of state B.

Two-dimensional array path[4][15], 4 is the number of states (0:B, 1:E, 2:M, 3:S), and 15 is the number of words in the input sentence. For example, path[0][2] represents the state of the previous word when weight[0][2] reaches the maximum value. For example, path[0][2] = 1, it means that weight[0][2] reaches the maximum value. When , the state of the previous word (that is, Ming) is E. The state of the previous word is recorded in order to use the viterbi algorithm to calculate the complete weight[4][15], so that the input sentence can be backtracked from right to left to find the corresponding state sequence.

Use InitStatus to initialize the weight two-dimensional array

The known InitStatus is as follows: (Step 1: Initialization)

#B
-0.26268660809250016
#E
-3.14e + 100
#M
-3.14e + 100
#S
-1.4652633398537678

And it can be obtained from EmitProbMatrix

Status(B) -> Observed(小) : -5.79545
Status(E) -> Observed(小) : -7.36797
Status(M) -> Observed(小) : -5.09518
Status(S) -> Observed(小) : -6.2475

So the value of weight[i][0] can be initialized as follows:

weight[0][0] = -0.26268660809250016 + -5.79545 = -6.05814
weight[1][0] = -3.14e+100 + -7.36797 = -3.14e+100
weight[2][0] = -3.14e+100 + -5.09518 = -3.14e+100
weight[3][0] = -1.4652633398537678 + -6.2475 = -7.71276

Note that the above formula is calculated by adding instead of multiplying, because the logarithm was taken before.

Traverse the sentence to calculate the entire weight two-dimensional array (step 2: induction calculation)

//遍历句子,下标i从1开始是因为刚才初始化的时候已经对0初始化结束了
for(size_t i = 1; i < 15; i++)
{
    // 遍历可能的状态
    for(size_t j = 0; j < 4; j++) 
    {
        weight[j][i] = MIN_DOUBLE;
        path[j][i] = -1;
        //遍历前一个字可能的状态
        for(size_t k = 0; k < 4; k++)
        {
            double tmp = weight[k][i-1] + _transProb[k][j] + _emitProb[j][sentence[i]];
            if(tmp > weight[j][i]) // 找出最大的weight[j][i]值
            {
                weight[j][i] = tmp;
                path[j][i] = k;
            }
        }
    }
}

After this traversal, weight[4][15] and path[4][15] are calculated. (Step 3: Termination)

Determine the boundary conditions and path backtracking (Step 4: Path backtracking) The
boundary conditions are as follows:

For each sentence, the state of the last word can only be E or S, not M or B.
So in this example, we only need to compare the size of weight[1(E)][14] and weight[3(S)][14].

In this example:

weight[1][14] = -102.492;
weight[3][14] = -101.632;
so S > E, that is, the starting point for path backtracking is path[3][14].

The backtracking path is:
SEBEMBEBEMBEBEB

In reverse order:
BE/BE/BME/BE/BME/BE/S

So the result of word segmentation is:
Xiaoming/Master/Graduate/China/Academy of Sciences/Computing/Institute

At this point, the process of an HMM model Chinese word segmentation algorithm is completed.

That is to say, given our model, after we load the model, we only need to run the Viterbi algorithm once to find out the state corresponding to each word, and we can segment the sentence according to the state.

Model training problem

The premise of the above is to perform word segmentation based on the model, that is to say, it is assumed that the HMM model at hand has been trained (that is, the key parameters of the three models of InitStatus, TransProbMatrix, and EmitProbMatrix are known), There is no reference to how these three parameters are obtained. These three parameters are actually calculated based on the corpus that has been segmented. These three parameters can be calculated by calculating the corresponding frequency and conditional probability. I will re-write an article on my own blog to introduce how to train these three parameters. parameters.

Reference article:
[1]. Detailed explanation of the HMM model of Chinese word segmentation
[2]. How to explain the hidden Markov model with a simple and easy-to-understand example?
[3]. cppjieba dictionary github address

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325402787&siteId=291194637