HMM, Viterbi and Chinese word segmentation

Study notes based on actual work experience and Internet, book information query and recording. Mainly about HMM (Hidden Markov Model) in Chinese word segmentation, Viterbi algorithm and its application in Chinese word segmentation.

foreword

        The method of keyword extraction + simhash is used to deal with the duplication of the question bank. Before extracting keywords, Chinese word segmentation is required. One basic method is to segment words based on the thesaurus, but obviously the thesaurus is impossible to complete. At this time, in order to confirm that the words that have not been recorded in the thesaurus (unregistered words) are not processed, a certain understanding is required. , in order to accurately deal with unexpected word segmentation situations. This article is a summary of the author's learning experience of Chinese word segmentation in the process of deduplication of the question bank, and it is accumulated here, so that future generations can refer to it and communicate with each other.

​ This article introduces the HMM (Hidden Markov Model) and Viterbi (Viterbi) algorithms, mainly for students who are not familiar with probability theory and want to have a better understanding of Chinese word segmentation tools. I look forward to providing readers with an in-depth understanding of other related algorithms. Ways of learning ideas. At the same time, we also expect professionals to provide further opinions or suggestions.

​ At the same time, I hope that students who are not good at mathematics will not have to be afraid of it. Because the author is not particularly familiar with this, he will introduce HMM, Viterbi, and the relationship between them and Chinese word segmentation as simply as possible based on his personal learning process and examples.

text

Before formally introducing HMM, we need to understand Markov process and Markov chain.

Markov process (Markov process)

A property: when a stochastic process is given the current state and all past states, the conditional probability distribution of its future state depends only on the current state; The history path of the process) is conditionally independent , then this random process has the Markov property

A definition: A process that satisfies the Markov property is called a Markov process

DTMC (Discrete Markov Chain)

A definition: Both time and state are discrete Markov processes.

An important property: Knowing the current state of the system, the future evolution does not depend on the past evolution.

A general description: Assuming that the current time is T, the probability that T+1 is equal to a certain value is only related to T, and has nothing to do with all values ​​in [0, T-1] (first-order Markov chain ).
P { X n + 1 = in + 1 ∣ X 0 , X 1 , . . . , X n = in } = P { X n + 1 ∣ X n = in } P\{X_{n+1}=i_ {n+1}\ |\ X_0,X_1,...,X_n=i_n\} = P\{X_{n+1}\ |\ X_n = i_n\}P{ Xn+1=in+1  X0,X1,...,Xn=in}=P{ Xn+1 X n=in}
A matrix:This is actually an n-step transition probability matrix. It is more complicated to explain in detail. You can Baidu it yourself without affecting the understanding later.
P = [ P 11 P 12 ... P 1 n ... P 21 P 22 ... P 2 n ... P 31 P 32 ... P 3 n ... ] P\ =\ \left[ \begin{array}{ccc} P_{11} & P_{12} & … & P_{1n} & … \\ P_{21} & P_{22} & … & P_{2n} & … \\ P_{31} & P_{32} & … & P_{ 3n} & ... \end{array} \right ]P = P11P21P31P12P22P32P1nP2 nP3 n
        In this way, it is easy to remember what is a Markov process and DTMC first, and then understand what is an HMM (Hidden Markov Model).

HMM (Hidden Markov Model)

        Before the formal introduction, we first use a sentence as an example. We directly give a text and segment the text according to HMM.

"我爱梅沙"

Next, we analyze according to the process of HMM.

  1. For the words in the sentence, we define four states (hidden state space S):

    • B—begin. Indicates that the character is the beginning character of the word
    • M - middle. Indicates that the character is the middle character of the word
    • E - end. signifies the ending word in a word
    • S——single. Express a single word into a word
  2. We define that in a sentence, all words together form an observation set (observation state), namely: O = {I, Love, Mei, Sha}​.

  3. When we see a word (observation state), it is the probability of a certain one in BMES, forming an initial probability matrix (the matrix value has been obtained based on massive text statistics ). For any word, the probability of transferring to BMES is consistent, so the matrix is ​​a 1x4 probability matrix (π vector/initial probability matrix), here, the matrix is ​​as follows:

    {
          
          
      "B": -0.26268660809250016,
      "E": -3.14e+100,
      "M": -3.14e+100,
      "S": -1.4652633398537678
    }
    
  4. There already exists a probability matrix, which is the probability that the next word is BMES when the current word is BMES (4x4 state transition probability matrix, the values ​​in the matrix have been obtained based on massive text statistics), namely: [ A bb A bm A be A bs A mb A mm A me A ms A eb A em A ee A es A sb A sm A se A ss ] \left[ \begin{array}{ccc} A_{bb} & A_{bm} & A_ {be} & A_{bs}\\ A_{mb} & A_{mm} & A_{me} & A_{ms}\\ A_{eb} & A_{em} & A_{ee} & A_{es} \\ A_{sb} & A_{sm} & A_{se} & A_{ss}\\ \end{array} \right ]AbbAmbAebAsbAbmAmmAemAsmAbeAmeAe eAwith eAbsAmsAesAss, actually, the probability of many cases here is 0, therefore, the actual matrix is ​​as follows:

    {
          
          
      "B": {
          
          
        "E": -0.510825623765990,
        "M": -0.916290731874155
      },
      "E": {
          
          
        "B": -0.5897149736854513,
        "S": -0.8085250474669937
      },
      "M": {
          
          
        "E": -0.33344856811948514,
        "M": -1.2603623820268226
      },
      "S": {
          
          
        "B": -0.7211965654669841,
        "S": -0.6658631448798212
      }
    }
    
  5. There is already a probability matrix, and each word is the probability of each state of BMES (launch probability matrix/confusion matrix, the values ​​in the matrix have been obtained based on massive text statistics ), namely: { B ij | i ∈ [0x4E00, 9FA5 ] , j ∈ (B, M, E, S) }, where the emission matrix is ​​as follows:

    {
          
          
        "B": {
          
          
            "我": 1.4614045662995514,
            "爱": 1.968025153941063,
            "梅": 2.237194588185915,
            "沙": 1.983966134924789
        },
        "E": {
          
          
            "我": 2.8844153800921876e+101,
            "爱": 2.5467388205842573e+101,
            "梅": 2.5483227218706336e+101,
            "沙": 2.6413544826999272e+101
        },
        "M": {
          
          
            "我": 2.6778005616524882e+101,
            "爱": 2.2547330469174095e+101,
            "梅": 2.5528428570784386e+101,
            "沙": 2.321741847245321e+101
        },
        "S": {
          
          
            "我": 6.611019698336738,
            "爱": 11.146923368528606,
            "梅": 14.546547456418994,
            "沙": 13.526900849382743
        }
    }
    

        In this way, we get the five-tuple of HMM ( S , O , Π , A , B ) (S, O, \Pi, A, B)(S,O,P ,A,B ) (two sets of state sets and three sets of probability sets):

  1. Hidden state S: Described by the Markov process, it is a set of hidden states
  2. Observation state O: It is a collection of states obtained by display/direct observation
  3. (hidden) state transition matrix A: the probability matrix of mutual transition between hidden states
  4. Initialize the probability vector Π ( π i ) \Pi(\pi_i)P ( pi) : Each observation state is initially the probability set of each hidden state. is a 1xN matrix (N is the size of the hidden state space)
  5. Confusion Matrix/Emission Probability Matrix B: The probability set (emission probability) for each observed state to transfer to each hidden state

        After determining the various parts of the model, you can proceed step by step. The general idea is

  1. Calculate the initial state
  2. Calculate the transition probabilities of the second, third, and fourth times at one time
  3. Find the most probable path (dynamic programming)

Why are the probability results obtained by statistics above all negative?

This involves the underflow problem         in the computer . For general operations, the error caused by underflow is not obvious, but in Chinese word segmentation, the font is very large [0x4E00, 9FA5], and the probability always belongs to the interval [0, 1] Yes, at this time, a little bit of underflow may cause the calculation error of the entire model. Therefore, it is recommended to use logarithms to deal with high-precision probability calculation problems in practical applications . By simply looking at the graph of the logarithmic function, we can understand why the stored probabilities are all negative. But why log and not other functions? Personally, I think the main factors are the following two points, but they may not be described clearly. Students are welcome to add:

  1. Floating-point multiplication and division operations can be converted to logarithmic and difference operations to reduce overflow precision and error accumulation:

    log ⁡ α MN = log ⁡ α M + log ⁡ α N \log_{\alpha }MN\ =\ \log_{\alpha }\!M+\log_{\alpha }\!NlogaMN = logaM+logaN

    log ⁡ α M N   =   log ⁡ α  ⁣ M − log ⁡ α  ⁣ N \log_{\alpha }\frac{M}{N}\ =\ \log_{\alpha }\!M -\log_{\alpha }\!N logaNM = logaMlogaN

  2. When x approaches 0, the logarithm is infinite, and it can also be expressed for infinitely small probabilities

  3. For any x ∈ [0, 1], any change in x will cause its logarithmic result to change significantly

Calculate the initial state

        According to the above HMM quintuple, combine the formula ai ( j ) = π ( j ) bjki , i ∈ N , j ∈ ( B , E , M , S ) , k ∈ ( 我, 爱, 梅, 沙) a_i( j) = \pi(j)b_{jk_i}, i \in N, j \in (B, E, M, S), k \in (me, love, plum, sand)ai(j)=π(j)bjki,iN,j(B,E,M,S),k( I ,love ,plum ,sand ) can be obtained by calculation.

{
    
    
    "B": {
    
    
        "我":1.1851974272161176,
        "爱":1.9983762718297673,
        "梅":2.388164278255412,
        "沙":2.5132210235744683 
    },
    "E":{
    
    
        "我":1.4167147493671037e+101,
        "爱":2.388740537292971e+101,
        "梅":2.854670014651611e+101,
        "沙":3.004155434998415e+101
    },
    "M":{
    
    
        "我":1.4167147493671037e+101,
        "爱":2.388740537292971e+101,
        "梅":2.854670014651611e+101,
        "沙":3.004155434998415e+101
    },
    "S":{
    
    
        "我":6.611019698336738,
        "爱":11.146923368528606,
        "梅":13.321157069582242,
        "沙":14.018722376196262
    }
}

Calculate the state transition probability matrix

        After obtaining the above initial state, the probability of all arcs in the following diagram can be iteratively obtained in combination with the HMM quintuple mentioned above. The t iteration formula is as follows (in fact, this can also be generalized to the Markov n-step transition matrix iteration formula).
P t + 1 ( s ) = bsot + 1 ∑ i = 0 nat ( s ) asis ∈ P_{t+1}(s)\ =\ b_{so_{t+1}}\sum_{i=0}^ {n}{a_{t}(s)a_{si}}\\ s \inPt+1(s) = bsot+1i=0nat(s)aand is

        It can be obtained by taking the probability of the arc as the weight and finding the path with the maximum weight. Due to the complexity of the calculation process, it is not listed here. After code testing, the word segmentation result of the case is "I/爱梅萨".

我/B
我/S
爱/B
爱/E
爱/M
爱/S
梅/B
梅/E
梅/M
梅/S
沙/S
沙/E

        Not all predictive problems can be solved using HMMs. In fact, HMM has three important assumptions (personal understanding, the three assumptions are the problems that satisfy these three assumptions, all of which can be solved using HMM).

  1. Markov hypothesis - that is, the first-order Markov chain mentioned above , the probability of being in a certain state is only related to the last state, and has nothing to do with the earlier state

    P { X n + 1 = i n + 1   ∣   X 0 , X 1 , . . . , X n = i n } = P { X n + 1   ∣   X n = i n } P\{X_{n+1}=i_{n+1}\ |\ X_0,X_1,...,X_n=i_n\} = P\{X_{n+1}\ |\ X_n = i_n\} P{ Xn+1=in+1  X0,X1,...,Xn=in}=P{ Xn+1 X n=in}

  2. immobility assumption - state is independent of time/timing

    P ( X i + 1 ∣ X i ) = P ( X j + 1 ∣ X j ) , for any i , j holds P(X_{i+1}|X_i)\ =\ P(X_{j+1} | X_j), holds true for any i, jP(Xi+1Xi) = P(Xj+1Xj) , for any i ,j established

  3. Output independence assumption - the output state is only related to the current state, that is, the current input state is determined by the output state, and this process is not changed by earlier input, output or time

    P ( O 1 , . . . , O T ∣ X 1 , . . . , X T ) = Π ( O T   ∣   X T ) P(O_1,...,O_T | X_1, ..., X_T) = \Pi(O_T\ |\ X_T) P(O1,...,OTX1,...,XT)=P ( OT X T)

Viterbi - an optimized implementation of HMM

        HMM has three probability matrices. For UTF-8 encoding, the number of Chinese characters is also huge. At this time, as the operation (the length of the sentence to be segmented) increases, the confusion matrix and the state transition probability matrix will grow exponentially, and then the maximum probability path is found based on dynamic programming. The time complexity and space complexity of this process may be exponential. increase. At this time, the Viterbi algorithm is used to solve this problem - when certain constraints are met, we do not need to solve the probability (weight) of all arcs in the DAG. The constraints are as follows (this constraint is a personal summary, I don’t know if it is accurate. Students are welcome to modify or supplement):

  1. Find the longest path in the network
  2. The selection of all nodes is only affected by the maximum probability node of the previous layer and the input value of the current node

        Combining the above, with regard to the three assumptions of HMM, it can be found that Veterbi is suitable for solving the maximum probability path of HMM. The specific process will not be repeated here. The general idea is that the initial probability calculation is the same as the calculation method described above, and all subsequent state calculations only take the highest probability node in the calculation of the previous layer. This is a recursive process, and the result of word segmentation can be obtained by backtracking at the end. It can be felt that the main effect of using the Viterbi algorithm is to save the amount of calculation caused by other irrelevant paths (I personally think this is a bit like greed, but the result of greed is not necessarily the optimal solution, and the result of the Viterbi algorithm is the optimal untie).

About the Viterbi Algorithm

         Many people use the hidden Markov model to answer the viterbi algorithm. In fact, the viterbi algorithm is just an implementation method to solve the third problem of the hidden horse (seeking the most likely label sequence of the observation sequence). This problem can be implemented by the viterbi algorithm, or by other methods (such as the exhaustive method); and the viterbi algorithm can be used to solve the third hidden horse problem, and can also be used to solve other problems. So don't equate the viterbi algorithm with the hidden Markov model .

         The viterbi algorithm is actually the optimal selection problem of a multi-step multi-choice model in each step . All the choices in each step save the minimum total cost (or maximum value) and the current cost of all previous steps to the current step. Next step selection. After calculating all the steps in turn, find the optimal path by backtracking. Anything that fits this model can be solved using the viterbi algorithm. The third problem of the hidden horse model just fits this model, so the viterbi algorithm is used .

——Quoting Zhihu Answer [4]

Why Viterbi and not topological sort?

        Personal understanding is: topological sorting solves the problem of path selection from a single source point to a multi-sink point, and the network structure in HMM is multi-source-the initial state space, and each case is a starting point at this time. Therefore, Topological sorting is not applicable to the calculation of HMM. But I also think this explanation is not appropriate enough, welcome corrections from outstanding students.

In addition, seeing the two keywords of weighted and directed acyclic graph , it is easy to think of AOE network. We need to make it clear that what the AOE network describes is **single source point; single sink point; **weighted; directed acyclic graph. The critical path selection process of the AOE network is not applicable here, and the two should not be confused.

stuttering participle

        Through the above, I believe that everyone has a certain understanding of the analysis method of unregistered words. For more information, it is recommended to directly read the reference [1] at the end of the article . Of course, this is just a common way of dealing with unregistered words. In fact, we can obtain a very large dictionary (ictclas provides a free thesaurus containing 30 million words), and we can also count a dictionary of our own based on existing data and HMM. For stuttering word segmentation, the prefix dictionary will be constructed first based on the dictionary file ; when the word is segmented, the sentence will be completely segmented based on the prefix dictionary , and then the word formation probability of each word will be used as the weight to construct a DAG (directed acyclic graph) ; finally, use The method of dynamic programming traverses from back to front to find the path with the greatest weight . Among them, the word formation probability of unregistered words is based on the HMM model and calculated using the Viterbi algorithm.

        Regarding this basic process, it is actually described in the README file of stammering word segmentation. But for students who don't know HMM, it may be confusing to see the sentence "For unregistered words, the HMM model based on the ability of Chinese characters to form words is used, and the Viterbi algorithm is used." Combining the above, and then returning to the description of the algorithm and its specific code in the README file of stuttering word segmentation, the idea may be much clearer.

        In addition, the stuttering participle adopts the cross-linked list method for DAG (each word is regarded as an arc, the position of the starting word of the word is the end of the arc, and the position of the ending word is the head of the arc ). In the stuttering participle, you can find the getDAG method, and trying to print its return value will help you gain a deeper understanding of this storage structure and subsequent traversal process.

Reference https://github.com/fxsjy/jieba/blob/master/README.md#algorithm

  • Realize efficient word graph scanning based on the prefix dictionary, and generate a directed acyclic graph (DAG) composed of all possible word formations of Chinese characters in a sentence
  • Dynamic programming is used to find the maximum probability path and find the maximum segmentation combination based on word frequency
  • For unregistered words, the HMM model based on the word-forming ability of Chinese characters is used, and the Viterbi algorithm is used

        In the above, we mentioned that the hidden state and the hidden state probability matrix have been obtained through statistics. In fact, this is true for many parameters besides this. For details, please refer to how the stammering word segmentation issue#7 data is generated . It should be noted that the answer to this issue is from 2012. In 2013 (latest update), the author also updated the big version of the dictionary. That is, the current data source is more comprehensive than the data source of the author's reply in issue #7.

Vertical field optimization

        After understanding the basic process of HMM and stuttering word segmentation, we can understand that although stuttering word segmentation provides the implementation of the HMM algorithm for parsing unregistered words, in order to achieve higher Precision and Recall in the production environment, It is necessary for us to obtain the thesaurus in the vertical field, count the corresponding word frequency, and modify the word segmentation rules. In order to improve the accuracy of text processing in vertical fields (especially text characteristics unique to this field, such as mathematics, physics, computer programming, etc.).

summary

        Regarding the application of HMM in word segmentation, my personal understanding is that this is based on the idea of ​​probability and statistics to simplify the problem of Chinese word segmentation. In the learning process, we cannot think that HMM is randomly selected for no reason or to realize the processing of Chinese unregistered words, which can easily lead to confusion in the follow-up learning. More reasonable, it may be because the characteristics of Chinese text coincide with the three assumptions of HMM, therefore, HMM was chosen. That is, is there any other way to solve it? There are also, such as LSTM [1].

        By understanding history, we can know that the initial natural language processing was based on grammatical rules (this is why in the earliest days, we used translation tools very unsmoothly, almost simply word-to-word mapping), as computer performance With the improvement of technology and the development of related technologies, the performance of the system based on probability and statistics is more ideal. But we cannot think that the system based on grammatical rules is hopeless because "practice has proved that the word segmentation system based on manual rules is inferior to the word segmentation system based on statistical learning in the evaluation". In fact, there are currently many literatures that mention that if more accurate natural language processing is to be achieved in the future, it is an inevitable trend to combine grammatical rules with methods based on probability and statistics (although there are no specific examples, but I have the impression that , there is a systematic use of grammatical rules to further correct the analysis results based on probability statistics, interested students can learn more about it). At the same time, I personally recommend interested students to understand the history of basic language development and the development process of Chinese word segmentation, which will help to deepen the principles and development trends of word segmentation systems or other NLP systems. For details, please refer to [2-3].

        From the previous article, we can understand that at present, we cannot think that the word segmentation tool and the system based on the word segmentation tool can be 100% correct for this technology. Therefore, in practical applications, it is necessary to design test cases based on actual business and limit the minimum Precision, Recall, and F-score. Let the system complete all test cases, and each score reaches the corresponding threshold, then the test can be considered as passed.

References

  1. Sequence tagging Chinese word segmentation based on LSTM network[J]. Computer Application Research, 2017(5).
  2. Zheng Jie. Principles and Practice of nlp Chinese Natural Language Processing [M]. Beijing: Electronic Industry Press, 2017.1
  3. Huang Changning, Zhao Hai. Ten Years Review of Chinese Word Segmentation[J]. Chinese Journal of Information, 2007, 21(3):8-19.
  4. https://www.zhihu.com/question/20136144
  5. https://github.com/fxsjy/jieba
  6. http://www.52nlp.cn/hmm
  7. https://www.cnblogs.com/zhbzz2007/p/6092313.html

Guess you like

Origin blog.csdn.net/qq_23937195/article/details/102684635