N-gram language model

        In the process of speech recognition, the GMM-HMM model introduced earlier can be solved by viterbi and other algorithms to obtain the best state sequence. However, if the targeted speech includes all vocabulary, English, and numbers, it may reach hundreds of thousands of words, and the decoding process will be very complicated. There are many different permutations and combinations of homophones, and the recognition results will not be ideal. Therefore, it is necessary to introduce a language model to constrain the recognition results, so that the probability of "today's weather is very good" is higher than the probability of "today's weather is very good", and the acoustic model has a high probability and fits the sentence.

        There are two important issues in model design:

        1) Free parameter problem: If every word in a sentence is related to each other, the free parameters of the model will increase exponentially as the length of the string increases, which makes it almost impossible for us to estimate these parameters correctly.

        2) Zero probability problem (OOV problem): There are tens of thousands of Chinese character combinations, but there are not so many combinations in more than a dozen corpus that can be used for training. Then according to the maximum likelihood estimation, the final result probability is likely to be 0. (Assuming that a combination does not appear in the model training, the model training is to obtain parameters to maximize the probability of the combination that appears, and to minimize the probability of the combination that does not appear. Therefore, when there is a combination that did not appear during training, the model There is a good chance that the calculated probability is 0, but the combination is probably not unreasonable.)

 N-Gram:

        N-Gram is a language model commonly used in large vocabulary continuous speech recognition. The model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, but not to any other words, and the probability of the entire sentence is the product of the occurrence probabilities of each word.

         N-Gram is an algorithm based on statistical language models. Its basic idea is to perform a sliding window operation with a size of N on the content of the text according to bytes, forming a sequence of byte fragments with a length of N.

        Perform frequency statistics for each byte fragment (gram) formed, and filter according to the preset threshold to form a key gram list, which is the vector feature space of this text, and each gram in the list is a feature vector dimension.

Free parameter problem:

        According to the idea of ​​the previous GMM-HMM model, to judge whether a sentence consisting of N words is \left \{w _{1},w _{2},..,w _{N} \right \}reasonable, we hope to get the probability of its occurrence, and its probability is the largest among all sentences corresponding to the observation state, and its probability is:

p\left ( w_{1},w_{2},...,w_{N} \right )=p\left ( w_{1}|\left \langle s\right \rangle \right )p\left ( w_{2}| w_{1}\right )p\left ( w_{3}| w_{1}w_{2}\right )...p\left ( w_{n}| w_{1}w_{2}...w_{N-1}\right )p\left ( \left \langle /s \right \rangle|w_{1}...w_{N} \right )

        Among them, the probability can be obtained through the statistics of the pre-marked text data:

        p\left ( w_{n}|w_{1}w_{2}...w_{n-1} \right )=\frac{count\left ( w_{1}w_{2}...w_{n} \right )}{count\left ( w_{1}w_{2}...w_{n-1} \right )}

        Generally, <s> and </s> are used to mark the end of the beginning, and there are no words in the vocabulary .

        Obviously, if the probability of the nth word is related to all previous words, then when we want to calculate the most likely sentence, the computation will be exponential times of word length N.        

        We might as well continue to learn from the idea of ​​HMM, assuming that the probability of the current word appearing is only associated with the first few limited words, which can greatly reduce the amount of calculation. In reality, most of what we say is also related to the most recent words. To the extreme, the correlation between what is said today and what was said today last year should be very small or even non-existent.

        When n=1, that is, the appearance of a word is independent of the words around it, called unigram.

        When n=2, that is, when the appearance of a word is only related to the word before it, it is called bigram.

        When n=3, that is, the appearance of a word is only related to the two words in front of it, which is called trigram.

Data smoothing algorithm:

        Due to the sparsity of the training corpus, some word orders cannot be found, and there will be  zero probability problems, so the data needs to be smoothed.

        Example (1):

        Assuming that the existing training set T = (watermelon, watermelon, watermelon | eat), it means the collected phrases that can be followed by "eat".

        According to the maximum likelihood estimation, the system parameters will be obtained so that p(watermelon, watermelon, watermelon | eat) is the largest. The final system will get the following results:

        p(x) = (x == watermelon? 1:0)

        Then, when reasonable phrases such as "eat apples", "eat bananas", and "eat pears" appear in the actual test set, but the phrases are not in the training set, the system outputs a probability of 0 phrases, which makes the output of the model unsatisfactory. It's a zero probability problem .

1) Add-one Smoothing/ Laplace Smoothing (Add-one Smoothing/ Laplace Smoothing )

        Idea: Add one to the counts of all word orders (it can be any suitable number a), so that any word order has a count, and the problem of zero probability can be avoided. In order to ensure that the sum of the probabilities of all instances is 1, the denominator is increased by the number of types of instances, namely:

        p\left ( w_{m+1}|w_{1}w_{2} ...w_{m} \right )=\frac{count\left (w_{1}w_{2} ...w_{m+1}\right )+a }{\sum_{w_{i}}^{W_{m+1}}\left (count\left (w_{1}w_{2} ...w_{m}w_{i} \right ) + a \right )}

        Advantages: simple algorithm, 0 probability problem solved

        Disadvantages: Due to the sparsity of the corpus, most combinations have not appeared. Add-one allocates too much probability space to N-grams that have not appeared in the training corpus, and considers that all N-grams that have not appeared have equal probability It's also a little unreasonable.

        Supplement the two concepts involved in this algorithm;

        1) Adjustment count  c^{*}: It is used to describe the influence of the smoothing algorithm on the molecule, which means the degree of influence on the original probability after its own probability is divided:

        c^{*}=\left ( c_{i}+1 \right )\frac{N}{N+V}

        Where: c_{i}is the original count, N is the original denominator, and V is the number of instance types increased by the denominator.

        2) Relative discount rate  d_{c}: It represents the ratio of the discounted count to the original count, which means the overall change:

        d_{c}=\frac{c^{*}}{c}

2) Good-Turing Smoothing

        Idea: Estimate the events you haven't seen (Unseen Events) with the things you saw once (Seen Once), and so on. That is to modify the actual count of events in the training sample, so that the sum of the probabilities of different events (actually occurring) in the sample is less than 1, and the remaining probability amount is allocated to the unseen probability. The frequency is smoothed using the class information of the frequency .

        Define the number of N-gram phrases whose frequency of occurrence is r (frequency of frequency, frequency of frequency c) as  N_{r}:

        N_{r}=\sum_{T:count\left ( t_{i} \right )==r}^{V}1

        Then there are:

        N=\sum_{c=1}^{\infty }N_{r}r

        where N is the total number of samples.

        There are two cores of the algorithm:

        1) Adjust the frequency of occurrence of the N-gram whose frequency of occurrence is r  r^{*}:

        For the zero probability problem, another solution is to classify the samples. When it is the same as a certain sample in the training set according to a certain classification standard, it can have the same probability. For example: "watermelon", "banana" and "pear" are all fruits, so it makes: p(watermelon|eat) = p(banana|eat) = p(pear|eat). Because there are new samples entering, in order to make the sum of the occurrence probability of all samples equal to 1, it is necessary to adjust the occurrence probability of the existing samples according to the rules.

        Good-turing Smoothing is all about classifying samples by frequency:

        W_{r}=\left \{ w_{1},w_{2},...w_{N_{r}} \right \}

        in:

        count\left ( w_{i} \right )=r

        That is, the sequence \left \{ w_{1},w_{2},...w_{N_{r}} \right \}has N_{r} a total of elements,  w_{i} the frequency of occurrence in the sample is r, and W_{r}the total number of samples is  N_{r}r, then:

        p\left ( W_{r} \right )=\frac{N_{r}r}{N}

        \sum_{r=1}^{\infty }p\left ( W_{r} \right )=1

        When a sample that does not appear in the training set appears in the sample, it appears:

        W_{0}=\left \{ w_{1},w_{2},...w_{N_{0}} \right \}

        Summing all sample probabilities gives:

p\left ( W_{0} \right ) + \sum_{r=1}^{\infty }p\left ( W_{r} \right )=1

        According to the training result p\left ( W_{0} \right ) =0, it is a zero probability problem. At this time, the probability distribution needs to be adjusted.

        Good-turing Smoothing uses the things you have seen once (Seen Once) to estimate the events you have not seen (Unseen Events), and so on.

        That is, use W_{1}(Seen Once) to estimate W_{0}(Unseen Events), and so on to use W_{2}estimates W_{1}... .

        Then we make W_{r}the adjusted probabilities p^{*}\left ( W_{r} \right ) =p\left ( W_{r+1} \right ):

        p^{*}\left ( W_{r} \right ) =\frac{N_{r}r^{*}}{N}=\frac{N_{r+1}r+1}{N}=p\left ( W_{r+1} \right )

W_{r}The new frequency of occurrence of each sample in         the reduced shift term is r^{*}:

        r^{*}=\left ( r+1 \right )\frac{N_{r+1}}{N_{r}}

        Then the adjusted probability of a sample with frequency r in the sample is:

        p=\frac{r^{*}}{N}

        2) In general, we choose the probability of seeing once (Seen Once), that is, use the number of things that have just been seen once to help estimate the number of things that have never been seen (Unseen Events), then.

       p\left ( W_{0} \right )^{*}\left ( Unseen Events \right )=p\left ( W_{1} \right )=\frac{N_{1}}{N}

        p\left ( W_{0} \right )^{*}Also known as missing mass

        W_{0}The probability of each sample in is:

        p=\frac{p\left ( W_{0} \right )^{*}}{N_{0}}=\frac{0^{*}}{N}=\frac{N_{1}}{NN_{0}}

         The key to Goode-Turing smoothing is to use the probability of an event to predict the probability of no event, and then adjust the probability of all events. When  N_{c+1}it does not exist, it should be supplemented by methods such as linear regression, and then brought into the calculation.

        Perceptual understanding, Goode-Turing smoothing is to shift the sample probability in the training set back, and leave the sample probability with frequency 1 to the sample with frequency 0. The reason for this is that in the natural language corpus, the words that appear frequently are often only those few words, and most of the words appear at low frequency. Then the words that do not appear in the training set have a high probability of being low-frequency words, and their probability should be close to the proportion of low-frequency words that only appear once in the training set, so it is feasible to estimate unseen things once. Things generally have continuous characteristics, so other probabilities can also be adjusted by analogy.

 

Guess you like

Origin blog.csdn.net/weixin_43284996/article/details/127414272