Detailed explanation of HMM (Hidden Markov Model) - Speech signal processing learning (3) (Elective 1)

references:

Speech Recognition (Option) - HMMbilibili

March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes HMM - 6 - Zhihu (zhihu.com)

Hidden Markov (HMM) decoding problem + Viterbi algorithm - Zhihu (zhihu.com)

All cited papers are omitted this time

Table of contents

1. Introduction

2. Modeling unit States

Origin of State

Transition probability and emission probability

3.Alignment

4. HMM under deep learning

Method 1: Tandem

Method 2: DNN-HMM Hybrid

5. Training method of State Classifier in DNN


 

Note that this article inherits the Speech Signal Processing Learning (3) course and belongs to the elective part of speech recognition tasks, with a total of three chapters.

 

Let's go back 14 years and see how people did speech recognition without neural networks at that time. You will find that many ideas from HMM are borrowed from current technology.

1. Introduction

  • In the past, we could use statistical models for speech recognition. Given the input speech sequenceX, we only need to find the output text with the maximum probabilityY That’s it, that is, exhausting all possible Y and finding one Y*) maximum change. We also call this process decoding, and the formula is as follows:X|Y makes P(


    Y^* = \arg \max_Y{P(Y|X)}
     

  • Exhaustive calculation requires very good algorithms, and this problem is too complex. Fortunately, we can use Bayes' theorem to transform it. The transformed formula is as follows. Since P(X) is irrelevant to our decoding task, it will not follow Y changes with changes. So we just need to keep the molecule part.


    \begin{aligned} Y^* & = \arg \max_Y{P(Y|X)} \\ & = \arg \max_Y{\frac{P(X|Y)P(Y)}{P(X)}} \\ & = \arg \max_Y{P(X|Y)P(Y)} \end{aligned}
     

  • After transformation, we transform the first half of the formula P(X|YY, and the latter term P(Acoustic Model) is called Language Model. The former is often used HMM. We see that if you need to use HMM, you must use it with LM. The conventional E2E model directly solves the unchanged row expression. On the surface, it seems that LM is not needed. In fact, the performance of the E2E model with LM is often much better. For this, please refer to the explanation of LM later.

2. Modeling unit States

Origin of State
  • We said before that in the speech recognition model, the target Y is a Token sequence. However, we will convert the target Token sequence into a States sequence in HMM, using S< /span> to represent. What is State? It is defined by a person and is a smaller unit than a phoneme.

  • We use the sentence "what do you think" as an example. If phoneme is used as the token unit, the decomposition results are as follows. However, since each factor is affected by preceding and following factors, the same factor uw may actually be pronounced differently. So we will subdivide it further and use Tri-phone as the token unit, that is, the current phoneme plus the previous and following phonemes.

  • State is a smaller unit than Tri-phone. We can stipulate that each Tri-phone consists of 3 or 5 states. How much depends on the computing resources you have. The disassembled State also retains the pronunciation order information.

  • Since we need to calculate the probability of the acoustic feature sequenceX given States, then we need to figure out how State is generated Acoustic characteristics. In fact, it is very simple. Suppose we have a total of 3 States, and .

Transition probability and emission probability
  • In order to complete the process just now, we need to calculate two probabilities, one is the probability that the current state ends and jumps to the next state, and the other is the probability that the current state generates the corresponding acoustic feature. We call them Transition Probability and Emission Probility respectively.

    • Transition Probability: The probability that this vector is generated by state a and the next vector is generated by state b.

    • Emission Probability: Given a State, the probability of producing a certain acoustic feature. We believe that the acoustic features emitted by each state have a fixed probability distribution, and we will use GMM (Gaussian Mixture Model) to represent this probability.

  • The calculation of emission probability also indirectly explains why we need such a small unit State as the modeling unit. Because we have to assume that the distribution emitted by each state is stable. If the unit is too large, it is likely that the distribution of the acoustic characteristics it emits will change. For example, if characters are used as units, this will happen: the pronunciation of the letter c is not fixed. It is often pronounced as "ke", but it is pronounced as "ch" after h. This is not suitable to be used as an HMM state.

  • However, emission probability also brings some problems, that is, there will be a lot of states. If there are 30 phonemes, then there will be 30×30×30 Tri-phones. One Tri-phone corresponds to 3 states, and the final number must be multiplied by 3. This may lead to a situation where a certain state only appears once or twice in the entire corpus, which makes its Gaussian mixture distribution difficult to calculate.

  • In response to this situation, a key technology has emerged in the past, namely Tied-state, which assumes that some states have the same pronunciation, so they will share the same Gaussian mixture distribution. This can reduce the number of Gaussian mixture models used, and can also allow few states with difficult-to-compute distributions to share distributions with other states. It's like you have two pointers with different names, both pointing to the same memory.

  • The development of this technology has now taken its final form: Subspace GMM. Among them, all States share the same Gaussian mixture model. It is actually a Gaussian mixture distribution pool (pool) with many Gaussian mixture distributions in it. Each State is like a net. It goes to this pool to fish out a few Gaussian distributions as the Gaussian mixture distribution it wants to emit. Therefore, each State has both a different Gaussian distribution and the same Gaussian distribution.

    However, this technology was published in 2010 and is not used much now. It was quite a sensation when it was first published. Interestingly, Hinton also published a paper on ASR for deep learning at the forum in the same year. But at that time, everyone's attention was on the previous paper, and Hinton's research did not receive much attention. The reason is that its performance was not as good as the state of the art at the time.

3.Alignment

  • Suppose we already know the Transition Probability and Emission Probability, but we still cannot calculate our target probability P(X|). S|X. What does it mean? It's just that we still don't know which state these vectors correspond to. In other words, we need to know which acoustic feature is generated by which state, so that we can use emission probability and transition probability to calculate P(Alignment), because we are still missingS

  • Suppose we have 3 states abc and 6 vectors x1~6, we need to get the alignment of the state with the vectorsh (i.e. State sequence), such as aabbcc, that is, x1 x2 is generated by state a, and so on. Knowing the alignment, we can calculate the target probability using the two probabilities. In reality, precisely because we don’t know Alignment, this information is hidden, so the Hidden naming in HMM was born. For different state sequences, the calculated probabilities will be different.

  • So how do we solve the problem of hidden Alignment information? We choose to exhaust all possibilities, calculate and add up the probabilities of all state sequences, and the final result is our target probability P(X |S). This is what the HMM is doing during decoding. Of course, sequences such as abccbc and abbbb are not included. The reason is that there are bounces and less states.

    But! ! ! Note that when I searched for relevant information, I found that maybe what the HMM really does in decoding is to "exhaustively exhaust" all possibilities and find the acoustic features it generates with the highest probability of observing X, and the most Consistent alignment. The "exhaustive" here generally involves using some dynamic programming algorithms (such as the Viterbi algorithm) to effectively calculate the most likely state sequence, that is, the most likely hidden state sequence given the acoustic feature sequence, in order to obtain the decoding result with the maximum probability.

    But I took another look and it seems that for a certain state sequence, when calculating its probability, the sum of the probabilities of all alignments is used. When looking for the state sequence with the highest probability, that is, when decoding and generating the result, Dynamic programming algorithm is used. This statement needs to be verified.

    Subsequent results: After learning RNN-T, I think HMM may be the same as RNN-T. During training, the sum of the probabilities of all alignments is used as the probability of the current text (token/state). The probability of the alignment with the highest probability is used as the probability of the current text.

    Reference materials:Hidden Markov (HMM) decoding problem + Viterbi algorithm - Zhihu (zhihu.com)

4. HMM under deep learning

Method 1: Tandem
  • There is no deep learning in HMM. When deep learning rose, people began to think about how to use deep learning. The earliest ideas were based on variations of HMM.

  • The first and most common method is Tandem. It was all over the streets in 2009. It does not change the HMM model, and its main purpose is to provide HMM with better quality acoustic characteristics. How to provide? Previous acoustic features were all MFCC, while Tandem trains a State Classifier based on a deep neural network. It can input an MFCC vector to predict the probability of which state it belongs to, and the output is its probability distribution. We replace the previous acoustic features with this probability distribution as the new input of the HMM.

  • Of course, we do not necessarily need to use the output of the State Classifier as the acoustic feature. We can also use the output of the last hidden layer or bottleneck layer.

Method 2: DNN-HMM Hybrid

Discriminative training and Generative Training are two different training methods in machine learning, commonly used for classification and generative models.

  1. Discriminative Training

    • Definition: This training method aims to learn the conditional distribution or decision boundary of the data in order to distinguish the differences between different categories. It mainly focuses on the task of label classification of input data. This approach focuses on learning conditional probability distributions given class labels directly, e.g., learning a mapping from inputs to labels in supervised learning.

    • Examples: Common examples include support vector machines (SVM), logistic regression, and neural networks.

  2. Generative Training

    • Definition: This training method focuses on modeling the generative distribution of data in an attempt to understand how the data was generated. It not only focuses on classification tasks, but also attempts to simulate the process of data generation. By learning the distribution model of the data, new data similar to the original data can be generated.

    • Example: Typical examples are Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs) and Hidden Markov Models (HMM), etc.

The two methods differ in goals and applications. Discriminative training focuses more on data classification problems, finding boundaries or conditional probabilities, so that input data can be accurately classified. Generative Training focuses on learning the data generation process so that new samples that are similar to the original data can be generated and can also be applied to classification tasks.

  • There is a Gaussian mixture model in the original HMM, and we want to use DNN to replace it. However, the Gaussian mixture model is given a State and outputs the probability distribution of each acoustic feature, that is, P(x|a); the State Classifier just mentioned is given an acoustic feature vector and outputs the probability distribution of each state. , that is, P(a|x). The two seem to be opposites.

  • However, we can still transform it according to Bayes' theorem. The transformation formula is as follows: P(a) can be statistically calculated from the training data. P(a|x) is the output of DNN. We don't care about P(x). . The biggest advantage of this is that there are very few changes to the original formula and modular management is achieved.

  • So, why is it better to use DNN to calculate P(x|a) than the Gaussian mixture model? Some people believe that the training process of DNN is discriminative training, while the original HMM is generative training, and the former is better. However, in fact, although HMM is a generative model, it can also use discriminative training, and many people have done relevant research before DNN. Some people also think that the great thing about it is that DNN has more parameters. But this underestimates the representation ability of GMM when the number of parameters increases. In the end, the parameters used by DNN are actually similar to those used by GMM-based HMM.

  • In fact, the contribution of this paper is that it allows all given observations to share a model to calculate the probability of the current possible state. Instead of like GMM, each State needs their own GMM, with different mean and variance. So it is a very powerful data-driven status annotation method.

  • So, how effective is DNN? It turns out to be very powerful. You know, DNN can not be a network composed of fully connected layers, but can be any type of neural network, such as CNN, LSTM, etc.

5. Training method of State Classifier in DNN

  • So how do we train the State Classifier? Its input is an acoustic feature, and its output is the probability that it is a certain state. Before we train this, we need to know the correspondence between each acoustic feature and state. However, the actual annotation data is not aligned, only acoustic features and corresponding text.

  • The past practice was to first train an HMM-GMM. Once you have it, you can calculate the Alignment with the highest probability. Once you have the alignment, you can train the State Classifier.

  • However, some people may worry about this. Isn’t it true that HMM-GMM performs poorly? Wouldn’t it be bad to use its results to train DNN? Then we can also replace the HMM-GMM with the first-generation DNN that has just been trained, give a new alignment sequence, and then use it to iterate the DNN, so that the training can continue in a loop until you are satisfied.

  • So what's the result of doing this? Very strong! In 2016, Microsoft claimed that the model results they trained using DNN-HMM Hybrid were comparable to human capabilities. Specifically, the human recognition error rate was equivalent to the machine recognition error rate, and the human recognition error rate was For calculations, Microsoft specifically hired professional dictation personnel to perform measurements.

  • In 2017, IBM used the same method to once again reduce the recognition error rate. However, this time the human recognition rate was lowered (the people found were better). In fact, the generally accepted error rate indicator for speech recognition is about 5%, which is already very strong. Professional dictation staff are at this level. Because the correct answers are also marked by humans, there is an error rate of about 5%. The model can reach 5% which is considered the limit. It’s hard to move up any higher.

  • In actual production, because of the inference speed, there are not many end-to-end deep learning models, except for Google’s mobile assistant. Most adopt hybrid models.

  • So how to improve the accuracy rate? What everyone can do with the hybrid model is to continuously deepen the depth of DNN. For example, in Microsoft's public information, they trained a 49-layer residual neural network. The output has 9000 state categories, and the output is a vector, using Softmax for normalization.

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/134692287