[NLP] Speech Recognition—GMM, HMM

 1. Description

        Before the deep learning (DL) era of speech recognition, HMM and GMM were two must-learn technologies for speech recognition. Now, there are hybrid systems that combine HMMs with deep learning, and some systems are HMMs for free. We now have more design options. However, HMMs are still important for many generative models. But regardless of the state, speech recognition can help us better understand the application of HMMs and GMMs in the context of ML. So stop growing faces and let's spend some time on them.

2. Automatic Speech Recognition (ASR)

        Let's start with a high-level overview. The figure below is the high-level architecture of speech recognition, which connects HMM (Hidden Markov Model) with speech recognition.

        Starting from an audio clip, we slide windows of width 25 ms and intervals of 10 ms to extract  MFCC features . For each window frame, 39 MFCC parameters are extracted.  The main goal of speech recognition is to build a statistical model that infers a text sequence  W (eg "the cat sat on the mat") from a sequence of feature vectors  X.

        One method looks through all possible word sequences (with a finite maximum length) and finds the one that best matches the input acoustic features.

        The model relies on building a language model P( W ), a pronunciation dictionary model and an acoustic model  P ( X|W ) (generative model), as follows.

Modified from source code

        Pronunciation models can use tables to convert words to phonemes, or the corpus has been transcribed using phonemes. Acoustic models are about modeling a sequence of feature vectors given a sequence of calls rather than words. But we will continue to use the notation  p ( X|W ) as the acoustic model. Just be aware.

        A language model is about the likelihood of a sequence of words. For example, "I watch a movie" is more likely than "I watch a movie" or "I watch an apple". It predicts the next word based on the previous words. If we approximate it with a first-order Markov chain, the next word will only depend on the current word. We can estimate it by counting the occurrences of word pairs in the corpus.

        By combining acoustic and language models, we search for text sequences with the greatest likelihood.

        This approach sounds indirect, and the search looks inefficient or impossible. But p ( X|W ) is much easier to model in speech recognition. The characteristic distribution of the phone can be modeled using a Gaussian Mixture Model (GMM). We will learn it by training data. The transition between phones and corresponding observables can be modeled using a Hidden Markov Model (HMM). So if we can figure out an optimal way to efficiently search for phone sequences, that might not sound too bad.

        HMM models consist of hidden variables and observables. The top node below represents the phone, and the bottom node represents the corresponding observable (the audio function). The horizontal arrow demonstrates the transition to the real label "She's just..." in the phone sequence.

        In speech recognition, observables can be represented by 39 MFCC features extracted from the corresponding audio frames. The good news is that with this HMM model, we don't need to search phone sequences one by one. Otherwise, the complexity grows exponentially with the number of calls. Using the Viterbi algorithm or other HMM methods, we can find the optimal sequence in polynomial time. We will come back to this question later.

        The diagram below is a possible implementation of Automatic Speech Recognition (ASR). Combining information about dictionaries, acoustic models, and language models, we can use the Viterbi decoder to find the optimal telephony sequence.

Modified from source code (O is the same as X here)

Let's quickly recap that we can model the acoustic model P ( X|W )         with an HMM . Arrows on the HMM model will represent call transitions or links to observables. To model the audio features we observe, we learn a GMM model from the training data. So let's start by understanding more about HMMs and GMMs in a general context.

3. Hidden Markov Model

        A Markov chain contains all possible states of the system and the probabilities of transitioning from one state to another.

        A first-order Markov chain assumes that the next state depends only on the current state. For simplicity, we usually refer to this as a Markov chain.

        This model will be easier to handle. However, in many ML systems, not all states are observable, we call these states hidden states or internal states. Some might see them as latent factors for input. For example, it may not be easy to know if I am happy or sad. My internal state will be {H or S}. But we can get some hints from observation. For example, when I'm happy, I have a 0.2 chance of watching a movie, but when I'm sad, that chance goes up to 0.4. The probability of observing an observable given its internal state is called the emission probability . The probability of transitioning from one internal state to another is called the transition probability .

        For speech recognition, the observable is what's in each audio frame. We can represent it using MFCC parameters. Let's see what we can do with HMMs.

Possibility of the forward algorithm

HMMs are modeled by transition and emission probabilities.

        Given the learned HMM model, we can use the forward algorithm to compute the likelihood of an observation. Our goal is to summarize the observed probabilities for all possible state sequences:

        But we have to be smart about it. We cannot sum all possible state sequences at once. It has exponential complexity.

        Our strategy will take a divide and conquer approach. If we can express computation recursively, we can decompose the problem into intermediate steps. In HMM, we use the results at time t-1 and/or t+1 to solve the problem at time t. The circle below represents the HMM hidden state j at time t. Therefore, even if the number of state sequences grows exponentially with time, if we can express the computation recursively over time, we can solve it linearly.

        This is the idea of ​​dynamic programming that breaks the exponential curse. At time t, the observed probability as of time  t  is:

        Let's rename the red underlined term to t ( j ) ( forward probability α and check if we can represent it recursively. Since the current observation depends only on the current state, α can be expressed as:

        So it does have a recurrence relation. Following are the steps to calculate the likelihood of an observation given a model  λ  using recursion. Instead of summing each state sequence individually, we compute α from time step 1 to the end (time  T ). If there are  k  internal states, the complexity will only be O( k²T ), not exponential.

        Below is an example where we start with the initial state distribution on the left. Then we propagate the value of α to the right. We compute α for each state and repeat this for each timestep.

        Next, given the HMM model, how do we find the internal state for a given sequence of observations. This process is called decoding . This is especially interesting for speech recognition. If we have an audio clip, the internal state represents a phone call. Speech recognition can be viewed as finding these internal states given an audio clip.

Decode (find internal state - Viterbi algorithm)

        Likewise, we wish to express our components recursively. Given state j at time t , vt ( j ) is the joint probability of the observed sequence and the optimal state sequence.

        So not only can it be done, the equation is similar to the forward algorithm except that the summation is replaced by a max function. Instead of summing all possible state sequences in a forward algorithm, the Viterbi algorithm takes the most probable path.

Modified from source code

        Finding the internal state that maximizes the likelihood of an observation is similar to the likelihood method. We just replace the summation with the max function.

        In this algorithm, we also record the maximum path to each node at time t (red arrow above), i.e. the best path we take backtracking to each node. For example, we   transition  from a happy state  H at t = 1 to a happy state  H at t = 2 .

source

Learning (Baum-Welch Algorithm/Forward-Backward Algorithm)

        Now, it comes to the hard part. How do we learn HMM models? This can be done with the Baum-Welch algorithm (forward-backward algorithm) to learn transition and emission probabilities. This task sounds impossible because both probabilities are very entangled in our calculations. But from a certain point of view, if we know the state occupancy probability ( the state distribution at time t ), we can derive the emission probability and transition probability. If we know these two probabilities, we can derive the state distribution at time  t  . This is the chicken-and-egg problem we discussed in the EM algorithm. The EM algorithm solves this problem in iterative steps. At each step, we optimize one latent variable while fixing other latent variables. Imagine improving the solution with each iterative step. Even for continuous spaces, we work with finite precision, so it is the finite states that need to be explored and improved. So if we keep iterating, the solution will converge.

        Therefore, it is not surprising that the Baum-Welch algorithm is a special case of the EM algorithm.

        Let's get acquainted with the following new symbols.

We are already familiar with α (forward probability)         in the forward algorithm . β (backward probability) is its close cousin in the opposite direction (   probability of seeing all upcoming observations given  state  i at time t ). We can express it recursively, similar to α but in the opposite direction (aka the backward algorithm ).

        To learn an HMM model, we need to know what state we are in to best explain the observations. This would be the state occupation probability γ  —  the probability of state  i at time given all observations   .

        Given fixed HMM model parameters, we can apply forward and backward algorithms to compute α and β of the observations . γ can be calculated by simply multiplying α by β , then renormalizing it.

        ξ  is the probability of  transitioning  from state  i to j  after time given all observations  . It can be calculated by something like α and β .

        Intuitively, with a fixed HMM model, we refine state occupancy probabilities ( γ ) and transitions ( ξ ) with given observations.

        Here comes the chicken and egg part. Once the distributions of γ and ξ (θ₂) have been refined , we can make point estimates of the optimal transition and emission probabilities ( θ₁: a,b ) .

        We fix one set of parameters to improve others, and continue iterating until the solution converges.

        The EM algorithm is usually defined as:

        Here, the E-step establishes p ( γ,ξ|x,a,b ). Then, M steps find a, b that roughly maximizes the objective below.

        Here is a review of the algorithm:

        Thus, the Baum-Welch algorithm can learn an HMM model given all observations in the training data. However, remember to keep an open mind. In speech recognition, the problem is much more complex, and many solutions sometimes do not scale well.

4. Acoustic model

Modified from source code

        In ASR, we can use a pronunciation table to   generate a phone call for a text sequence Y. Next, we need to create an acoustic model for these phones.

        The study of phonetics has been done by people for decades. Experts can identify vowels and consonants by directly reading the spectrogram.

source

        But again, we need a denser representation of the acoustic model so that we can determine the likelihood of an audio feature vector X given a phone P ( X |phone) .

        Using  MFCC, we extract 39 features from audio frames. Let's simplify the picture and assume there is only one feature per frame. For state "sh" (telephone), the value of this function can be modeled using a normal distribution.

        To extend the concept to 39 features, we only need a multivariate normal distribution with 39 variables. The figure below visualizes the bivariate normal distribution of two variables.

bivariate normal distribution

        Following is the definition of multivariate normal distribution.

where Σ is the covariance matrix         that measures the correlation between variables . MFCC parameters have nice properties. There are relatively independent. Therefore, the off-diagonal elements of Σ can simply be set to zero.

        However, multi-dimensional thinking is too difficult. Therefore, we will stick to the one-dimensional example for illustration.  The likelihood  p ( x| q) of an observed feature  x  will be calculated as   how far it is from the peak of the normal distribution q :

        Given different phones, we can calculate the corresponding probability density values ​​and classify it as the phone with the highest value. To learn this Gaussian distribution, we can simply estimate it from the training data points xi .

        These equations can be proven by maximizing the likelihood of the training data.

source

        So this Gaussian model is easy to learn from the training data and gives us a nice P( x | μ ,σ²). In the context of speech recognition, we can learn a Gaussian model ( μ , σ²) for each phone . This is used as the likelihood probability. This also acts as the firing probability in the HMM.

        Unfortunately, even if we use a multivariate Gaussian distribution, this concept is naive. If this were true, learning to speak a foreign language would be much simpler. The possibilities are more complex than a single peak bell curve. To address this, we switch to Gaussian Mixture Models (GMMs). This allows the distribution to be multimodal, i.e. we allow several possible values ​​for a feature. This provides the flexibility of voice morphing.

        For example, the GMM on the right combines three Gaussian distributions with different weights to form a new probability density (3-component GMM). The model is still very dense, with 6 Gaussian parameters plus 3 weights.

GMM acoustic model

        Intuitively, the eigenvalues ​​of a particular mobile phone can be observed near one of the m- modes. But some values ​​may be more likely than others. Therefore, we introduce weights to indicate which ones are more likely. When the internal HMM state is  j  , the likelihood of the observed eigenvector is:

        To learn a GMM, eg for a 2-component GMM, we feed the features extracted from the training data to fit the parameters of the two clusters. Conceptually, we start with an initial or random guess of these parameters. We find which cluster each data sample should belong to. We then recompute the clustering parameters based on the associated data points.

        Yes, we will iterate the solution using the EM algorithm until it converges. In EM we use soft assignments instead of hard assignments. For hard assignment, we assign each data sample to a specific cluster (point estimate). In soft assignment, it would be a probability distribution. Therefore, it is possible for a sample to belong to a cluster. We then recompute the cluster parameters based on this soft assignment. Since we've covered this many times , we won't detail how to train it further.

        To recap, given a cell phone, we can use GMMs to learn the eigenvectors of observables. This probability distribution allows us to compute the likelihood of a speech segment given a phone  P ( x|s ) - which is also the probability of firing given the internal state of the HMM.

Five, vector quantization

        Throughout, we attempt to model denser representations of the acoustic signal. GMM is a popular method. Alternatively, after we extract a training set of feature vectors from the corpus, we group these features into k clusters, say using  k-  means clustering. This will create a  codebook of size k to encode audio frames.

        k=3 means two-dimensional data

        With this index in hand, we can start using it to train the HMM. After training the model, we can also use it to decode audio clips. This method is called vector quantization and was used in early research. But compared to GMM, it is less popular. So we just want you to be aware of that.

6. Reflection

        GMMs model the probability distribution of observations for a given cell phone's feature vector. It provides a principled way to measure the "distance" between the phone and the audio frames we observe.

        HMMs, on the other hand, yield a principled model of how states transit and observe. Since the probability of an observation can be modeled with an HMM as:

Equation source        

        where  h  is the hidden state (phone). The likelihood of a given phone's capabilities can be modeled using a GMM.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131953395
HMM