Speech Recognition - Feature Extraction MFCC and PLP

1. Description

        Speech recognition is a technology that uses computers and software systems to convert people's spoken language into computer-readable text or commands. It uses speech signal processing algorithms to recognize and understand human speech and convert it into a computer-processable format. Speech recognition technology is widely used in many fields, such as voice assistant, voice control, voice translation, voice search, automatic telephone answering, etc.

2. Basic questions raised

Going back to speech recognition, our goal is to find the best sequence of words corresponding to the audio based on the acoustic and language models.

To create the acoustic model, our observation X is represented by a series of acoustic feature vectors ( x₁, x₂, x₃, ... ). In the previous article , we looked at how people express and perceive speech. In this article, we discuss how to extract audio features from what we have learned.

3. Speech Recognition Requirements

Let us first define some requirements for feature extraction in ASR (Automatic Speech Recognizer). Given an audio clip, we extract audio features using a 25ms wide sliding window.

This 25 ms width is enough for us to capture enough information, but features within this frame should remain relatively static. If we speak 4 words per second with 3 phones, and each call is subdivided into 3 phases, then there are 36 states per second or 28 milliseconds per state. So the 25ms window is about right.

source

Context is very important in speech. The pronunciation changes according to the pronunciation before and after the call. Each sliding window is about 10 ms apart, so we can capture the dynamics between frames to capture the proper context.

Pitch varies from person to person. However, this does little to identify what he/she said. F0 is related to pitch. It has no value in speech recognition and should be removed. More important are the formants F1, F2, F3, ... For those who have problems following these terms, we recommend you to read the previous article first .

We also want the extracted features to be robust to who the speaker is and the noise in the environment. Also, like any ML problem, we want the extracted features to be independent of other features. It is easier to develop models and train those models using independent features.

A popular audio feature extraction method is Mel-Frequency Cepstral Coefficients  (MFCC), which has 39 features. The feature counts are small enough to force us to learn audio information. Twelve parameters are related to frequency amplitude. It gives us enough frequency channels to analyze the audio.

The following is the process of extracting MFCC features.

The main goals are:

  • Delete Vocal Cord Excitation (F0) - pitch information.
  • Make the extracted features independent.
  • Adapts to the way humans perceive sound loudness and frequency.
  • Capture the motion (context) of the phone.

4. Mel frequency cepstrum coefficient

        Let's go through it one step at a time.

Analog-to-digital conversion

A/D conversion samples an audio clip and digitizes the content, converting an analog signal into a discrete space. Typically a sampling frequency of 8 or 16 kHz is used.

source

pre-emphasis

Pre-emphasis boosts energy in high frequencies. For voiced segments such as vowels, lower frequencies have more energy than higher frequencies. This is called spectral tilt and is related to the glottal source (how the vocal cords produce sound). Increasing high-frequency energy makes information in higher formants more accessible to acoustic models. This improves the accuracy of mobile phone detection. In humans, when we can't hear these high frequency sounds, we start to have hearing problems. Furthermore, noise has a high frequency. In engineering, we use pre-emphasis to make the system less susceptible to noise introduced later in the process. For some applications we just need to undo the boost at the end.

Pre-emphasis uses a filter to boost higher frequencies. Here's a before and after signal on how to boost high frequency signals.

Zhuravsky and Martin, fig. 9.9

window

Windowing involves cutting the audio waveform into sliding frames.

But we can't just chop it off at the edge of the frame. The sudden drop in amplitude creates a lot of noise, which appears in the high frequencies. To slice the audio, the amplitude should taper off near the edges of the frame.

Let  w  be the window applied to the original audio clip in the time domain.

Some alternatives to w are Hamming and Hanning windows. The figure below shows how these windows are used to chop a sinusoidal waveform. As shown, for the Hamming and Hanning windows, the amplitude drops off near the edges. (The edges of the Hamming window have a slight dip, while the Hanning window does not.

 The corresponding equation for w is:

Above right is the sound wave in the time domain. It mainly consists of only two frequencies. As shown, the Hamming and Hanning's chopping framework preserves the original frequency information better with less noise than the rectangular window.

Source top right: signal consisting of two frequencies

Discrete Fourier Transform (DFT)

Next, we apply DFT to extract information in the frequency domain.

Mel filter set

As mentioned in the previous article, device measurements are not the same as our auditory perception. For humans, perceived loudness varies with frequency. Furthermore, the perceived frequency resolution decreases with increasing frequency. That is, humans are less sensitive to higher frequencies. The image on the left shows  how the Mel scale maps measured frequencies to frequencies we perceive in the context of frequency resolution.

source

All these mappings are non-linear. In feature extraction, we apply triangular bandpass filters to hide frequency information to mimic human perception.

source

First, we square the output of the DFT. This reflects the speech power at each frequency (x[k]²), which we call the DFT power spectrum. We apply these triangular mel-scale filterbanks to convert it to a mel-scale power spectrum. The output of each mel-scale power spectrum bin represents the energy of the multiple frequency bands it covers. This mapping is called Mel binning . The exact equation for slot  m  is:

The Trainang angular bandpass is wider at higher frequencies to reflect human hearing, and less sensitive at high frequencies. Specifically, it is spaced linearly below 1000 Hz and then rotated logarithmically.

All of these efforts try to mimic how the basilar membrane in our ears senses the vibrations of sound. At birth, the basilar membrane contains approximately 15,000 hairs within the cochlea. The figure below shows the frequency response of these hairs. Therefore, the curve shape response below is only approximated by the triangles in the Mel filter bank.

We mimic how our ears perceive sound through these hairs. In short, it is modeled by a triangular filter using a Mel filter bank.

source

log

Mel filter bank output power spectrum. Humans are not as sensitive to small energy changes at high energies as they are at low energy levels. In fact, it is logarithmic. Therefore, our next step will be to remove the logs from the output of the Mel filter group. This also reduces acoustic variants that are not important for speech recognition. Next, we need to address two other requirements. First, we need to remove the F0 information (pitch) and make the extracted features independent of other features.

Cepstrum — IDFT

Below is the model for speech production.

source

Our pronunciation controls the shape of our vocal tract. The source filter model combines the vibrations produced by the vocal cords with the filters produced by our pronunciation. The glottal source waveform will be suppressed or amplified at different frequencies by the shape of the vocal tract.

Ceps trum is the reverse of the first 4 letters of the word "spectrum". Our next step is to compute the cepstrum separating the glottal source and filter. Figure (a) is the spectrum, where the y-axis is the magnitude. Figure (b) takes the logarithm of magnitude. Looking closely, the waves fluctuate between 8 and 1000 about 2000 times. In fact, about 1000 fluctuations per 8 units. This is about 125 Hz - the source of vocal cord vibrations.

Paul Taylor [2008]

As observed, the log spectrum (first graph below) is composed of information related to the phone (second graph) and pitch (third graph). The peaks in the second plot identify the formants that distinguish the calls. But how do we separate them?

source

Recall that periods in the time or frequency domain are inverted after transformation.

Recall that pitch information has a short period in the frequency domain. We can apply an inverse Fourier transform to separate the pitch information from the formants. Pitch information will be displayed in the middle and right as shown in the image below. The peak in the middle actually corresponds to F0, and the phone related information will be on the far left.

Here's another visualization. The solid line on the left plot is the signal in the frequency domain. It consists of phone information and pitch information drawn with dotted lines. After IDFT (Inverse Discrete Fourier Transform), pitch information with a period of 1/T is converted to a peak around T on the right.

source

So for speech recognition we only need the leftmost coefficient and discard the others. In fact, MFCC only takes the first 12 cepstral values. There is another important property associated with these 12 coefficients. The logarithmic power spectrum is real and symmetric. Its inverse DFT is equivalent to the Discrete Cosine Transform (DCT).

DCT is an orthogonal transform. Mathematically, transformations produce uncorrelated features. Therefore, MFCC functions are highly irrelevant. In ML, this makes our models easier to model and train. If we model these parameters using a multivariate Gaussian distribution, all off-diagonal values ​​in the covariance matrix will be zero. Mathematically, the output of this stage is

Below is a visualization of the 12 cepstral coefficients of the cepstrum.

source

Dynamic Elements (Incremental)

MFCC has 39 functions. We ended up with 12 and what are the rest. The 13th parameter is the energy in each frame. It can help us identify mobile phones.

In pronunciation, contextual and dynamic information is important. Articulations such as stop closure and release can be identified by formant transitions. Characterizing features over time provides contextual information for the phone. The other 13 values ​​compute the incremental value  d ( t ) below. It measures the feature change from the previous frame to the next frame. This is the first derivative of the feature.

The last 13 parameters are  the dynamics of d ( t ) from the last frame to the next. It acts as  the second derivative of c ( t ).

Therefore, the 39 MFCC feature parameters are 12 cepstral coefficients plus the energy term. Then we also have 2 sets corresponding to delta and double delta values.

Cepstral mean and variance normalization

Next, we can perform feature normalization. We normalize the feature by its mean and divide it by its variance.  The mean and variance are computed using feature values ​​j across all frames in a single sentence  . This allows us to adjust values ​​to combat variation in each record.

However, this may not be reliable if the audio clip is short. Instead, we can compute mean and variance values ​​over speakers or even over the entire training dataset. This type of feature normalization will effectively cancel the pre-emphasis done earlier. This is how we extract MFCC features. A final note is that MFCCs are not very immune to noise.

5. Perceptual Linear Prediction (PLP)

PLPs are very similar to MFCCs. Motivated by auditory perception, it uses equal loudness pre-emphasis and cube root compression instead of logarithmic compression.

source

It also uses linear regression to finalize the cepstral coefficients. PLP has slightly better accuracy and slightly better noise robustness. But others see MFCC as a safe bet. In this series, when we say extracting MFCC features, we can also extract PLP features.

6. Postscript

        ML builds models for problem domains. For complex problems, this is very difficult, and the method is often very heuristic. Sometimes people think we are hacking the system. The feature extraction method in this paper relies heavily on empirical results and observations. With the introduction of deep learning, we can train complex models with fewer hacks. However, certain concepts are still valid and important for DL ​​speech recognition.

        Next: To gain a deeper understanding of speech recognition, we need to examine two ML algorithms in detail.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131980228