Uncovering the Magic of MFCC: A Key Technology for Speech Recognition

Now, before reading this blog, you must know that MFCC (Mel Frequency Cepstral Coefficients) are widely used for speech recognition in artificial intelligence . MFCC is basically used to extract features from a given audio signal. Let's first look at the flowchart illustration of the steps involved in MFCC :

image-20230724112216846

Analog-to-Digital Conversion : This step basically involves converting an analog signal into a digital signal. This is because most of the steps we perform in speech recognition are done on digital signals. Converting an analog signal to a digital signal involves various steps such as sampling, quantization, normalization, frame-based processing, etc. Detailed instructions for these steps will be shared in the next blog.

Pre- emphasis: The pre-emphasis step is usually implemented using a first-order high-pass filter. Filters emphasize high-frequency content, which is critical for distinguishing important details in speech and audio signals. By applying a high-pass filter with pre-emphasis, the amplitude of high-frequency components is boosted relative to low-frequency components. Increasing the energy of the sound at higher frequencies will improve the accuracy of cell phone detection . (don't confuse yourself with the phone)

Windowing : In simple terms, windowing refers to dividing an audio signal into different segments, the standard being a distance between 25ms and 10ms . Also, while making segments to avoid excessive noise due to chopping rather than rectangular segments, we have Hamming windows*. *Reason for choosing a value of 25ms: The average number of words spoken by the person in 1 second is 3 words. Each word contains 4 calls, which in turn contain 3 states. Therefore, the total number of states in 1 second = 3 * 4 * 3 = 36 states. So 1 state takes about 28 ms, close to the chosen value of 25 ms .

**DFT (Discrete Fourier Transform):

**In the next steps, we will transform the signal from time domain to frequency domain using DFT to calculate MFCC coefficients. In simple terms, you can think of it as a series of complex numbers.

Mel Filter Bank: Before diving into the term, let's understand how humans like us hear sounds? Basically, the human ear is very sensitive to low frequency audio when we compare it to high frequency audio. As just one example, we can say that humans can easily tell the difference between 100Hz and 200Hz audio, but we have a hard time telling the difference between 2100Hz and 2000Hz audio. So, to simulate this in a machine, we use the mel scale to find the audio frequencies that humans can hear:

image-20230724112300106

Mel frequency

Log(): Let us review an important property of the logarithmic function, which tells us that at low input values, the gradient is relatively large , while at large input values, the gradient is relatively small. This means that as the input value increases, the input value also decreases. This is similar to our hearing mechanism. The human ear is more sensitive to audio signals at lower energies than at higher energies. That's why we will apply the log() function to mimic the human ear.

image-20230724112333734

IDFT: IDFT stands for Inverse Discrete Fourier Transform. After extracting MFCC features, we need to convert the audio signal from frequency domain to time domain. The MFCC model takes the first 12 coefficients after applying IDFT and energy as features.

Dynamic features: In addition to the 13 features, MFCC will also consider the first and second derivatives of the features. This leaves us with 26 more features to consider. Therefore, MFCC will generate 39 features from each audio signal. The delta coefficient (ΔMFCC) or the first derivative\ represents the rate of change of the static MFCC coefficient with time. They help capture dynamic changes. Increment - Increment Coefficient (ΔΔ MFCC) or Second Derivative\ Denotes the acceleration or rate of change of the Increment Coefficient over time. They all contribute to obtaining the final feature vector for each frame.

Guess you like

Origin blog.csdn.net/shupan/article/details/131915640