In-depth understanding of MFCC (Mel Frequency Cepstral Coefficient)

Starting from the cepstrum

  • MFCC is the abbreviation of Mel Frequency Cepstral Coefficient. To understand the characteristics of MFCC, you need to understand a new concept introduced here——Cepstral. fight)
  • Cepstral maps are used to "extract" the timbre of speech, the most powerful feature for distinguishing speakers, especially in the pre-deep learning era. First directly give the formula for finding the cepstrum:
    C [ x ( n ) ] = F − 1 [ log ( ∣ F ​​[ x ( n ) ] ∣ 2 ) ] C[x(n)] = F^{-1 }[ log(|F[x(n)]|^2) ]C[x(n)]=F1[log(F[x(n)]2)]
  • where x ( n ) x(n)x ( n ) is the discretized original signal,F [ ⋅ ] F[\cdot]F [ ] is the discrete Fourier transform,log ( ∣ ⋅ ∣ 2 ) log(|\cdot|^2)log(2 )For the result of discrete Fourier transform, first take the amplitude, then take the square, and finally take the logarithm,F − 1 [ ⋅ ] F^{-1}[\cdot]F1 []is the inverse discrete Fourier transform.
    Below is a demonstration diagram of each step:
    insert image description here
    insert image description here
    insert image description here
  • Finally, the cepstrum obtained by the inverse transformation, the abscissa is the reciprocal frequency (Quefrency, the reciprocal of the frequency, in seconds), and the ordinate is the amplitude.
  • The so-called 1st rhamonic in the last picture is the first peak seen from the right to the left of the cepstrum. In fact, this 1st rhamonic corresponds to the fundamental frequency of the original signal.
  • To understand the content of this section, you need to have the knowledge of discrete Fourier transform and Mel hour spectrogram, you can refer to In-depth understanding of Fourier transform (3) and In-depth understanding of Mel scale, Mel filter bank and Mel hour spectrogram .

Why cepstrum can extract timbre

  • The object that initially vibrates to produce sound is called the sound source, and for speech, the sound source is the human vocal cords.
  • The human lungs expel air, which passes through the glottis to form a pulse (glottis pulse). The frequency of the pulse at this time determines the pitch of the sound . The pulse makes the vocal cords vibrate, and the vocal cords have a resonant frequency that strengthens the pulse. Because the sound source is the human vocal cords, the resonant frequency of the vocal cords is called the fundamental frequency of the speech at this time .
  • The pulse also needs to pass through the vocal tract to leave the human mouth and become audible speech. The vocal tract has a resonant frequency. As the shape and size of the vocal tract change, the resonant frequency will change. The existence of the resonant frequency of the vocal tract makes the voice signal appear formant .
  • The glottal pulse, fundamental frequency and formant, and sound intensity correspond to the three elements of sound: pitch, timbre, and loudness.
    insert image description here
  • So why can cepstrum extract timbre? We imagine that the gas that first passes through the glottis is a signal called Glottal pulses, denoted as h ( t ) h(t)h ( t ) , the function of the vocal cords and the vocal tract is equivalent to a complex filter, denoted ase ( t ) e(t)e ( t ) , the output speech signal is the signal after Glottal pulses is filtered by the channel, denoted asx ( t ) x(t)x ( t ) , notice that they are all continuous signals at this time, then there is the following equation:
    x ( t ) = h ( t ) ∗ e ( t ) x(t) = h(t) * e(t)x(t)=h(t)e(t)
  • Among them, ∗ * refers to the convolution operation. After discretization, discrete Fourier transform is performed, and time domain convolution is equivalent to frequency domain product:
    X ( n ) = H ( n ) ⋅ E ( n ) X(n) = H(n) \cdot E( n)X(n)=H(n)E(n)
  • Next, take the magnitude, then the square, and finally the logarithm:
    log [ ∣ X ( n ) ∣ 2 ] = 2 log ∣ H ( n ) ∣ + 2 log ∣ E ( n ) ∣ log[|X(n )|^2] = 2log|H(n)| + 2log|E(n)|log[X(n)2]=2logH(n)+2logE(n)
  • Now the speech signal has been decomposed into the sum of two signals, as shown in the figure below:
    insert image description here
  • The lower left corner is 2 log ∣ E ( n ) ∣ 2log|E(n)|2 l o g E ( n ) , the lower right corner is2 log ∣ H ( n ) ∣ 2log|H(n)|2 l o g H ( n ) , the lower left corner is actually the envelope of the speech signal, and the lower right corner is the signal after subtracting the envelope from the speech signal. Among them, the envelope has several protruding peaks (it should be a sharp peak, but it is smoothed after taking the logarithm), which characterizes the fundamental frequency and formant, and is the signal we want to extract.
  • The last step is to use the inverse discrete Fourier transform to obtain the cepstrum.

MFCC

  • For a piece of audio, the extraction process of MFCC is as follows:
    1. Pre-emphasizes the audio signal to reduce some of the high-frequency energy. This step can be handled simply by the following formula:
      x [ n ] = x [ n ] − α x [ n − 1 ] , 0.9 ≤ α ≤ 1.0 x[n] = x[n] - \alpha x[n-1] ,0.9 \le \alpha \le 1.0x[n]=x[n]αx[n1],0.9a1.0
    2. Short Time Fourier Transform
    3. Mel filter filtering to get the Mel hour spectrogram
    4. Take the logarithm, separate the signal
    5. discrete cosine transform
    6. Choose Cepstral Coefficients
  • The extraction process of MFCC improves the last step by changing the inverse discrete Fourier transform to discrete cosine transform.
  • Because the signal of log-power spectrum can be regarded as the superposition of two signals, and the fundamental frequency and formant we want to extract can be regarded as the low-frequency part of the superimposed signal.
  • So MFCC regards log-power spectrum as a time-domain signal, performs Fourier analysis on it, and then takes the former nmfcc n_{mfcc}nm f ccThe operation value corresponding to the frequency is used as the final MFCC feature.
  • In addition, using the discrete cosine transform has the following benefits:
    • is a simplified version of the discrete Fourier transform
    • The result of the operation is a real number, which is exactly what MFCC needs
    • Decouples the overlapping weights between different Mel filter banks, making the extracted features more independent of each other, suitable for machine learning
    • Input log-power spectrum, output MFCC features, played a role in dimensionality reduction

MFCC output

  • Usually the first 12 coefficients are selected, and then the energy of the current frame is stitched together, a total of 13.

  • The higher the coefficient, the more information about the fundamental frequency and formant.

  • After the 13 coefficients are obtained, the first-order difference and the second-order difference will be calculated for the 13 coefficients in time series, and the second-order difference is equivalent to the first-order difference for the first-order difference. The first-order difference has the difference between backward difference and forward difference. You can also calculate the mean value of the backward difference and forward difference to get the central difference, and the central difference error is the smallest:

    • Forward difference
      Δ x [ n ] = x [ n + 1 ] − x [ n ] \Delta x[n] = x[n+1] - x[n]Δx[n]=x[n+1]x[n]
    • Backward difference
      Δ x [ n ] = x [ n ] − x [ n − 1 ] \Delta x[n] = x[n] - x[n-1]Δx[n]=x[n]x[n1]
    • Central difference
      Δ x [ n ] = x [ n + 1 ] − x [ n − 1 ] 2 \Delta x[n] = \frac{x[n+1]-x[n-1]}{2}Δx[n]=2x[n+1]x[n1]

    where x [ n ] x[n]x [ n ] represents the 13 coefficients of the nth frame, and the first-order difference and the second-order difference are spliced ​​together with the original function value to obtain 39 coefficients.

  • The output of MFCC can be expressed as a two-dimensional array with shape [ nmfcc , frames ] [n_{mfcc}, frames][nm f cc,f r am es ] , since it is a two-dimensional array, it can be visualized with a heat map.

Advantages and disadvantages of MFCC

  • advantage
    • Compared with the Mel time spectrogram, the information of the time spectrogram is described with a smaller amount of data. The number of filters in the former is usually 80, and the number of MFCC features is usually 39.
    • Compared with the Mel hour spectrogram, the correlation between features is lower and has better discrimination
    • Can extract the information that characterizes the fundamental frequency and formant, and filter out other irrelevant information
    • Works well with GMM-based acoustic models
  • shortcoming
    • Compared with the mel-hour spectrogram, the amount of calculation is larger, because MFCC is obtained on the basis of the mel-hour spectrogram
    • Not robust enough to noise, especially additive noise
    • Artificially designed traces are too heavy, leading to greater experience risk
    • Invalid for speech synthesis because there is no inverse transform from MFCC features to audio signal

demo

  • Note: librosa's MFCC extraction algorithm,
    • By default, the energy of the current frame is not used as the 13th coefficient, you can find it yourself, and then splicing
    • In addition, there is no first-order difference and second-order difference by default, you can also find it yourself, and then splicing
  • The following code calculates the first-order difference and the second-order difference, then concatenates and visualizes.
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

if "__main__" == __name__:
    filepath = r"20- Extracting MFCCs with Python\female_audio.wav"
    signal, sr = librosa.load(path=filepath, sr=16000)
    N_FFT = 512
    N_MELS = 80
    N_MFCC = 13

    mel_spec = librosa.feature.melspectrogram(y=signal,
                                              sr=sr,
                                              n_fft=N_FFT,
                                              hop_length=sr // 100,
                                              win_length=sr // 40,
                                              n_mels=N_MELS)
    mfcc = librosa.feature.mfcc(S=librosa.power_to_db(mel_spec), n_mfcc=N_MFCC)

    delta_mfcc = librosa.feature.delta(data=mfcc)
    delta2_mfcc = librosa.feature.delta(data=mfcc, order=2)
    mfcc = np.concatenate([mfcc, delta_mfcc, delta2_mfcc], axis=0)

    librosa.display.specshow(data=mfcc,
                             sr=sr,
                             n_fft=N_FFT,
                             hop_length=sr // 100,
                             win_length=sr // 40,
                             x_axis="s")
    plt.colorbar(format="%d")

    plt.show()

insert image description here

  • The knowledge of audio signal processing is very vast. This series only explains the knowledge of audio signal processing for machine learning.

Guess you like

Origin blog.csdn.net/m0_46324847/article/details/128274708