In-depth understanding of Mel scales, Mel filter banks, and Mel-hour spectrograms

Recap

Short-time Fourier transform formula
S ( m , k ) = ∑ n = 1 N − 1 x ( n + m H ) w ( n ) e − i 2 π k N n S(m,k) = \sum_{ n=1}^{N-1} x(n+mH)w(n)e^{-i2 \pi \frac{k}{N} n}S(m,k)=n=1N1x(n+m H ) w ( n ) ei2πNkn
Among them, m is the serial number of the current filter, which represents the current time period, and k is the serial number of the current frequency, which represents the currente − i 2 π k N ne^{-i2 \pi \frac {k}{N} n}ei2πNkn signal, looking for the best amplitude and initial phase, w(n) is the window function. For more information about the short-time Fourier transform, please refer toIn-depth understanding of the Fourier transform (4).

The mel-hour spectrogram to be explained in this article requires the knowledge of the spectrogram sometimes, and you can also refer to the in-depth understanding of Fourier transform (4) .

Mel scale

The human ear's perception of pitch (pitch) is nonlinear. When the sound frequency increases linearly, we don't feel that the pitch also increases linearly. In order to characterize the human ear's linear perception of pitch, we need the Mel scale, which is essentially a function of frequency, mapping Hertz (Hz) to Mel: m = 2595 log 10 (
1 + f 700 ) = 1127 ln ( 1 + f 700 ) m = 2595 log_{10}(1+\frac{f}{700}) = 1127 ln(1+\frac{f}{700})m=2 5 9 5 l o g10(1+700f)=1127ln(1+700f)
It can be seen from the formula that the logarithm part can be based on natural logarithm or base 10. Different bases correspond to different coefficients. To determine the current coefficient, just substitute (1000Hz, 1000mel).

We are not only curious, since the human ear's perception of pitch is nonlinear, why does the Mel scale pass (1000Hz, 1000mel)? The reason is that the human ear's perception of the low-frequency part is approximately linear. This low-frequency part is about 0Hz~1000Hz, so the Mel scale also passes through the (0Hz, 0mel) point. It can also be seen from the Mel scale diagram that the low-frequency part is approximately Linear:
insert image description here
The mapping from Mel scale to Hertz is as follows:
f = 700 ( 1 0 m 2595 − 1 ) = 700 ( em 1127 − 1 ) f = 700(10^{\frac{m}{2595}}-1 ) = 700(e^{\frac{m}{1127}}-1)f=700(102595m1)=700(e1127m1 )
insert image description here
Now, while the mel scale increases linearly, Hertz increases logarithmically, and the human ear's perception of such changing pitch is linear.

mel hour spectrogram

The Mel spectrogram is drawn by considering three elements at the same time:


  • This is what the time spectrogram is for simultaneously presenting the time, frequency, and spectrum information of a piece of audio .
  • The measurement of loudness should be linearly related to the perception of loudness by the human ear, which
    can be realized by converting the square of the amplitude into decibels.
  • The measurement of the pitch should be linearly related to the perception of the pitch by the human ear.
    This requires the use of the Mel scale to obtain the Mel filter bank and filter the original time-spectrogram to achieve it.

Mel filter bank

There are three steps to using a Mel filter bank:

  1. The number of selected filter banks nmels n_{mels}nm e l s
    The number depends on the problem we want to study, commonly used 40, 60, 90, 128, now take a smaller number: 6, easy to read the picture.
  2. Building a filter bank
    is divided into 4 steps:
    1. Determine the frequency range to be filtered, that is, select the minimum frequency fl f_{l}fland the maximum frequency fh f_{h}fh, usually the minimum frequency is selected as 0, and the maximum frequency is selected as the Nyquist frequency sr 2 \frac{s_r}{2}2sr, and then convert these two frequency values ​​to Mel scale ml m_{l}mland mh m_{h}mh
    2. will ml m_{l}mland mh m_{h}mhConnect into a line, and then on this line, equally take nmels n_{mels}nm e l spoints, get the sequence { m 1 , m 2 , . . . , mnmels } \left \{ m_1,m_2,...,m_{n_{mels}} \right \}{ m1,m2,...,mnm e l s} . Since the Mel scale is used at this time, the human ear perceives the pitches corresponding to these points linearly.
    3. Convert these points into Hertz to get the sequence { f 1 , f 2 , . . . , fnmels } \left \{ f_1,f_2,...,f_{n_{mels}} \right \}{ f1,f2,...,fnm e l s} . Note that when converting to Hertz, these points need to be rounded so that these points are in{ 0 , 1 nfftsr , 2 nfftsr , . . . , nfft / 2 nfftsr } \left \{0,\frac{1}{n_ {fft}} s_r,\frac{2}{n_{fft}} s_r,...,\frac{n_{fft}/2}{n_{fft}} s_r \right \}{ 0,nfft1sr,nfft2sr,...,nfftnfft/2sr} total( 1 + nfft 2 ) (1+\frac{n_{fft}}{2})(1+2nfft) frequency, can be one-to-one correspondence. Expressed in mathematical language:
      ∀ fi , fi ∈ { 0 , 1 nfftsr , 2 nfftsr , . . . , nfft / 2 nfftsr } , i = 1 , 2 , . . . , nmels \forall f_i, f_i \in \left \ {0,\frac{1}{n_{fft}} s_r,\frac{2}{n_{fft}} s_r,...,\frac{n_{fft}/2}{n_{fft}} s_r \right \} ,i=1,2,...,n_{mels}fi,fi{ 0,nfft1sr,nfft2sr,...,nfftnfft/2sr},i=1,2,...,nm e l s
    4. According to the sequence { f 1 , f 2 , . . . , fnmels } \left \{ f_1,f_2,...,f_{n_{mels}} \right \}{ f1,f2,...,fnm e l s} Draw the following figure:nmels n_{mels}
      insert image description here
      in the figurenm e l svertices, the abscissa coordinates of these vertices are { f 1 , f 2 , . . . , fnmels } \left \{ f_1,f_2,...,f_{n_{mels}} \right \}{ f1,f2,...,fnm e l s} , the vertical coordinates are all 1, and the left vertex of the first triangle isfl f_{l}fl, the upper vertex is f 1 f_{1}f1, the right vertex is f 2 f_{2}f2;The left vertex of the second triangle is f 1 f_{1}f1, the upper vertex is f 2 f_{2}f2, the right vertex is f 3 f_{3}f3, and so on, until the left vertex of the last triangle is fmels − 1 f_{mels-1}fm e l s 1, the upper vertex is fmels f_{mels}fm e l s, the right vertex is fh f_{h}fh.
      The meaning of the ordinate of the triangle is: the weight value corresponding to the frequency of the abscissa. The Mel filter bank can be expressed as a two-dimensional matrix with a shape of [ nmels , 1 + nfft 2 ] [n_{mels},1+\frac{n_{fft}}{2}][nm e l s,1+2nfft] .
      Note that a Mel filter also has weights for frequencies outside the triangle, but they are all 0.
  3. Filtering the time spectrogram The time
    spectrogram can also be expressed as a two-dimensional matrix with a shape of [ 1 + nfft 2 , frames ] [1+\frac{n_{fft}}{2},frames][1+2nfft,f r a m e s ] , if you still remember linear algebra, remember that the Mel filter bank is M, and the time spectrum is Y, then the calculation result of filtering is the matrix product of M and Y, that is, the column of the first matrix The operation can only be performed when the number of rows is the same as that of the second matrix, and the shape of the operation result is [the number of rows of the first matrix, the number of columns of the second matrix], that is, [ nmels , frames ] [n_{mels },frames][nm e l s,frames]

demo

Read a piece of audio, use the short-time Fourier transform to get an ordinary time-spectrogram, and then draw the Mel filter bank. It is worth noting that librosa's Mel filter bank function also has a weight normalization function. That is, each weight of a triangle filter is divided by the area of ​​the triangle. If you do not want to perform this normalization, set the parameters, norm=Noneie melfb = librosa.filters.mel(sr=sr, n_fft=N_FFT, n_mels=N_MELS, norm=None).

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

if "__main__" == __name__:

    debussy_path = r"16 - Extracting Spectrograms from Audio with Python\audio\debussy.wav"
    signal, sr = librosa.load(path=debussy_path, sr=16000)
    N_FFT = 512
    N_MELS = 6

    stft = librosa.stft(y=signal,
                        n_fft=N_FFT,
                        hop_length=sr // 100,
                        win_length=sr // 40)
    freq = librosa.fft_frequencies(sr=sr, n_fft=N_FFT)
    power, phase = librosa.magphase(stft, power=2)

    melfb = librosa.filters.mel(sr=sr, n_fft=N_FFT, n_mels=N_MELS)
    plt.plot(freq, np.transpose(melfb))

    plt.show()

insert image description here
Direct matrix product, then convert the square of the amplitude into decibels, draw the mel hour spectrogram, be sure to filter first , and then convert to decibels .

import librosa
import matplotlib.pyplot as plt
import librosa.display
import numpy as np

if "__main__" == __name__:
    # m = np.linspace(0, 2600, 2600 + 1)
    # f = 700 * (np.exp(m / 1127) - 1)
    # plt.plot(m, f)
    # plt.xlabel("Mel Frequency(mel)")
    # plt.ylabel("Frequency(Hz)")

    debussy_path = r"16 - Extracting Spectrograms from Audio with Python\audio\debussy.wav"
    signal, sr = librosa.load(path=debussy_path, sr=16000)
    N_FFT = 512
    N_MELS = 6

    stft = librosa.stft(y=signal,
                        n_fft=N_FFT,
                        hop_length=sr // 100,
                        win_length=sr // 40)
    freq = librosa.fft_frequencies(sr=sr, n_fft=N_FFT)
    power, phase = librosa.magphase(stft, power=2)

    melfb = librosa.filters.mel(sr=sr, n_fft=N_FFT, n_mels=N_MELS)
    # plt.plot(freq, np.transpose(melfb))
    melspec = np.matmul(melfb, power)

    melspec_db = librosa.power_to_db(melspec)
    # plt.subplot(2, 1, 1)
    librosa.display.specshow(melspec_db,
                             sr=sr,
                             n_fft=N_FFT,
                             hop_length=sr // 100,
                             win_length=sr // 40,
                             x_axis="s",
                             y_axis="mel")
    plt.colorbar(format="%+2.f db")

    plt.show()

insert image description here
The result drawn by myself according to the theory is consistent with the result drawn directly by librosa:

import librosa
import matplotlib.pyplot as plt
import librosa.display
import numpy as np

if "__main__" == __name__:
    # m = np.linspace(0, 2600, 2600 + 1)
    # f = 700 * (np.exp(m / 1127) - 1)
    # plt.plot(m, f)
    # plt.xlabel("Mel Frequency(mel)")
    # plt.ylabel("Frequency(Hz)")

    debussy_path = r"16 - Extracting Spectrograms from Audio with Python\audio\debussy.wav"
    signal, sr = librosa.load(path=debussy_path, sr=16000)
    N_FFT = 512
    N_MELS = 6

    stft = librosa.stft(y=signal,
                        n_fft=N_FFT,
                        hop_length=sr // 100,
                        win_length=sr // 40)
    freq = librosa.fft_frequencies(sr=sr, n_fft=N_FFT)
    power, phase = librosa.magphase(stft, power=2)

    melfb = librosa.filters.mel(sr=sr, n_fft=N_FFT, n_mels=N_MELS)
    # plt.plot(freq, np.transpose(melfb))
    melspec = np.matmul(melfb, power)

    melspec_db = librosa.power_to_db(melspec)
    plt.subplot(2, 1, 1)
    librosa.display.specshow(melspec_db,
                             sr=sr,
                             n_fft=N_FFT,
                             hop_length=sr // 100,
                             win_length=sr // 40,
                             x_axis="s",
                             y_axis="mel")
    plt.colorbar(format="%+2.f db")

    S = librosa.feature.melspectrogram(y=signal,
                                       sr=sr,
                                       n_fft=N_FFT,
                                       hop_length=sr // 100,
                                       win_length=sr // 40,
                                       n_mels=N_MELS)
    S_dB = librosa.power_to_db(S)
    plt.subplot(2, 1, 2)
    librosa.display.specshow(S_dB,
                             sr=sr,
                             n_fft=N_FFT,
                             hop_length=sr // 100,
                             win_length=sr // 40,
                             x_axis="s",
                             y_axis="mel")
    plt.colorbar(format="%+2.f db")

    np.testing.assert_array_almost_equal(S_dB, melspec_db)

    plt.show()

insert image description here
The mel hour spectrogram is a widely used audio feature.

The next section will talk about MFCC, Mel Frequency Cepstrum Coefficient (MFCC).

Guess you like

Origin blog.csdn.net/m0_46324847/article/details/128264697