In-depth understanding of Fourier transform (4)

Recap

  • Discrete Fourier transform formula
    X ^ ( k N ) = ∑ n = 0 N − 1 x ( n ) e − i 2 π k N n , k = 0 , 1 , 2 , . . . , N − 1 \hat {X} (\frac{k}{N}) = \sum_{n=0}^{N-1} x(n) e^{-i2 \pi \frac{k}{N}n},k =0,1,2,...,N-1X^(Nk)=n=0N1x(n)ei2πNkn,k=0,1,2,...,N1
  • Inverse discrete Fourier transform formula
    x ( n ) = 1 N ∑ k = 0 N − 1 X ^ ( k N ) ei 2 π k N nx(n) = \frac{1}{N} \sum_{k= 0}^{N-1} \hat{X} (\frac{k}{N}) e^{i2 \pi \frac{k}{N}n}x(n)=N1k=0N1X^(Nk)ei2πNkn

Why short-time Fourier analysis is needed

Discrete Fourier analysis is the decomposition of a finite period of non-periodic discrete signal (set the sampling frequency as sr and the number of sampling points as N). By applying the discrete Fourier transform formula to the entire original signal, we can obtain , the optimal amplitude and initial phase corresponding to each frequency. The frequency range is: { 0 , 1 N sr , 2 N sr , . . . , N − 1 N sr } \left \{ 0, \frac{1}{N} s_r, \frac{2}{N} s_r ,...,\frac{N-1}{N} s_r \right \}{ 0,N1sr,N2sr,...,NN1sr}

The "best" here refers to: the sinusoidal signal determined by the frequency, amplitude and initial phase, at this frequency, the similarity with the original signal is the highest.

Obviously, discrete Fourier analysis is a global analysis of the original signal, but in the field of audio signal processing, it is necessary to analyze finer time intervals and finally obtain the spectrogram of the original signal.
insert image description here
The time-spectrogram contains the information of the frequency (ordinate) and power (background color, usually represented by the square of the amplitude) of the original signal in each small time interval (abscissa). To draw the time-spectrogram, a short-time Fu Short-Time Fourier Transformation (STFT).

Short Time Fourier Transform

Short-time Fourier transform includes: framing -> windowing -> discrete Fourier transform.

  • Framing: Use a window to slide along the time axis of the original signal. There are two parameters: the window size - frame-size, and the step size - hop-size. The unit of these two parameters is usually an integer number of sampling points.
    insert image description here
    Sometimes there is a window-size parameter, usually the window-size is equal to the frame-size.
    insert image description here
    When the hop-size is smaller than the frame-size, there will be an overlap between frames. This overlap is necessary, and the reason will be explained later.

  • Windowing: Use the window function to weight all the sampling points in a frame. The window functions usually include Gaussian window, Hamming window and Hanning window. The latter two are the most used and can be expressed uniformly as:
    w [ n ] = ( 1 − α ) − α cos ( 2 π n N − 1 ) w[n] = (1-\alpha)-\alpha cos (\frac{2 \pi n}{N-1} )w[n]=(1a )αcos(N12πn)
    whenα = 0.5 \alpha=0.5a=When 0.5 , the above is Hanning window; when α = 0.46 \alpha=0.46a=When 0.46 , the above is the Hamming window . All three window functions are "bell" shaped. The picture below shows the Hanning window, which is characterized by zero crossings.
    insert image description here
    The size of the window function is specified by window-size. If window-size is not specified in librosa, window-size = frame-size. This article also agrees that the two are equal, so N = frame-size = window-size.

  • Short-time Fourier transform: the formula is as follows
    S ( m , k ) = ∑ n = 1 N − 1 x ( n + m H ) w ( n ) e − i 2 π k N n S(m,k) = \ sum_{n=1}^{N-1} x(n+mH)w(n)e^{-i2 \pi \frac{k}{N} n}S(m,k)=n=1N1x(n+m H ) w ( n ) ei2πNkn
    emphasizes again that N = frame-size = window-size in the formula, m is the serial number of the current window, the first window serial number is 0, H = hop-size, and w(n) is the window function. The left coordinate of the current window is mH, so take the value of the original signal at n+mH and perform Fourier transform.
    FFT needs to work at the sampling point of an integer power of 2, so for a frame-size that is not an integer power of 2, 0 will be filled on both sides of the frame, for example, frame-size = 400, it will be filled on both sides 56 zeros, reaching 512. Because after the window function processing, the values ​​on both sides of the frame are also close to 0, so it will not cause discontinuity.

  • Explain why the window function is used and why the frames must overlap: when the frame is divided, additional discontinuity will be caused at the boundary of the frame. This discontinuity is not present in the original signal. The Fourier transform of the discontinuous signal will lead to spectral leakage. The causes of spectrum leakage are as follows:

    • The duration of the processed signal is not an integer number of periods, which is often the case
    • The end point of the signal is discontinuous, and framing will cause this situation

    Spectrum leakage will cause high-frequency components that do not exist in the original signal to appear on the spectrogram, and these high-frequency components are leaked from discontinuities.
    insert image description here
    If a window function is used, especially a zero-crossing window function (such as: Hanning window), this discontinuity can be gradually suppressed from both sides of the frame, and the effect is as follows: But the more you suppress the two sides, the easier it
    insert image description here
    is The information on both sides is lost. For non-overlapping frames, a large amount of information is suppressed, as shown in the figure below:
    insert image description here
    Therefore, the overlap between frames is necessary, and the overlapping of frames will make each frame carry more information above. Therefore, the suppressed information can be recovered by virtue of this overlap.
    insert image description here

What will the STFT output

The short-time Fourier transform will output a spectral matrix with a shape of [frequency-bins, frames]. The calculation formula of these two values ​​is as follows:
frequencybins = nfft 2 + 1 frames = samples − framesizehopsize + 1 \begin{aligned} frequency bins &= \frac{n_{fft}}{2} +1 \\ frames &= \frac{samples-framesize}{hopsize} +1 \end{aligned}frequencybinsframes=2nfft+1=hopsizesamplesframesize+1

  • The frequency-bins need to be divided by 2 because the spectrogram is centered on the Nyquist frequency and is left-right symmetrical. See the in-depth understanding of Fourier transform (3) for the reason .

  • The calculation of frames mainly considers the coordinates on the right side of the frame. When we place the first frame, it occupies frame-size sample points, and each time it takes a step to the right for hop-size sample points, it takes samples − framesizehopsize in total . \frac{samples-framesize}{hopsize}hopsizesamplesframesizeSteps, plus the first placed frame, gives the total number of frames.

  • frame-size will affect frequency resolution (freq resolution) and time resolution (time resolution).
    When the frame-size decreases, the frequency resolution will decrease. Although frequency-bins is only related to n-fft, each frame will still get the Nyquist frequency, but the frame-size decreases, representing each frame. The sampling points are reduced and more zeros are filled, which does not help to restore the frequency of the original signal; at the same time, the time resolution will increase because the spectrogram in a shorter time interval can be obtained.
    A way to increase the number of frames without affecting the frequency resolution is to reduce the hop-size.

Common parameters for speaker recognition: Sampling rate sr=16kHz, frame-size occupies 25ms, that is, 400 sampling points, can be taken as sr//40, hop-size occupies 10ms, that is, 160 sampling points, can be taken as sr//100, Since fft is required, after framing, the frame-size will be expanded to 512 sampling points.

Time Spectrum

If you use the power spectrum, you need to find the square of the amplitude. The following audio is read, and the first 5 seconds are taken, and the time spectrum is drawn:
Y ( m , k ) = ∣ S ( m , k ) ∣ 2 Y(m,k) = |S(m,k)|^2Y(m,k)=S(m,k)2
Then plot the heatmap:

import librosa
import librosa.display
import matplotlib.pyplot as plt


def wav_to_spectrum(filepath, y_axis="linear"):
    signal, sr = librosa.load(path=filepath, sr=16000)
    duration = librosa.get_duration(y=signal, sr=sr)
    signal = signal[:int(sr * 5)]

    stft = librosa.stft(y=signal,
                        n_fft=512,
                        hop_length=sr // 100,
                        win_length=sr // 40)
    power, phase = librosa.magphase(stft, power=2)

    librosa.display.specshow(power,
                             sr=sr,
                             n_fft=512,
                             hop_length=sr // 100,
                             win_length=sr // 40,
                             x_axis="s",
                             y_axis=y_axis)

    plt.colorbar(format="%+2.f")
    plt.show()


if "__main__" == __name__:
    debussy_path = r"16 - Extracting Spectrograms from Audio with Python\audio\debussy.wav"
    wav_to_spectrum(debussy_path)

insert image description here
The heat represents the square of the amplitude, which is a measure of the sound intensity. The human ear's perception of sound intensity is nonlinear. There is almost no heat in the figure because the decibel is not taken. The decibel is actually a logarithmic operation. P 0 P_0 in the following formulaP0Indicates zero decibels.
L db = 10 log 10 ( PP 0 ) L_{db} = 10 log_{10} (\frac{P}{P_0} )Ldb=10log10(P0P)
insert image description here
Zero decibels is the minimum sound intensity that can be heard by the human ear. It is found that most of the heat in the picture is negative decibels, because the "linear" linear scale is used for the frequency, and the human ear's perception of frequency is nonlinear. Roughly in logarithmic form.
insert image description here

import librosa
import librosa.display
import matplotlib.pyplot as plt


def wav_to_spectrum(filepath, y_axis="linear"):
    signal, sr = librosa.load(path=filepath, sr=16000)
    duration = librosa.get_duration(y=signal, sr=sr)
    signal = signal[:int(sr * 5)]

    stft = librosa.stft(y=signal,
                        n_fft=512,
                        hop_length=sr // 100,
                        win_length=sr // 40)
    power, phase = librosa.magphase(stft, power=2)
    db = librosa.power_to_db(power)

    librosa.display.specshow(db,
                             sr=sr,
                             n_fft=512,
                             hop_length=sr // 100,
                             win_length=sr // 40,
                             x_axis="s",
                             y_axis=y_axis)

    plt.colorbar(format="%+2.f db")
    plt.show()


if "__main__" == __name__:
    debussy_path = r"16 - Extracting Spectrograms from Audio with Python\audio\debussy.wav"
    wav_to_spectrum(debussy_path, "log")

Actually the log frequencies are not good enough, because there are still quite a few negative decibels, and a lot of high decibel parts are drawn densely.

The in-depth understanding of the Fourier transform series is basically over, and the following content is related to audio.

Guess you like

Origin blog.csdn.net/m0_46324847/article/details/128245898