Audio signal preprocessing operation of advanced artificial intelligence


This chapter mainly introduces the processing of time domain and frequency domain, Fourier transform and time-frequency diagram, focusing on the understanding of related concepts.

References in this article:

Speech signal processing (4) Mel frequency cepstrum coefficient (MFCC)

Audio frame processing Frame blocking and Windowing

This is a bit like the frame-by-frame hand-painting when I was a child. When processing audio signals, I am also used to processing the signal into a fixed frame size. This kind of awareness needs to be established. The following Fourier transform and linear predictive coding are all based on audio frames. processed. Among them, Frame Size refers to the size of a frame, usually refers to the number of samples in a window or the time covered by a window, and can also be combined with the sampling frequency to calculate the number of samples in a window. Adjacent frames separated by m samples, which refers to the distance between the two windows, also called hop_size, non-overlap samples.
Please add a picture description
Exercise 1:
Please add a picture description

Hamming window constraints

Fourier transform can transform the time domain into the frequency domain, but before the transformation, the beginning and end of the audio frame are not continuous, and a sudden change in the signal level will generate huge energy, which will generate noise in the frequency domain image of the Fourier transform afterwards.
Please add a picture description
The Hamming window is equivalent to a coefficient, such as W(k) in the figure. For the Signal level of each point in the original time domain diagram, I use such a coefficient to multiply and constrain. The final effect is as follows. Effectively smoothes the fluctuations in the start and end moments:
Please add a picture description

Fourier Transform and Spectrum

Perceptual understanding

Here, a piece of speech corresponds to many frames, and each frame of speech is converted into a spectrum through a Fourier transform (FFT), and the spectrum can represent the relationship between the energy and frequency of a frame of speech.
insert image description here
The Fourier transform is calculated by a formula, which can convert the original time domain information to the frequency domain. The original abscissa is time, and the current abscissa is frequency. The following is the spectrogram obtained by Fourier transform, corresponding to each yellow rectangle above.
insert image description here
The peaks represent the main frequency components of speech. We call these peaks formants, and the formants carry the identification properties of the sound (just like a personal ID card). So it is very important. Use it to identify different sounds. We draw the formants including their transformation process into a smooth curve, which is called the spectrum envelope (Spectral Envelope).
insert image description here

rational calculation

Please add a picture description
The Fourier transform focuses on an audio frame. If the window size of an audio frame is N, it can output N (N/2 in some materials) complex numbers Xm (m=0,1,...N-1 (or N /2-1)), Xm is divided into the real part real_part and the imaginary part imaginary_part, each Xm corresponds to a process of accumulation and summation, that is, X m = ∑ k = 0 N − 1 ske − j ( 2 π km N ) X_m=\sum_{k=0}^{N-1}s_k e^{-j\left(\frac{2\pi km}{N}\right)}Xm=k=0N1skej(N2πkm) ,Define− j θ = cos ⁡ ( θ ) − j sin ⁡ ( θ ) e^{- j \theta}=\cos(\theta)-j\sin(\theta)ejθ=cos ( θ )jsin ( θ ) can combine alle − j θ e^{- j \theta}ej θ is disassembled and combined, where the real part corresponds tocos ⁡ ( θ ) \cos(\theta)cos ( θ ) , the imaginary part corresponds to− sin ⁡ ( θ ) -\sin(\theta)sin ( θ ) . When calculating each complex number Xm of a sound frame as a whole, it is a double-layer loop. The pseudo-code is as follows:
Please add a picture description
An example brought in by real data is as follows, which only lists the process of summing and accumulating each complex number. The corresponding real part real_part and imaginary part imaginary_part are not added and simplified separately:
insert image description here

I went to find out more about the Xkrelationship between the frequency domain after conversion and the frequency before conversion. If the sampling frequency Sampling Frequency is 25600Hz, and we continuously sample 256 data for discrete Fourier transform, then after transforming to the frequency domain graph, the minimum frequency interval (that is, the interval) we can see is 25600/256=100Hz Xk. In the field diagram, a point is drawn every 100Hz starting from 0. If the Sound Frequency of the sound we sampled is 200Hz, we can see its corresponding energy in the frequency domain diagram.

Time-frequency graph Spectrogram

The time-frequency graph Spectrogram is equivalent to displaying all three-dimensional information on an image, including time, frequency, and energy that changes with frequency. As shown in the figure below, the abscissa is time, the ordinate is frequency, and the color represents Energy, the whiter the color, the greater the energy.

Viewed vertically, each window is an image of frequency-dependent changes in energy generated by a section of sound frame through Fourier transform FT. The dimension of time is added to the Fourier transform of each audio frame, so that the frequency spectrum of a segment of audio can be displayed instead of a frame of audio, and static and dynamic information can be seen intuitively.

insert image description here
insert image description here
The key information displayed by the time-frequency diagram varies with the size of the window. The larger the window, the better the frequency resolution, and the smaller the window, the better the time resolution.
Please add a picture description

Guess you like

Origin blog.csdn.net/qq_44036439/article/details/126851272