Sound generation: Energy passes through the vocal cords to vibrate to produce a fundamental sound, which passes through the vocal tract and interacts with the vocal tract to generate a resonant sound, and the fundamental sound and the resonant sound are transmitted together.
1. Introduction to audio signal
1. Sound waveform diagram
The sensor detects the amplitude strength of the sound at a certain frequency and the direction of the vibration, resulting in a series of points that change over time.
2. Sampling frequency
The detection frequency of the sensor is the sampling frequency. The sampling frequency is obtained according to the sampling theorem.
Sampling Theorem (Nyquist-Shannon Theorem)
Definition: Used to describe the highest transmission rate for a given bandwidth.
Integer period (eg. the time required for the object to return to its original shape after rotation), the phase change cannot be detected when the sampling period is an integer multiple of the integer period.
*If it is a wheel rotation problem: If you need to see the rotation direction and phase changes at the same time, the sampling period should be less than 1/2 of the integer period, and the sampling frequency should be greater than twice the original frequency.
➡️➡️For analog signals: To see all the characteristics of the signal at the same time, the sampling frequency should be greater than 2 times the maximum frequency of the original analog signal, otherwise aliasing will occur.
Aliasing
It refers to the phenomenon that when the discrete Fourier transform (DFT) is used to sample the Z domain of the signal in the frequency domain, the number of sampling points is smaller than the column length in the time domain, and the period extension sequences in the time domain overlap
3. Spectrogram
Divided into narrowband spectrogram and wideband spectrogram
Narrowband: slow access speed, low transmission rate
Broadband: Transmit analog signals , divide the channel into multiple sub-channels, and transmit audio, video and digital signals respectively, which is called broadband transmission.
Bandwidth: The width of the electromagnetic wave frequency band, that is, the difference between the highest frequency and the lowest frequency of the signal
Time Width: Pulse width, which is the end time of the signal minus the start time of the signal
Time window: time interval (time interval)
narrowband spectrogram
- The bandwidth is small, the time window is large, and the length of the short time window is long. The narrow-band spectrogram is the spectrogram drawn under the long window condition.
- It is expressed as "horizontal line", and "horizontal" reflects the high frequency resolution .
broadband spectrogram
- The bandwidth is large, the time width is narrow, and the short time window length is short.
- It is expressed as a "vertical line", which can distinguish the part of the voice repeated in time, and "vertical" reflects the high time resolution .
4. Fundamental frequency (base tone frequency)
- The frequency at which the vocal cords open and close each time, and the vocal cord vibration period is the pitch period.
- On the narrow-band spectrogram , it is the one with the lowest frequency range among all horizontal stripes, and the stripes on the same horizontal line as it represent the pitch frequency components at that moment. The vertical axis scale value corresponding to this stripe is the pitch frequency value.
- The other horizontal stripes are the harmonics
- On a wideband spectrogram , the time between two vertical bars represents the pitch period
5. Formant
- Some places in the harmonic are darker than other horizontal stripes near it at the same time, and these darker colors represent formants
2. Voice signal processing
Goal: Find the distribution of each frequency component
Fourier transform (FFT) operation && wavelet transform && fully convolutional time domain audio separation network - Conv-TasNet
Speech Signal Processing Operations
1. Fourier series
Guess that any periodic function can be written as a sum of trigonometric functions.
Euler formula
Definition: For θ∈R, we have
Imaginary number i: i*i=-1
1*(-1) [i.e. 1*i*i] on the number axis , the line segment rotates 180° around the origin on the number axis
When 1*i , the line segment rotates 90° on the plane, that is, the imaginary number axis (complex plane) is obtained.
Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way
is a vector on the complex plane with an included angle of
On the time axis t, record the value of the imaginary part (ordinate) of the vector, which is
On the time axis t, record the value of the real part (abscissa) of the vector, which is
Two angles, one can observe the frequency of rotation, so it is called the frequency domain; the other can see the elapsed time, so it is called the time domain.
The basis (the most basic unit) of is:
After dot product get:
spectrum time spectrum
- Any waveform can be formed by the superposition of countless sine waves, and these sine waves of different frequencies are called frequency components
Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way
- Among them, the first frequency component with the lowest frequency is the basis (the most basic unit) for constructing the frequency domain [analogous to the basic unit "1" of the rational number axis], a sine wave with an infinite period, that is, a straight line [that is, the "rational number axis " 0"]
- A sine wave is the projection of a circular motion onto a straight line.
Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way
- In the frequency domain, the 0 frequency is called the DC component. In the superposition of Fourier series, it only affects whether all waveforms are up or down relative to the number axis as a whole, and does not change the shape of the wave
- The graph in the time-delay direction is called the time-domain image [time spectrum] (the final pattern formed by the superposition of sine waves)
- The graph in the frequency direction is called the frequency domain image [spectrum/amplitude spectrum] (composed of all vertical lines superimposed on the amplitude of the sine wave)
Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way
#导包 import numpy as np from scipy.io import wavfile from scipy.fftpack import dct import matplotlib.pyplot as plt #绘制时域图 def plot_time(sig, fs): time = np.arange(0,len(sig))*(1.0/fs) plt.figure(figsize = (20, 5)) plt.plot(time, sig) plt.xlabel('Time(s)') plt.ylabel('Amplitude')#振幅 plt.grid() #绘制频域图 def plot_freq(sig, sample_rate, n_fft=512): freqs = np.linspace(0, sample_rate/2, n_fft//2 + 1) xf = np.fft.rfft(sig, n_fft) / n_fft xfp = 20*np.log10(np.clip(np.abs(xf), le-20, le100))#强度 plt.figure(figsize = (20, 5)) plt.plot(freqs, xfp) plt.xlabel('Freq(hz)') plt.ylabel('dB')#强度 plt.grid() #绘制二维数组 def plot_spectrogram(spec,ylabel = 'ylabel'): fig = plt.figure(figsize = (20, 5)) heatmap = plt.pcolor(spec) fig.colorbar(mappable = heatmap) plt.xlabel('Time(s)') plt.ylabel(ylabel) plt.tight_layout() plt.show() wav_file = '文件名.wav' fs, sig = wavfile.read(wav_file) #fs是wav文件的采样率,signal是wav文件的内容,filename是要读取的音频文件的路径 sig = sig[0: int(10 *fs)] #保留前10s的数据 plot_time(sig, fs) #时域图 plot_freq(sig, fs) #频域图
Time Domain Map Source Blog Garden yifanhunter
Frequency Domain Map Source Blog Garden yifanhunter
pre-emphasis
Definition: Emphasize the high frequency part of speech
Purpose:
- Balance the spectrum , the high frequency usually has a smaller amplitude than the low frequency, increase the high frequency part, make the spectrum of the signal flat , keep it in the entire frequency band from low frequency to high frequency, and can use the same noise ratio (SNR) to find spectrum
- Highlight high-frequency formants
Pass the speech signal through a high-pass filter:
(where filter coefficient values are typically 0.95 or 0.97
# 代码形式 pre_emphasis = 0.97 emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1]) # emphasized_signal为新signal
Effect
Time Domain Map Source Blog Garden yifanhunter
Frequency Domain Map Source Blog Garden yifanhunter
filtering
Remove some specific frequency components from a curve
2. Fourier Transformation
Basic idea: A non-periodic signal can be approximated by the superposition of multiple periodic signals. Use trigonometric functions of infinite length as basis functions
Fourier transform: convert a non-periodic continuous signal in the time domain into a non-periodic continuous signal in the frequency domain (images connecting points in the frequency domain) to obtain spectrum and time spectrum
Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way
Discrete spectral frequency domain:
Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way
Continuum frequency domain:
Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way
Framing
Explanation: Intercepting the voice signal into small segments is called framing , and each segment of the signal is called a "frame"
- That is, the entire time domain process is decomposed into countless small processes of equal length, and each small process is approximately stable (the signal in a short period of time can be regarded as stable, and can be intercepted for FFT
Source: Zhihu Wang Yun Maigo
Source: Zhihu Wang Yun Maigo
Frame shift: STRIDE, 0~1/2 frame length, smooth length between frames
def framing(frame_len_s, frame_shift_s, fs, sig):
"""
分帧,主要是计算对应下标
param frame_len_s: 帧长,s
param frame_shift_s: 帧移,s
param fs: 采样率,hz
param sig: 信号
return: 二维list,一个元素为一帧信号
"""
sig_n = len(sig)
frame_len_n, frame_shift_n = int(round(fs * frame_len_s)), int(round(fs * frame_shift_s))
num_frame = int(np.ceil(float(sig_n - frame_len_n) / frame_shift_n) + 1)
pad_num = frame_shift_n * (num_frame - 1) + frame_len_n - sig_n # 待补0的个数
pad_zero = np.zeros(int(pad_num)) # 补0
pad_sig = np.append(sig, pad_zero)
# 计算下标
# 每个帧的内部下标
frame_inner_index = np.arange(0, frame_len_n)
# 分帧后的信号每个帧的起始下标
frame_index = np.arange(0, num_frame) * frame_shift_n
# 复制每个帧的内部下标,信号有多少帧,就复制多少个,在行方向上进行复制
frame_inner_index_extend = np.tile(frame_inner_index, (num_frame, 1))
# 各帧起始下标扩展维度,便于后续相加
frame_index_extend = np.expand_dims(frame_index, 1)
# 分帧后各帧的下标,二维数组,一个元素为一帧的下标
each_frame_index = frame_inner_index_extend + frame_index_extend
each_frame_index = each_frame_index.astype(np.int, copy=False)
frame_sig = pad_sig[each_frame_index]
return frame_sig
frame_len_s = 0.025
frame_shift_s = 0.01
frame_sig = framing(frame_len_s, frame_shift_s, fs, sig)
Short-Time Fourier Transform (STFT)
After framing, a windowing operation is performed, that is, multiplied by a "window function"
- The purpose of windowing: Let the signal amplitude of one frame gradually change to 0 at both ends (that is, as shown in Figure 3 below , which can make the peaks on the spectrum thinner and reduce spectrum leakage
- After windowing, the two ends of a frame signal are weakened
- The time difference between the starting positions of two adjacent frames is called frame shift (common method: half of the frame length, or fixed at 10 milliseconds)
Source: Zhihu Wang Yun Maigo
Determine the width of the window function:
- The window is too narrow, and the signal in the window is too short, which will lead to inaccurate frequency analysis, poor frequency resolution, but high time resolution
- The window is too wide, the time domain is not fine enough, the time resolution is low, but the frequency resolution is high
For time-varying non-stationary signals, high frequencies are suitable for small windows and low frequencies are suitable for large windows
Source: Jishi Platform
Perform FFT on the signal of each frame to get the spectrum
Source: Zhihu Wang Yun Maigo
- where the horizontal axis is frequency and the vertical axis is amplitude
- "Fine structure": It is a small peak on the blue line , and the distance on the horizontal axis is the fundamental frequency, which reflects the pitch of the voice
- The sparser the peaks, the higher the fundamental frequency and the higher the pitch
- "Envelope": It is a smooth curve ( red line ) connecting the peaks of these small peaks, which represents which sound is made. The peak on it is called the formant (you can see what sound is made according to the position of the formant
algorithm
- For a signal (1, T) expressed as 1 row and T column, a set of linearly increasing frequencies is usually set, and then it is assumed that the signal is composed of three function signals of these frequencies.
- FFT calculation is to transform the Fourier series into the complex domain, and then into the time domain after calculation. The result is a complex representation of each assumed trigonometric signal, namely a+bj. Calculate with the code in librosa library and torchaudio library, and get a matrix composed of ai+bi j. ai bi is the vector representation of each signal.
- The geometric representation in the field of complex numbers is:
- Two matrices are obtained, the magnitude spectrum (spectrogram) and the phase spectrum,
- The spectrum obtained by Fourier transform is called "linear spectrum".
n_fft is how many signal points to do Fourier transform
official:
- Do STFT for a certain frame, and get the number of frequency groups = n_fft // 2 + 1 (// means divisible
- Calculate the number of frames that can be obtained by STFT of a signal: the window length winlength of the known frame division, the frame shift length hoplength, and the number of signal sampling points L
- Number of time frames N = L // hoplength + 1 (independent of window length
eg: Assuming that the sampling rate of a certain signal is 16000, take one second, that is, a signal with 16000 sampling points, and make a STFT with a window length of 512 (512/16000*1000=32 milliseconds) points and a frame shift of 256 (16 milliseconds) transform to get
16000 // 256 + 1 = 63 frames.
import torchaudio signal = torch.rand(16000) stft = torch.stft(signal.return_complex=True,n_fft=512,hop_length=256,win_length=512) print(stft,shape)
3. Wavelet transform
Time-frequency analysis: the time when each component appears, the signal frequency changes with time, the instantaneous frequency and its amplitude at each moment
Fourier transform defect : only the frequency components included in a segment of the signal can be obtained, but the moment when each component appears cannot be known. ➡️➡️"For non-stationary processes, the Fourier transform has limitations" "Two signals with huge differences in time domain may have a high degree of agreement in frequency domain"
Wavelet transform idea: replace the infinitely long trigonometric function in FFT with a finitely long attenuating wavelet basis
Source: Jishi Platform
Two variables:
- Scale : Controls the expansion and contraction of the wavelet function , corresponding to the frequency (vertical axis
- Translation amount : Controls the translation of the wavelet function , corresponding to time (horizontal axis
Get time spectrum
For mutation signals: FFT has Gibbs effect
Fourier transform:
Source: Jishi Platform
For wavelet transform:
Source: Jishi Platform
4. Spectrogram, Mel Spectrum
Spectrogram
For a long speech signal, divide into frames, add windows, perform Fourier transform on each frame, and then stack the results of each frame along another dimension, and the resulting image is the spectrogram
Source: CSDN lvziye00lvziye article
Mel Spectrum
Pass the spectrogram through a Mel scale filter (Mel filter) and turn it into a Mel spectrum to obtain sound features of appropriate size
- The unit of frequency is HZ. Converting HZ to Mel frequency will make the human ear's perception of frequency become linear.
- official:
Source: CSDN lvziye00lvziye article
5. Fbank and MFCC
Fbank(FilterBank)
A front-end processing algorithm that processes audio in a manner similar to the human ear to improve speech recognition performance.
MFCC
MFCC features can be obtained by performing discrete cosine transform (DCT) on Fbank.
MFCC: Mel Frequency Cepstral Coefficients. In fact, it is to do cepstrum analysis on the Mel spectrum (take the logarithm and do DCT transformation)
Reference article:
This article is not for any commercial use, it is only an excerpt for self-study. If any part infringes on everyone's interests, please also look at Haihan and contact to delete it, thank you everyone! ! !
https://www.zhihu.com/question/24490634 -- Sampling Theorem
https://blog.csdn.net/lzrtutu/article/details/78882715 -- spectrogram, fundamental frequency, formant
https://www.zhihu.com/question/19714540/answer/334686351 --Student Ma (how to understand FT formula
https://mp.weixin.qq.com/s/CRqhHIlYYRjYJ64PZZnUkQ -- Jishi platform Fourier transform wavelet transform
https://www.cnblogs.com/h2zZhou/p/8405717.html --Han Hao's blog garden explains the Fourier transform in simple terms
https://www.zhihu.com/question/52093104 --by Zhihu Wang Yun Maigo how to understand sub-framing
https://blog.csdn.net/lvziye00lvziye/article/details/100132715 -- spectrogram, Mel spectrogram
https://www.cnblogs.com/yifanrensheng/p/13510742.html --Introduction to Fbank and MFCC - Yifan Life - Blog Garden