Speech signal processing - Notes (1) Audio signal processing

  Sound generation: Energy passes through the vocal cords to vibrate to produce a fundamental sound, which passes through the vocal tract and interacts with the vocal tract to generate a resonant sound, and the fundamental sound and the resonant sound are transmitted together.


1. Introduction to audio signal

1. Sound waveform diagram

The sensor detects the amplitude strength of the sound at a certain frequency and the direction of the vibration, resulting in a series of points that change over time.

2. Sampling frequency

The detection frequency of the sensor is the sampling frequency. The sampling frequency is obtained according to the sampling theorem.

Sampling Theorem (Nyquist-Shannon Theorem)

Definition: Used to describe the highest transmission rate for a given bandwidth.

Integer period (eg. the time required for the object to return to its original shape after rotation), the phase change cannot be detected when the sampling period is an integer multiple of the integer period.

*If it is a wheel rotation problem: If you need to see the rotation direction and phase changes at the same time, the sampling period should be less than 1/2 of the integer period, and the sampling frequency should be greater than twice the original frequency.

➡️➡️For analog signals: To see all the characteristics of the signal at the same time, the sampling frequency should be greater than 2 times the maximum frequency of the original analog signal, otherwise aliasing will occur.

Aliasing

It refers to the phenomenon that when the discrete Fourier transform (DFT) is used to sample the Z domain of the signal in the frequency domain, the number of sampling points is smaller than the column length in the time domain, and the period extension sequences in the time domain overlap

3. Spectrogram

 Divided into narrowband spectrogram and wideband spectrogram

Narrowband: slow access speed, low transmission rate

Broadband: Transmit analog signals , divide the channel into multiple sub-channels, and transmit audio, video and digital signals respectively, which is called broadband transmission.

Bandwidth: The width of the electromagnetic wave frequency band, that is, the difference between the highest frequency and the lowest frequency of the signal

Time Width: Pulse width, which is the end time of the signal minus the start time of the signal

Time window: time interval (time interval)

narrowband spectrogram

  • The bandwidth is small, the time window is large, and the length of the short time window is long. The narrow-band spectrogram is the spectrogram drawn under the long window condition.
  • It is expressed as "horizontal line", and "horizontal" reflects the high frequency resolution .

broadband spectrogram

  • The bandwidth is large, the time width is narrow, and the short time window length is short.
  • It is expressed as a "vertical line", which can distinguish the part of the voice repeated in time, and "vertical" reflects the high time resolution .

4. Fundamental frequency (base tone frequency)

  • The frequency at which the vocal cords open and close each time, and the vocal cord vibration period is the pitch period.
  • On the narrow-band spectrogram , it is the one with the lowest frequency range among all horizontal stripes, and the stripes on the same horizontal line as it represent the pitch frequency components at that moment. The vertical axis scale value corresponding to this stripe is the pitch frequency value.
  • The other horizontal stripes are the harmonics
  • On a wideband spectrogram , the time between two vertical bars represents the pitch period

5. Formant

  • Some places in the harmonic are darker than other horizontal stripes near it at the same time, and these darker colors represent formants

2. Voice signal processing

Goal: Find the distribution of each frequency component

Fourier transform (FFT) operation && wavelet transform && fully convolutional time domain audio separation network - Conv-TasNet

Speech Signal Processing Operations 

1. Fourier series

Guess that any periodic function can be written as a sum of trigonometric functions.

Euler formula

Definition: For θ∈R, we havee^{i\theta }=cos\theta +isin\theta

Imaginary number i: i*i=-1

1*(-1) [i.e. 1*i*i] on the number axis , the line segment rotates 180° around the origin on the number axis

When 1*i , the line segment rotates 90° on the plane, that is, the imaginary number axis (complex plane) is obtained.

Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way   

e^{i\theta }\thetais a vector on the complex plane with an included angle of

On the time axis t, record e^{it }the value of the imaginary part (ordinate) of the vector, which issin(t)

On the time axis t, record e^{it}the value of the real part (abscissa) of the vector, which iscos(t)

e^{i\omega t} \Leftarrow \Rightarrow\left\{\begin{matrix} sin(\omega t)\\ cos(\omega t)\end{matrix}\right.

 Two angles, one can observe the frequency of rotation, so it is called the frequency domain; the other can see the elapsed time, so it is called the time domain.

f(x)=C+\sum_{n=1}^{\infty }(a_{n}cos(\frac{2\pi n}{T})x+b_{n}sin(\frac{2\pi n}{T})x),C\in R

f(x)The basis (the most basic unit) of is:  \begin{Bmatrix} 1 ,cos(\frac{2\pi n}{T}x) ,sin(\frac{2\pi n}{T}x) \end{Bmatrix}

After dot product get:

f(x)=\frac{a_{0}}{2}+\sum_{n=1}^{\infty }(a_{n}cos(\frac{2\pi n}{T})x+b_{n}sin(\frac{2\pi n}{T})x),C\in R

a_{n}=\frac{2}{T}\int \int_{x_{0}}^{x_{0}+T}f(x)\cdot cos(\frac{2\pi nx}{T})dx,n\in \begin{Bmatrix} 0 \end{Bmatrix}\bigcup N

b_{n}=\frac{2}{T}\int \int_{x_{0}}^{x_{0}+T}f(x)\cdot sin(\frac{2\pi nx}{T})dx,n\in N

spectrum time spectrum

  • Any waveform can be formed by the superposition of countless sine waves, and these sine waves of different frequencies are called frequency components

Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way   

  • Among them, the first frequency component with the lowest frequency is the basis (the most basic unit) for constructing the frequency domain [analogous to the basic unit "1" of the rational number axis], a sine wave with an infinite period, that is, a straight line [that is, the "rational number axis cos(0t)" 0"]
  • A sine wave is the projection of a circular motion onto a straight line.

Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way

  • In the frequency domain, the 0 frequency is called the DC component. In the superposition of Fourier series, it only affects whether all waveforms are up or down relative to the number axis as a whole, and does not change the shape of the wave
  • The graph in the time-delay direction is called the time-domain image [time spectrum] (the final pattern formed by the superposition of sine waves)
  • The graph in the frequency direction is called the frequency domain image [spectrum/amplitude spectrum] (composed of all vertical lines superimposed on the amplitude of the sine wave)

Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way   

#导包
import numpy as np
from scipy.io import wavfile
from scipy.fftpack import dct
import matplotlib.pyplot as plt

#绘制时域图
def plot_time(sig, fs):
time = np.arange(0,len(sig))*(1.0/fs)
plt.figure(figsize = (20, 5))
plt.plot(time, sig)
plt.xlabel('Time(s)')
plt.ylabel('Amplitude')#振幅
plt.grid()

#绘制频域图
def plot_freq(sig, sample_rate, n_fft=512):
freqs = np.linspace(0, sample_rate/2, n_fft//2 + 1)
xf = np.fft.rfft(sig, n_fft) / n_fft
xfp = 20*np.log10(np.clip(np.abs(xf), le-20, le100))#强度
plt.figure(figsize = (20, 5))
plt.plot(freqs, xfp)
plt.xlabel('Freq(hz)')
plt.ylabel('dB')#强度
plt.grid()

#绘制二维数组
def plot_spectrogram(spec,ylabel = 'ylabel'):
fig = plt.figure(figsize = (20, 5))
heatmap = plt.pcolor(spec)
fig.colorbar(mappable = heatmap)
plt.xlabel('Time(s)')
plt.ylabel(ylabel)
plt.tight_layout()
plt.show()

wav_file = '文件名.wav'
fs, sig = wavfile.read(wav_file)
#fs是wav文件的采样率,signal是wav文件的内容,filename是要读取的音频文件的路径
sig = sig[0: int(10 *fs)] #保留前10s的数据

plot_time(sig, fs) #时域图
plot_freq(sig, fs) #频域图

 Time Domain Map Source Blog Garden yifanhunter

 Frequency Domain Map Source Blog Garden yifanhunter

pre-emphasis

Definition: Emphasize the high frequency part of speech

Purpose:

  • Balance the spectrum , the high frequency usually has a smaller amplitude than the low frequency, increase the high frequency part, make the spectrum of the signal flat , keep it in the entire frequency band from low frequency to high frequency, and can use the same noise ratio (SNR) to find spectrum
  • Highlight high-frequency formants

Pass the speech signal through a high-pass filter:

y(t)=x(t)-\alpha x(t-1)       

(where filter coefficient \alphavalues ​​are typically 0.95 or 0.97

# 代码形式
pre_emphasis = 0.97
emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
# emphasized_signal为新signal

Effect

 Time Domain Map Source Blog Garden yifanhunter

Frequency Domain Map Source Blog Garden yifanhunter 

filtering

Remove some specific frequency components from a curve

2. Fourier Transformation

Basic idea: A non-periodic signal can be approximated by the superposition of multiple periodic signals. Use trigonometric functions of infinite length as basis functions

Fourier transform: convert a non-periodic continuous signal in the time domain into a non-periodic continuous signal in the frequency domain (images connecting points in the frequency domain) to obtain spectrum and time spectrum

Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way   

Discrete spectral frequency domain:

Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way   

Continuum frequency domain:

Source: Blog Garden - Han Hao - Explaining the Fourier Transform in a simple way   

Framing

Explanation: Intercepting the voice signal into small segments is called framing , and each segment of the signal is called a "frame"

  • That is, the entire time domain process is decomposed into countless small processes of equal length, and each small process is approximately stable (the signal in a short period of time can be regarded as stable, and can be intercepted for FFT

 Source: Zhihu Wang Yun Maigo

 

 Source: Zhihu Wang Yun Maigo 

Frame shift: STRIDE, 0~1/2 frame length, smooth length between frames

def framing(frame_len_s, frame_shift_s, fs, sig):

"""

分帧,主要是计算对应下标
param frame_len_s: 帧长,s
param frame_shift_s: 帧移,s
param fs: 采样率,hz
param sig: 信号
return: 二维list,一个元素为一帧信号

"""

sig_n = len(sig)
frame_len_n, frame_shift_n = int(round(fs * frame_len_s)), int(round(fs * frame_shift_s))
num_frame = int(np.ceil(float(sig_n - frame_len_n) / frame_shift_n) + 1)
pad_num = frame_shift_n * (num_frame - 1) + frame_len_n - sig_n # 待补0的个数
pad_zero = np.zeros(int(pad_num)) # 补0
pad_sig = np.append(sig, pad_zero)

# 计算下标
# 每个帧的内部下标
frame_inner_index = np.arange(0, frame_len_n)

# 分帧后的信号每个帧的起始下标
frame_index = np.arange(0, num_frame) * frame_shift_n

# 复制每个帧的内部下标,信号有多少帧,就复制多少个,在行方向上进行复制
frame_inner_index_extend = np.tile(frame_inner_index, (num_frame, 1))

# 各帧起始下标扩展维度,便于后续相加
frame_index_extend = np.expand_dims(frame_index, 1)

# 分帧后各帧的下标,二维数组,一个元素为一帧的下标
each_frame_index = frame_inner_index_extend + frame_index_extend
each_frame_index = each_frame_index.astype(np.int, copy=False)

frame_sig = pad_sig[each_frame_index]
return frame_sig


frame_len_s = 0.025
frame_shift_s = 0.01
frame_sig = framing(frame_len_s, frame_shift_s, fs, sig)

 

Short-Time Fourier Transform (STFT)

After framing, a windowing  operation is performed, that is, multiplied by a "window function"

  • The purpose of windowing: Let the signal amplitude of one frame gradually change to 0 at both ends (that is, as shown in Figure 3 below , which can make the peaks on the spectrum thinner and reduce spectrum leakage
  • After windowing, the two ends of a frame signal are weakened
    • The time difference between the starting positions of two adjacent frames is called frame shift (common method: half of the frame length, or fixed at 10 milliseconds)

Source: Zhihu Wang Yun Maigo  

Determine the width of the window function:

  • The window is too narrow, and the signal in the window is too short, which will lead to inaccurate frequency analysis, poor frequency resolution, but high time resolution
  • The window is too wide, the time domain is not fine enough, the time resolution is low, but the frequency resolution is high

For time-varying non-stationary signals, high frequencies are suitable for small windows and low frequencies are suitable for large windows

 Source: Jishi Platform

Perform FFT on the signal of each frame to get the spectrum

 Source: Zhihu Wang Yun Maigo  

  • where the horizontal axis is frequency and the vertical axis is amplitude
  •  "Fine structure": It is a small peak on the blue line , and the distance on the horizontal axis is the fundamental frequency, which reflects the pitch of the voice
    • The sparser the peaks, the higher the fundamental frequency and the higher the pitch
  • "Envelope": It is a smooth curve ( red line ) connecting the peaks of these small peaks, which represents which sound is made. The peak on it is called the formant (you can see what sound is made according to the position of the formant

algorithm

  • For a signal (1, T) expressed as 1 row and T column, a set of linearly increasing frequencies is usually set, and then it is assumed that the signal is composed of three function signals of these frequencies.
  • FFT calculation is to transform the Fourier series into the complex domain, and then into the time domain after calculation. The result is a complex representation of each assumed trigonometric signal, namely a+bj. Calculate with the code in librosa library and torchaudio library, and get a matrix composed of ai+bi j. ai bi is the vector representation of each signal.
  • The geometric representation in the field of complex numbers is:

  • Two matrices are obtained, the magnitude spectrum (spectrogram) and the phase spectrum,
  • The spectrum obtained by Fourier transform is called "linear spectrum".

n_fft is how many signal points to do Fourier transform

official:

  1. Do STFT for a certain frame, and get the number of frequency groups = n_fft // 2 + 1 (// means divisible
  2. Calculate the number of frames that can be obtained by STFT of a signal: the window length winlength of the known frame division, the frame shift length hoplength, and the number of signal sampling points L
    • Number of time frames N = L // hoplength + 1 (independent of window length

eg: Assuming that the sampling rate of a certain signal is 16000, take one second, that is, a signal with 16000 sampling points, and make a STFT with a window length of 512 (512/16000*1000=32 milliseconds) points and a frame shift of 256 (16 milliseconds) transform to get

16000 // 256 + 1 = 63 frames.

import torchaudio
signal = torch.rand(16000)
stft = torch.stft(signal.return_complex=True,n_fft=512,hop_length=256,win_length=512)
print(stft,shape)

3. Wavelet transform 

Time-frequency analysis: the time when each component appears, the signal frequency changes with time, the instantaneous frequency and its amplitude at each moment

Fourier transform defect : only the frequency components included in a segment of the signal can be obtained, but the moment when each component appears cannot be known. ➡️➡️"For non-stationary processes, the Fourier transform has limitations" "Two signals with huge differences in time domain may have a high degree of agreement in frequency domain" 

Wavelet transform idea: replace the infinitely long trigonometric function in FFT with a finitely long attenuating wavelet basis

 Source: Jishi Platform 

Two variables:

  • Scale a: Controls the expansion and contraction of the wavelet function , corresponding to the frequency (vertical axis
  • Translation amount  \ can: Controls the translation of the wavelet function , corresponding to time (horizontal axis

Get time spectrum

For mutation signals: FFT has Gibbs effect

Fourier transform:

 Source: Jishi Platform 

For wavelet transform: 

 Source: Jishi Platform 

4. Spectrogram, Mel Spectrum

Spectrogram

For a long speech signal, divide into frames, add windows, perform Fourier transform on each frame, and then stack the results of each frame along another dimension, and the resulting image is the spectrogram

The process of obtaining the spectrogram

Source: CSDN lvziye00lvziye article

Mel Spectrum

Pass the spectrogram through a Mel scale filter (Mel filter) and turn it into a Mel spectrum to obtain sound features of appropriate size

  • The unit of frequency is HZ. Converting HZ to Mel frequency will make the human ear's perception of frequency become linear.
  • official:

mel(l)=2595*log_{10}(1+\frac{f}{700})

insert image description here

Source: CSDN lvziye00lvziye article 

5. Fbank and MFCC

Fbank(FilterBank)

A front-end processing algorithm that processes audio in a manner similar to the human ear to improve speech recognition performance.

MFCC

MFCC features can be obtained by performing discrete cosine transform (DCT) on Fbank.

MFCC: Mel Frequency Cepstral Coefficients. In fact, it is to do cepstrum analysis on the Mel spectrum (take the logarithm and do DCT transformation)

Reference article:

This article is not for any commercial use, it is only an excerpt for self-study. If any part infringes on everyone's interests, please also look at Haihan and contact to delete it, thank you everyone! ! !

https://www.zhihu.com/question/24490634 -- Sampling Theorem

https://blog.csdn.net/lzrtutu/article/details/78882715 -- spectrogram, fundamental frequency, formant

https://www.zhihu.com/question/19714540/answer/334686351     --Student Ma (how to understand FT formula

https://mp.weixin.qq.com/s/CRqhHIlYYRjYJ64PZZnUkQ -- Jishi platform Fourier transform wavelet transform

https://www.cnblogs.com/h2zZhou/p/8405717.html --Han Hao's blog garden explains the Fourier transform in simple terms

https://www.zhihu.com/question/52093104 --by Zhihu  Wang Yun Maigo how to understand sub-framing

https://blog.csdn.net/lvziye00lvziye/article/details/100132715 -- spectrogram, Mel spectrogram

https://www.cnblogs.com/yifanrensheng/p/13510742.html --Introduction to Fbank and MFCC - Yifan Life - Blog Garden

Guess you like

Origin blog.csdn.net/sinat_56238820/article/details/125656189