Introduction
- Noise suppression technology is used to eliminate background noise, improve the signal-to-noise ratio and intelligibility of speech signals, and make people and machines hear more clearly
- Common types of noise: human voice noise, street noise, car noise
- Classification of noise suppression methods:
- According to the number of input channels: single-channel noise reduction, multi-channel noise reduction
- According to the statistical characteristics of noise: stationary noise suppression, non-stationary noise suppression
- According to the noise reduction method: passive noise reduction, active noise reduction
- The method described below is used for single-channel, passive, stationary noise suppression
Minima Controlled Recursive Averaging(MCRA)
-
The noise reduction method in the traditional signal processing method can be divided into two major steps:
- One is the estimation of noise (Noise Estimation/Tracking)
- The second is the estimation of the gain factor
-
Common methods for noise estimation:
- Recursive Averaging: As long as the probability of speech in a certain frequency band is low, the noise spectrum can be estimated/updated using this frequency band
- Minima Controlled/Tracking: Due to the sparsity of speech signals, even if speech exists, within a short time window (0.5s-1.5s), the minimum value of each frequency band will approach the noise power with a high probability , so the noise estimate for each frequency band can be obtained by tracking the minimum in a short time window
- Histogram-based Method: Histogram statistics are performed in a short time window for each frequency band of the noisy speech signal, and the value with the highest frequency corresponds to the noise level of this frequency band
-
MCRA steps:
-
Consider a data model with additive noise: y ( n ) = x ( n ) + d ( n ) y(n)=x(n)+d(n)and ( n )=x(n)+d(n)
-
The input signal goes through STFT: Y ( k , l ) = ∑ n = 0 N − 1 y ( n + l M ) h ( n ) e − j 2 π N nk Y(k,l)=\sum_{n=0 }^{N-1}y(n+lM)h(n)e^{-j\frac{2\pi}{N}nk}and ( k ,l)=∑n=0N−1and ( n)+lM)h(n)e−jN2 p.mnk , get time spectrogram
-
in,
- k is the index of the frequency bin
- l is the index of the time frame
- h(n) is the window function
- M is for frame shift
-
Given two hypotheses H 0 ( k , l ) H_0(k,l)H0(k,l ) andH 1 ( k , l ) H_1(k,l)H1(k,l ) , which respectively represent "no speech" and "speech exists":
H 0 ( k , l ) : Y ( k , l ) = D ( k , l ) H 1 ( k , l ) : Y ( k , l ) ) = X ( k , l ) + D ( k , l ) \begin{aligned} H_0(k,l):&Y(k,l)=D(k,l) \\ H_1(k,l):&Y (k,l)=X(k,l)+D(k,l) \end{aligned}H0(k,l):H1(k,l):and ( k ,l)=D(k,l)and ( k ,l)=X(k,l)+D(k,l) -
The estimation of noise is defined as: λ d ( k , l ) = E [ ∣ D ( k , l ) ∣ 2 ] \lambda_d(k,l)=E[|D(k,l)|^2]ld(k,l)=E[∣D(k,l)∣2 ], that is, the energy of the noise
-
Using time recursive smoothing, when "voice does not exist", iteratively update λ d ( k , l ) \lambda_d(k,l)ld(k,l )
H 0 ′ (k, l ) : λ ^ d ( k , l + 1 ) = α d λ ^ d ( k , l ) + ( 1 − α d ) ∣ Y ( k , l ) ∣ 2 H ′ ( k , l ) : λ ^ d ( k , l + 1 ) = λ ^ d ( k , l ) and if α d ( 0 < α d < 1 ) are aligned \begin{aligned} H'_0( k,l):&\hat{\lambda}_d(k,l+1)=\alpha_d \hat{\lambda}_d(k,l)+(1-\alpha_d)|Y(k,l)| ^2\\ H'_1(k,l):&\hat{\lambda}_d(k,l+1)=\hat{\lambda}_d(k,l) \\ integer,&\alpha_d(0 < \alpha_d < 1) must be aligned \end{aligned}H0′(k,l):H1′(k,l):Among them ,l^d(k,l+1)=adl^d(k,l)+(1−ad)∣Y(k,l)∣2l^d(k,l+1)=l^d(k,l)ad(0<ad<1 ) is the smoothing factor -
Speech Presence Probability: p ′ ( k , l ) = P ( H 1 ′ ( k , l ) ∣ Y ( k , l ) ) p'(k,l)=P(H'_1(k ,l)|Y(k,l))p′(k,l)=P(H1′(k,l)∣Y(k,l))
-
对星是的是以可以作的:
λ ^ d ( k , l + 1 ) = λ ^ d ( k , l ) p ′ ( k , l ) + [ α d λ ^ d ( k , l ) + ( 1 − α d ) ∣ Y ( k , l ) ∣ 2 ] ( 1 − p ′ ( k , l ) ) = α ^ d ( k , l ) λ ^ d ( k , l ) + [ 1 − α ^ d ( k , l ) ] ∣ Y ( k , l ) ∣ 2 where, α ^ d ( k , l ) = α d + ( 1 − α d ) p ′ ( k , l ) \begin{aligned} \hat{\lambda }_d(k,l+1)&=\hat{\lambda}_d(k,l)p'(k,l)+[\alpha_d \hat{\lambda}_d(k,l)+(1- \alpha_d)|Y(k,l)|^2](1-p'(k,l)) \\ &=\hat{\alpha}_d(k,l) \hat{\lambda}_d(k ,l)+[1-\hat{\alpha}_d(k,l)]|Y(k,l)|^2 \\ 任任,\hat{\alpha}_d(k,l)&=\alpha_d+ (1-\alpha_d) p'(k,l) \end{aligned}l^d(k,l+1)in,a^d(k,l)=l^d(k,l)p′(k,l)+[ adl^d(k,l)+(1−ad)∣Y(k,l)∣2](1−p′(k,l))=a^d(k,l)l^d(k,l)+[1−a^d(k,l)]∣Y(k,l)∣2=ad+(1−ad)p′(k,l) -
The question now is: how to determine p ′ ( k , l ) p'(k,l)p′(k,l)
-
Calculate p ′ ( k , l ) p'(k,l)p′(k,l ) idea: in a short time window, calculate the local energyS ( k , l ) S(k,l)S(k,l ) and minimum energyS min ( k , l ) S_{min}(k,l)Smin(k,l ) ratio
-
Calculation of local energy:
- Area smoothing: S f ( k , l ) = ∑ i = − wwb ( i ) ∣ Y ( k − i , l ) ∣ 2 S_f(k,l)=\sum_{i=-w}^{w} b(i)|Y(ki,l)|^2Sf(k,l)=∑i=−wwb(i)∣Y(k−i,l)∣2
- Temporal smoothing: S ( k , l ) = α s S ( k , l − 1 ) + ( 1 − α s ) S f ( k , l ) S(k,l)=\alpha_sS(k,l-1 )+(1-\alpha_s)S_f(k,l)S(k,l)=asS(k,l−1)+(1−as)Sf(k,l)
- b ( i ) b(i) b ( i ) is the frequency domain window function,α s ( 0 < α s < 1 ) \alpha_s (0<\alpha_s<1)as(0<as<1 ) is the time-domain smoothing factor of the local energy
-
Calculation of minimum energy:
- Conventional practice: use the local minimum search method, set a time window L (L is usually 1s), and search for the local minimum
- Simplified approach:
-
Calculate the ratio: S r ( k , l ) = S ( k , l ) S min ( k , l ) S_r(k,l)=\frac{S(k,l)}{S_{min}(k,l )}Sr(k,l)=Smin(k,l)S(k,l)
-
The discriminant for the existence of speech is:
I ( k , l ) = { 1 , S r ( k , l ) > δ 0 , otherwise I(k,l)= \left\{\begin{matrix} 1, S_r( k,l)>\delta \\ 0, otherwise \end{matrix}\right.I(k,l)={ 1,Sr(k,l)>d0,otherwise -
Iterative estimation of speech presence probability:
- p ^ ′ ( k , l ) = α p p ^ ′ ( k , l − 1 ) + ( 1 − α p ) I ( k , l ) \hat{p}'(k,l)=\alpha_p\hat{p}'(k,l-1)+(1-\alpha_p)I(k,l) p^′(k,l)=app^′(k,l−1)+(1−ap)I(k,l)
- α p ( 0 < α p < 1 ) \alpha_p(0<\alpha_p<1) ap(0<ap<1 ) is the smoothing factor of the speech existence probability
-
MCRA noise estimation process:
-
MCRA Reference Parameters
-
After estimating the noise, multiply the noisy speech by the gain factor to perform noise suppression
X ^ ( k , l ) = G ( k , l ) Y ( k , l ) or ∣ X ^ ( k ) ∣ 2 = G 2 ( k ) ∣ Y ( k ) ∣ 2 \begin{aligned} \hat{X}(k,l)&=G(k,l)Y(k,l) \\ or|\hat{X} (k)|^2&=G^2(k)|Y(k)|^2 \end{aligned}X^(k,l)or ∣X^(k)∣2=G(k,l)Y(k,l)=G2(k)∣Y(k)∣2 -
To determine the gain factor:
- spectral subtraction
- Wiener filtering
- MMSE
-
Spectral subtraction: Assuming that the noise changes steadily or slowly, from the noisy speech spectrum, the noise spectrum is subtracted
- Definitely:
G ( k ) = ∣ X ^ ( k ) ∣ 2 ∣ Y ( k ) ∣ 2 = ∣ Y ( k ) ∣ 2 − ∣ D ^ ( k ) ∣ 2 ∣ Y ( k ) ∣ = ∣ Y ( k ) ∣ 2 − λ d ( k ) ∣ Y ( k ) ∣ 2 = 1 − λ d ( k ) ∣ Y ( k ) ∣ 2 = 1 − 1 γ ( k ) \begin{aligned} G (k)&=\sqrt{\frac{|\hat{X}(k)|^2}{|Y(k)|^2}} \\ &=\sqrt{\frac{|Y(k) |^2-|\hat{D}(k)|^2}{|Y(k)|^2}}=\sqrt{\frac{|Y(k)|^2-\lambda_d(k)} {|Y(k)|^2}} \\ &=\sqrt{1-\frac{\lambda_d(k)}{|Y(k)|^2}}=\sqrt{1-\frac{1 }{\gamma(k)}} \end{aligned}G(k)=∣Y(k)∣2∣X^(k)∣2=∣Y(k)∣2∣Y(k)∣2−∣D^(k)∣2=∣Y(k)∣2∣Y(k)∣2−ld(k)=1−∣Y(k)∣2ld(k)=1−c ( k )1 - For example, γ ( k ) = ∣ Y ( k ) ∣ 2 λ d ( k ) \gamma(k)=\frac{|Y(k)|^2}{\lambda_d(k)}c ( k )=ld(k)∣Y(k)∣2, known as the posterior SNR
- Definitely:
-
Frequency-domain Wiener filtering: the mean square error between the estimated pure speech amplitude spectrum and the real amplitude spectrum is the smallest
- Frequency domain estimation error: E ( k ) = X ( k ) − X ^ ( k ) = X ( k ) − G ( k ) Y ( k ) E(k)=X(k)-\hat{X}( k)=X(k)-G(k)Y(k)E(k)=X(k)−X^(k)=X(k)−G(k)Y(k)
- Objective function: J = E [ ∣ E ( k ) ∣ 2 ] J=E[|E(k)|^2]J=E[∣E(k)∣2]
- By minimizing the objective function, the expression of the gain factor can be obtained:
G ( k ) = λ x ( k ) λ x ( k ) + λ d ( k ) = ξ ( k ) ξ ( k ) + 1 G(k)= \frac{\lambda_x(k)}{\lambda_x(k)+\lambda_d(k)}=\frac{\xi(k)}{\xi(k)+1}G(k)=lx(k)+ld(k)lx(k)=ξ ( k )+1ξ ( k ) - Among them, λ x ( k ) = E [ ∣ X ( k ) ∣ 2 ] \lambda_x(k)=E[|X(k)|^2]lx(k)=E[∣X(k)∣2 ],ξ ( k ) = λ x ( k ) λ d ( k ) \xi(k)=\frac{\lambda_x(k)}{\lambda_d(k)}ξ ( k )=ld(k)lx(k)SNR prior
- 后水信颜比:γ ( k ) = ∣ Y ( k ) ∣ 2 λ d ( k ) \gamma(k)=\frac{|Y(k)|^2}{\lambda_d(k)}c ( k )=ld(k)∣Y(k)∣2
- Prior SNR: ξ ( k ) = λ x ( k ) λ d ( k ) \xi(k)=\frac{\lambda_x(k)}{\lambda_d(k)}ξ ( k )=ld(k)lx(k)
- Estimating the prior SNR from the posterior SNR—Decision Directed (DD)
ξ ( k , l ) = α DD ∣ X ^ ( k , l − 1 ) ∣ 2 λ d ( k , l − 1 ) + ( 1 − α DD ) max { γ ( k , l ) − 1 , 0 } \xi(k,l)=\alpha_{DD}\frac{|\hat{X}(k,l- 1)|^2}{\lambda_d(k,l-1)}+(1-\alpha_{DD})max\{ \gamma(k,l)-1,0 \}ξ ( k ,l)=aDDld(k,l−1)∣X^(k,l−1)∣2+(1−aDD) ma x { γ ( k ,l)−1,0} - Among them, α DD \alpha_{DD}aDDThe empirical value is 0.95~0.98, λ d ( k , l − 1 ) \lambda_d(k,l-1)ld(k,l−1 ) Availableλ d ( k , l ) \lambda_d(k,l)ld(k,l ) instead of
-
Gain factor for MMSE:
-
The derivation of MMSE can refer to: Ephraim, Y., and D. Malah. "Speech Enhancement Using a Minimum-Mean Square Error Short-Time Spectral Amplitude Estimator."