Speech Signal Processing - Noise Suppression

Introduction

  • Noise suppression technology is used to eliminate background noise, improve the signal-to-noise ratio and intelligibility of speech signals, and make people and machines hear more clearly
  • Common types of noise: human voice noise, street noise, car noise
  • Classification of noise suppression methods:
    • According to the number of input channels: single-channel noise reduction, multi-channel noise reduction
    • According to the statistical characteristics of noise: stationary noise suppression, non-stationary noise suppression
    • According to the noise reduction method: passive noise reduction, active noise reduction
  • The method described below is used for single-channel, passive, stationary noise suppression

Minima Controlled Recursive Averaging(MCRA)

  • The noise reduction method in the traditional signal processing method can be divided into two major steps:

    • One is the estimation of noise (Noise Estimation/Tracking)
    • The second is the estimation of the gain factor
  • Common methods for noise estimation:

    • Recursive Averaging: As long as the probability of speech in a certain frequency band is low, the noise spectrum can be estimated/updated using this frequency band
    • Minima Controlled/Tracking: Due to the sparsity of speech signals, even if speech exists, within a short time window (0.5s-1.5s), the minimum value of each frequency band will approach the noise power with a high probability , so the noise estimate for each frequency band can be obtained by tracking the minimum in a short time window
    • Histogram-based Method: Histogram statistics are performed in a short time window for each frequency band of the noisy speech signal, and the value with the highest frequency corresponds to the noise level of this frequency band
  • MCRA steps:

  • Consider a data model with additive noise: y ( n ) = x ( n ) + d ( n ) y(n)=x(n)+d(n)and ( n )=x(n)+d(n)

  • The input signal goes through STFT: Y ( k , l ) = ∑ n = 0 N − 1 y ( n + l M ) h ( n ) e − j 2 π N nk Y(k,l)=\sum_{n=0 }^{N-1}y(n+lM)h(n)e^{-j\frac{2\pi}{N}nk}and ( k ,l)=n=0N1and ( n)+lM)h(n)ejN2 p.mnk , get time spectrogram

  • in,

    • k is the index of the frequency bin
    • l is the index of the time frame
    • h(n) is the window function
    • M is for frame shift
  • Given two hypotheses H 0 ( k , l ) H_0(k,l)H0(k,l ) andH 1 ( k , l ) H_1(k,l)H1(k,l ) , which respectively represent "no speech" and "speech exists":
    H 0 ( k , l ) : Y ( k , l ) = D ( k , l ) H 1 ( k , l ) : Y ( k , l ) ) = X ( k , l ) + D ( k , l ) \begin{aligned} H_0(k,l):&Y(k,l)=D(k,l) \\ H_1(k,l):&Y (k,l)=X(k,l)+D(k,l) \end{aligned}H0(k,l)H1(k,l)and ( k ,l)=D(k,l)and ( k ,l)=X(k,l)+D(k,l)

  • The estimation of noise is defined as: λ d ( k , l ) = E [ ∣ D ( k , l ) ∣ 2 ] \lambda_d(k,l)=E[|D(k,l)|^2]ld(k,l)=E[D(k,l)2 ], that is, the energy of the noise

  • Using time recursive smoothing, when "voice does not exist", iteratively update λ d ( k , l ) \lambda_d(k,l)ld(k,l )
    H 0 ′ (k, l ) : λ ^ d ( k , l + 1 ) = α d λ ^ d ( k , l ) + ( 1 − α d ) ∣ Y ( k , l ) ∣ 2 H ′ ( k , l ) : λ ^ d ( k , l + 1 ) = λ ^ d ( k , l ) and if α d ( 0 < α d < 1 ) are aligned \begin{aligned} H'_0( k,l):&\hat{\lambda}_d(k,l+1)=\alpha_d \hat{\lambda}_d(k,l)+(1-\alpha_d)|Y(k,l)| ^2\\ H'_1(k,l):&\hat{\lambda}_d(k,l+1)=\hat{\lambda}_d(k,l) \\ integer,&\alpha_d(0 < \alpha_d < 1) must be aligned \end{aligned}H0(k,l)H1(k,l)Among them ,l^d(k,l+1)=adl^d(k,l)+(1ad)Y(k,l)2l^d(k,l+1)=l^d(k,l)ad0<ad<1 ) is the smoothing factor

  • Speech Presence Probability: p ′ ( k , l ) = P ( H 1 ′ ( k , l ) ∣ Y ( k , l ) ) p'(k,l)=P(H'_1(k ,l)|Y(k,l))p(k,l)=P(H1(k,l)Y(k,l))

  • 对星是的是以可以作的:
    λ ^ d ( k , l + 1 ) = λ ^ d ( k , l ) p ′ ( k , l ) + [ α d λ ^ d ( k , l ) + ( 1 − α d ) ∣ Y ( k , l ) ∣ 2 ] ( 1 − p ′ ( k , l ) ) = α ^ d ( k , l ) λ ^ d ( k , l ) + [ 1 − α ^ d ( k , l ) ] ∣ Y ( k , l ) ∣ 2 where, α ^ d ( k , l ) = α d + ( 1 − α d ) p ′ ( k , l ) \begin{aligned} \hat{\lambda }_d(k,l+1)&=\hat{\lambda}_d(k,l)p'(k,l)+[\alpha_d \hat{\lambda}_d(k,l)+(1- \alpha_d)|Y(k,l)|^2](1-p'(k,l)) \\ &=\hat{\alpha}_d(k,l) \hat{\lambda}_d(k ,l)+[1-\hat{\alpha}_d(k,l)]|Y(k,l)|^2 \\ 任任,\hat{\alpha}_d(k,l)&=\alpha_d+ (1-\alpha_d) p'(k,l) \end{aligned}l^d(k,l+1)in,a^d(k,l)=l^d(k,l)p(k,l)+[ adl^d(k,l)+(1ad)Y(k,l)2](1p(k,l))=a^d(k,l)l^d(k,l)+[1a^d(k,l)]Y(k,l)2=ad+(1ad)p(k,l)

  • The question now is: how to determine p ′ ( k , l ) p'(k,l)p(k,l)

  • Calculate p ′ ( k , l ) p'(k,l)p(k,l ) idea: in a short time window, calculate the local energyS ( k , l ) S(k,l)S(k,l ) and minimum energyS min ( k , l ) S_{min}(k,l)Smin(k,l ) ratio

  • Calculation of local energy:

    • Area smoothing: S f ( k , l ) = ∑ i = − wwb ( i ) ∣ Y ( k − i , l ) ∣ 2 S_f(k,l)=\sum_{i=-w}^{w} b(i)|Y(ki,l)|^2Sf(k,l)=i=wwb(i)Y(ki,l)2
    • Temporal smoothing: S ( k , l ) = α s S ( k , l − 1 ) + ( 1 − α s ) S f ( k , l ) S(k,l)=\alpha_sS(k,l-1 )+(1-\alpha_s)S_f(k,l)S(k,l)=asS(k,l1)+(1as)Sf(k,l)
    • b ( i ) b(i) b ( i ) is the frequency domain window function,α s ( 0 < α s < 1 ) \alpha_s (0<\alpha_s<1)as0<as<1 ) is the time-domain smoothing factor of the local energy
  • Calculation of minimum energy:

    • Conventional practice: use the local minimum search method, set a time window L (L is usually 1s), and search for the local minimum
    • Simplified approach:
      insert image description here
  • Calculate the ratio: S r ( k , l ) = S ( k , l ) S min ( k , l ) S_r(k,l)=\frac{S(k,l)}{S_{min}(k,l )}Sr(k,l)=Smin(k,l)S(k,l)

  • The discriminant for the existence of speech is:
    I ( k , l ) = { 1 , S r ( k , l ) > δ 0 , otherwise I(k,l)= \left\{\begin{matrix} 1, S_r( k,l)>\delta \\ 0, otherwise \end{matrix}\right.I(k,l)={ 1Sr(k,l)>d0otherwise

  • Iterative estimation of speech presence probability:

    • p ^ ′ ( k , l ) = α p p ^ ′ ( k , l − 1 ) + ( 1 − α p ) I ( k , l ) \hat{p}'(k,l)=\alpha_p\hat{p}'(k,l-1)+(1-\alpha_p)I(k,l) p^(k,l)=app^(k,l1)+(1ap)I(k,l)
    • α p ( 0 < α p < 1 ) \alpha_p(0<\alpha_p<1) ap0<ap<1 ) is the smoothing factor of the speech existence probability
  • MCRA noise estimation process:
    insert image description here

  • MCRA Reference Parameters
    insert image description here

  • After estimating the noise, multiply the noisy speech by the gain factor to perform noise suppression
    X ^ ( k , l ) = G ( k , l ) Y ( k , l ) or ∣ X ^ ( k ) ∣ 2 = G 2 ( k ) ∣ Y ( k ) ∣ 2 \begin{aligned} \hat{X}(k,l)&=G(k,l)Y(k,l) \\ or|\hat{X} (k)|^2&=G^2(k)|Y(k)|^2 \end{aligned}X^(k,l)or X^(k)2=G(k,l)Y(k,l)=G2(k)Y(k)2

  • To determine the gain factor:

    • spectral subtraction
    • Wiener filtering
    • MMSE
  • Spectral subtraction: Assuming that the noise changes steadily or slowly, from the noisy speech spectrum, the noise spectrum is subtracted

    • Definitely:
      G ( k ) = ∣ X ^ ( k ) ∣ 2 ∣ Y ( k ) ∣ 2 = ∣ Y ( k ) ∣ 2 − ∣ D ^ ( k ) ∣ 2 ∣ Y ( k ) ∣ = ∣ Y ( k ) ∣ 2 − λ d ( k ) ∣ Y ( k ) ∣ 2 = 1 − λ d ( k ) ∣ Y ( k ) ∣ 2 = 1 − 1 γ ( k ) \begin{aligned} G (k)&=\sqrt{\frac{|\hat{X}(k)|^2}{|Y(k)|^2}} \\ &=\sqrt{\frac{|Y(k) |^2-|\hat{D}(k)|^2}{|Y(k)|^2}}=\sqrt{\frac{|Y(k)|^2-\lambda_d(k)} {|Y(k)|^2}} \\ &=\sqrt{1-\frac{\lambda_d(k)}{|Y(k)|^2}}=\sqrt{1-\frac{1 }{\gamma(k)}} \end{aligned}G(k)=Y(k)2X^(k)2 =Y(k)2Y(k)2D^(k)2 =Y(k)2Y(k)2ld(k) =1Y(k)2ld(k) =1c ( k )1
    • For example, γ ( k ) = ∣ Y ( k ) ∣ 2 λ d ( k ) \gamma(k)=\frac{|Y(k)|^2}{\lambda_d(k)}c ( k )=ld(k)Y(k)2, known as the posterior SNR
  • Frequency-domain Wiener filtering: the mean square error between the estimated pure speech amplitude spectrum and the real amplitude spectrum is the smallest

    • Frequency domain estimation error: E ( k ) = X ( k ) − X ^ ( k ) = X ( k ) − G ( k ) Y ( k ) E(k)=X(k)-\hat{X}( k)=X(k)-G(k)Y(k)E(k)=X(k)X^(k)=X(k)G(k)Y(k)
    • Objective function: J = E [ ∣ E ( k ) ∣ 2 ] J=E[|E(k)|^2]J=E[E(k)2]
    • By minimizing the objective function, the expression of the gain factor can be obtained:
      G ( k ) = λ x ( k ) λ x ( k ) + λ d ( k ) = ξ ( k ) ξ ( k ) + 1 G(k)= \frac{\lambda_x(k)}{\lambda_x(k)+\lambda_d(k)}=\frac{\xi(k)}{\xi(k)+1}G(k)=lx(k)+ld(k)lx(k)=ξ ( k )+1ξ ( k )
    • Among them, λ x ( k ) = E [ ∣ X ( k ) ∣ 2 ] \lambda_x(k)=E[|X(k)|^2]lx(k)=E[X(k)2 ]ξ ( k ) = λ x ( k ) λ d ( k ) \xi(k)=\frac{\lambda_x(k)}{\lambda_d(k)}ξ ( k )=ld(k)lx(k)SNR prior
    • 后水信颜比:γ ( k ) = ∣ Y ( k ) ∣ 2 λ d ( k ) \gamma(k)=\frac{|Y(k)|^2}{\lambda_d(k)}c ( k )=ld(k)Y(k)2
    • Prior SNR: ξ ( k ) = λ x ( k ) λ d ( k ) \xi(k)=\frac{\lambda_x(k)}{\lambda_d(k)}ξ ( k )=ld(k)lx(k)
    • Estimating the prior SNR from the posterior SNR—Decision Directed (DD)
      ξ ( k , l ) = α DD ∣ X ^ ( k , l − 1 ) ∣ 2 λ d ( k , l − 1 ) + ( 1 − α DD ) max { γ ( k , l ) − 1 , 0 } \xi(k,l)=\alpha_{DD}\frac{|\hat{X}(k,l- 1)|^2}{\lambda_d(k,l-1)}+(1-\alpha_{DD})max\{ \gamma(k,l)-1,0 \}ξ ( k ,l)=aDDld(k,l1)X^(k,l1)2+(1aDD) ma x { γ ( k ,l)1,0}
    • Among them, α DD \alpha_{DD}aDDThe empirical value is 0.95~0.98, λ d ( k , l − 1 ) \lambda_d(k,l-1)ld(k,l1 ) Availableλ d ( k , l ) \lambda_d(k,l)ld(k,l ) instead of
  • Gain factor for MMSE:
    insert image description here

  • The derivation of MMSE can refer to: Ephraim, Y., and D. Malah. "Speech Enhancement Using a Minimum-Mean Square Error Short-Time Spectral Amplitude Estimator."

Guess you like

Origin blog.csdn.net/m0_46324847/article/details/130954924