Paper translation: DeepFilterNet2


Paper title : DeepFilterNet2: Towards Real-Time Speech Enhancement On Embedded Devices For Fullband Audio
Title translation : DeepFilternet2: Full-band audio real-time speech enhancement for embedded devices
Paper address : https://arxiv.org/abs/2205.05474
Paper code : https://github.com/Rikorose/DeepFilterNet
Citation : Schröter H, Rosenkranz T, Maier A. DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio[J]. arXiv preprint arXiv:2205.05474, 2022.

Summary

  Deep learning-based speech enhancement techniques have made tremendous progress and have recently been extended to full-band audio (48 kHz). However, many methods have considerable computational complexity and require large temporal buffers for real-time use, e.g. due to temporal convolutions or attention. Both of these make these methods infeasible on embedded devices. This work further extends DeepFilterNet to exploit the harmonic structure of speech for effective speech enhancement. Several optimizations in the training process, data augmentation, and network structure bring SE performance to the state-of-the-art while reducing the real-time factor to 0.04 on a notebook Core-i5 CPU. This allows algorithms to run in real-time on embedded devices. The DeepFilterNet framework is available under an open source license.
Keywords : DeepFilterNet, speech enhancement, full-band, two-stage modeling

1 Introduction

  Recently, deep learning based speech enhancement has been extended to full frequency band (48 kHz) [1,2,3,4]. Most SOTA methods perform SE in the frequency domain by performing short-time Fourier transform (STFT) on the noisy audio signal, and augment the signal in a U-Net similar to a deep neural network (DNN). However, many methods have relatively large computational demands in terms of multiply-accumulate operations (MAC) and memory bandwidth. That is, higher sampling rates generally require larger FFT windows, resulting in a large number of frequency bins, which directly translates to more MACs.
  PercepNet [1] solves this problem by using a triangular ERB (Equivalent Rectangular Bandwidth) filter bank. Here, the STFT-based frequency bins are logarithmically compressed to 32 ERB bands. However, this only allows real-valued processing, which is why PercepNet additionally applies a comb filter to better enhance the periodic components of speech. In contrast, FRCRN [3] divides the frequency bin into 3 channels to reduce the size of the frequency axis. This approach allows for the processing and prediction of complex scale masks (CRMs). Similarly, DMF-Net [4] uses a multi-band approach, where the frequency axis is divided into 3 bands, which are processed separately by different networks. In general, multi-stage networks like DMF-Net have recently demonstrated their potential compared to single-stage methods. For example, GaGNet [5] uses two so-called glance and gaze stages after the feature extraction stage. The glance module works on the coarse-scale domain, while the gaze module processes the spectrum on the complex-number domain, allowing the spectrum to be reconstructed at finer resolution.
  In this work, we extend the work of [2], which is also divided into two stages. DeepFilterNet utilizes a speech model consisting of a periodic component and a stochastic component. The first stage works in the ERB domain and only enhances the speech envelope, while the second stage uses deep filtering [6,7] to enhance periodic components. In this paper, we describe several optimizations to achieve SOTA performance on the Voicebank+Demand [8] and Deep Noise Suppression (DNS) 4 blind test challenge datasets [9]. Additionally, these optimizations improve runtime performance, making it possible to run models in real time on a Raspberry Pi 4.

2 methods

2.1. Signal model and DeepFilterNet framework

  We assume that noise and speech are uncorrelated, such as:
x ( t ) = s ( t ) ∗ h ( t ) + n ( t ) x(t)=s(t)*h(t)+n(t)x(t)=s(t)h(t)+n ( t )
of whichs ( t ) s(t)s ( t ) is the pure speech signal,n ( t ) n(t)n ( t ) is additive noise,h ( t ) h(t)h ( t ) is the room impulse response simulating a reverberant environment, producing a noise mixturex ( t ) x(t)x ( t ) . This translates directly to the frequency domain:
X ( k , f ) = S ( k , f ) ∗ H ( k , f ) + N ( k , f ) X(k,f)=S(k,f)*H( k,f)+N(k,f)X(k,f)=S(k,f)H(k,f)+N(k,f )
whereX ( k , f ) X(k,f)X(k,f ) is the time domain signalX ( t ) X(t)STFT representation of X ( t ) , k , fk,fk,f is the time and frequency index.
  In this study, we adopt the two-stage denoising process of DeepFilterNet [2]. That is, the first stage operates over a magnitude range and predicts real-valued gains. The entire first stage is performed within a compressed ERB domain, with the aim of reducing computational complexity when simulating human auditory perception. Therefore, the purpose of the first stage is to enhance the speech envelope at coarse frequency resolution. The second stage utilizes deep filtering working in the complex domain [7,6] in an attempt to reconstruct the periodicity of speech. [2] show that depth filtering (DF) often outperforms conventional complex ratio masks (CRMs), especially under very noisy conditions.
  The combined SE process can be expressed as follows. EncoderF enc F_{enc}FencEncode ERB and complex features into an embedding ε \varepsilonε (k) = F enc(Xerb(k,b), Xdf(k,ferb)) \varepsilon(k)=F_{enc}(X_{erb}(k,b),X_{ df
}(k,f_{erb}))e ( k )=Fenc(Xerb(k,b),Xdf(k,ferb))
  Next, the first stage predicts the real-valued gainGGG and enhance the voice envelope, so as to get the short-term spectrumYG Y_GYG
Gerb(k,b) = Ferb_dec(ε(k)) G(k,f) = interp(Gerb(k,b)) YG(k,f) = X(k,f) ⋅ G ( k , f ) \begin{align} G_{erb}(k,b)=\mathcal{F}_{erb\_dec}(\valuepsilon(k))\notag\\ G(k,f)= interp(G_{erb}(k,b))\notag\\ Y_G(k,f)=X(k,f)·G(k,f)\notag\end{align}Gerb(k,b)=Ferb_dec( e ( k ))G(k,f)=in t er p ( Gerb(k,b))YG(k,f)=X(k,f)G(k,f)
  Finally in the second stage, F erb _ dec F_{erb\_dec}Ferb_decpredicted NNN order DF coefficientC df N C_{df}^NCdfN, which is then linearly applied to YG Y_GYGC
df N(k,i,fdf) = Ferb_dec(ε(k)) Y(k,f′) = ∑i = 0 NC(k,i,f′) ⋅ X(k −i+ l , f ) \begin{align} C_{df}^N(k,i,f_{df})=\mathcal{F}_{erb\_dec}(\valuepsilon(k))\notag\\ Y( k,f^{'})=\sum_{i=0}^NC(k,i,f^{'})·X(k-i+l,f)\note\end{align}CdfN(k,i,fdf)=Ferb_dec( e ( k ))and ( k ,f)=i=0NC(k,i,f)X(ki+l,f)
hel is DF look-ahead. As mentioned earlier, the second stage only works in the lower part of the spectrum atfdf = 5 k Hz f_{df} = 5 kHzfdf=5kHz . _ _ The DeepFilterNet2 framework is shown in Figure 1.
Please add a picture description

An overview of the DeepFilterNet2 speech enhancement process shown in Figure 1

2.2. Training process

  In DeepFilterNet [2], we used exponential learning rate schedule and fixed weight decay. In this work, we also use a learning rate warmup for 3 epochs followed by cosine decay. Most importantly, we update the learning rate at each iteration rather than after each epoch. Similarly, we schedule weight decay with an increasing cosine schedule to bring greater regularization to later stages of training. Finally, to achieve faster convergence, especially at the beginning of training, we use batch scheduling [10] with batch size starting from 8 and gradually increasing to 96. The scheduling scheme is shown in Figure 2.
Please add a picture description

Figure 2 shows the learning rate, weight decay and batch size used for training

2.3. Multi-objective loss

  We adopt the spectral loss L spec from [2] L_{spec}Lspec. Furthermore, a multiresolution (MR) spectral loss is used, where the enhanced spectrum Y ( k , f ) Y(k,f)and ( k ,f ) First convert to time domain and then calculate multiple STFTs with windows from 5 ms to 40 ms [11]. To propagate the gradient of this loss, we use pytorch STFT/ISTFT, which is numerically close enough to the original DeepFilterNet processing loop implemented in Rust.
LMR = ∑ i ∥ ∣ Y i ′ ∣ c − ∣ S i ′ ∣ c ∥ 2 ∥ ∣ Y i ′ ∣ cej φ Y − ∣ S i ′ ∣ cej φ S ∥ 2 L_{MR}=\sum_i\parallel{ |Y_{i}^{'}|^c-|S_{i}^{'}|^c}\parallel^2\parallel{|Y_{i}^{'}|^ce^{j\varphi Y}-|S_{i}^{'}|^ce^{j\varphi S}}\parallel^2LMR=iYicSic2Yicej φ YSicej φS2
of whichY i ′ = STFT i ( y ) Y_{i}^{'}=STFT_i(y)Yi=STF Ti( y ) is the second iiof predicted TD signal yi STFT with a window size of {5, 10, 20, 40} ms,
which is the compression parameter [1]. Compared with DeepFilterNet[2], we removedα \alphaα loss term, since the heuristic used is only a poor approximation of the local speech periodicity. In addition, DF can enhance the speech of the non-speech part, and can pass the coefficientt 0 t_0t0Set the real part of to 1, and set the remaining coefficients to 0 to disable its effect. The multi-objective comprehensive loss is:
L = λ spec L spec + λ MLLMLL=\lambda_{spec}L_{spec}+\lambda_{ML}L_{ML}L=lspecLspec+lMLLML

2.4. Data Augmentation

  DeepFilterNet is trained on the Deep Noise Suppression (DNS) 3 challenge dataset [12], while we train DeepFilterNet2 on the English part of DNS4 [9] because DNS4 [9] contains more full-band noise and speech samples .
  In speech enhancement, usually only background noise and in some cases reverberation are reduced [1,11,2]. In this work, we further extend the concept of SE to descent. Therefore, we distinguish between augmentation and distortion in the dynamic data preprocessing pipeline. Augmentation is applied to speech and noise samples with the goal of further expanding the data distribution observed by the network during training. Distortion, on the other hand, is only applied to speech samples for the creation of noise mixes. Clear speech targets are not affected by distortion transformations. Thus, the DNN learns to reconstruct the original, undistorted speech signal. Currently, the DeepFilterNet framework supports the following random augmentations:

  • Random second-order filtering [13]
  • Change Gain
  • Equalizer through 2nd order filter
  • Resampled speed and pitch changes [13]
  • Add colored noise (not used for speech samples)

In addition to denoising, DeepFilterNet will also try to restore the following distortions:

  • Reverb: By attenuating the room transfer function, the target signal will contain less reverberation.
  • Clip artifacts with a signal-to-noise ratio of [20,0] dB.

2.5. DNN

  We keep the general convolutional U-Net structure of DeepFilterNet [2], but make the following adjustments. The final architecture is shown in Figure 3.
1. Unification of the encoder. Convolutions for both ERB and complex features are now handled in the encoder, concatenated, and passed to a grouped linear (GLinear) layer and a single GRU.
2. Simplify Grouping. Previously, the grouping of linear and GRU layers was implemented through separate smaller layers, which resulted in relatively high processing overhead. In DeepFilterNet2, only linear layers are grouped on the frequency axis, implemented by a single matrix multiplication. GRU hidden dim was reduced to 256. We also apply grouping at the output layer of the DF decoder, exciting adjacent frequencies enough to predict filter coefficients. This greatly reduces runtime while only increasing the number of flops by a small amount.
3. Reduction of temporal kernels. Although Temporal Convolution (TCN) or Temporal Attention have been successfully applied to SE, they require temporal buffering during real-time inference. This can be efficiently achieved with a ring buffer, however, the buffer needs to be kept in memory. This extra memory access can cause bandwidth to become a limiting bottleneck, especially for embedded devices. Therefore, we reduce the kernel size of the convolution and transpose the convolution from 2 3 to 1 3, which is 1D on the frequency axis. Now only the input layer passes causal 3∗3 3*333 Convolutions incorporate temporal context. This greatly reduces the time buffer used during real-time inference.
4. Depthwise pathway convolutions. When using separable convolutions, a large number of parameters and flops are located at 1*1 convolutions. Therefore, adding grouping in path convolution (PConv) can greatly reduce parameters without losing any significant SE performance.
Please add a picture description

Figure 3 DeepFilterNet2 architecture

2.6. Post-processing

  We adopt the post-filter first proposed by Valin et al. [1], the purpose is to slightly filter and attenuate the noisy TF bins, while adding some gain to the less noisy frequency points. We do this on the first-stage prediction gain:
G ′ ( k , b ) ← G ( k , b ) ⋅ sin ( π 2 G ( k , b ) ) G ( k , b ) ← ( 1 + β ) ⋅ G ( k , b ) 1 + β + G ′ ( k , b ) \begin{align} G^{'}(k,b)\leftarrow G(k,b) sin(\frac{\pi }{2}G(k,b))\notag\\ G(k,b)\leftarrow \frac{(1+\beta)·G(k,b)}{1+\beta+G^{' }(k,b)}\notag \end{align}G(k,b)G(k,b)sin(2pG(k,b))G(k,b)1+b+G(k,b)(1+b ) G(k,b)

3. Experiment

3.1. Implementation Details

  As described in Section 2.4, we train DeepFilterNet2 on the DNS4 dataset using a total of over 500 hours of full-band pure speech (approximately). 150 hours of noise and 150 real and 60,000 simulated HRTFs. We split the data into training, validation and test sets (70%, 15%, 15%). The Voicebank set split is speaker-exclusive and has no overlap with the test set. We evaluate our method on the Voicebank+Demand test set [8] and the DNS4 blind test set [9]. We train the model with AdamW for 100 epochs and select the best model based on the validation loss.
  In this work, we use a window of 20 ms, an overlap of 50%, and a look-ahead of two frames, resulting in an overall algorithmic delay of 40 ms. We take 32 ERB bands, f DF = 5 k H z f_{DF}= 5kHzfDF=5k Hz , DF orderN = 5 N= 5N=5 , look-ahead = 2 frames. Loss parameterλ spec = le 3 \lambda_{spec}=le3lspec=l e 3 andλ spec = 5 e 2 \lambda_{spec}=5e2lspec=The choice of 5e 2 makes both losses the same order of magnitude. Source code and a pre-trained DeepFilterNet2 are available athttps://github.com/Rikorose/DeepFilterNet.

3.2. Results

  We use the Valentini speech corpus + demand test set [8] to evaluate the speech enhancement performance of DeepFilterNet2. Therefore, we choose WB-PESQ [19], STOI [20] and composite indicators CSIG, CBAK, COVL [21]. Table 1 shows the comparison results of DeepFilterNet2 with other state-of-the-art (SOTA) methods. It can be found that DeepFilterNet2 achieves sota-level results while requiring minimal multiply-accumulate operations per second (MACS). On DeepFilterNet (Section 2.5), the number of parameters is slightly increased, but the network is able to run more than twice as fast and achieve a high PESQ score of 0.27. GaGNet [5] achieves similar RTF while having good SE performance. However, it only runs fast when feeding the entire audio, requiring large temporal buffers due to the large temporal kernels it uses. FRCRN [3] achieves the best results on most metrics, but has high computational complexity, which is not achievable on embedded devices.

Table 1 Objective results on the Voicebank+Demand test set. Real Time Factor (RTF) is measured as an average of 5 runs on a notebook Core i5-8250U CPU.
Unreported related work is denoted as −

Please add a picture description

  • a. Source code and weights for metrics and RTF measurements are available at https://github.com/xiph/rnnoise
  • b. Note that RNNoise runs single-threaded
  • c. The source code of RTF measurement is provided at https://github.com/huyanxin/DeepComplexCRN
  • d. Composite and STOI indicators are provided by the same authors in [16]
  • e. The source code and weights of metrics and RTF measurements are available at: https://github.com/hit-thusz-RookieCJ/FullSubNet-plus
  • f. The source code of RTF measurement is provided at: https://github.com/Andong-Li-speech/GaGNet

Table 2 shows the results of DNSMOS P.835 [22] on the DNS4 blind test set. Although deepfilternet [2] cannot improve speech quality mean opinion score (SIGMOS), we also obtain good results using DeepFilterNet2, both for background and overall MOS values. In addition, DeepFilterNet2 is relatively close to the minimum DNSMOS value for selecting clean speech samples to train DNS4 baseline NSNet2 (SIG=4.2, BAK=4.5, OVL=4.0) [9], further emphasizing its good SE performance.
Please add a picture description

4 Conclusion

  In this work, we propose DeepFilterNet2, a low-complexity speech enhancement framework. Using DeepFilterNet's perceptual approach, we are able to further apply some optimizations that improve SOTA SE performance. Thanks to its lightweight architecture, it can run with a real-time factor of 0.42 on a Raspberry Pi 4. In future work, we plan to extend the idea of ​​speech enhancement to other enhancements, such as correcting low-pass characteristics due to the current room environment.

5. Reference

[1] Jean-Marc Valin, Umut Isik, Neerad Phansalkar, Ritwik Giri, Karim Helwani, and Arvindh Krishnaswamy, A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech, in INTERSPEECH 2020, 2020.

[2] Hendrik Schr oter, Alberto N Escalante-B, Tobias Rosenkranz, and Andreas Maier, DeepFilterNet: A low complexity speech enhancement framework for fullband audio based on deep filtering, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

[3] Shengkui Zhao, Bin Ma, Karn N Watcharasupat, and Woon-Seng Gan, FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

[4] Guochen Yu, Yuansheng Guan, Weixin Meng, Chengshi Zheng, and Hui Wang, DMF-Net: A decoupling-style multi-band fusion model for real-time full-band speech enhancement, arXiv preprint arXiv:2203.00472, 2022.

[5] Andong Li, Chengshi Zheng, Lu Zhang, and Xiaodong Li, Glance and gaze: A collaborative learning framework for single-channel speech enhancement, Applied Acoustics, vol. 187, 2022.

[6] Hendrik Schr oter, Tobias Rosenkranz, Alberto Escalante Banuelos, Marc Aubreville, and Andreas Maier, CLCNet: Deep learning-based noise reduction for hearing aids using complex linear coding, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

[7] Wolfgang Mack and Emanu el AP Habets, Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters, IEEE Signal Processing Letters, vol. 27, 2020.

[8] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi, Investigating RNN-based speech enhancement methods for noise-robust Text-toSpeech, in SSW, 2016.

[9] Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, et al., ICASSP 2022 deep noise suppression challenge, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

[10] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le, Don t decay the learning rate, increase the batch size, arXiv preprint arXiv:1711.00489, 2017.

[11] Hyeong-Seok Choi, Sungjin Park, Jie Hwan Lee, Hoon Heo, Dongsuk Jeon, and Kyogu Lee, Real-time denoising and dereverberation wtih tiny recurrent u-net, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[12] Chandan KA Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, and Sriram Srinivasan, Interspeech 2021 deep noise suppression challenge, in INTERSPEECH, 2021.

[13] Jean-Marc Valin, A hybrid dsp/deep learning approach to real-time full-band speech enhancement, in 2018 IEEE 20th international workshop on multimedia signal processing (MMSP). IEEE, 2018.

[14] Sebastian Braun, Hannes Gamper, Chandan KA Reddy, and Ivan Tashev, Towards efficient models for realtime deep noise suppression, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[15] Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie, DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement, in INTERSPEECH, 2020.

[16] Shubo Lv, Yihui Fu, Mengtao Xing, Jiayao Sun, Lei Xie, Jun Huang, Yannan Wang, and Tao Yu, SDCCRN: Super wide band dccrn with learnable complex feature for speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

[17] Shubo Lv, Yanxin Hu, Shimin Zhang, and Lei Xie, DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement, in INTERSPEECH, 2021.

[18] Jun Chen, Zilin Wang, Deyi Tuo, Zhiyong Wu, Shiyin Kang, and Helen Meng, FullSubNet+: Channel attention fullsubnet with complex spectrograms for speech enhancement, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

[19] ITU, Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs, ITU-T Recommendation P.862.2, 2007.

[20] Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, An algorithm for intelligibility prediction of time frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, 2011.

[21] Yi Hu and Philipos C Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Transactions on audio, speech, and language processing, 2007.

[22] Chandan KA Reddy, Vishak Gopal, and Ross Cutler, Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

Guess you like

Origin blog.csdn.net/qq_43522889/article/details/130876391