[Interpretation of Speech Enhancement Papers 04] DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for SpeechEnhancement

Authors: Shubo Lv, Yanxin Hu, Shimin Zhang, Lei Xie

The article address and open source code address are attached at the end of the article

1. Motivation

        Further updates have been made to DCCRN (those who don’t know about DCCRN can read my previous exhibition).

2. Method

        1. Extend the model to subband processing.

        2. Replace LSTM with TF-LSTM.

        3. The output of the encoder is aggregated using a convolutional block before being fed to the decoder.

        4. Develop a priori SNR estimation module for the decoder, and then remove noise while maintaining good speech quality.

        5. Finally, the post-processing module is used to further suppress the unnatural residual noise.

3. Network Architecture

        The overall network architecture of DCCRN+ is shown in the figure below:

        The overall structure is similar to DCCRN, but with the following differences:

        1. Subband processing using split/merge modules before/after encoder/decoder.

        2) Complex TF-LSTM for frequency and time-scale temporal modeling.

        3) Add Convolution Pathway to better aggregate information from encoder output before feeding to decoder.

        4) Add SNR estimation module to alleviate speech distortion during noise suppression.

        5) Post-processing to further remove residual noise.

3.1 TF-LSTM module

        The TF-LSTM module design is as follows:

3.2 Loss function

        The SI-SNR loss is used for noise suppression, and the MSE loss is also used to guide the learning of the SNR estimator. The overall loss is:

4. Experiment

4.1 Dataset

        We first conduct comprehensive ablation experiments on the proposed model on the DNS-2020 dataset. The model is then trained, integrated with the Post-Processing module, and evaluated using the Interspeech 2021 DNS challenge dataset to demonstrate its performance in more complex and realistic acoustic scenarios. Other competing models such as PercepNet are also compared with our model on the Voice Bank+DEMAND dataset.

 4.2 Training Strategy

        The window length and frame displacement are 20ms and 10ms respectively, and the FFT length is 512. Using the Adam optimizer, the initial learning rate is 1e-3. When the loss on the validation set increases, the learning rate decays by a factor of 0.5.

4.3 Baseline

        DCCRN: The number of channels of DCCRN is {16, 32, 64, 128, 256, 256}, and the convolution kernel and step size are set to (5,2) and (2,1). Two-layer LSTM is used, and the number of nodes is 256. There is a 1024*256 fully connected layer after LSTM. Each encoder module processes the current frame and the previous frame. In the decoder, the last layer processes an additional future frame, and each previous layer uses the current frame and a history frame.

        DCCRN+: The number of channels of DCCRN+ is {32,64,128,256}. The split-band module is a group Conv1D layer with 4 groups. Correspondingly, the merge-module is a linear layer. The Complex TF-LSTM module consists of a complex LSTM (the unit of the real and imag parts is 256) and a complex BLSTM. The CLP block has 256 elements for real and imaginary parts. The Convolution pathway module consists of a 1 × 1 complex Conv2D layer. The SNR Estimator is a 64-unit LSTM layer followed by a Conv1D layer with 3 kernels. The rest of the configuration is the same as DCCRN.

4.4 Experimental results

        The results in Table 1 show that subband operations can significantly improve speed and reduce model size. But the PESQ of the subband DCCRN model based on FIR filter has obvious degradation compared with the original DCCRN. With the help of the proposed neural network filter, the PESQ of the subband model recovers to the same level as DCCRN, and the inference speed is further improved with an RTF of 0.137. After replacing LSTMs with complex TF-LSTMs in subband DCCRNs (NN filters), we obtain significant PESQ improvements while the models become larger and slower. The use of Convolution pathway and SNR estimator brings further PESQ gain, and the best PESQ is 3.32, which is a clear improvement compared with the original DCCRN.

        The PESQ performance of DCCRN+ is compared with other competing models on the Voice Bank + DEMAND test set. The results in Table 2 show that the proposed DCCRN+ significantly outperforms other models, and DCCRN+ greatly outperforms PercepNet with fewer parameters.

        The model trained on the DNS-2021 dataset is further tested. Evaluation using DNS MOS - a new metric provided by challenge organizers believed to be more relevant to subjective listening scores. As can be seen from Table 3, DNS MOS increases with more updates on our model, and the highest score of 3.46 is obtained by using all updates (including post-processor).

5 Conclusion

        The new model of DCCRN+ is equipped with sub-band processing capability via learnable neural filters for band segmentation and merging, enabling compact model size and accelerated inference. The new model also updates the TF-LSTM and convolution paths. Importantly, under the multi-task learning framework, the SNR estimator is adopted together with the decoder to remove noise while maintaining good speech quality. Finally, a post-processor is used to remove unnatural residual noise. Experiments demonstrate the effectiveness of these updates.

Article address: https://arxiv.org/pdf/2106.08672.pdf

Open source code address: None

Guess you like

Origin blog.csdn.net/qq_41893773/article/details/124109621