[Interpretation of Speech Enhancement Paper 07] Single-channel speech enhancement and reverberation based on UFORMER

Authors: Yihui Fu, Yun Liu, Jingdong Li, Dawei Luo, Shubo Lv, Yukai Jv, Lei Xie

1. Motivation

        In recent years, researchers have begun to try to use the complex spectrum to model the real and imaginary parts of the input speech spectrum at the same time, which can obtain a higher theoretical upper limit, which has gradually become a hot research direction. But previous work did not jointly optimize the features of these two domains to explore their potential internal connections. Speech dereverberation in single-channel scenarios is full of challenges due to the loss of spatial information. On the other hand, the Conformer model evolved from the Transformer model has achieved excellent results in end-to-end speech recognition tasks due to its powerful temporal modeling capabilities. However, for the speech front-end processing model, unlike the speech recognition model, we cannot only focus on the timing information and ignore the frequency band information, because different frequency bands contain different energy and information, requiring a more refined modeling method. Therefore, the researchers have carried out a dual-path transformation on the self-attention mechanism, that is, the attention mechanism is learned in the two dimensions of time and frequency.

2. Network Architecture

        Uformer includes three main modules: Encoder, Decoder and Dilated dual-path conformer. As shown below

The Encoder learns the high-dimensional features of speech through the stacking of convolutional layers, and the Decoder maps the high-dimensional features to the same dimension as the input through the stacking of deconvolutional layers. Each layer of Encoder and Decoder uses the Hybrid Encoder and Decoder architecture to simultaneously perform complex spectrum and amplitude spectrum modeling and information fusion, and each layer of Encoder and Decoder uses the Encoder Decoder Attention mechanism to learn the information between the corresponding layers. Correlation. For the Dilated dual-path conformer, it mainly contains four main modules: two layers of fully connected layers (feed forward, FF), timing attention (time attention, TA), frequency band attention (frequency attention, FA) and hole volume product (dilated convolution, DC).

Complex number Self Attention : In complex number self attention, the operation shown in the following formula will be performed on the input complex number features to perform the attention mechanism calculation on the real part and the imaginary part respectively:

Dilated Dual-path Conformer : The module structure is shown in the figure below. The FF module is mainly used for feature dimension compression and restoration, and uses a semi-residual connection to avoid gradient disappearance. The TA module models local timing features by concatenating the current frame and historical/future frame information and performing real/complex self attention calculations on the time axis. The FA module models frequency information by performing real/complex self attention calculations on different frequency bands. The DC module modifies the TCN [12] model with atrous convolutions for global temporal feature modeling. The specific transformation method is to invert the expansion coefficient (Dilation) of the two hole convolutions to jointly model different receptive fields. Hybrid Encoder and Decoder : Since Uformer needs to simultaneously process complex spectrum and amplitude spectrum features for modeling, this module is proposed to achieve information interaction between the two domains: 

 Encoder Decoder Attention : In order to prevent the gradient from disappearing in the traditional method, the corresponding Encoder and Decoder layers are generally skipped (Skip connection). However, it is considered that this method cannot learn the correlation information of Encoder and Decoder well. Therefore, the attention mechanism is used to enhance the modeling ability. First, the outputs of the corresponding Encoder and Decoder layers are respectively passed through real/complex two-dimensional convolution to learn high-dimensional features and summed and passed through the Sigmoid function, and then through the third layer of real/complex two-dimensional convolution to learn The Sigmoid mask is applied to the output of the original Decoder layer. The feature is spliced ​​with the output of the Encoder layer as the input of the next layer of Decoder:

 

 3. Loss function

After fusing the complex and magnitude features of the model output, a strategy for joint optimization of the time-domain SI-SNR, time-domain L1loss, magnitude spectrum L2loss, and complex spectrum L2loss loss functions is used:

4. Experiment

        In the experiment, the clean vocal data used include LibriTTS, AISHELL-3, DNS competition voice data and the a cappella part of MUSDB, totaling 1050 hours. The noise data includes MUSAN, DNS competition noise data, the music part of MUSDB, MS-SNSD, etc., a total of 260 hours. The RIR is simulated using the mirror method, and the RT60 ranges from 0.2 to 1.2 seconds. The signal-to-noise ratio is -5 to 15dB. All training data is generated from a dynamic random mixture. The test data is from the same source as the above data set without overlap, and the simulation is carried out according to the three signal-to-noise ratio intervals of [-5,0], [0,5] and [5,10]. In addition, the official blind test set of the Interspeech2021 DNS competition is used as another test set. Use PESQ, eSTOI, DNSMOS and MOS as test indicators.

        The experimental results are shown in the table below. Uformer is superior to all previous complex spectral models (DCCRN, DCCRN+) and time domain models (TasNet[12], DPRNN[13]) in terms of objective indicators and subjective hearing, showing its powerful enhancement and demixing ability. Uformer also achieved almost the same effect as the first-place model SDD-Net in the Interspeech2021 DNS competition. However, compared with the SDD-Net multi-step training configuration, Uformer's training is completely end-to-end, and the process is simpler, controllable and easy to reproduce. In addition, the ablation experiments prove that all the proposed sub-modules including dilated dual-path conformer, hybrid Encoder and Decoder and Encoder Decoder attention have their obvious contributions.

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_41893773/article/details/124401213