01 Conv-TasNet paper sharing

 题目:Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

                                                          The article address and open source code address are attached at the end of the article

1. Motivation:

Insufficient accuracy, latency, computational cost of single-channel, speaker-independent speech splitting methods Several problems with time-frequency description separation problems such as decoupling of signal phase and amplitude Suboptimal time-frequency representation for speech separation performance, and long delays in computing spectrograms.

2. Method:

A fully convolutional temporal audio separation network, Conv-TasNet, a deep learning framework for end-to-end temporal speech separation, is proposed using a linear encoder to generate a representation of the speech waveform optimized to separate individual utterances By. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representation is then inverted back to a waveform using a linear decoder.

3. Network architecture:

The overall network architecture is as follows

The detailed network architecture is as follows

The 1-D convolution block is designed as follows

The corresponding network parameters are as follows 

Compare different hyperparameter configurations:

Experiments show that using the penultimate hyperparameter has better performance, using non-causality, the penultimate parameter is the same as that used in the penultimate experiment, and the difference is the use of causality.

4. Dataset:

The authors evaluate the performance of the proposed system on two-speaker and three-speaker speech separation using the WSJ0-2mix and WSJ0-3mix datasets. 30 hours of training data and 10 hours of validation data were generated from the dataset. Hybrid speech is generated by randomly selecting speech from different speakers in the "Wall Street Journal" dataset (WSJ0) and blending with random signal-to-noise ratio (SNR) between -5 dB and 5 dB. A 5-hour evaluation set was generated in the same manner, with all waveforms resampled to 8 kHz.

5. Experimental results:

Performance on the WSJ0-2MIX dataset:

Performance on the WSJ0-3MIX dataset:

6 Conclusion:

Conv-TasNet represents an important step towards the realization of speech separation algorithms and opens many future research directions to further improve its accuracy, speed and computational cost, which will eventually make automatic speech separation a practical application for every speech processing Common and necessary features of technology.

Paper address: https://arxiv.org/pdf/1809.07454.pdf

Source code: https://github.com/naplab/Conv-TasNet

                                                                                                                                          2022.3.28


 

Guess you like

Origin blog.csdn.net/qq_41893773/article/details/123808305