AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss笔记

Article Directory


论文: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

 

Code: github address

AutoVC performs well in the traditional many-to-many voice conversion task of non-parallel data, and can realize Zero-shot voice conversion (convert to a voice style that has not been heard).

The whole conversion process is divided into three steps (1) Audio -> Mel spectrogram (2) Use AutoVC model to convert Mel spectrogram (3) Use WaveNet to convert Mel spectrogram to audio

Network structure

There are currently two commonly used VC methods: GAN and VAE, but GAN is more difficult to train, and VAE does not guarantee distribution matching, and the problem of smooth conversion output often occurs. This article looks forward to combining the advantages of both GAN and VAE. AutoVC follows the self-encoding framework and only trains for self-encoding loss, but it introduces carefully adjusted dimensionality reduction and time downsampling to constrain the information flow.

The entire network includes three modules:

Content encoder Ec: generate voice content

Speaker encoder Es: Generate speaker style

Decoder D: Produce speech from the output of Ec and Es

When converting, send the original voice Mel map to Ec, and send any voice Mel map of the target speaker to Es, and you can get the converted voice Mel map from D

During training, the source speech is input to Ec. Input another voice of the same speaker into Es. Make the content encoder and decoder minimize the self-reconstruction error. Throughout the paper, assuming that Es is pre-trained, the training we refer to refers to training Ec and D. The input of the content encoder is X1, but the input of the style encoder becomes a different utterance Z1' from the same speaker U1, denoted as X1'. Minimize the loss function during training, that is, the weighted combination of self-reconstruction error and content code reconstruction error.
min ⁡ E c (⋅), D (⋅, ⋅) L = L recon + λ L content \min _(E_(c)(\cdot), D(\cdot, \cdot)) L=L_(\text {recon }}+\lambda L_{\text {content }}Ec​(⋅),D(⋅,⋅)min​L=Lrecon ​+λLcontent ​

L recon  = E [ ∥ X ^ 1 → 1 − X 1 ∥ 2 2 ] L_{\text {recon }}=\mathbb{E}\left[\left\|\hat{X}_{1 \rightarrow 1}-X_{1}\right\|_{2}^{2}\right]Lrecon ​=E[∥∥∥​X^1→1​−X1​∥∥∥​22​]

L content  = E [ ∥ E c ( X ^ 1 → 1 ) − C 1 ∥ 1 ] L_{\text {content }}=\mathbb{E}\left[\left\|E_{c}\left(\hat{X}_{1 \rightarrow 1}\right)-C_{1}\right\|_{1}\right]Lcontent ​=E[∥∥∥​Ec​(X^1→1​)−C1​∥∥∥​1​]

Speaker encoder

Es is formed by stacking two LSTM layers with a unit size of 768. Only the last output is selected and projected to 256 dimensions with fully connected layers. The final speaker style is a 256×1 vector. The GE2E loss pre-training of the speaker encoder maximizes the embedding similarity between different utterances of the same speaker, and minimizes the similarity between different speakers.

In the experiment, Es is pre-trained on the VoxCeleb1 and Librispeech datasets.

Content encoder

The input X1 of Ec is an 80-dimensional mel spectrogram, and Es(X1) is embedded in series speakers at each time step. The concatenated features are input into three 5×1 convolutional layers, and each layer performs batch normalization and ReLU activation in turn. The number of channels is 512. The output is then passed to a stack of two bidirectional LSTM layers. The dimensions of the forward and backward cells are both 32, so their combined dimension is 64.

decoder

Upsample the output of Es and Ec to restore the original time resolution. Formally, it means that the up-sampling features are formal, and the up-sampling features are U→ and U←
U → (:, t) = C 1 → (:, ⌊ / 32 ⌋) U ← (:, t) = C 1 ← (:, ⌊ t / 32 ⌋)

U→(:,t)=C1→(:,⌊/32⌋)U←(:,t)=C1←(:,⌊t/32⌋)U→(:,t)=C1→(:,⌊/32⌋)U←(:,t)=C1←(:,⌊t/32⌋)

U → (:, t) = C1 → (:, ⌊ / 32⌋) U ← (:, t) = C1 ← (:, ⌊t / 32⌋)
Then, the sampled The features are connected and sent to 3 5×1 convolutional layers, each convolutional layer has 512 channels, and each channel is batch normalized and ReLU in turn, and then 3 LSTM layers, with a unit size of 1024. Then the output of the LSTM layer is sent to a convolutional layer of dimension 80, 1×1.

Vocoder

Convert the obtained Mel image to speech

Using a pre-trained WaveNet network, including 4 deconvolution layers, the frame rate of the mel image in the experiment is 62.5 Hz, and the sampling rate of the voice waveform is 16 kHz. Therefore, the deconvolution layer will up-sample the Mel image to match the sampling rate of the voice waveform. Then, use the standard 40-layer WaveNet to perform the up-sampling spectrogram to generate the voice waveform.

experiment

Two kinds of AutoVC are implemented in the paper, one is the speaker style vector generated by Es, and the other is AutoVC-one-hot, which uses the one-hot encoding of each speaker as the speaker style vector.

MOS results show that AUTOVC is superior to existing non-parallel conversion systems in terms of naturalness. In terms of similarity, AUTOVC is also better than Baseline. Note that for Baseline, there is a significant drop in switching from same-sex to transgender, but the AUTOVC algorithm does not show this drop. Finally, there is no significant difference between AUTOVC and AUTOVC-one-hot, indicating that the performance gain of AUTOVC is not due to the use of the speaker encoder.

Guess you like

Origin blog.csdn.net/c2a2o2/article/details/111798885