SLT2021: HOW FAR ARE WE FROM ROBUST VOICE CONVERSION: A SURVEY

0. Title

HOW FAR ARE WE FROM ROBUST VOICE CONVERSION: A SURVEY

A survey: How far are we from robust tone conversion?

1. Summary

In recent years, with the help of deep learning, speech conversion technology has been greatly improved, but the ability to produce natural sound under different conditions is still unclear. In this article, we have conducted an in-depth study on the robustness of known VC models. We also modified these models, such as replacing the speaker embedding vector coding, to further improve its performance. We found that the sampling rate and audio duration greatly affect the voice conversion. All VC models do not perform well for unseen speakers, but AdaIN-VC is relatively more robust. Moreover, the effect of speaker embedding vector joint training is more suitable for speech conversion than the effect of speaker identification training network

关键词: voice conversion, speaker verification, speaker identification, speaker representation, speaker embedding, network robustness

Voice conversion, speaker verification, speaker recognition, speaker representation, speaker embedding vector identification, robustness of the network

2. Introduction

Voice conversion (VC) technology aims to convert the speaker characteristics of the utterance into the characteristics of the target speaker while maintaining the language content information. In previous work, parallel multi-speaker corpus was needed to realize VC. Recently, several models using non-parallel data have been proposed [1,2,3,4,5]. DGAN-VC [6] learns structure content and speaker information through adversarial training. StarGANVC [7] uses conditional input to achieve many-to-many voice conversion. However, both are limited to VC between the speakers seen during the training process. In the future, the Zeroshot method [8, 9, 10, 11, 12] will be used, in which the model can perform VC between any speaker's corpus without fine-tuning the model. AdaIN-VC [13] applies the standardized operation techniques of examples to meet the requirement of timbre conversion for any speaker. The other is called AUTOVC [14], he uses pre-trained d-vector [15] and related techniques of information bottleneck

 

Although VC technology has become more and more powerful recently, most of the papers train and evaluate their VC model on the same corpus (usually the VCTK corpus [16]), and test sentences for both seen and unseen speakers have Similar recording conditions. However, in actual applications, the recording conditions for vocalization may be completely different from the training data. Is today's VC model powerful enough for practical applications? The answer may be no. [17] successfully conducted an adversarial attack on the VC model, which indicates that today's VC model may still be not robust enough. However, we don't know how robust they are, and what mismatched attributes will affect them?

 

This may be the first paper on the robustness of the VC model. In the following three aspects, we investigated the robustness of three popular VC models with five commonly used data sets:

  • How the model handles: (1) different sampling rates, (2) total duration, (3) even unseen languages
  • Use ablation studies to find out which modules of these models are critical for speech conversion
  • Check the influence of speaker embedding vector to determine which embedding vector is most suitable for VC

3. Others-easy to understand

Here, we first introduce the VC model and speaker embedding involved in this article. We investigated the models that realize the unraveling of speaker tone and content information: DGANVC [6], AdaIN-VC [13] and AUTOVC [14]. All these models are publicly available, so they can be copied easily. For speaker embeddings, we investigated i-vector [18], d-vector [15] and x-vector [19], they all performed well in speaker verification, and further introduced a new embedding v-vector

 

DGAN-VC [6] uses a two-stage training framework for many-to-many speech conversion. Es provides a one-hot vector as a speaker embedding to represent different speakers. In the first stage, an additional classifier is applied to express content as input and conduct adversarial training to obtain language information independent of the speaker. In the second stage, Generative Adversarial Network (GAN) [20] is then used to make the output spectrum more realistic

 

AdaIN-VC [13] uses a variable autoencoder [21] (VAE), for content encoders, where the potential representation is limited by the KL divergence loss, and uses adaptive instance normalization [22] (AdaIN) to achieve Zero pulse VC. The instance normalization used in Ec deletes the timbre information while retaining the content information, and then Es provides the timbre information to the decoder with the AdaIN layer

 

AUTOVC [14] is a carefully designed autoencoder whose information bottleneck is only subject to self-reconstruction loss training. The information bottleneck is the dimensionality of the latent vector between Ec and D. With this bottleneck, the language content and timbre information will be untied without explicit constraints, while the latter (timbre information) is provided by the pre-trained d vector during the decoding process

 

In i-vector [18], the speaker and utterance-related GMM supervector M is modeled as

Where m represents the GMM super vector independent of the speaker, and T is the low-rank total variability matrix, which can capture both the speaker and the utterance variability. The i vector is defined as the largest posterior estimate of w

 

d-vector [15] uses DNN as a speaker feature extractor and trains for speaker recognition tasks or GE2E loss [23]. The output of the last hidden layer is regarded as a compact (non-sparse, non-one-hot) representation of the speaker, called d-vector

 

x-vector [19] uses Time Delay Neural Network (TDNN) to learn temporal context information. The speaker recognition task is trained, and the output of the last hidden layer is used as a representation of the speaker, called x-vector

 

In addition, v-vector is a new concept introduced in this paper, defined as: the representation of a speaker encoder trained jointly with the VC model, and the structure design uses a pre-trained speaker encoder from AdaIN-VC that has been trained by VCTK

 

 

We consider two objective indicators: character error rate (CER) and speaker verification acceptance rate (SVAR). SVAR is the ratio of the number of audio received by the speaker verification system to the total. A lower CER indicates better retained content, and a higher SVAR indicates more successful tone conversion. Automatic speech recognition (ASR) measures how effective the language information from the source utterance will be retained in the converted information. We use the speech-to-text service in the Google Cloud Voice API to calculate CER. For Chinese, we use pinyin

 

Speaker Verification (SV) measures that the converted speech belongs to the speaker who provides timbre information in the VC. We use a third-party pre-trained speaker encoder to extract the speaker embedding vector from the utterance. If the embedded cosine similarity between the converted utterance and the target pronunciation (timbre information is provided in the VC) is greater than a given threshold, the converted utterance is considered to be successfully converted. We obtain the threshold by calculating the Equal Error Rate (EER) on the data set used for testing. For each speaker, we randomly sampled 128 positive samples and 128 negative samples, and the total number of examples exceeds 100k. EER is 5.65% and the threshold is 0.6597

 

The AUTOVC2, AdaIN-VC3 and DGAN-VC4 used here are obtained from the official implementation and have all received VCTK training. In its initial implementation, DGANVC and AdaIN-VC used the Griffin-Lim algorithm [28] to generate waveforms from the spectrogram, while AUTOVC used WaveNet [29] as a vocoder. In order to eliminate the influence of different vocoders, we modified each model to output 80-dimensional Mel spectrograms, and used pre-trained MelGAN [30] as a vocoder to convert these Mel spectrograms into waveforms . Since DGAN-VC is limited to executing VC in the original executors seen during training, we replace the speaker embedding of DGAN-VC with the speaker encoder architecture used in AdaIN-VC, which can be implemented zero-shot.

 

For the use of speaker vector, i-vector and x-vector were obtained from the Kaldi [31] pre-training system trained on VoxCeleb1 [32] and Voxceleb2 [33]. For d-vector, we used the pre-trained d-vector model of AUTOVC trained on VoxCeleb1 and LibriSpeech. For v-vector, we use a pre-trained speaker encoder from AdaIN-VC, trained by VCTK

 

Generally, the performance of AdaIN-VC and AUTOVC is relatively better than that of DGAN-VC, and both have their own advantages. These two models will be further studied:

  1. We trained AUTOVC and replaced its speaker encoder with the trained v-vector of AdaIN-VC (AUTOVC-Vvec)
  2.  We add the VAE architecture to AUTOVC (-VAE)
  3. We integrate the AdaIN layer into the AUTOVC decoder (AdaIN)
  4. We replaced the VAE architecture in AdaINVC with an autoencoder (AdaIN-VC-AE)

in conclusion:

  • In terms of SVAR of all data sets, the performance of AUTOVC-Vvec is better than AUTOVC. However, compared to AUTOVC, AUTOVC-Vvec has not shown a significant improvement in CER, so it is still not strong enough
  • The performance of AdaIN-VC-AE and AUTOVC-AdaIN is not ideal on CER and SVAR, which shows that only using AdaIn layer is not enough to convert speaker tone. Among the AUTOVC modifications, AUTOVC-VAE performed the worst in terms of CER. Both the small size of the latent vector and the VAE architecture limit the information that the content encoder provides to the decoder. The undesirable results of AUTOVC-VAE may be due to insufficient content information caused by the incorrect combination of latent vectors and VAE architecture (omitted)

 

Discuss how different speaker embedding vectors affect the performance of VC:

We trained the model using three different pre-trained speaker embeddings, including the i vector, the d vector and the x vector denoted as -I, -D, and -X, respectively. We show that its speaker encoders are jointly trained as a -Joint VC model. In addition, we trained AUTOVC, and its speaker encoder was replaced by the trained AdaIN-VC (AUTOVC-Vvec) vvector

 

We can see that AdaIN-VC-Joint performs best among all AdaIN-VC modifications, and the same is true for AUTOVC-Vvec among all AUTOVC modifications. The model trained with the d vector and the x vector cannot convert the timbre of the speaker well. The model trained with i-vector can convert timbre well, but CER is not satisfactory. Compared with other pre-trained speaker embeddings (VCTK and VoxCeleb), V-vector has much less data processing, but it is still better than pre-trained speaker embedding. Finally, AUTOVC-Joint failed, with extremely high CER and low SVAR in each case

 

Important conclusion:

The results show that, although the pre-trained speaker vector preprocesses a large amount of corpus for VC, its effect on VC is not as good as that of ASV in her original task. The speech embedding vector jointly trained with the VC model is more suitable for speech conversion. The results also show that VC and speaker verification tasks require different speaker information. At the same time, unlike the one-hot training method, AUTOVC requires pre-trained speaker vector embedding representation to (help) enable the content encoder to only extract content information

 

4. Other-not easy to understand

  • Specific model structure
  • Specific experimental results

 

Guess you like

Origin blog.csdn.net/u013625492/article/details/112983074