SLT2021: OPTIMIZING VOICE CONVERSION NETWORK WITH CYCLE CONSISTENCY LOSS OF SPEAKER IDENTITY

0. Title

OPTIMIZING VOICE CONVERSION NETWORK WITH CYCLE CONSISTENCY LOSS OF SPEAKER IDENTITY

Speech conversion network optimization with loss of cyclic consistency of speaker identity

Tone conversion network optimized by the same identification speaker cycle consistency Loss

1. Summary

We propose a novel training scheme to optimize the timbre conversion network of the same voice identification speaker cycle consistency Loss. The training scheme not only minimizes the frame-level spectrum loss, but also minimizes the speaker identity loss. We introduce a periodic consistency loss that limits the converted speech to maintain the same speaker identity as the reference speech at the utterance level. Although the proposed training scheme is applicable to any speech conversion network, we have selected: The research is carried out under the average model speech conversion framework described in this article. Experiments conducted on the CMU-ARCTIC and CSTR-VCTK corpus confirmed that this method is superior to the baseline method in terms of speaker similarity

关键词: Voice conversion, cycle consistency loss, speaker embedding

Voice conversion, cycle consistency Loss, speaker tone embedding vector

2. Introduction

The purpose of voice conversion (VC) [1] is to modify the voice signal of the source speaker to sound like the voice of the target speaker while retaining language information. This technology has a variety of applications, such as emotion conversion, speech transformation, personalized text-to-speech synthesis, movie dubbing, and other entertainment applications. The speech conversion pipeline usually consists of multiple functions: extraction, feature conversion and speech generation. In this work, we focus on feature transformation. Many studies are devoted to the conversion of spectral features between specific sound source-target speaker pairs, for example, Gaussian Mixture Model (GMM) [2, 3, 4, 5], frequency distortion [6, 7, 8, 9], based on Examples of methods [10, 11, 12], deep neural networks (DNN) [13, 14, 15] and long short-term memory (LSTM) [16]. In order to benefit from publicly available voice data and reduce the amount of target data required, a method based on an average model is proposed. Instead of training the conversion model of the target speaker from scratch, it is better to train a general model with a multi-speaker database, and then adapt the general model to the target through a small amount of target data [17, 18, 19]. This is called the average model method. Alternatively, in some other studies, speaker vectors, such as ont-hot vectors, i vectors, or speaker embeddings, are used as auxiliary inputs to control speaker identity. Since the ont-hot speaker vector is only suitable for closed speaker training, for example, in the variational autoencoder (VAE) [20, 21], for the invisible speaker [22], the i vector is a better speaker Representation. There are other researches on the speaker embedding technology of speech conversion [23, 24, 25, 26, 27]

Although progress has been made, the similarity between the speaker and the target speaker There is room for improvement in the above techniques [28]. One of the reasons is that this method attempts to minimize the difference between the conversion feature and the target feature in the acoustic feature space, which has no direct relationship with the speaker's identity. In order to further improve the speaker similarity between the speaker and the target speech, recent studies have proposed to use the perceptual loss as a feedback constraint for speech synthesis. In [29], feedback constraints on the speaker embedding space are used for speech synthesis. Literature [30] proposes a "verification to synthesis" framework, in which VC is trained through the automatic speaker verification (ASV) network (discussed with Liang Shuang)

In this paper, we introduce the period consistency loss speaker embedding space to enhance the speaker identity conversion of the average model VC method. In the proposed method, the speaker-independent speech posterior graph (PPG) [31] is used to represent content information, and the speaker embedding extracted from the pre-trained speaker embedding extractor is used to control the generated speaker identity. In order to ensure that the generated speech retains the identity of the target speaker, the cycle consistency loss encourages the speaker embedding. The converted speech is the same as the input speaker embedding.

3. Others-easy to understand

A popular average modeling method (AMA) uses speaker-independent PPG functions [17, 18, 32] as a feature representation, which allows us to use publicly available data from multiple speakers for speech conversion modeling. It takes the PPG function as input and generates mel cepstral coefficients (MCC). Figure 1(a) shows the training process of the average model without embedded speakers as input. In the adaptation phase, a small amount of target data will be used to fine-tune the conversion model. Figure 1(b) shows the training process of the average conversion model with speaker embedding as input. Unlike Figure 1(a), Figure 1(b) uses speaker embedding to control speaker identity. During the adaptation period, the speaker embedding is extracted from the target speech, and the model is also fine-tuned using the target data

Guess you like

Origin blog.csdn.net/u013625492/article/details/112974382