Voice Conversion, detailed explanation of voice conversion-speech signal processing learning (6)

references:

[1] Voice Conversion (1/2)bilibili

[2] March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes Voice Conversion - 10 - Zhihu (zhihu.com)

[3] Voice Conversion (2/2)bilibili

[4] March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes StarGAN in VC - 11 - Zhihu (zhihu.com)

[5] National Taiwan University Professor Li Hongyi Flow-based Generative Modelbilibili

All cited papers are omitted this time

Table of contents

1. Introduction to Voice Conversion

2. Starting point (problem to be solved)

3. Classification of corpus

4. Specific implementation technology (Unparallel Data)

Feature Disentangle

Feature Disentangle improvements

Direct Transformation

CycleGAN

StarGAN

Comparison between feature unwrapping and direct conversion methods

Replenish


1. Introduction to Voice Conversion

  • What is the mission of VC:

    • Input a piece of sound and output another piece of sound.

    • The output sound is the same in content as the input, but the timbre has changed.

    • It's like Conan's bow tie voice changer.

  • What is the use (necessity)

    • Change Speaker:

      • Different people say the same content has different effects

      • can fool people

      • Can create Personalized TTS (Text-to-Speech), which is a personalized speech synthesis system

      • You can also convert singing voices

      • Can protect personal privacy (voice change)

    • Change Speaking Style:

      • Emotional changes in speech

      • Normal-to-Lombard. Lombard speech is a speech style commonly used by people in noisy environments. It is usually clearer, louder, contains more high-frequency components, and is easier to hear than normal speech. Be heard and understood in noisy environments.

      • Convert whispers to normal speech (Whisper-to-Normal)

      • Singers vocal technique conversion, such as adding lip flicks and vibrato

    • Can enhance Intelligibility (comprehensibility)

      • Convert sounds that are difficult for ordinary people to understand into sounds that are easy to understand. This could help people with vocal organ damage

      • It can be used for accent correction and conversion, or it can be used in language learning to convert other people’s spoken language teaching into your own voice.

    • Data Augmentation can be done

      • Convert male and female voices to each other to diversify the speech recognition training set.

      • Convert clean sounds into noisy sounds to enhance the robustness of the speech recognition system. Or denoise noisy sounds to improve the ability of the speech recognition system.

2. Starting point (problem to be solved)

During actual deployment, we usually have two important questions:

  • In actual implementation, the lengths of the input and output sequences of VC can be different. If the lengths are different, then we need to use the Seq2Seq model, but many literatures will assume that the input and output lengths are the same. In this way, you can use some relatively simple models to do it.

  • And when using sound models, we usually use Acoustic Features, which is a series of vectors. However, in actual use, what we need in the end is an audio. How to convert vectors into audio also requires certain techniques ( Implemented through Vocoder).

3. Classification of corpus

  • Parallel Data: Paired speech data, where multiple speakers pronounce the same sentence. But there is too little such data, we can pass:

    • Model Pre-training: Use a pre-trained Seq2Seq model, and then use a small amount of paired speech data for fine-tuning (model fine-tuning)

    • Synthesized data: Even if synthesized speech is used, for example, after knowing the content of the corpus, the content can be given to Google Voice and the voice of Miss Google can be synthesized, so that there are more paired speeches.

  • Unparallel Data: Unpaired speech, there are multiple speakers, but the words they read may be different, and even the languages ​​​​can be different.

    • In fact, this task can be called "audio style transfer", analogous to image style transfer

    • We can steal some techniques from image style transfer and use them. There are roughly two directions: Feature Disentangle and Direct Transformation

4. Specific implementation technology (Unparallel Data)

We first consider the research direction of unpaired speech, which can be divided into two ideas: decomposition and direct conversion. The picture below gives the decomposition, which is the general idea of ​​Feature Disentangle.

Feature Disentangle

Speech actually contains a lot of information, such as text information, speaker information, environmental information, emotional information, etc. If you need to replace the Speaker, you only need to replace the speaker information. This technology can also be used in places other than switching speakers, such as mood switching.

  • We can train two Encoders to extract content information and speaker information respectively, and then train a Decoder to fuse the two information.

  • In this way, after completing the training, we only need to provide the Speaker Encoder with the speech materials of other speakers and then fuse them.

  • Here comes the question. The goal of training is to make the fused output of the decoder closer to the input, but how do we make the two encoders obedient, one to extract content information and the other to extract speaker information?

  • A simple method is to no longer use Speaker Encoder, but to use one-hot vectors to represent the speaker. However, the disadvantages of this are also obvious, and the voice of a new speaker cannot be synthesized. Because before training, the dimension of the one-hot vector has been determined as the number of speakers included in the data, and once new speakers need to be added, the entire Decoder needs to be retrained.

  • Therefore, we re-use Speaker Encoder to extract speaker information. There are already many pre-trained models that can generate Speaker Embedding.

  • So what should Content Encoder do? A direct way is to use Speech Recognition (speech recognition), and as shown in the picture, use a neural network added to the HMM. This deep neural network outputs the emission matrix of the HMM, which can naturally be used as a Content Encoder.

  • There are some other ideas, such as imitating GAN (Generative Adversarial Network, Generative Adversarial Network), taking the output content of the Content Encoder directly and putting it into a discriminator (Speaker Classifier), and the task of the discriminator is to judge this Which speaker spoke the pronunciation. The goal of the Content Encoder as a generator is to fool the discriminator so that it cannot recognize the speaker information, so that the Content Encoder can remove all the speaker information. During training, the two can be trained alternately.

    GAN, which stands for Generative Adversarial Network, is a deep learning model consisting of a generator and a discriminator. The goal of GAN is to learn the distribution of data through the adversarial game of generator and discriminator and generate synthetic data similar to real data.

    The role of the generator is to convert the input random noise signal into synthetic data samples, while the discriminator is responsible for evaluating whether the input data sample is real data or synthetic data generated by the generator. The two continuously optimize each other through repeated adversarial training processes to achieve the purpose of generating realistic synthetic data.

    The training process of GAN can be briefly summarized as follows:

    1. The generator receives a random vector as input, maps and transforms it through a series of neural network layers, and finally generates a synthetic data sample.

    2. The discriminator receives both real data and synthetic data generated by the generator as input, evaluates it through a series of neural network layers, and outputs a probability value representing the likelihood that the input data is real data.

    3. The generator and discriminator are trained alternately. First, the generator is fixed, and the parameters of the discriminator are updated by minimizing the probability of misjudgment by the discriminator on real data and the probability of misjudgment on data generated by the generator. Then, the discriminator is fixed and the parameters of the generator are updated by maximizing the discriminator's probability of misjudgment on the data generated by the generator.

    4. Repeat step 3 until the generator and discriminator reach an equilibrium state, where the generator can generate realistic synthetic data, but the discriminator cannot accurately distinguish between real data and generated data.

  • In addition, we can also design the network architecture to let the two Encoders complete their work obediently. The technology originates from Image style transfer. By adding Instance normalization to Content Encoder, speaker information can be removed.

  • How did IN do this? It itself acts on the vector sequence generated by the filter in the Encoder after collecting the acoustic wave information, that is, calculating the mean and variance of each dimension of the vector sequence for normalization. ). For Conv1D, each filter is equivalent to capturing each Pattern. So each row represents whether each feature in the sound signal appears. Simply put, some filters capture high frequency features and some capture low frequency features. When a male voice comes in, the low-frequency filter output will be larger, while the high-frequency output will be smaller. When a female voice comes in, it's the other way around. But through this normalization method, we make the mean of all filter outputs 0 and the variance 1. This is equivalent to removing the characteristics of the human voice. Because the normalized filter will not have a particularly large output value, it will not reveal the speaker's characteristic information.

  • In fact, there are many ways to add the Encoder's Embedding to the Decoder, and you can delve into it yourself.

  • The network architecture of Content Encoder has been solved, but what about Speaker Encoder? Just add AdaIN (Adaptive Instance Normalization) to the Decoder and use it to add the output of the Speaker Encoder as input to the decoder, so that this input can only affect the timbre, but not the content of the speech. .

  • How does AdaIN do it? In the decoder, we will also have an IN to normalize each row output by the decoder to remove speaker-specific information. The speaker information needs to be combined with the information of the Speaker Encoder. The output of Speaker Encoder can be obtained by two transforms, γ and β vectors. If the hidden layer embeddings after IN are Z1-Z4, we multiply these hidden layer embeddings by γ (i.e. scaling operation) and add β (i.e. shift operation) to get Z`. This step is called Add Global. That is, γ and β are global to the output of the decoder. The combination of IN and Add Global is what we just called AdaIN. We trained it end to end and it was over.

  • So does the method of designing the network architecture have any effect? ​​The answer is yes. We use a trained speaker discriminator to determine which speaker the output of the Content Encoder is. It turns out that the discriminator with IN is better than without IN. The accuracy will indeed decrease. When we do a visual clustering of the embeddings output by the Speaker Encoder, we will find that it indeed distinguishes the voices of different speakers (different colors in the picture represent different speakers), and there are also clear boundaries between men and women, proving that The effect is not bad.

Feature Disentangle improvements
  • However, back to the Feature Disentangle, we now understand that this is actually similar to the Auto Encoder. But the final sound it outputs is sometimes unsatisfactory. One of the reasons is that during the training process, we have been feeding the same person's voice, and the model does not know that we are going to do Voice Conversion, that is, it will replace the Speaker's timbre. Different training and inference processes will lead to poor final results.

  • How to solve this problem? The most logical consideration is to add the output content of different Speaker Encoder during training, that is, we need to carry out 2nd Stage Training, but the problem is that there is no training goals. So we consider using GAN again:

    • Train a Discriminator to distinguish whether the output speech is a real person speaking.

    • Going further we can train a Speaker Classifier to make the output as close as possible to a person's voice (get a correct result)

  • In specific applications, if you directly update the parameters of the Decoder to fool the two discriminators, the training will easily collapse. So we use an additional training Patcher, which also has the same input as the Decoder. The difference is that its output will eventually be added to the Decoder's output, and the result will be used to deceive the discriminator, so that the training will be more stable.

Direct Transformation
  • This method is simple and crude. It directly trains a model, inputs the voice of A, and directly outputs the voice of B. The picture below introduces one of the technologies - CycleGAN migrated from image style transfer technology.

CycleGAN
  • The principle is very simple. It is to make a Generator, input the voice of person Discriminator, let it think it is what Y said.

  • But there is also a problem with this, that is, the generator may completely ignore the input content and directly output what the discriminator wants to hear most, that is, "mode collapse" will occur, so we add a generator that can convert the sound of Y into The sound of

  • In addition, we can also add a training process to make it stable, that is, when training the X→Y generator, we also feed it the voice of Y. The training goal is to output exactly the same sound.

  • Similarly, we can also swap the two generators, input the sound of Y, and finally output the sound of Y, so that we get the complete CycleGAN

StarGAN
  • CycleGAN has a flaw. If there are N speakers, if we want to get the speech conversion between these N speakers, then we need to train N×(N-1) CycleGAN, so we have our advanced StarGAN.

  • StarGAN improves both the generator and the discriminator. The generator has an additional input, and the speaker's encoding is also input into the generator, allowing the generator to specify the source of the generated speech. The speaker encoding can be one-hot encoding or a pre-trained speaker encoder. The modification of the discriminator is the same as that of the generator. It also adds a speaker-encoded input to determine whether the speech source is a given speaker.

  • How it works specifically is to input the voice of the sk speaker, and then provide the code of another speaker si to the generator, let the generator generate the voice of si, and by the way, send the code of si and the generated voice of si to the generator. The discriminator determines whether it is what si said, and then sends the encoding of sk and the generated voice of si to the generator, allowing it to generate the voice of sk, and compare the result with the original sk voice. The closer the better (training Target).

  • However, here is a simplified version of StarGAN, and a classifier has been added to the specific paper. If you want to know more, you can check it out in the original paper.

Comparison between feature unwrapping and direct conversion methods
  • What are you thinking? In fact, it is incomparable. In the literature, these two methods use different network architectures to do different things. It is really not easy to compare.

  • Moreover, there is no conflict between these two methods. For example, you can combine the two methods and use the Autoencoder architecture of Feature Disentangle to design the generator of Direct Transformation.

Replenish

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/134215093