李宏毅DLHLP.09.Voice Conversion.1/2. Feature Disentangle

Introduction

This course is a new course for Teacher Li Hongyi in 2020: Deep Learning for Human Language Processing (Deep Learning for Human Language Processing)
course website
B station video. For
formula input, please refer to: Online Latex formula

What is VC

Input a voice, output another voice.
Insert picture description here
Usually:
What is preserved? Content
What is changed? Many different aspects…

application

A common application is to change the person who speaks
• The same sentence said by different people has different effect.
• Deep Fake: Fool humans / speaker verification system
• One simple way to achieve personalized TTS
• Singing
and
• Emotion: Speaking style conversion, soft turn anger.
• Normal-to-Lombard: Lombard is the voice and intonation of a person talking in a noisy environment. You can imagine wearing a headset to speak loudly and then talking to people around you.
• Whisper-to-Normal
• Singers vocal technique conversion plus singing techniques, such as: lip thrill (lip thrill) or vibrato (tremolo)

• Improving the speech intelligibility to increase the intelligibility of speech in
surgical patients who have had parts of their articulators removed. For example, some people are born with slurred speech, and AI can be used to increase the intelligibility of speech.

• Accent conversion Accent conversion, such as Indian English
voice quality of a non-native speaker and the pronunciation patterns of a native speaker
Can be used in language learning
can also be used as Data Augmentation
Insert picture description here
denoising and adding noise
Insert picture description here

Practice

In the actual process, for simplicity, it is usually assumed that the input and output sequence lengths are the same.
Insert picture description here
In addition, the output of voice conversion is the acoustic feature sequence, which is a string of vectors, which often cannot be directly converted into sound signals. We need another module: The
Insert picture description here
common methods of Vocoder are:
• Rule-based: Griffin-Lim algorithm
• Deep Learning: WaveNet
Vocoder is still more complicated. We will specialize in another course later. We will not go deep here. We only need to know that there is such a thing, which is usually used in: VC, TTS, Speech Separation, etc.

classification

According to the training data, we can divide speech conversion into two major categories:
1. Parallel Data: A and B users both speak the same sentence. Then just eat the data on the left and hope the model will spit out the data on the right.
Insert picture description here
But such data is more difficult to collect. The solution is:
• Model Pre-training[Huang, et al., arXiv'19], first pre-train the model and then use very little data for fine-tuning
• Synthesized data![Biadsy , et al., INTERSPEECH'19], if there is a bunch of A's speech data, and you know the corresponding text of these speeches, then use the method of speech synthesis (read it with a machine) to generate the data of B

2. Unparallel Data is a common data. For this type of data, the transfer learning method on the image is used for reference
Insert picture description here
• This is "audio style transfer"
• Borrowing techniques from image style transfer
There are also two methods:
One is Feature Disentangle: the sound contains a lot of information, as shown in the figure below: content information, speaker information, background noise information, etc. We now want to separate these information, and then replace the speaker's information separately, and complete the VC. In other words, if the accent information can be extracted and replaced, the accent can be converted, and if the emotional part can be extracted, the emotion can be converted.
Insert picture description here
This section mainly talks about Feature Disentangle.
Another method is: Direct Transformation,
take the replacement speaker information as an example to look at Feature Disentangle

Feature Disentangle

There are two encoders, which can respectively extract the content C and speaker A features in the language signal. After the two features are combined, a decoder can be used to synthesize the speaker A's speech signal of content C.
Insert picture description here
After the model is trained, when we need to convert the speaker information, we can throw the words spoken by speaker B into the speaker's encoder, combine the features of the previous content C, and get the final result through the decoder.
Insert picture description here
The principle of the model involved is similar to AE, but the trick is more complicated.
Insert picture description here
How can you make one encoder for content and one for speaker?
The question now is how to make two encoders extract the information we specify separately?

Using Speaker Information

The first method is not to train two Encoders. Assuming that we now know who said each sentence, then we can use a one-hot encoding to represent each speaker.
Insert picture description here
Then replace the original speaker encoder with one-hot encoding, throw it into the decoder, and restore the original sound signal. Since we have already provided additional information about the speaker, the Content Encoder and Decoder will not involve the speaker information after training.
There is a problem like this: difficult to consider new speakers. For speaker Z that is not in the training data, it cannot be converted to his voice.

Pre-training Encoders

The second method is to pre-train the Speaker Encoder.
The Speaker Encoder eats a speech signal, and can output the vector representation of the speaker who got the speech information. This method is also called: Speaker embedding, usually: i-vector, d-vector, x-vector …, I won’t go into depth here, I will talk about it later with the speaker verification.

Content Encoder

For Content Encoder, a speech recognition system can be directly dropped here, and speech recognition systems generally ignore information other than speech. But the output of the speech recognition system is text, not a vector, and there is no way to connect it to the decoder below. If DL speech recognition is used here, then the probability of each state output in the middle can be used as the output of Content Encoder.
Insert picture description here

Adversarial Training

The third method is Adversarial Training, which uses the idea of ​​GAN to train the model, adds a classifier, and then Speaker classifier and encoder are learned iteratively
Insert picture description here

Designing network architecture

Drawing lessons from the technology used on the image, modify the network architecture so that each Encoder can learn the specified features.
Insert picture description here
The specific method is to add an instance normalization to the Content Encoder to remove speaker information. The principle is explained below:
first there is a sound signal:
Insert picture description here
then pass through a set of 1D filters of CNN to get a set of values:
Insert picture description here
then pass another set of 1D filters to get another set of values, and so on, after several sets, Each segment of the sound signal corresponds to a vector:
Insert picture description here
instance normalization acts on these vectors:
Insert picture description here
instance normalization acts on the same dimensions of these vectors, processing the mean and variance in these dimensions (the former is subtracted, the latter is divided) Becomes:
Insert picture description here
Finally, mean is 0 and variance is 1.
So why can the speaker characteristics be removed? Because the dimension of each vector (the black rectangle in the above figure) represents a certain feature extracted by CNN. If the rectangle in the figure is a gender feature, then the low frequency feature of boys will be obvious, and the high frequency of girls The features will be obvious, if after instance normalization, it is equivalent to erasing these features.
Then add the model to AdaIN (adaptive instance normalization), this only influence speaker information, let’s look at its principle:
First, the decoder also has instance normalization, which normalizes each dimension of the vector output by CNN, removes the speaker’s information and removes the language How to output the speaker’s information after the speaker’s information?
Insert picture description here

This information needs to be extracted from the speaker encoder. The output of the speaker encoder passes through two transforms to get γ, β \gamma, \beta respectivelyγ ,β these two variables:
Insert picture description here

These two variables will affect the output of the decoder in the above figure (after instance normalization): z 1, z 2, z 3, z 4 z_1,z_2,z_3,z_4with1,with2,with3,with4, The influence formula is as follows:
zi ′ = γ ⊙ zi + β z_i'=\gamma\odot z_i+\betawithi=cwithi+β
can see that this effect is on all vectors and is global.
Insert picture description here
Here IN plusγ, β \gamma,\betaγ ,The two effects of β combined are AdaIN. 3
Finally, the entire model can be trained according to AE.

Implementation results

For content encoder:
Insert picture description here
From the figure above, you can see that after adding IN, the correct rate of the speaker classifier is reduced, indicating that the IN effect is good, filtering out the speaker information, resulting in a lower correct rate.
For the speaker encoder:
Input the test voice data, and you will get a very good classification result, each person has a separate gender.
Insert picture description here

Guess you like

Origin blog.csdn.net/oldmao_2001/article/details/108856335