Speech Synthesis, detailed explanation of speech synthesis - speech signal processing learning (8)

references:

[1] Speech Synthesis (1/2)-Tacotronbilibili

[2] March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes TTS - 14 - Zhihu (zhihu.com)

[3] Speech Synthesis (2/2) - More than Tacotron哔哩哔哩bilibili

[4] March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes Beyond Tacotron - 15 - Zhihu (zhihu.com)

All cited papers are omitted this time

Table of contents

1. Brief background introduction

Course outline introduction

The earliest speech synthesis

Concatenative Approach

Parametric Approach

Deep Voice

二、Tacotron:End-to-end TTS

Model comparison

Tacotron model introduction

Tacotron Encoder

Tacotron Attention

Tacotron Decoder

Tacotron Post Processing Network

3. Performance of Tacotron

Differences in performance between first generation and second generation

The magic of Dropout

4. Beyond Tacotron

Tacotron question

Improvements to Tacotron - Encoder

Improvements to Tacotron - Attention

Fast Speech—without Seq2Seq

Dual Learning: ASR & TTS

5. Controlable TTS (Controllable Text to Speech)

simple classification

TTS vs VC (Voice Conversion)

How to do it

GST-Tacotron

Two-Stage Training

6. Add some questions


1. Brief background introduction

Course outline introduction
  • TTS:Text-to-Speech, that is, text to speech, is what our course is about: speech synthesis. Current speech synthesis technology is trained end-to-end. The course outline will first talk about how the industry did it before deep learning became popular, and then talk about how we control TTS to synthesize the sound we want.

  • The picture below is the outline of this course:

The earliest speech synthesis
  • The earliest information on speech synthesis can be found in 1939. VODER is a speech synthesis system demonstrated at the US World Expo. For examples, please refer to vintagecomputermusic.com/mp3/s2t9_Computer_Speech_Demonstration.mp3. Speech synthesis at that time was like an electronic keyboard. Create different sounds by manipulating the keys.

  • In the 1960s, Johb synthesized speech using an IBM computer at Bell Labs. Click the link to listen. It sounds like a robotic voice. He looks like a ghost when he sings. However, the current speech synthesis sounds are too real and cannot restore the feeling of robot voices. Sometimes, we will deliberately make the synthesized human voice less realistic to prevent users from hearing it sound real and fake, causing the uncanny valley effect.

Concatenative Approach
  • The speech synthesis technology used commercially in the past used the concatenative approach. Its idea is very intuitive and it directly builds a huge sound database. For example, if you want to synthesize "How are you?", you can select the sounds of "you", "good", and "okay" from the database, and then splice them together.

  • Directly splicing the sound will sound unnatural. This is also the research direction of this method, that is, how to select speech fragments that can be spliced ​​together to sound natural. Many idiots at Bilibili take out the words spoken by a person and then combine them, which is the so-called "movable type printing" that is ridiculed, which is a similar technology.

  • However, this method has a big flaw, that is, the synthesized sounds are not rich. To synthesize someone's voice, there must be sound data similar to that person's voice in the corpus, but it cannot be synthesized that is not in the speech data. And if it needs to be used, the entire corpus must be stored locally, which takes up a lot of storage space.

Parametric Approach
  • In the era of machine learning, we hope to use machine learning to complete speech synthesis. Among them, speech synthesis using HMM, Deep Learning and other technologies was born. The most famous one is undoubtedly HTS . However, if you use HMM to filter out the speech with the highest probability of each STATE, then you will usually get the same speech. They are also not end-to-end, but parameterized methods, which are very complex, and this is not what we need to study.

Deep Voice
  • This model is the closest to end-to-end model before entering the era of end-to-end speech models, and was developed by Baidu. It connects many modules in series to implement speech synthesis. Let’s briefly introduce each module:

    • Grapheme-to-phoneme:Guess the pronunciation (phoneme) of the given text. And pass the output results to the other two modules.

    • Duration Prediction:Given phoneme, guess the time it takes to pronounce each pronunciation.

    • Fundamental Frequency Prediction:Given phoneme, predict the pitch of the pronunciation, that is, the vocal cord vibration frequency. There are some phonemes that humans do not pronounce. This module also considers and decides to mark phonemes that are silent.

    • Audio Synthesis:The speech synthesis network takes the output of the above three modules and throws out the corresponding speech.

  • In fact, the above four modules are all based on deep learning, and training in series is end-to-end. However, the first generation of Deep Voice was trained separately and then strung together. Deep Voice has been improved to end-to-end in the third generation.

二、Tacotron:End-to-end TTS

Taco means a Mexican burrito, and the addition of tron ​​is to add a sense of technology to it. Its previous name was originally called Talktron. Because the author likes to eat burritos, and Talktron is a homophone to Tacontron, the name was chosen.

Spectrogram:It is an image that represents the frequency and time information of a speech signal. It performs Short Time Fourier Transform (STFT) on the speech signal in the time domain, and then draws the resulting spectrum information into an image. The horizontal axis of a spectrogram represents time, the vertical axis represents frequency, and color or brightness represents signal strength or energy at a specific time and frequency.

Rule-based vocoder: is a method of synthesizing speech that uses a vocoder based on a set of predefined rules or A set of rules generates speech signals. Unlike many modern vocoders, this approach does not rely on machine learning or statistical models, but instead uses well-defined rules to simulate and synthesize speech.

Model comparison
  • In fact, there were some end-to-end TTS attempts before Tacotron. The input and output of Tacotron are as follows:

    • Input: character (letter)

    • Output:(linear) spectrogram

  • Yes, its output spectrogram linear spectrogram and Wave waveform only differ by one phase information. Conversion between the two does not actually require a very powerful Vocoder, a Rule-based one is enough.

  • As for other models, someone has done an end-to-end TTS model before where the input is phoneme and the output is acoustic feature vector. STRAIGHT means that these acoustic feature vectors need to be thrown into a special Vocoder to output the sound signal; some people have also done a Char2wav model, the input is a word, and the output is the acoustic feature vector of SampleRNN.

  • Of all the models, only the Tacotron's approach is the most complete.

Tacotron model introduction
  • Tacotron uses a typical Seq2Seq + Attention model architecture. Its output will also have post-processing to produce a sound spectrum.

Tacotron Encoder
  • The purpose of the Encoder is to input some letters and punctuation marks and output a bunch of vectors, similar to converting them into phoneme vectors, telling the Decoder and attention module how to pronounce these letters.

  • The letters are first transformed into input embeddings through transform, then enter a pre-net (several layers of fully connected layers plus dropout), and then enter a CBHG. The module architecture is relatively complex, as shown in the figure. The purpose is also to learn advanced input text express.

    In fact, CBHG was improved in the second version and became 3 layers of convolution and a BiLSTM.

Tacotron Attention
  • The attention module is actually the ordinary attention module we learn. When we start discussing the attention module, the first thing to make clear is that in fact, the input text and the output audio have monotonic aligned properties.

  • The purpose of the attention module is to allow the machine to automatically learn the length of the sound signal that the embedding represented by each letter will generate in the decoder. The monotonic correspondence means that when we visualize the attention (the x-axis represents the text embedding, which is the output of the encoder, and the y-axis represents the audio, which is the output of the decoder), the generated visual pattern should be distributed diagonally. .

Tacotron Decoder

Mel-Spectrogram: is a representation of audio signals and is commonly used for speech processing and audio processing tasks. It is a frequency domain representation obtained by applying a Mel filter bank and performing Fourier transform on the audio signal.

  • The Decoder used here is actually the Decoder in Seq2Seq. The starting zero vector is thrown to the RNN through a pre-net, the RNN performs attention once, and the resulting context vector (context vector) is thrown to the next layer of RNN to generate output. The output is then used as the input for the next round.

  • What's special is that this time the Decoder does not generate only one vector at a time, but generates many vectors, and these outputs are Mel-spectrograms. The number produced is called r frames. In the first version of Tacotron, r could be 3 or 5. Why let it generate multiple vectors at once? Because the sound signal is very long, and the sound generated by the decoder at each time step is very short. To make the generation faster, we allow the model to generate several more spectral vectors at a time.

    However, in the second version of Tacotron, r becomes 1 again.

  • When the output is multiple vectors, we can string them together as the input of the next RNN, or we can use the last vector as the input of the next RNN. At the same time, we also added the dropout operation to the Pre-net module, which plays an extremely important role.

  • Similarly, we still use teacher forcing during training. What is interesting is that in Seq2Seq, the final output is obtained through sampling, but Tacotron does not have a sampling process, which will lead to a disconnect between training and testing (each layer of training eats the correct Output, how about this), then dropout can make the correct answer less correct, and can simulate the situation where RNN makes an error, which is similar to the sampling process.

  • Similarly, the previous Seq2Seq output was a token. We can determine whether the model outputs a termination token to determine whether to stop. This time it's a continuous vector, so we need an additional module, usually a binary classifier, to decide when to end. Its input is the last hidden layer and Cell of RNN, and the output is a [0,1] value, representing the probability of whether to end.

Tacotron Post Processing Network
  • After Decoder, Tacotron also has a Post Processing (post-processing module), the first generation of which is a CBHG, and the second generation is a bunch of convolutional layers. It takes all the vectors output by the decoder as input, and then outputs another row of vectors.

  • Why do you need this extra step? Because the vectors output by RNN are generated sequentially, it can only generate subsequent vectors through the vectors generated previously. It's possible that you encountered a problem along the way and wanted to correct it, but you didn't have the chance. The goal of this post-processing is to correct these problems that may be encountered during the process.

  • Therefore, when training Tacotron, we usually have two losses, one is the difference between the output of RNN and the Mel-spectrogram of the correct answer, and the other is the difference between the output of RNN and the correct answer after Post Processing. The closer these two are, the better. . When using it, we still take the output of Post Processing and throw these acoustic features into the vocoder.

  • It is worth mentioning that the first generation of Tacotron used a rule-based Vocoder, and the second generation used Wavenet.

3. Performance of Tacotron

Differences in performance between first generation and second generation
  • How did Tacotron fare? We usually use a mean opinion score to evaluate the quality of the generated speech. It's calculated by taking a group of people, having them listen to a sound, and giving the score a score of 1-5. The first-generation Tacotron scored at 3.82, but the second-generation already has a 4.53 rating. This is very close to Ground truth.

  • Why is there such a big gap between the first and second generation Tacotron? The main reason is that it replaces Vocoder. Switched from simple rule-based Vocoder to machine learning-based Wavenet. Wavenet also requires data for training. If you choose different data, the final performance will be different. Interestingly, if real sounds are directly used for training, the effect is not as good as using synthetic sounds for training. The reason is that there are some subtle differences between the synthesized spectrogram and the real sound spectrogram. Tell Vocoder in advance what the synthesized spectrogram looks like, so that the effect can be better.

The magic of Dropout
  • There is a more interesting problem that also requires dropout operations during the inference phase. Most models only add dropout during training to make the model more robust. Overfitting will not occur, and dropout will not be used in the test inference phase. Dropout is not used for testing, and the sound produced is broken. Why is this so? There is no very good explanation yet.

    As for the explanation, the teacher gave an analogy during the explanation: when using GPT-2 to generate sentences, there is actually a similar The problem. If you ask GPT-2 to generate the word with the highest probability every time, it will fall into a strange loop. It keeps skipping frames and keeps generating repeated words. Therefore, when we need GPT-2 to generate some random articles, we need to make it have a certain degree of randomness, that is, use distribution sampling to obtain the output words. In this way, words with other probabilities also have the opportunity to be generated. From this we can infer that the reason why Tacotron needs to use Dropout in testing is also the same. Dropout allows Tacotron to take some randomness into account when generating new speech spectra. Otherwise it will get stuck in a repeating loop like GPT-2.

4. Beyond Tacotron

Tacotron question
  • Although the sound quality of Tacotron 2 is already very good, there are occasional problems with mispronunciation of words. what is the reason? We say that in fact, the speech synthesis task does not require tens of thousands of hours of labeled data like training a speech recognition model. The data set averages more than 20 hours of a person's voice, which ensures that the synthesized voice is of very high quality. However, more than 20 hours of vocals cannot guaranteevocabulary. The vocabulary of the VCTK data set is about 5,000. Even Nancy’s vocabulary is less than 20,000. Currently, the largest public dataset like LibriTTS has a vocabulary of less than 100,000. Generally, the number of English dictionaries is more than 100,000.

  • Although the model can guess the phonemes of English words. But it hasn't seen enough vocabulary to accurately estimate how each word should be pronounced. So when it sees unfamiliar words or new words, it will mispronounce them. Users will find this unacceptable. What are we going to do?

  • One solution is that we no longer use characters as input, but choose a high-quality vocabulary (lexicon) to convert the characters into phonemes, so that there seems to be no mispronunciation. But there are still problems with this approach. For example, if a new word is created, it is not in the dictionary and cannot be pronounced. what is it now? We can consider both character and phoneme hybrid input. And if we find that the sound predicted by the model is wrong, after knowing the correct pronunciation, we can add the correct pronunciation to the word list, so that the model can be corrected immediately.

Improvements to Tacotron - Encoder

BERT (Bidirectional Encoder Representations from Transformers): is a pre-trained language model based on the Transformer architecture, proposed by Google in 2018. The key to BERT is to use a bidirectional training method to understand the context and adopt the Transformer mechanism.

  • For Encoder, we can also add grammatical information. Grammar will cut a sentence into various components, such as subject, object, noun, etc., which will affect the mood, pauses, etc. of a sentence.

    Application scenario: For example, Xiao Longnu said to Yang Guo: "I also want to live the life that Guo'er lived." After splitting, the model knows that it should be split into: "I also thought about / lived / Guo'er / Live a life."

  • Some people also use BERT's embedding as the input of Tacotron for speech synthesis. The intuition is that BERT uses a self-attention mechanism, and each of its word embeddings incorporates contextual information, and of course also incorporates semantic and syntactic information. This information is helpful for speech synthesis.

Improvements to Tacotron - Attention
  • As we said before, we hope that in the attention mechanism, the attention correspondence between text and speech will be a diagonal line. So how can we achieve this effect? We can use Guided Attention . The original Tacotron loss is to calculate the closer the model output is to the target speech, the better. Guided Attention adds a regularization (regulation) of attention to the original loss. We make the loss very large once the weight appears in the red part, and give a certain penalty, and that's it.

  • In addition, there are some very strong restrictions. For example, Monotonic Attention requires attention to be from left to right. There is also Location-aware attention mentioned in ASR, which requires that when generating attention, you must know the previously generated attention.

  • In fact, the attention mechanism has many tricks worth studying, and the attention mechanism also has a great impact on the quality of the final model generation.

  • This article from ICASSP 2020 analyzes how important attention is to speech synthesis. This paper does an interesting experiment. It is only trained on sounds shorter than 10 seconds in the LJ Speech dataset. But during the test, the machine was deliberately asked to read very long sentences from Harry Potter that lasted more than 10 seconds. It turns out that if Attention is not chosen well, such as using Content-Based, the training will be broken.

  • The metric is calculated by using synthesized sounds, feeding them to existing language recognition systems and converting them into text, and comparing them to the target manuscript to see how many words are wrong. The greater the difference in this error, the worse the performance of the speech synthesis system. Experiments show that using GMMv2b or DCA's attention results will be better.

  • Or we can be more decisive. Since we want the attention weight matrix of the decoder to be diagonal, why not directly set the non-diagonal area of ​​the attention weight to 0 during inference? This technique turns out to be quite useful. And there is no need to change the training process.

  • In addition, another magical trick is that it adds positional encoding information to both the Query (generated by the decoder) and Keys (generated by the encoder) that input attention. This enhances the calculation of attention. Moreover, Positional Encoding is adjusted and controlled by Speaker's embedded information. This Speaker embedding contains information about the speaker’s timbre, emotion, and speed. Intuitively, the speaker's speed information will affect position encoding.

Fast Speech—without Seq2Seq

Fast Speech and Duration Informed Attention are similar ideas proposed by different teams at the same time.

  • We use the Seq2Seq model with attention in Tacotron because our input and output sequence lengths are different, so we use Seq2Seq. A drawback of this type of model is that some segments may be missed or repeated. Now there is a type of model that does not use Seq2Seq. This is the Fast Speech model.

  • There is also an Encoder that converts characters into embeddings. The difference is that the Duration module is trained separately between the encoder and the decoder, which can predict how long each character will be. The Duration module takes in a word embedding and outputs the length of each character. For example, if the output is 2, it will copy the current character embedding twice.

  • So how should we train this model? You must know that the result output by Duration is not a continuous number, but an integer (discrete number), which cannot be differentiated. The method we adopt is to use the correct answer to train Duration alone, and the training of other parts of the model also uses the correct answer to extend the original embedding.

  • So where should the correct answer for training come from? In the Fast Speech paper, a Tacotron is first trained, and then the corresponding relationship between character and acoustic feature length is calculated based on Tacotron's Attention. Of course, in theory, it can also be obtained through Alignment through the speech recognition system.

  • How does the model perform? In the experiment done in the original paper, the advantage of FastSpeech using the Duration module is that it does not have some pronunciation flaws like Tacotron or Transformer-based TTS, such as stuttering, skipping words, and mispronounced words. Intuitively, FastSpeech places a greater restriction on Duration to avoid these errors.

  • In fact, Tacotron's performance is not as bad as stated in the paper. The reason why there are so many errors is that the sentences tested are special cases. For example, there are many same words, or reading a URL. Since there is a lack of such corpus in the Tacotron training data, it is more difficult to say.

Dual Learning: ASR & TTS

Dual Learning: is a machine learning framework designed to improve model performance through two mutually complementary learning processes. This framework is usually used for sequence-to-sequence tasks, such as machine translation, dialogue generation, etc.

  • It is not difficult to find that ASR and TTS are two tasks that are mutually exclusive. ASR converts speech into text, and TTS converts text into speech. When they are strung together, they can form a Speech Chain. What is the use of this Chain? We can use it for Dual Learning, allowing the two of us to enhance each other's capabilities.

  • How to carry out two-way learning? If you obtain some voice data without corresponding text, we can first feed the voice to ASR to generate corresponding text, and then feed the generated text to TTS, so that voice can be generated again. Our training goal is Make the final speech as close as possible to the previous speech. The converse is also true for having text but not audio material.

  • What's the effect? In the experiment, we first used paired data to train two models, and then added unpaired data and used two-way learning for training. Both models have improved to a certain extent.

5. Controlable TTS (Controllable Text to Speech)

  • What is Controllable TTS? Why do we need Controllable TTS? In fact, in the model we just discussed, at the speech level, we only focused on the content of the speech, that is, what the Speech said. However, speech contains much more information than this. For example, who is speaking (tone color) and how to speak (speed, emotion contained, cadence, etc.), so the generated speech output needs to be more precisely controlled and adjusted to meet specific needs or achieve specific effects.

simple classification

Therefore, research directions can be divided into two major categories:

  • Who is saying:

    • That is, synthesizing the voice of a specific person, a technology also known as voice cloning.

    • If there are many voices corresponding to the target, we can naturally use these voices to fine-tune the model. However, in most cases it is difficult to find enough high-quality data corresponding to the target. How to achieve a goal with a small amount of data?

  • How to say:

    • In the same sentence, the intonation, stress and rhythm can be controlled

    • These properties are collectively called Prosody. However, this thing is quite difficult to define. There is a simple definition as shown in the figure. Prosody is defined from the opposite side. It is not the content of the sound signal, the voice of the speaker, nor the reverberation of the environment.

TTS vs VC (Voice Conversion)
  • The original model accepts text output corresponding to speech. Here we hope to add a reference voice (Reference audio) and try to let the model learn the speaking method of this voice.

  • This is a bit like the Voice Conversion task. It also extracts the timbre information from a piece of Reference Audio so that the model can output speech corresponding to the timbre. In fact, the two are indeed very similar, especially the training process of the two. However, the voice content output by the former is determined by the given text, while the voice content output by the latter needs to be determined by providing another piece of voice.

How to do it
  • How should we train? Theoretically, we should use the ground truth as the input of the Controllable TTS model, extract the information, then add the text data, and finally output the corresponding speech. It sounds like CTTS is a bit like Auto Encoder, but it just adds text information.

  • But wait, didn’t you directly give the answer during training, so that the model can see the answer directly during training? Then wouldn’t it be enough if I just ignore the text information and directly copy the input Reference audio? This is a major research area of ​​Controllable TTS, that is, how to prevent direct copying from occurring, so that the model only extracts speaker information from Reference audio, but not content information.

  • So how to do it? We can use a Speaker Embedding model. This model can eat a sound signal and spit out a vector that only contains the speaker’s information. This model needs to be trained in advance. When training TTS, the parameters are fixed and do not participate in weight updates. This forces TTS to read the text information before outputting the final sentence.

GST-Tacotron

GST is the abbreviation of Global Style Tokens.

  • How does this model block copying behavior? As we can tell from the name, it is an advanced version of Tacotron. The principle is that we will first use an Encoder to turn the text into an embedding, and the reference speech will be turned into a timbre embedding through a feature extractor. It is worth mentioning that the feature extractor here is not pre-trained, but trained together with the entire TTS model.

  • The feature extractor will only output a vector. We copy this vector to the same length as the text embedding we just obtained, and then concatenate the two (Concatenate) (you can also embed and add other operations), and then After doing attention, the subsequent process is the same as the original Tacotron.

  • Then the feature extractor will not perform copying behavior and extract the content information as well? The subtlety lies here, GST-Tacotron’s Feature Extractor has a special design.

  • We amplify the feature extractor, and the Encoder will encode the reference speech into a vector, but this vector will not be directly used as output, but as Attention Weight. The feature extractor also pre-defines a set of vectors. This set can be understood as the feature extractor's "parameters" . Of course, these " Parameters" are also trained in advance. We weight the sum of the two (Linear Combination) to get the final output of the feature extractor.

  • In this way, the only thing the encoder can do is to control the attention weight offeature parameters, but it can no longer bring out the content information. . And thesecharacteristic parameters are our Global Style Tokens.

  • In fact, the name Style Tokens is precisely because each dimension vector of these learned Vector sets corresponds to each way of speaking. If you increase its corresponding attention weight, the sound will change in style in the corresponding dimension, that is, a certain cadence element will be changed. For example, some Style Tokens correspond to the pitch of the speaking voice, and some correspond to the speed of speaking.

Two-Stage Training

Two-Stage Training: is a machine learning training strategy that is often used in the training process of complex models or tasks. This strategy improves the performance and convergence speed of the model through a staged training process. Usually, the two-stage training process includesPre-training StageandFine- tuning Stage).

  1. Pre-training Stage: In the first stage, the model is used to perform a relevant but relatively simple task or is exposed to a large-scale data set Perform unsupervised learning. The goal of this stage is to allow the model to learn common feature representations or certain patterns to improve the model's initial performance and generalization ability.

  2. Fine-tuning Stage: The second stage will fine-tune the model on a specific task or a specific data set to adapt to the needs of the target task. At this stage, the model will further adjust parameters based on the annotated data of the target task or a supervised learning method to make the model perform better on specific tasks.

The advantage of two-stage training is that the pre-training stage can provide better initialization parameters for the model, making it easier to converge to a better local optimal solution, while reducing the dependence on large-scale annotated data.

  • Another blocking method is to use two-stage training. In the first stage, we train normally, using the same Reference audio and ground truth, that is, the input text information is the same. In the second stage of training, we made the content of the text deliberately different from the Reference audio for training. The question is, where do we get the ground truth? We don't know what it would be like to say "good bye" the way I love you. what to do?

  • We can connect a good ASR later to convert the output speech into text to minimize the gap between the final output text and the initially input text, so that TTS will no longer extract content information from the reference speech. .

  • Of course, if TTS and ASR are both Seq2Seq models and both use the attention mechanism, we can also perform attention consistency. For TTS attention, the input letters correspond to the sound signals it produces. We expect that ASR's attention should also produce the same letters when seeing the same sound. That is, keep the two attention weight matrices consistent.

Replenish:

Attention Consistency: refers to the property of ensuring that the attention weights of multiple copies or multiple views remain consistent or stable in a model structure. This concept is often associated with multimodal learning or multimodal attention mechanisms.

When a model processes multiple input modalities (such as images, text, audio, etc.) simultaneously, it may adopt a multi-modal attention mechanism to determine the correlation and importance between different modalities. In this case, attentional consistency becomes important.

Specifically, attention consistency usually includes the following aspects:

  1. Cross-Modal Consistency: In multi-modal tasks, the attention weights between different modalities should be consistent, that is, the same input Attention should be stable between different representations of . For example, in the image description generation task, the attention between the visual features of the image and the corresponding text description should be consistent.

  2. Temporal Consistency: For time series data, such as video processing or audio processing, the attention weight between different time steps should remain stable or have a certain Continuity. This ensures that the model can continuously focus on similar features or contextual information.

  3. Spatial Consistency: For spatial data, such as different regions or objects in image processing, the model’s attention should be consistent in different parts, which helps It is used to ensure that the model can pay reasonable attention to both global and local information.

Maintaining attention consistency can improve the model's ability to understand and process multi-modal data, helping to generate more accurate and consistent prediction results. In order to achieve this consistency, researchers usually design specific loss functions, constraints, or regularization terms to ensure that the model maintains the stability and consistency of attention during the learning process.

6. Add some questions

  • How do you know that what GST-Tacontron learned is not Speaker Identity, but Prosody?

    • Because there is only one speaker in the GST data set, there will be no differences between different speakers, only differences in cadence.

    • If we want to do better, we need to separate Speaker Identity and Prosody features. In a speech dataset, we need to know which sentences were spoken by the same person. From these sentences, we will extract certain common Style Vectors. After removing these common features, what remains will be a vector representing the Prosody information.

  • GST-Tacontron only uses one vector to represent the speaking style. Is this enough to represent the cadence information?

    • A vector has limited representational power. This may not be a good solution. We need to experimentally study whether speaking style can be represented by a vector. Is it too rough? Some information is lost and not expressed.

    • Therefore, some models that do controllable TTS will consider using a row of vectors, the number of which is the length of the input sequence. In this way, each small piece of sound signal has a vector representation. Perhaps this way we can truly control the Prosody of a sentence. This is an issue yet to be studied.

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/134456474