Speech Synthesis Model Cheat Sheet (1)

foreword

Voice is also an increasingly popular industry. Given a piece of text, we want it to be read. We need to use speech synthesis technology, which is Text-to-Speech, or TTS for short. Here are some interesting models I saw.

One-stage speech synthesis is generally called end-to-end
Two-stage speech synthesis step, usually stage1:
text-(FFT)-> spectrogram-(filtering)-> mel spectrum/linear spectrum
stage 2: Generate a waveform (audio) from the Mel/Linear spectrum


paper

JOKE

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
ICML 2021
Paper: https://arxiv.org/abs/2106.06103
Code: https://github.com/jaywalnut310/vits

Condition VAE + flow + GAN
flow can look at the two articles v-flow and flow++.

I saw two paper notes on Zhihu: Read
the classics in detail : VITS, Conditional Variational Autoencoder with Adversarial Learning for Speech Synthesis Short [Paper Notes] VITS_OlaWod

The monotonic alignment search algorithm is introduced in the Glow-TTS article. Glow-TTS is a flow model, official code: https://github.com/jaywalnut310/glow-tts

Thesis explanation: The best speech synthesis VITS model based on cVAE+Flow+GAN Essays _bilibili
code explanation: The best speech synthesis VITS model code based on cVAE+Flow+GAN is explained line by line _bilibili

For some introductions to Flow, see this: Neural Network (15) Normalizing Flow and INN

And some common flow model code implementations https://github.com/janosh/awesome-normalizing-flows

Guess you like

Origin blog.csdn.net/weixin_43850253/article/details/126085711