1. Summary
VITS theoretical basis: https://github.com/jaywalnut310/vits
VITS project implementation: GitHub - rhasspy/piper: A fast, local neural text to speech system
VITS one-click cloning, Chinese, English and Japanese, Plachtaa/VITS-fast-fine-tuning
VITS Chinese model, high quality, block streaming reasoning, PlayVoice/vits_chinese
VITS singing voice conversion, multiplayer model, PlayVoice/so-vits-svc-5.0
2. Origin
2.1 VITS - Official Version v1.0
June 11, 2021 VITS paper and code release:
论文:Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Code: https://github.com/jaywalnut310/vits
Institution: Korean Academy of Sciences
Conference: ICML 2021
Author's other papers: HiFiGAN, GlowTTS
2.3 PITS - Official Version v2.0
February 27, 2023 End-to-end pitch-controllable TTS pitch-free reasoning
论文:PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS
Agency: VITS Team
Code: https://github.com/anonymous-pits/pits
Objective: Based on VITS, PITS combines Yingram encoder, Yingram decoder and adversarial frequency shift synthesis training to achieve pitch controllability.
3. Official evaluation
15 October 2021 VITS assessment papers published:
论文:ESPnet2-TTS Extending the Edge of TTS Research
Code: https://github.com/espnet/espnet/tree/master/espnet2/gan_tts/vits
Institutions: Open source organization ESPnet, Cameron University, University of Tokyo, etc.
Purpose: To evaluate advanced speech synthesis systems, especially VITS; 48 of the 152 pre-trained models (ASR+TTS) provided by ESPnet are VITS speech synthesis models.
4. Excellent extension
4.1 YourTTS
December 4, 2021 VITS related papers:
论文:YourTTS:Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
Code: https://edresson.github.io/YourTTS/
Organization: open source organization coqui-ai/TTS
Purpose: Realize cross-language speech synthesis and voice conversion based on VITS
4.2 Typical Application Scenarios of VoiceMe
March 30, 2022 VoiceMe: Personalized Voice Generation in TTS
论文:VoiceMe: Personalized voice generation in TTS
Code: https://github.com/polvanrijn/VoiceMe
Institution: University of Cambridge etc
Objective: To tune TTS models using speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet). Demonstrates that users can create voices that closely match human faces, artistic portraits, and cartoon photos; lip-synthesizing using wav2lip.
5. Model optimization
5.1 Model Acceleration
March 30, 2022 Nix-TTS: Acceleration of the VITS model
论文:Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation
Code: https://github.com/choiHkk/nix-tts
Demo: https://github.com/rendchevi/nix-tts
October 31, 2022 VITS accelerated
论文:Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform
Code: https://github.com/MasayaKawamura/MB-iSTFT-VITS
Institutions: University of Tokyo, Japan, LINE Corp., Japan.
Purpose: 4.1 times faster than VITS with no impact on sound quality; 1) Partial replacement of computationally most expensive convolution with simple iSTFT (2x speedup), 2) Multiband generation of PQMF to generate waveforms.
5.2 Unlabeled training
October 6, 2022 Unlabeled training
论文:Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
Code: https://github.com/hcy71o/TransferTTS
Organization: Samsung, etc.
Purpose: Use large-scale unlabeled corpus to train TTS; use wav2vec2.0;
5.3 C++ support
January 2023 VITS onnx inference code
Code: https://github.com/rhasspy/piper
Agency: Rhasspy
Purpose: export VITS training code of onnx model; C++ reasoning code; provide installation package and pre-training model; support platform desktop Linux && Raspberry Pi 4;
Six, voice changer
6.1 FreeVC
Voice change based on VITS architecture on October 28, 2022
论文:FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
Code: https://github.com/olawod/freevc
Objective: This paper adopts the end-to-end VITS framework to achieve high-quality waveform reconstruction, and proposes a clean content information extraction strategy without text annotation. By introducing an information bottleneck into the WavLM feature, the content information is decomposed, and a data enhancement method based on spectrogram resizing is proposed to improve the purity of extracted content information.
6.2 QuickVC
February 2023 VITS voice change QuickVC
论文:QuickVC: Many-to-any Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion
Code: https://github.com/quickvc/QuickVoice-Conversion
Purpose: SoftVC + VITS + iSTFT
6.3 PITS - Official Version v2.0
February 27, 2023 End-to-end pitch-controllable TTS pitch-free reasoning
论文:PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS
Agency: VITS Team
Code: https://github.com/anonymous-pits/pits
Objective: Based on VITS, PITS combines Yingram encoder, Yingram decoder and adversarial frequency shift synthesis training to achieve pitch controllability.
7. Voice Cloning
7.1 Hearing
January 2023 Voice Clone
论文:HierSpeech: Bridging the Gap between Text andSpeech by Hierarchical Variational Inference usingSelf-supervised Representations for Speech Synthesis
Institution: Korea University
Code: https://github.com/CODEJIN/HierSpeech
Objective: To use self-supervised speech representations as additional language representations to bridge the information gap between text and speech. HierSpeech achieves a +0.303 comparison mean opinion score, and the phoneme error rate is reduced from 9.16% to 5.78%. Self-supervised speech can be exploited to adapt to new speakers without annotation.
8. Zero-short sound cloning
8.1 SNAC - unofficial implementation
December 01, 2022 zero-short voice clone
论文:SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
Institution: Seoul National University & Samsung
Code: https://github.com/hcy71o/SNAC
Homepage: https://byoungjinchoi.github.io/snac/
Purpose: Based on Microsoft's speaker adaptor; embed the adapter in the Flow layer of VITS to achieve zero-short speech cloning; we improve the previous speaker condition by introducing a speaker normalized affine coupling (SNAC) layer method, this layer allows unseen speaker speech to be synthesized in a zero-beat fashion using normalization-based conditioning techniques.
8.2 NaturalSpeech 2
April 01, 2023 zero-short voice clone
论文:NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
Organization: Microsoft
Code: https://github.com/lucidrains/naturalspeech2-pytorch
Code: https://github.com/rishikksh20/NaturalSpeech2
Code: GitHub - CODEJIN/NaturalSpeech2
Code: https://github.com/adelacvg/NS2VC
Purpose: Capture the diversity of human speech, such as speaker identity, prosody and style, such as singing; use neural audio codec and residual vector quantizer to obtain quantized latent vectors, and use diffusion model to generate these conditioned on text input latent vectors; designing a speech prompt mechanism to facilitate diffusion model learning context and duration and pitch prediction; novel zero-beat singing synthesis using only speech prompts;
8.3 Automatic
May 2023 zero-shot vits
论文:Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis
Code: https://github.com/cnaigithub/Auto_Tuning_Zeroshot_TTS_and_VC
Purpose: To design a zero-shot vits framework; there are many vits losses, and the balance of loss has a great impact on the quality, so an automatic balance loss solution is proposed.