Summary of VITS open source projects (updated to 2023-06-01)

1. Summary

VITS theoretical basis: https://github.com/jaywalnut310/vits

VITS project implementation: GitHub - rhasspy/piper: A fast, local neural text to speech system

VITS one-click cloning, Chinese, English and Japanese, Plachtaa/VITS-fast-fine-tuning

VITS Chinese model, high quality, block streaming reasoning, PlayVoice/vits_chinese

VITS singing voice conversion, multiplayer model, PlayVoice/so-vits-svc-5.0

2. Origin

2.1 VITS - Official Version v1.0

June 11, 2021 VITS paper and code release:

论文:Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Code: https://github.com/jaywalnut310/vits

Institution: Korean Academy of Sciences

Conference: ICML 2021

Author's other papers: HiFiGAN, GlowTTS

2.3 PITS - Official Version v2.0

February 27, 2023 End-to-end pitch-controllable TTS pitch-free reasoning

论文:PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS

Agency: VITS Team

Code: https://github.com/anonymous-pits/pits

Objective: Based on VITS, PITS combines Yingram encoder, Yingram decoder and adversarial frequency shift synthesis training to achieve pitch controllability.

3. Official evaluation

15 October 2021 VITS assessment papers published:

论文:ESPnet2-TTS Extending the Edge of TTS Research

Code: https://github.com/espnet/espnet/tree/master/espnet2/gan_tts/vits

Institutions: Open source organization ESPnet, Cameron University, University of Tokyo, etc.

Purpose: To evaluate advanced speech synthesis systems, especially VITS; 48 of the 152 pre-trained models (ASR+TTS) provided by ESPnet are VITS speech synthesis models.

4. Excellent extension

4.1 YourTTS

December 4, 2021 VITS related papers:

论文:YourTTS:Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Code: https://edresson.github.io/YourTTS/

Organization: open source organization coqui-ai/TTS

Purpose: Realize cross-language speech synthesis and voice conversion based on VITS

4.2 Typical Application Scenarios of VoiceMe

March 30, 2022 VoiceMe: Personalized Voice Generation in TTS

论文:VoiceMe: Personalized voice generation in TTS

Code: https://github.com/polvanrijn/VoiceMe

Institution: University of Cambridge etc

Objective: To tune TTS models using speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet). Demonstrates that users can create voices that closely match human faces, artistic portraits, and cartoon photos; lip-synthesizing using wav2lip.

5. Model optimization

5.1 Model Acceleration

March 30, 2022 Nix-TTS: Acceleration of the VITS model

论文:Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Code: https://github.com/choiHkk/nix-tts

Demo: https://github.com/rendchevi/nix-tts

October 31, 2022 VITS accelerated

论文:Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

Code: https://github.com/MasayaKawamura/MB-iSTFT-VITS

Institutions: University of Tokyo, Japan, LINE Corp., Japan.

Purpose: 4.1 times faster than VITS with no impact on sound quality; 1) Partial replacement of computationally most expensive convolution with simple iSTFT (2x speedup), 2) Multiband generation of PQMF to generate waveforms.

5.2 Unlabeled training

October 6, 2022 Unlabeled training

论文:Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Code: https://github.com/hcy71o/TransferTTS

Organization: Samsung, etc.

Purpose: Use large-scale unlabeled corpus to train TTS; use wav2vec2.0;

5.3 C++ support

January 2023 VITS onnx inference code

Code: https://github.com/rhasspy/piper

Agency: Rhasspy

Purpose: export VITS training code of onnx model; C++ reasoning code; provide installation package and pre-training model; support platform desktop Linux && Raspberry Pi 4;

Six, voice changer

6.1 FreeVC

  Voice change based on VITS architecture on October 28, 2022

论文:FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

Code: https://github.com/olawod/freevc

Objective: This paper adopts the end-to-end VITS framework to achieve high-quality waveform reconstruction, and proposes a clean content information extraction strategy without text annotation. By introducing an information bottleneck into the WavLM feature, the content information is decomposed, and a data enhancement method based on spectrogram resizing is proposed to improve the purity of extracted content information.

6.2 QuickVC

February 2023 VITS voice change QuickVC

论文:QuickVC: Many-to-any Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion

Code: https://github.com/quickvc/QuickVoice-Conversion

Purpose: SoftVC + VITS + iSTFT

6.3 PITS - Official Version v2.0

February 27, 2023 End-to-end pitch-controllable TTS pitch-free reasoning

论文:PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS

Agency: VITS Team

Code: https://github.com/anonymous-pits/pits

Objective: Based on VITS, PITS combines Yingram encoder, Yingram decoder and adversarial frequency shift synthesis training to achieve pitch controllability.

7. Voice Cloning

7.1 Hearing

January 2023 Voice Clone

论文:HierSpeech: Bridging the Gap between Text andSpeech by Hierarchical Variational Inference usingSelf-supervised Representations for Speech Synthesis

Institution: Korea University

Code: https://github.com/CODEJIN/HierSpeech

Objective: To use self-supervised speech representations as additional language representations to bridge the information gap between text and speech. HierSpeech achieves a +0.303 comparison mean opinion score, and the phoneme error rate is reduced from 9.16% to 5.78%. Self-supervised speech can be exploited to adapt to new speakers without annotation.

8. Zero-short sound cloning

8.1 SNAC - unofficial implementation

December 01, 2022 zero-short voice clone

论文:SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech

Institution: Seoul National University & Samsung

Code: https://github.com/hcy71o/SNAC

Homepage: https://byoungjinchoi.github.io/snac/

Purpose: Based on Microsoft's speaker adaptor; embed the adapter in the Flow layer of VITS to achieve zero-short speech cloning; we improve the previous speaker condition by introducing a speaker normalized affine coupling (SNAC) layer method, this layer allows unseen speaker speech to be synthesized in a zero-beat fashion using normalization-based conditioning techniques.

8.2  NaturalSpeech 2

April 01, 2023 zero-short voice clone

论文:NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Organization: Microsoft

Code: https://github.com/lucidrains/naturalspeech2-pytorch

Code: https://github.com/rishikksh20/NaturalSpeech2

Code: GitHub - CODEJIN/NaturalSpeech2

Code: https://github.com/adelacvg/NS2VC

Purpose: Capture the diversity of human speech, such as speaker identity, prosody and style, such as singing; use neural audio codec and residual vector quantizer to obtain quantized latent vectors, and use diffusion model to generate these conditioned on text input latent vectors; designing a speech prompt mechanism to facilitate diffusion model learning context and duration and pitch prediction; novel zero-beat singing synthesis using only speech prompts;

8.3 Automatic

May 2023 zero-shot vits

论文:Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

Code: https://github.com/cnaigithub/Auto_Tuning_Zeroshot_TTS_and_VC

Purpose: To design a zero-shot vits framework; there are many vits losses, and the balance of loss has a great impact on the quality, so an automatic balance loss solution is proposed.

Guess you like

Origin blog.csdn.net/zhangziliang09/article/details/130984558