Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short

  • 2023ICASSP
  • University of Tokyo&LINE Corp
  • Happy Kawamura
  • github-code

abstract

  • Motivation: The quality of VITS is very good. The work of this article aims to achieve high-quality synthesis with smaller models and faster inference speed.
  • contribution:
    • The most time-consuming part is the waveform generation module of the decoder (HFG), which is replaced by iSTFTNet to complete the conversion from frequency domain to time domain;
    • multi-band生成:each iSTFT module generates sub-band signals, summed to generate the full-band target waveform.
    • multi-stream生成:use a trainable synthesis filter for the sub-band signals,
  • result
    • 0.066 on an Intel Core i7 CPU, 4.1x faster than VITS
    • Compared with distillation models (Nix-TTS, teacher-student, smaller model size), the generation quality is better when using the same model size, because the structure of end2end has less loss than distillation.

Insert image description here

method

  • The decoder of VITS is the structure of HFG. By upsampling z to the sampling point (multiple convolution upsampling), it consumes a relatively large amount of calculation;
  • Inspired by iSTFTNet , this process is replaced by the inverse Fourier transform;

Guess you like

Origin blog.csdn.net/qq_40168949/article/details/131529688