Speech Synthesis Using PP-TTS

The dependent files that need to be installed to call PP-TTSrequirements.txt are as follows ( the version number below is only the version number at the time of writing this article, not the latest version number ):

paddlepaddle==2.4.2
paddlespeech==1.0.1
paddleaudio==1.0.1

During the execution of the TTS task, the used acoustic model and vocoder model will be downloaded to the local C:\Users\XXX\.paddlespeech\modelsdirectory.

Execute TTS tasks

You can quickly call the inference model to complete the TTS task and output audio through code or commands.

code call

from paddlespeech.cli.tts import TTSExecutor

tts_executor = TTSExecutor()

wav_file = tts_executor(
    text='湖北十堰竹山县的桃花摇曳多姿,和蓝天白云一起,构成一幅美丽春景。',
    output='output.wav',  # 输出音频的路径
    am='fastspeech2_csmsc',  # TTS任务的声学模型
    voc='hifigan_csmsc',  # TTS任务的声码器
    lang='zh',  # TTS任务的语言
    spk_id=174,  # 说话人ID
)

command call

paddlespeech tts --input "湖北十堰竹山县的桃花摇曳多姿,和蓝天白云一起,构成一幅美丽春景。" --output output.wav --am fastspeech2_csmsc --voc hifigan_csmsc --lang zh --spk_id 174

Possible exceptions

Question 1

The following encoding exceptions may be encountered under Windows:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x8c in position 2088: illegal multibyte sequence

The solution is to modify the source code file:

  • Search for venv\Lib\site-packages\paddlespeech\t2s\frontend\zh_frontend.pythe following code
    • with open(phone_vocab_path, 'rt'
    • with open(tone_vocab_path, 'rt'
  • Search for venv\Lib\site-packages\paddlespeech\cli\tts\infer.pythe following code
    • with open(self.phones_dict, "r"
    • with open(self.tones_dict, "r"
    • with open(self.speaker_dict, 'rt'

At the searched codes, add all , encoding='UTF-8'to fix coding exceptions.

Question 2

futureThe following exceptions may be encountered under the installation library No module named 'src':

          import src.future
      ModuleNotFoundError: No module named 'src'
      [end of output]

The solution is to download the source compressed package file from future · PyPIfuture-x.xx.x.tar.gz , and then add a line at the beginning of the file after decompression setup.py(as long as it is before the problematic code line) sys.path.append('')to solve the path problem:

sys.path.append('')
import src.future

Then execute python setup.py installthe command to complete the installation.

Question 3

webrtcvadYou may encounter the following exception when installing the library Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit code 2:

      pywebrtcvad.c
      cbits\pywebrtcvad.c(1): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
      error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit code 2
      [end of output]

The solution is to download the source code of the corresponding version from Releases PaddlePaddle/PaddleSpeech (github.com) , unzip the compressed package, and replace the setup.pyin the dependency list in the file with :webrtcvadwebrtcvad-wheels

base = [
    ....
    "webrtcvad-wheels",
    ....
]

Then execute the following two commands in the source code directory to complete the installation:

pip install pytest-runner
pip install .

model selection

acoustic model Vocoder model language data set Speaker ID call parameters
FastSpeech2 HiFiGAN Chinese CSMSC 174 am='fastspeech2_csmsc',voc='hifigan_csmsc',lang='zh',spk_id=174,
FastSpeech2 HiFiGAN English LJSpeech 175 am='fastspeech2_ljspeech',voc='hifigan_ljspeech',lang='en',spk_id=175,
FastSpeech2 WaveNet Chinese CSMSC 174 am='fastspeech2_csmsc',voc='wavernn_csmsc',lang='zh',spk_id=174,
SpeedySpeech HiFiGAN Chinese CSMSC 174 am='speedyspeech_csmsc',voc='hifigan_csmsc',lang='zh',spk_id=174,
Tacotron2 HiFiGAN Chinese CSMSC 174 am='tacotron2_csmsc',voc='hifigan_csmsc',lang='zh',spk_id=174,
Tacotron2 HiFiGAN English LJSpeech 175 am='tacotron2_ljspeech',voc='hifigan_ljspeech',lang='en',spk_id=175,

acoustic model

Parameter am= acoustic model = acoustic model , which is the acoustic feature that converts language features into audio.

  • FastSpeech2 : Non-autoregressive model, simplifying the pre-training work of Tacotron2 model, using MFA to guide duration prediction, introducing pitch and energy prediction to assist acoustic modeling
  • SpeedySpeech : Another mainstream acoustic model, non-autoregressive model
  • Tacotron2 : Using RNN structure, autoregressive model, slow synthesis speed

Vocoder

Parameter voc= vocoder = vocoder , derived from the abbreviation of human voice coder, also known as speech signal analysis and synthesis system, a system for analyzing and synthesizing sound, mainly used for synthesizing human speech. The main function of a vocoder is to convert acoustic features into playable speech waveforms. The quality of the vocoder directly determines the quality of the audio.

  • HiFiGAN : A residual structure is proposed, which alternately uses hole convolution and ordinary convolution to increase the receptive field, while ensuring the sound quality of the synthesis, and improving the reasoning speed
  • WaveNet : In order to increase the receptive field, superimpose hole convolution, generate sample points one by one, the sound quality is really good, but it also makes the model larger and the inference speed slower

data set

  • CSMSC : Chinese, single speaker, female voice, about 12 hours, with high audio quality
  • LJSpeech : English, single speaker, female voice, about 24 hours, with high audio quality

Because PP-TTS provides a streaming speech synthesis system based on FastSpeech2 acoustic model and HiFiGAN vocoder by default, if you can't understand the relevant theory, you can directly select the first two calling parameters of FastSpeech2 + HiFiGAN .

おすすめ

転載: blog.csdn.net/hekaiyou/article/details/129364206