The dependent files that need to be installed to call PP-TTSrequirements.txt
are as follows ( the version number below is only the version number at the time of writing this article, not the latest version number ):
paddlepaddle==2.4.2
paddlespeech==1.0.1
paddleaudio==1.0.1
During the execution of the TTS task, the used acoustic model and vocoder model will be downloaded to the local C:\Users\XXX\.paddlespeech\models
directory.
Execute TTS tasks
You can quickly call the inference model to complete the TTS task and output audio through code or commands.
code call
from paddlespeech.cli.tts import TTSExecutor
tts_executor = TTSExecutor()
wav_file = tts_executor(
text='湖北十堰竹山县的桃花摇曳多姿,和蓝天白云一起,构成一幅美丽春景。',
output='output.wav', # 输出音频的路径
am='fastspeech2_csmsc', # TTS任务的声学模型
voc='hifigan_csmsc', # TTS任务的声码器
lang='zh', # TTS任务的语言
spk_id=174, # 说话人ID
)
command call
paddlespeech tts --input "湖北十堰竹山县的桃花摇曳多姿,和蓝天白云一起,构成一幅美丽春景。" --output output.wav --am fastspeech2_csmsc --voc hifigan_csmsc --lang zh --spk_id 174
Possible exceptions
Question 1
The following encoding exceptions may be encountered under Windows:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x8c in position 2088: illegal multibyte sequence
The solution is to modify the source code file:
- Search for
venv\Lib\site-packages\paddlespeech\t2s\frontend\zh_frontend.py
the following codewith open(phone_vocab_path, 'rt'
with open(tone_vocab_path, 'rt'
- Search for
venv\Lib\site-packages\paddlespeech\cli\tts\infer.py
the following codewith open(self.phones_dict, "r"
with open(self.tones_dict, "r"
with open(self.speaker_dict, 'rt'
At the searched codes, add all , encoding='UTF-8'
to fix coding exceptions.
Question 2
future
The following exceptions may be encountered under the installation library No module named 'src'
:
import src.future
ModuleNotFoundError: No module named 'src'
[end of output]
The solution is to download the source compressed package file from future · PyPIfuture-x.xx.x.tar.gz
, and then add a line at the beginning of the file after decompression setup.py
(as long as it is before the problematic code line) sys.path.append('')
to solve the path problem:
sys.path.append('')
import src.future
Then execute python setup.py install
the command to complete the installation.
Question 3
webrtcvad
You may encounter the following exception when installing the library Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit code 2
:
pywebrtcvad.c
cbits\pywebrtcvad.c(1): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit code 2
[end of output]
The solution is to download the source code of the corresponding version from Releases PaddlePaddle/PaddleSpeech (github.com) , unzip the compressed package, and replace the setup.py
in the dependency list in the file with :webrtcvad
webrtcvad-wheels
base = [
....
"webrtcvad-wheels",
....
]
Then execute the following two commands in the source code directory to complete the installation:
pip install pytest-runner
pip install .
model selection
acoustic model | Vocoder model | language | data set | Speaker ID | call parameters |
---|---|---|---|---|---|
FastSpeech2 | HiFiGAN | Chinese | CSMSC | 174 | am='fastspeech2_csmsc',voc='hifigan_csmsc',lang='zh',spk_id=174, |
FastSpeech2 | HiFiGAN | English | LJSpeech | 175 | am='fastspeech2_ljspeech',voc='hifigan_ljspeech',lang='en',spk_id=175, |
FastSpeech2 | WaveNet | Chinese | CSMSC | 174 | am='fastspeech2_csmsc',voc='wavernn_csmsc',lang='zh',spk_id=174, |
SpeedySpeech | HiFiGAN | Chinese | CSMSC | 174 | am='speedyspeech_csmsc',voc='hifigan_csmsc',lang='zh',spk_id=174, |
Tacotron2 | HiFiGAN | Chinese | CSMSC | 174 | am='tacotron2_csmsc',voc='hifigan_csmsc',lang='zh',spk_id=174, |
Tacotron2 | HiFiGAN | English | LJSpeech | 175 | am='tacotron2_ljspeech',voc='hifigan_ljspeech',lang='en',spk_id=175, |
acoustic model
Parameter am
= acoustic model = acoustic model , which is the acoustic feature that converts language features into audio.
- FastSpeech2 : Non-autoregressive model, simplifying the pre-training work of Tacotron2 model, using MFA to guide duration prediction, introducing pitch and energy prediction to assist acoustic modeling
- SpeedySpeech : Another mainstream acoustic model, non-autoregressive model
- Tacotron2 : Using RNN structure, autoregressive model, slow synthesis speed
Vocoder
Parameter voc
= vocoder = vocoder , derived from the abbreviation of human voice coder, also known as speech signal analysis and synthesis system, a system for analyzing and synthesizing sound, mainly used for synthesizing human speech. The main function of a vocoder is to convert acoustic features into playable speech waveforms. The quality of the vocoder directly determines the quality of the audio.
- HiFiGAN : A residual structure is proposed, which alternately uses hole convolution and ordinary convolution to increase the receptive field, while ensuring the sound quality of the synthesis, and improving the reasoning speed
- WaveNet : In order to increase the receptive field, superimpose hole convolution, generate sample points one by one, the sound quality is really good, but it also makes the model larger and the inference speed slower
data set
- CSMSC : Chinese, single speaker, female voice, about 12 hours, with high audio quality
- LJSpeech : English, single speaker, female voice, about 24 hours, with high audio quality
Because PP-TTS provides a streaming speech synthesis system based on FastSpeech2 acoustic model and HiFiGAN vocoder by default, if you can't understand the relevant theory, you can directly select the first two calling parameters of FastSpeech2 + HiFiGAN .