Summary of open source models for Chinese speech synthesis

Recently, I have been busy with the attempt of voice open source cloning model, which is summarized as follows:

MockingBird: The feature is that the sound of the clone is more similar, and the disadvantages are obvious, the speed is slow, about 5 seconds, can be optimized to about 0.4-1.2 seconds, and the MOS value is low;

Vits: The characteristic is that the current public MOS value is closest to the real value, and the speed is relatively fast, about 0.08-0.4 seconds; 

ms_istft_vits: The characteristic is that the performance is about 4 times that of vits, and the speed is faster, about 0.06-0.1 seconds, and the MOS value is close to the real value.

These model codes are more or less buggy and need to be repaired by yourself. In addition, the multiplayer training model code of the vits class needs to be modified by yourself. You can use pinyin or phonemes, and the effect of using phonemes plus pauses is better.

The multi-person training of the vits model uses AISHELL-3 multi-person (174 people, more than 80,000 voices) Chinese data set with 8K sampling rate, batch_size=16, and it needs to be trained to 500K steps to achieve better results. T4 GPU 16G needs about 10 days of training. AISHELL single 10,000 female voices with 44K sampling rate, the model takes about 9 days, the effect of 240K steps is better, and it can clone the moonlight in the lotus pond.

For polyphonic characters: you need to maintain your own polyphonic dictionary.

In terms of acceleration: Quantization, conversion to onnx or script model failed, the code does not support, and conversion to traced_model was successful, but the performance was very low, short sentences took 10 seconds, give up.

Paper MOS value comparison chart:

 

 Comparison of MOS value and single inference performance (unit: second):

Guess you like

Origin blog.csdn.net/wxl781227/article/details/127996110