- 2023.3
- meta AI
method
-
This article is more inclined to experience sharing after multi-dimensional experiments, and the practicality is relatively strong
-
hubert extracts content embedding. The hubert-emb used here is not a pre-trained feature, but a feature of the hubert model after ASR data finetune. However, the experiment proves that the tone decoupling after finetune is obviously better than the pre-train model, but there are still residues;
-
f0 gets more harmonic characterization through f0-encoder, and shift will be performed in the infer stage. Because speaker embedding actually models the speaker's fundamental frequency distribution, using src_f0 directly will result in worse results. Suppose f A f_AfAand f B f_BfBare all Gaussian distributed.
-
After the speaker-emb passes through the LUT, the three features are spliced together and sent to HiFiGan.
-
The fundamental frequency processing method is as shown in the figure below
Experimental results
- The quality of speech+sing data is better than that of sing-single; this article uses 24k data, 200h high-fidelity speech, and 10+h singing data (NUS48E+CSD+AmericanSong)
- The self-supervised model, based on asr data finetune, will filter out some speaker features, and the synthesized voice quality and target person similarity will be improved; compared with wav2vec+finetune, hubert can achieve similar performance without finetune, It shows that the speaker features in hubert are less than wav2vec.
- PBTC's baseband coding method is better
- In the inference stage, f0 is extracted from the source speech and shifted to the range of the target speaker;