Self-Supervised Representations for Singing Voice Conversion

  • 2023.3
  • meta AI

method

insert image description here

  • This article is more inclined to experience sharing after multi-dimensional experiments, and the practicality is relatively strong

  • hubert extracts content embedding. The hubert-emb used here is not a pre-trained feature, but a feature of the hubert model after ASR data finetune. However, the experiment proves that the tone decoupling after finetune is obviously better than the pre-train model, but there are still residues;

  • f0 gets more harmonic characterization through f0-encoder, and shift will be performed in the infer stage. Because speaker embedding actually models the speaker's fundamental frequency distribution, using src_f0 directly will result in worse results. Suppose f A f_AfAand f B f_BfBare all Gaussian distributed.
    insert image description here

  • After the speaker-emb passes through the LUT, the three features are spliced ​​together and sent to HiFiGan.

  • The fundamental frequency processing method is as shown in the figure below
    insert image description here

Experimental results

  • The quality of speech+sing data is better than that of sing-single; this article uses 24k data, 200h high-fidelity speech, and 10+h singing data (NUS48E+CSD+AmericanSong)
  • The self-supervised model, based on asr data finetune, will filter out some speaker features, and the synthesized voice quality and target person similarity will be improved; compared with wav2vec+finetune, hubert can achieve similar performance without finetune, It shows that the speaker features in hubert are less than wav2vec.
  • PBTC's baseband coding method is better
  • In the inference stage, f0 is extracted from the source speech and shifted to the range of the target speaker;

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_40168949/article/details/130701542