Unsupervised Cross-Domain Singing Voice Conversion

会议: 2020 interspeech
单位:FaceBook
作者:Adam Polyak

demo page

abstract

  • 使用了speech & sing的数据;cross-domain的意思是可以把source singing utt转换成原始为说话or歌唱的音色。
  • wav2wav的转换,GAN网络
  • 使用了ASR提取声学特征,CNN提取基频,另外提取loudness feature,
  • 提出perceptual loss:计算重建x和原始x的基频一致性,以及内容一致性;

在这里插入图片描述
Figure 1: Proposed GAN architecture. (a) Generator architecture. Musical and speech features are extracted from a singing waveform (floud(x), fw2l(x), Γ(fcrepe(x))) and passed through context stacks (colored green). The features are then concatenated and tempo- rally upsampled to match the audio frequency. The joint embedding is used to condition a non-causal WaveNet (colored blue), which receives random noise as input. (b) Discriminator architecture. Losses are drawn with dashed lines, input/output with solid lines. The discriminator (colored orange) differentiates between synthesized and real singing. Multi-scale spectral loss and perceptual losses are computed between matching real and generated samples.
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

  • 多说话人的时候用到了back-translation:
    x u j = G ( z , E ( x j ) , u ) x^j_u =G(z,E(x_j),u) xuj=G(z,E(xj),u)

aechitecture

input——conv block (8层non-casual layer)——generator(wavenet) :将U(0,1)之间分布的数据预测为采样点级别的wav———discriminator

Experiments

  • 单人数据:LJSpeech,LCSING-单人歌唱数据
  • 多人数据:VCTK, NUS

分别用纯speech数据、纯歌唱数据、speech+sing的数据用作模型训练,作为target speaker,测试的时候输入是nus的数据。

猜你喜欢

转载自blog.csdn.net/qq_40168949/article/details/120145044