专业实践记录III: 端到端跨语言音色迁移语音合成论文

0. 说明

记录的是2020-1-17到2021-2-17之间的工作

一个月总结了之前的工作, 形成两篇论文

1. 模型结构优化

1.1. 题目

Module TTS: Cross-lingual Improved Transfer Learning from Speaker Verification to Multispeaker TTS

1.2. 主要内容

引自论文的摘要:

The most advanced TTS technology can produce monolingual speech with high definition and naturalness. However, when the model is applied to synthesize cross-lingual and code-switch speech, the performance will be severely degraded. Recently, a large number of excellent methods use monolingual corpus to achieve cross-lingual TTS. Once the source language corpus is used in the training process, it is necessary to disentangle content information, speaker information, and language information. If not complete, the acoustic model will be influenced by source language, which leads to accents. Transfer learning from Speaker Verification to multispeaker TTS (SV-TTS) is able to not use source language corpus during training. It synthesis standard English by modeling speaker’s voice. This belongs to the unseen category, and cross language. The speaker’s voice is not similar enough. We want to keep the idea of SV-TTS and only use the target language corpus to train the acoustic model. In addition, convert unseen to seen mode, and consider the characteristics of cross-lingual. We propose Module TTS: (1) based on SV-TTS, use phonetic posteriorgram (PPG) as universal input representation; (2) target language corpus train all modules while source language only train SV modules without acoustic model. The target speaker’s TTS therefore changes to the seen mode; (3)Add DANN, LID, and Similar Loss modules for the characteristics of cross-lingual. Experimental results are better than baseline SV-TTS.

关键字: cross-lingual speech synthesis, code-switch, phonetic posteriorgrams, acoustic model, disentangle information

2. 参考音频优化

2.1. 题目

One-reference TTS: AWay to Provide Reference Speech for Voice in Cross-lingual Voice Clone Task

2.2. 主要内容

引自论文的摘要:

Code-swich TTS is very important in life, among which crosslingual voice clone and voice conversion are one of the important solutions. Using a method based on voice transfer is better than a program based on one-hot Embedding and acoustic model training at the same time in terms of pure accent. However, the method based on voice transfer requires reference speech for voice, which will lead to inconsistencies in the distribution of training and synthesis, as is common: the content information of reference speech and content input during training is the same, and the synthesis is different. The cross-lingual voice clone task is more serious, not only content information conflicts, but also language information differences. These will lead to unstable content quality of synthesized speech, poor voice similarity, and bad accents.

On the other hand, traditional voice clone and other methods did not point out the selection opinions of reference speech for a given pair of content input and designated speaker, resulting in a synthetic ”difficulty in choice”. The ”difficulty in choice” is inconvenient, at the same time, too many choices deviate from the original intention of synthesizing consistent speaker’s voice (one pair of content and voice mapping to only one best speech). This paper discusses (1) one pair of content and speaker firstly generate reference speech, and maps to one target speech (2) the generated reference speech is frame-level consistent with content input, reducing content information conflicts. We introduce PPG instead of Text as content input (3) two specific solutions are proposed: Voicereference TTS based on the traditional PPG Voice Model and Cycle-reference TTS based on Cycle ideas. These methods are based on voice transfer, which implements the function of cross-language voice transfer, called One-reference TTS. Experiments show that comparing baseline: Google Voice Clone, One-Reference TTS has improved the quality of speech generation, which proves that providing reference speech for voice in cross-lingual voice clone task is effective.

关键字: cross-lingual speech synthesis, code-switch, phonetic posteriorgrams, voice transfer, cycle

3. 下阶段任务

将两篇论文实验完成, 并投稿 Interspeech (3月26号交稿)

论文实验的代码和实验室以及公司讨论, 形成规范并留存

猜你喜欢

转载自blog.csdn.net/u013625492/article/details/114171503