1. 实践目的及意义

1.1. 背景意义

Code-switch is a common phenomenon in multilingual society around the world. The latest speech synthesis can generate monolingual speech with high identifiable and naturalness. However, they cannot fully handel code-switch text, which can lead to missing or incorrect pronunciation in the synthesized output. Using bilingual recordings from bilingual speakers to build a code-switch TTS is simple . However, in reality, it is expensive to obtain large amounts of such bilingual data. We explore cross-lingual TTS: use source speaker saying target language copurs and target speaker saying source language to generate target speaker saying language speech.

1.2. 已有方案和缺点

Papers try to solve cross-lingual
TTS, they can generate expressive speech, but may lead to wrong accent because of not completely information detangled. Different texts with different speakers will get different quality, which is also a big problem for commercial cross-lingual TTS. Apple studies the characteristics of the speaker’s feature vector in cross-lingual TTS. By adjusting the small difference of the same speaker’s feature vector in different languages, it can achieve better timbre similarity and speech naturalness. In Voice Clone or Voice Conversion, more attention is paid to the modeling of timbre. CUHK papers disentangle the voice content and timbre in speech. For unseen speakers, these methods can also get the speech of its timbre by modeling speaker’s feature from reference speech. These methods can also implement cross-lingual TTS by referring to the speech of different languages. These methods are not optimized for cross-lingual TTS tasks. Because the text language is different, the SV module is not universal, etc., the speaker similarity is low, and the pronunciation stability is also not ideal.

1.3. 技术难点

In Cross-lingual TTS, we need to pay attention to the
three parts:speech content, speaker timbre, and language accent. Among all the method above, SV-TTS works best in accent, which inspired us to use SV-TTS as the baseline and improve it: (1) Use VC’s cross-lingual method that modeling the target speaker’s timbre, rather than Use source language corpus to train acoustic models; (2) use phonetic posteriorgram (PPG) as universal input representation. And the source language corpus does not train the acoustic model; (3) use DANN, LID, and Similar Loss modules for the characteristics of cross-lingual.

2. 实践的主要内容

2.1. 技术结构角度

三点主要内容

based on SV-TTS, use SV module to model the timbre of the target speaker instead of force the target speaker to speak the target language, which achieves a completely correct target language’s accent.
use PPG as universal input representation, unseen mode changed to seen mode, the timbre of the target speaker is closer when it was trained to network before synthesis.
the module for disentangling is added for the crosslingual task, which essentially provides high-quality structure support

Based on the above three points, we propose an improved solution based on SV-TTS: module TTS. The module TTS can achieve a pure target language accent, a more similar timber, a more thorough disentangling, and a more stable synthesis of high-quality speech content. At the same time, it is easy to realize both in mind and code

2.2. 工程角度

[1] 两类问题

少量语料问题
无语料问题

[2] 一个转折点

音色迁移

[3] 三步走战略

(a): Pre-trained Ear TTS
(b): Self-trained Ear TTS
(c): PPG Ear TTS

2.3. 成果角度

平安科技少语料双语 TTS
可商业化无语料双语 TTS (核心)
一篇专利
两篇论文

3. 实践主要成果

3.1. 第一阶段成果

一个专利: 主要描述了一种基于PPG一致性的最优映射跨语言音色转换方法的专利

引用摘要来进行说明:

本发明涉及一种基于PPG一致性的最优映射跨语言音色转换方法，首先通过语音信号处理技术提取被转换语音的帧级别的声学特征，并通过ASR计算得到语音波形对应的帧级别的语音内容的表征PPG。同时结合预先设置的目标说话人的大型语料库，在目标说话人的PPG集合中进行最优搜索，从而得到既能准确表示被转换语音的语音内容又符合目标说话人特征的映射序列。最后通过神经网络声学模型和声码器将其转换为自然的语音波形。本发明通过帧级别的语音内容表征PPG建模被转换语音和目标说话人语料之间的关系，不涉及具体语言的限制，因此可以实现跨语言的音色转换。同时，提出的PPG一致性评测标准和与之相符的最优映射算法，保证了音色转换前后语音内容的相近。本发明能自动，有效地不限制被转换语音语言的进行音色转换，应用于智能语音交互系统中，有助于系统更好的传达信息和意图，更丰富的语音说话人选择，提升用户的满意度

3.2. 第二阶段成果

PPG 基线系统搭建

3.2.1. PPG_Extractor封装接口

PPG为对音频帧级别的发音信息特征描述, 解耦去掉了音色. 利用PPG特征可以方便的进行跨语言音色迁移语音合成

英文的PPG提取工程Git: https://github.com/ruclion/ppgs_extractor_10ms_sch_lh_librispeech

中文的PPG提取工程Git: https://github.com/ruclion/ppgs_extractor_10ms_sch_lh_aishell1

3.2.2. Multi-Speaker的Transformation Model

PPG添加音色还原为声学参数, 需要使用转换模型. 而多说话人的转换模型一起训练, 可以(一定程度)共享不同语言的语音模型

工程Git: https://github.com/ruclion/ppg_decode_spec_10ms_sch_Multi

具体模型结构等细节, 参看: https://blog.csdn.net/u013625492/article/details/109225039

3.2.3. BPPG跨语言合成复现

将1.1和1.2联合起来, 即为复现BPPG版本的跨语言合成, 根据各模块结构的细节, 分为三类:

CUHK-BPPG版本
HCSI-BPPG版本
AiLi-BPPG版本

结构的细节差别见: https://blog.csdn.net/u013625492/article/details/109286333

3.2.4. 商业化混语言合成系统

关于跨语言和混语言合成, 实际上是为了方便, 在本文语境下特指:

跨语言指, 只有中国人说中文+英国人说英文语料, 让中国人去"跨"说英文, 更偏向于科研
混语言指, 无论原有的语料如何, 最终合成出同音色的混语言句子, 其中文本有Code-Swith发生, 更偏向于层次性的工程

既跨语言时混语言中的一部分, 或者特例. 这部分的依据为阿里2020年论文: Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

以平安科技语料"春春"为例,

混语言流程

起始为: 6050句中文录音+200句英文录音
跨语言合成出英文虚拟语料(在多说话人英文TTS基础上, 使用200句春春英文录音Fine-Tune, 最终Inference整个LJSpeech文本), 此步骤为同事所做
得到: 6050句中文录音+10000句英文(虚拟)语料
双语语料训练混语言TTS, 需要注意输入文本的表示, Vocoder的使用, Inference Code-Switch文本时的发音和连贯等
(可选)Code-Switch文本合成效果较好的TTS(如Transformer TTS)生成混语言虚拟语料, 结合双语语料, 再重新训练TTS

完成了春春语音的Tacotron TTS的第一版, 借鉴了实验室的Google版本的代码, 引入了VAE结构(未探究是否有提升质量作用), 代码上传至与BPPG同一个Git中

3.2.5. 商业化TTS合成细节修正工作

参加了TTS合成细节修正工作, 基于公司工作比较细致的商业化过程, 学到了很多细节和模式

数字连读, 车牌号发音错误的挑选和修正(的思路)
音频中截取指定发音的完整流程, 用于加强发音
经由ASR筛选后的训练集拼音校准修正
中文TTS以ASR为基础的评分(还未使用过)
批量文本合成的测听效果, 以及反馈流程
从实际的包含用例图的流程文件中, 快速提取待合成的文本

学到了TTS商业应用的流程细节和具体操作方式

记录链接: https://blog.csdn.net/u013625492/article/details/111192083

3.3. 第三阶段成果

3.3.1. 阿里提出的跨语言音色转换结构

基于Tacotron的PPG到MEL谱映射

PPG降采样
Fine-Tune冻结位置的尝试
Fine-Tune程度的尝试

3.3.1. PPG-Tacotron 代码实现

对比阿里相对于Tacotron的结构修正
基于r9y9的Pytorch实现

3.3.2. AutoVC复现

复现AutoVC论文，并探究影响实验结果的条件

Similar Loss: 从AutoVC的Content Loss中的推论，对于自编码结构的影响讨论
不同声学超参数提取对实验结果的影响
AutoVC提出的维度与降采样的作用
One-hot与Speaker Encoder方案的区别

实验结论应用于同事的论文中

记录链接: https://blog.csdn.net/u013625492/article/details/113393773

3.4. 第四阶段成果

总结了之前的工作, 形成两篇论文

3.4.1. 模型结构优化

Module TTS: Cross-lingual Improved Transfer Learning from Speaker Verification to Multispeaker TTS

3.4.2. 参考音频优化

One-reference TTS: AWay to Provide Reference Speech for Voice in Cross-lingual Voice Clone Task

具体见: https://blog.csdn.net/u013625492/article/details/114171503

3.5. 第五阶段成果

跨语言音色迁移的三步走中:

Pre-trained Ear Speech
Self-trained Ear Speech

的系统搭建和实验结果

同时也给

Tacotron 找到/实现了更好的一版 Pytorch 代码
https://github.com/ruclion/Tuned-GE2E-EarSpeech (跑通了, 并且效果好)
https://github.com/ruclion/WisdomTeeth-Tacotron (简化版本)

具体见: https://mp.csdn.net/editor/html/114684610

专业实践最终总结: 端到端跨语言 TTS