Paper Reading_Speech Synthesis_VALLE-X

Paper information

name_en: Speak Foreign Languages ​​with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
name_ch: Speak Foreign Languages ​​with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
paper_addr: http://arxiv.org/abs/2303.03926
date_read : 2023-04-25
date_publish: 2023-03-07
tags: ['Deep Learning','Speech Synthesis']
author: Ziqiang Zhang, Microsoft
code: https://github.com/microsoft/unilm

1 Feedback

An extension to VALL-E, given source language speech and target language text as cues, predicts the sequence of acoustic markers for target language speech, which can be used for speech-to-speech translation tasks. It can generate high-quality speech in the target language while preserving the voice, emotion, and acoustic environment of unseen speakers. It effectively alleviates the problem of foreign accents, which can be controlled by language ID.

Generates target speech cued by phoneme sequences derived from source and target texts, and source acoustic markers derived from an audio codec model.

2 Introduction

Main contributions
• Proposed VALL-EX conditional cross-lingual language model, using source language speech and target language text as prompts to predict target language acoustic markers.
• A multilingual contextual learning framework that preserves the voice, emotion, and phonetic context of unseen speakers, relying only on a sentence cue in the source language.
• Significantly reduces foreign accent problems, a well-known cross-lingual problem.
• Applying VALL-EX to zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Beat strong baselines on speaker similarity, speech quality, translation quality, speech naturalness, and human evaluation.

3 methods

In addition to the model itself, G2P Tool is used in combination to convert text into phonemes, and finally Encodec is used to generate audio data.

3.1 Model Architecture

It consists of an autoregressive multi-speech encoder and a non-autoregressive encoder.
Multilingual acoustic markers (A) and phoneme sequences (S) are transformed from speech and transcription using encoder and G2P tools, respectively. During training, the two models are optimized using pairs S and A from different languages. Semantic markers in this paper refer to phoneme sequences.

3.2 Multilingual training

A bilingual speech transcription (ASR) corpus is utilized with pairs of (Ss, As) and (St, At) to train a multilingual model.
In addition, language IDs are used to direct language-specific speech generation in VALL-EX. Because it is trained with multilingual data, if you do not specify the ID, it may be confusing to choose the appropriate acoustic token for a specific language. For example, Chinese is a tonal language, while English is a non-tonal language. This was surprisingly effective at guiding correct speaking style and mitigating accent issues, specifically embedding language IDs into dense vectors and adding them to the acoustically labeled embeddings.

3.3 Multilingual Reasoning

The inputs to the autoregressive and non-autoregressive models are different; the process of speech-to-speech translation is shown on the right.
Given a source speech Xs, the speech recognition and translation model first generates a source phoneme Ss from a semantic encoder and a target phoneme St from a semantic decoder. Additionally, X is compressed to source acoustic markers As using the EnCodec encoder. Then, Ss, St and As are concatenated as the input of VALL-EX to generate the sequence of acoustic markers of the target speech. Use EnCodec's decoder to convert the generated acoustic markers into the final target speech.

4 Relevant knowledge

  • SpeechUT: SpeechUT is a cross-modal pre-trained model for bridging speech and text. It uses hidden units as an interface to align speech and text, and connects the representations of the speech encoder and text decoder with a shared unit encoder.
  • G2P Tool is the abbreviation of Grapheme-to-Phoneme tool, which is a tool for converting graphemes of words into phonemes. It is implemented using a recurrent neural network.

Guess you like

Origin blog.csdn.net/xieyan0811/article/details/130791386
Recommended