Clone your voice, you only need three seconds of recording, voice cloning has evolved again!

Recently, Microsoft has open sourced a language processing model that only needs a speech sentence in the source language as a prompt to generate high-quality speech clips in the target language while retaining the speaker's voice, emotion, and acoustic environment in the source language. , In addition, this model also effectively alleviates the accent problem of foreign languages. This problem can be controlled by marking the language ID in the prompt.

868b776fb56153921bd349b1305e1fb0.png

This framework is named VALL-E Using source speech tokens as cues, the model is able to produce acoustic tokens of the target language, which can then be decompressed into target speech waveforms. Thanks to its powerful context learning capabilities, VALL-E X can perform various zero-resource cross-language speech generation tasks without requiring different cross-language data of the same speaker during training. Such as cross-language speech synthesis and cross-language translation.

63faf21d8550d37e7b2978d05c02cb4c.png

Training diagram of VALL-E Multilingual acoustic tokens (A) and sound source sequences (S) are converted from speech and transcribed text through audio codec encoders and phonetic alphabetical conversion tools respectively. During training, we use pairs S and A in different languages ​​to optimize the two models.

It adopts two-stage modeling, first using an autoregressive language model to generate the encoding of the first quantizer of the Encodec based on the paired sound source, and then using a non-autoregressive model to generate the encodings of other quantizers in parallel. After training on LibriLight, a large-scale English speech transcription dataset, VALL-E demonstrates a strong ability to learn from context. It only needs a 3-second voice clip as a prompt to generate a personalized voice. Based on VALL-E, we used VALL-E Translation tasks.

Although Microsoft has not opened its model yet, someone has roughly reproduced it, and we will take a sneak peek. Enter

VALL EX - a Hugging Face Space by Plachta, you can see the web page used.

98d375702d7220089d5e29a32507ed0d.png

First, we find a source voice, using the dubbing voice of Peppa Pig’s grandma.

Then we fill in the speech content into the transcript, and then click Make prompt to generate acoustic tags.

42ddfafe71fc7cc3313a8018db438b4b.png

After generating the acoustic tags, we fill in the content of the speech we want to generate in Text, and then select the language and accent. Here we all choose Chinese. The generated result is as follows, it is really very natural, but a little bit of jitter can also be heard.

let's try another voice

Converted result:

If you want to try it locally, you can also

First clone the repository: git clone https://github.com/Plachtaa/VALL-EX.git

Then enter the project folder: cd VALL-EX

Generate and activate the virtual environment:

python -m venv venv #generate virtual environment

venv\Scripts\activate #Activate the virtual environment

Install the dependency packages required for operation: pip install -r requirements.txt

During the installation process, the required models will be downloaded automatically

Friends who want to try it, come and try it now:

Guess you like

Origin blog.csdn.net/wutao22/article/details/132705755