Recently, Microsoft has open sourced a language processing model that only needs a speech sentence in the source language as a prompt to generate high-quality speech clips in the target language while retaining the speaker's voice, emotion, and acoustic environment in the source language. , In addition, this model also effectively alleviates the accent problem of foreign languages. This problem can be controlled by marking the language ID in the prompt.
This framework is named VALL-E Using source speech tokens as cues, the model is able to produce acoustic tokens of the target language, which can then be decompressed into target speech waveforms. Thanks to its powerful context learning capabilities, VALL-E X can perform various zero-resource cross-language speech generation tasks without requiring different cross-language data of the same speaker during training. Such as cross-language speech synthesis and cross-language translation.
Training diagram of VALL-E Multilingual acoustic tokens (A) and sound source sequences (S) are converted from speech and transcribed text through audio codec encoders and phonetic alphabetical conversion tools respectively. During training, we use pairs S and A in different languages to optimize the two models.
It adopts two-stage modeling, first using an autoregressive language model to generate the encoding of the first quantizer of the Encodec based on the paired sound source, and then using a non-autoregressive model to generate the encodings of other quantizers in parallel. After training on LibriLight, a large-scale English speech transcription dataset, VALL-E demonstrates a strong ability to learn from context. It only needs a 3-second voice clip as a prompt to generate a personalized voice. Based on VALL-E, we used VALL-E Translation tasks.
Although Microsoft has not opened its model yet, someone has roughly reproduced it, and we will take a sneak peek. Enter
VALL EX - a Hugging Face Space by Plachta, you can see the web page used.
First, we find a source voice, using the dubbing voice of Peppa Pig’s grandma.
Then we fill in the speech content into the transcript, and then click Make prompt to generate acoustic tags.
After generating the acoustic tags, we fill in the content of the speech we want to generate in Text, and then select the language and accent. Here we all choose Chinese. The generated result is as follows, it is really very natural, but a little bit of jitter can also be heard.
let's try another voice
Converted result:
If you want to try it locally, you can also
First clone the repository: git clone https://github.com/Plachtaa/VALL-EX.git
Then enter the project folder: cd VALL-EX
Generate and activate the virtual environment:
python -m venv venv #generate virtual environment
venv\Scripts\activate #Activate the virtual environment
Install the dependency packages required for operation: pip install -r requirements.txt
During the installation process, the required models will be downloaded automatically
Friends who want to try it, come and try it now: