AI: Introduction and practice of voice cloning MockingBird (generate the voice content you want in seconds)

Preface

With the continuous development of artificial intelligence technology, voice cloning technology has also received more and more attention and research. At present, AI voice cloning technology has enabled the machine to simulate a person's voice, and even allow the machine to simulate a person's language habits and expressions.

However, AI voice cloning technology still faces many difficulties and pain points. First of all, the existing voice cloning technology still has problems such as insufficient voice quality and insufficient voice reproduction, and it is difficult to achieve the real "disgusting the real" effect. In response to these difficulties and pain points, our team proposed a new AI voice cloning solution, MockingVoice, based on the open source project MockingBird. By adopting more advanced voice synthesis technology and stricter privacy protection measures, higher quality voice cloning effects and A safer and more reliable use experience. We believe that this new AI voice cloning technology will help to bring people a more intelligent, convenient and safe voice interaction experience, and bring more possibilities to people's life and work.

1. Introduction to MockingBird

Insert image description here

MockingBird is an advanced TTS (text-to-speech) project that uses deep learning models to generate high-quality synthetic speech. It was developed by a team of researchers and engineers passionate about natural language processing and speech technology.

Key Features: Support for Mandarin. Training was performed using multiple Chinese datasets, including aidatatang_200zh, magicdata, aishell3, biaobei, MozillaCommonVoice, and data_aishell, to ensure that the generated speech sounds natural and fluent for a variety of applications, including voice assistants, audiobooks, and language learning tools.

Deep learning framework: Pytorch. It uses PyTorch as its main deep learning framework and has been tested on the latest version PyTorch 1.9.0 released in August 2021. It supports Tesla T4 and GTX 2060 GPUs for faster training and inference times.

Extensibility: easy to use and customize. A pre-trained synthesizer is provided for immediate use, or train your own to generate speech to your specific needs. Additionally, you can use a pre-trained encoder and vocoder, or use a real-time HiFi-GAN as a vocoder to generate high-quality speech.

Servitization: remote calling. Mocking Bird supports web services, which means you can easily deploy it on a web server and use it to generate speech remotely. This is especially useful for applications that require real-time speech synthesis or for users who don't have access to high-end hardware.

2. Deployment practice

1. Environment installation

The test environment for this article: Mac M1 chip macos Monterey
Step 1 Download code: Download the git code, https://github.com/babysor/MockingBird
Step 2 Install anaconda: You can download it from the following mirror site: https://repo.anaconda.com /archive/ Find the image suitable for your machine version, download and install it.
After the installation is completed, as shown in the figure:

Insert image description here

Step 3 Build a virtual python environment.
Note: Since the original git project code relies on different versions of third-party libraries, and some libraries have requirements for the python version, it is recommended to install according to the version given in this article. Currently, it has passed the test on M1.
Run the command: conda create -n mock_voice python=3.9
activate the virtual environment mock_voice: conda activate mock_voice
Step 4 Install third-party dependent libraries.
Run the command directly: pip install -r requirements.txt
Basically follow the git provided in this article, there is no problem in downloading and installing. During the installation process, if you encounter something that cannot be installed, please google it yourself.

2. Download the pre-trained model

Here we can directly download the model trained by the community developers and use it directly. The download address is as follows:

Download link information
https://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g Baidu disk link 4j5d 75k steps Mixed training with 3 open source data sets
https://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw Baidu disk link extraction code: om7f 25k steps Use 3 open source data sets for mixed training, switch to tag v0.0.1 to use
https://drive.google.com/file/d/1H-YGOUHpmqKxJ9FRc6vAjPuqQki24UbC/view?usp=sharing Baidu disk link extraction code: 1024 200k steps Taiwanese accent needs to be switched to tag v0.0.1 to use
https://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ Extraction code: 2021 150k steps Note: Fix according to the issue and switch to tag v0.0.1 for use

We download the first model and place the file at the address: data/ckpt/synthesizer/pretrained-11-7-21_75k.pt

3. Run the toolbox

Step 1 Use Audacity to record: If we use the toolbox that comes with MockingBird for recording, the final cloned sound is often not good. We need to use professional tools to record our own voices and denoise them. Download the software: Audacity https://www.audacityteam.org/ and install it

Insert image description here

Open audacity, click on sound recording, and perform noise reduction on the recorded sound.

Insert image description here

Finally, export the recorded sound personal_test.wav to the local computer.

Step 2 Run the toolbox: Enter the command python demo_toolbox.py, and an interface will be loaded.

Insert image description here

Step 3 Load recording

Insert image description here

Step 4 Synthesize only: Enter the Chinese text you want to test and click the button Synthesize only
Step 5 Vocode only: Click the button Vocode only

Insert image description here

Finally, the generated sound is played.

3. Analysis and conclusion

1. Recording sound duration

It is best to limit the length of the audio to be cloned to between 3-8 seconds. This is contrary to the thinking of many people, who believe that the longer the input speech, the more accurately the timbre is captured. However, due to model characteristics, the capacity to extract timbre features is limited. For longer audio, the model will only be mapped into a relatively smaller model, which does not improve accuracy. During training, 110 seconds of speech are typically fed, so some unusual pauses in long audio may cause inference to be too divergent. Therefore, it is recommended that the optimal length of input audio is 3 to 8 seconds, rather than longer, the better.

2. Input audio to remove obvious background sound/noise

Despite some optimizations, especially after the introduction of GST, the latest code base can extract and separate part of the noise features in the speaker encoder part to reduce the impact of noise, so even if there is some background noise, the clone synthesis can be performed normally. However, the original model is still prone to losing timbre extraction due to noise floor. For better cloning results, we recommend pre-processing the input audio with a professional audio tool or using an open source tool like Audacity to remove obvious noise. This can greatly improve the cloning effect.

3. Enter audio to ensure there is only one human voice

After actual measurement, when the input audio contains more than one person's voice, the cloned voice will become unrecognizable (sometimes even like a ghost voice), and often cannot normally generate audio that meets the quality requirements, and is prone to word loss.

4. The voice of the input audio should be flat.

During the cloning process, it is often difficult to obtain accurate timbre characteristics for singing and excited voices. Therefore, for better results, it is recommended that the speaking voice in the input audio be in a normal intonation.

5. Look at the mel spectrogram while synthesizing

A certain random value will be added during the synthesis process, so you can try to perform only the synthesis (synthesize) operation, check the Mel spectrum diagram output by the synthesis reasoning, and then output the vocoder until a satisfactory result is obtained. The following is a better Mel spectrum diagram for reference:

other

Reference: [ AI Voice Cloning] Clone your voice and generate arbitrary voice content within 5 seconds_Xiaohu AI Lab’s Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/zhanggqianglovec/article/details/131454553