Bert-vits2 new version V2.1 English model local training and Chinese and English mixed inference (mix)

Insert image description here

Mixed output of Chinese and English is a very common demand scenario in text-to-speech (TTS) projects, especially in the field of technical articles or technical videos. The Chinese text will definitely be mixed with a large number of English words. Of course, we do not want the AI ​​spoken broadcast to only Can read Chinese, the old version of Bert-vits2 (version below 2.0) does not support English training and reasoning, but after updating the base model, the version above V2.0 supports Chinese and English mixed reasoning (mix) mode.

Let’s take Swift as an example:

https://www.bilibili.com/video/BV1bB4y1R7Nu/

A 30-second audio clip of Swift speaking English:

Bert-vits2 English material processing

First clone the project:

git clone https://github.com/v3ucn/Bert-VITS2_V210.git

Install dependencies:

pip3 install -r requirements.txt

Put the audio material into the Data/meimei_en/raw directory, where en represents the English character.

Then split the material:

python3 audio_slicer.py

The audio is then identified and resampled:

python3 short_audio_transcribe.py

The speech recognition model whisper is still used here. The medium model is selected by default. If the video memory is not enough, you can modify the short_audio_transcribe.py file:

import whisper  
import os  
import json  
import torchaudio  
import argparse  
import torch  
from config import config  
lang2token = {  
            'zh': "ZH|",  
            'ja': "JP|",  
            "en": "EN|",  
        }  
def transcribe_one(audio_path):  
    # load audio and pad/trim it to fit 30 seconds  
    audio = whisper.load_audio(audio_path)  
    audio = whisper.pad_or_trim(audio)  
  
    # make log-Mel spectrogram and move to the same device as the model  
    mel = whisper.log_mel_spectrogram(audio).to(model.device)  
  
    # detect the spoken language  
    _, probs = model.detect_language(mel)  
    print(f"Detected language: {max(probs, key=probs.get)}")  
    lang = max(probs, key=probs.get)  
    # decode the audio  
    options = whisper.DecodingOptions(beam_size=5)  
    result = whisper.decode(model, mel, options)  
  
    # print the recognized text  
    print(result.text)  
    return lang, result.text  
if __name__ == "__main__":  
    parser = argparse.ArgumentParser()  
    parser.add_argument("--languages", default="CJ")  
    parser.add_argument("--whisper_size", default="medium")  
    args = parser.parse_args()  
    if args.languages == "CJE":  
        lang2token = {  
            'zh': "ZH|",  
            'ja': "JP|",  
            "en": "EN|",  
        }  
    elif args.languages == "CJ":  
        lang2token = {  
            'zh': "ZH|",  
            'ja': "JP|",  
        }  
    elif args.languages == "C":  
        lang2token = {  
            'zh': "ZH|",  
        }

Recognized voice file:

Data\meimei_en\raw/meimei_en/processed_0.wav|meimei_en|EN|But these were songs that didn't make it on the album.  
Data\meimei_en\raw/meimei_en/processed_1.wav|meimei_en|EN|because I wanted to save them for the next album. And then it turned out the next album was like a whole different thing. And so they get left behind.  
Data\meimei_en\raw/meimei_en/processed_2.wav|meimei_en|EN|and you always think back on these songs, and you're like.  
Data\meimei_en\raw/meimei_en/processed_3.wav|meimei_en|EN|What would have happened? I wish people could hear this.  
Data\meimei_en\raw/meimei_en/processed_4.wav|meimei_en|EN|but it belongs in that moment in time.  
Data\meimei_en\raw/meimei_en/processed_5.wav|meimei_en|EN|So, now that I get to go back and revisit my old work,  
Data\meimei_en\raw/meimei_en/processed_6.wav|meimei_en|EN|I've dug up those songs.  
Data\meimei_en\raw/meimei_en/processed_7.wav|meimei_en|EN|from the crypt they were in.  
Data\meimei_en\raw/meimei_en/processed_8.wav|meimei_en|EN|And I have like, I've reached out to artists that I love and said, do you want to?  
Data\meimei_en\raw/meimei_en/processed_9.wav|meimei_en|EN|do you want to sing this with me? You know, Phoebe Bridgers is one of my favorite artists.

As you can see, each slice has a corresponding English character.

Then comes the annotation and the generation of BERT model files:

python3 preprocess_text.py  
python3 emo_gen.py  
python3 spec_gen.py  
python3 bert_gen.py

After running, view the English training set:

Data\meimei_en\raw/meimei_en/processed_3.wav|meimei_en|EN|What would have happened? I wish people could hear this.|_ w ah t w uh d hh ae V hh ae p ah n d ? ay w ih sh p iy p ah l k uh d hh ih r dh ih s . _|0 0 2 0 0 2 0 0 2 0 0 2 0 1 0 0 0 2 0 2 0 0 2 0 1 0 0 2 0 0 2 0 0 2 0 0 0|1 3 3 3 6 1 1 3 5 3 3 3 1 1  
Data\meimei_en\raw/meimei_en/processed_6.wav|meimei_en|EN|I've dug up those songs.|_ ay V d ah g ah p dh ow z s ao ng z . _|0 2 0 0 2 0 2 0 0 2 0 0 2 0 0 0 0|1 1 1 0 3 2 3 4 1 1  
Data\meimei_en\raw/meimei_en/processed_5.wav|meimei_en|EN|So, now that I get to go back and revisit my old work,|_ s ow , n aw dh ae t ay g eh t t uw g ow b ae k ae n d r iy V ih z ih t m ay ow l d w er k , _|0 0 2 0 0 2 0 2 0 2 0 2 0 0 2 0 2 0 2 0 2 0 0 0 1 0 2 0 1 0 0 2 2 0 0 0 2 0 0 0|1 2 1 2 3 1 3 2 2 3 3 7 2 3 3 1 1  
Data\meimei_en\raw/meimei_en/processed_1.wav|meimei_en|EN|because I wanted to save them for the next album. And then it turned out the next album was like a whole different thing. And so they get left behind.|_ b ih k ao z ay w aa n t ah d t uw s ey V dh eh m f ao r dh ah n eh k s t ae l b ah m . ae n d dh eh n ih t t er n d aw t dh ah n eh k s t ae l b ah m w aa z l ay k ah hh ow l d ih f er ah n t th ih ng . ae n d s ow dh ey g eh t l eh f t b ih hh ay n d . _|0 0 1 0 2 0 2 0 2 0 0 1 0 0 2 0 2 0 0 2 0 0 2 0 0 1 0 2 0 0 0 2 0 0 1 0 0 2 0 0 0 2 0 2 0 0 2 0 0 2 0 0 1 0 2 0 0 0 2 0 0 1 0 0 2 0 0 2 0 1 0 2 0 0 2 0 1 1 0 0 0 2 0 0 2 0 0 0 2 0 2 0 2 0 0 2 0 0 0 1 0 2 0 0 0 0|1 5 1 6 2 3 3 3 2 5 5 1 3 3 2 4 2 2 5 5 3 3 1 3 7 3 1 3 2 2 3 4 6 1 1  
Data\meimei_en\raw/meimei_en/processed_2.wav|meimei_en|EN|and you always think back on these songs, and you're like.|_ ae n d y uw ao l w ey z th ih ng k b ae k aa n dh iy z s ao ng z , ae n d y uh r l ay k . _|0 2 0 0 0 2 2 0 0 3 0 0 2 0 0 0 2 0 2 0 0 2 0 0 2 0 0 0 2 0 0 0 2 0 0 2 0 0 0|1 3 2 5 4 3 2 3 4 1 3 1 1 1 3 1 1

At this point, the English data set has been processed.

Bert-vits2 English model training

Then run the training file:

python3 train_ms.py

You can train the English model locally.

It should be noted here that Chinese models and English models usually need to be trained separately. In other words, English training sets and Chinese training sets cannot be mixed for training.

There are significant differences between Chinese and English in terms of language structure, vocabulary and grammar. Chinese uses Chinese characters as the basic unit, while English uses letters as the basic unit. Chinese sentence structure and word order are also different from English. Therefore, Chinese models and English models require different processing methods and model architectures when learning language features and patterns.

Chinese and English text data are encoded differently. Chinese usually uses Unicode encoding, while English uses ASCII or Unicode encoding. This results in differences in the representation of Chinese and English text data. In mixed training, the encoding and processing methods of Chinese and English text data need to be unified, otherwise it will lead to inconsistencies and errors in the model training process.

Therefore, the so-called Mix mode of Bert-vits2 only refers to inference, not training. Of course, although it is not possible to mix data sets for training, it is still possible to open multiple processes for concurrent training of Chinese and English models.

Bert-vits2 mixed inference between Chinese and English models

After the training of the English model is completed (the so-called training is completed, it is often done by running 50 steps to see the effect), and the Chinese model is also placed in the Data directory. For training of the Chinese model, please go to: . Due to space limitations, I will not go into details here. Local training, available immediately. The 30-second audio material reproduces the sound of Swift speaking Chinese based on Bert-VITS2V2.0.2

The model structure is as follows:

E:\work\Bert-VITS2-v21_demo\Data>tree /f  
Folder PATH listing for volume myssd  
Volume serial number is 7CE3-15AE  
E:.  
├───meimei_cn  
│   │   config.json  
│   │   config.yml  
│   │  
│   ├───filelists  
│   │       cleaned.list  
│   │       short_character_anno.list  
│   │       train.list  
│   │       val.list  
│   │  
│   ├───models  
│   │       G_50.pth  
│   │  
│   └───raw  
│       └───meimei  
│               meimei_0.wav  
│               meimei_1.wav  
│               meimei_2.wav  
│               meimei_3.wav  
│               meimei_4.wav  
│               meimei_5.wav  
│               meimei_6.wav  
│               meimei_7.wav  
│               meimei_8.wav  
│               meimei_9.wav  
│               processed_0.bert.pt  
│               processed_0.emo.npy  
│               processed_0.spec.pt  
│               processed_0.wav  
│               processed_1.bert.pt  
│               processed_1.emo.npy  
│               processed_1.spec.pt  
│               processed_1.wav  
│               processed_2.bert.pt  
│               processed_2.emo.npy  
│               processed_2.spec.pt  
│               processed_2.wav  
│               processed_3.bert.pt  
│               processed_3.emo.npy  
│               processed_3.spec.pt  
│               processed_3.wav  
│               processed_4.bert.pt  
│               processed_4.emo.npy  
│               processed_4.spec.pt  
│               processed_4.wav  
│               processed_5.bert.pt  
│               processed_5.emo.npy  
│               processed_5.spec.pt  
│               processed_5.wav  
│               processed_6.bert.pt  
│               processed_6.emo.npy  
│               processed_6.spec.pt  
│               processed_6.wav  
│               processed_7.bert.pt  
│               processed_7.emo.npy  
│               processed_7.spec.pt  
│               processed_7.wav  
│               processed_8.bert.pt  
│               processed_8.emo.npy  
│               processed_8.spec.pt  
│               processed_8.wav  
│               processed_9.bert.pt  
│               processed_9.emo.npy  
│               processed_9.spec.pt  
│               processed_9.wav  
│  
└───meimei_en  
    │   config.json  
    │   config.yml  
    │  
    ├───filelists  
    │       cleaned.list  
    │       short_character_anno.list  
    │       train.list  
    │       val.list  
    │  
    ├───models  
    │   │   DUR_0.pth  
    │   │   DUR_50.pth  
    │   │   D_0.pth  
    │   │   D_50.pth  
    │   │   events.out.tfevents.1701484053.ly.16484.0  
    │   │   events.out.tfevents.1701620324.ly.10636.0  
    │   │   G_0.pth  
    │   │   G_50.pth  
    │   │   train.log  
    │   │  
    │   └───eval  
    │           events.out.tfevents.1701484053.ly.16484.1  
    │           events.out.tfevents.1701620324.ly.10636.1  
    │  
    └───raw  
        └───meimei_en  
                meimei_en_0.wav  
                meimei_en_1.wav  
                meimei_en_2.wav  
                meimei_en_3.wav  
                meimei_en_4.wav  
                meimei_en_5.wav  
                meimei_en_6.wav  
                meimei_en_7.wav  
                meimei_en_8.wav  
                meimei_en_9.wav  
                processed_0.bert.pt  
                processed_0.emo.npy  
                processed_0.wav  
                processed_1.bert.pt  
                processed_1.emo.npy  
                processed_1.spec.pt  
                processed_1.wav  
                processed_2.bert.pt  
                processed_2.emo.npy  
                processed_2.spec.pt  
                processed_2.wav  
                processed_3.bert.pt  
                processed_3.emo.npy  
                processed_3.spec.pt  
                processed_3.wav  
                processed_4.bert.pt  
                processed_4.emo.npy  
                processed_4.wav  
                processed_5.bert.pt  
                processed_5.emo.npy  
                processed_5.spec.pt  
                processed_5.wav  
                processed_6.bert.pt  
                processed_6.emo.npy  
                processed_6.spec.pt  
                processed_6.wav  
                processed_7.bert.pt  
                processed_7.emo.npy  
                processed_7.wav  
                processed_8.bert.pt  
                processed_8.emo.npy  
                processed_8.wav  
                processed_9.bert.pt  
                processed_9.emo.npy  
                processed_9.wav

Here meimei_cn represents the Chinese character model, and meimei_en represents the English character model. Each has only been trained for 50 steps.

Start the inference service:

python3 webui.py

Visit http://127.0.0.1:7860/ and enter in the text box:

[meimei_cn]<zh>但这些歌曲没进入专辑因为想留着他们下一张专辑用,然後下一張專輯完全不同所以他們被拋在了後面  
[meimei_en]<en>But these were songs that didn't make it on the album.  
because I wanted to save them for the next album. And then it turned out the next album was like a whole different thing. And so they get left behind.

Then set the language to mix.

Here, the text is identified by [role] and <language>, allowing the system to select the corresponding Chinese or English model for concurrent reasoning:

If there is only one English model and one Chinese model locally, you can also choose the auto model to perform automatic mixed Chinese and English inference:

但这些歌曲没进入专辑因为想留着他们下一张专辑用,然後下一張專輯完全不同所以他們被拋在了後面  
But these were songs that didn't make it on the album.  
because I wanted to save them for the next album. And then it turned out the next album was like a whole different thing. And so they get left behind.

The system will automatically detect the language of the text and select the corresponding model for inference.

Conclusion

In tasks such as technical article translation or video, cross-language information retrieval, etc., it is necessary to process the conversion and alignment between Chinese and English. Through Bert-vits2 Chinese and English mixed reasoning, these tasks can be handled more effectively and provide more accurate and coherent As a result, the address of Bert-vits2 Chinese and English mixed reasoning integration package is as follows:

https://pan.baidu.com/s/1iaC7f1GPXevDrDMCRCs8uQ?pwd=v3uc

Guess you like

Origin blog.csdn.net/zcxey2911/article/details/134877205