Mixed output of Chinese and English is a very common demand scenario in text-to-speech (TTS) projects, especially in the field of technical articles or technical videos. The Chinese text will definitely be mixed with a large number of English words. Of course, we do not want the AI spoken broadcast to only Can read Chinese, the old version of Bert-vits2 (version below 2.0) does not support English training and reasoning, but after updating the base model, the version above V2.0 supports Chinese and English mixed reasoning (mix) mode.
Let’s take Swift as an example:
https://www.bilibili.com/video/BV1bB4y1R7Nu/
A 30-second audio clip of Swift speaking English:
Bert-vits2 English material processing
First clone the project:
git clone https://github.com/v3ucn/Bert-VITS2_V210.git
Install dependencies:
pip3 install -r requirements.txt
Put the audio material into the Data/meimei_en/raw directory, where en represents the English character.
Then split the material:
python3 audio_slicer.py
The audio is then identified and resampled:
python3 short_audio_transcribe.py
The speech recognition model whisper is still used here. The medium model is selected by default. If the video memory is not enough, you can modify the short_audio_transcribe.py file:
import whisper
import os
import json
import torchaudio
import argparse
import torch
from config import config
lang2token = {
'zh': "ZH|",
'ja': "JP|",
"en": "EN|",
}
def transcribe_one(audio_path):
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio(audio_path)
audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
lang = max(probs, key=probs.get)
# decode the audio
options = whisper.DecodingOptions(beam_size=5)
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)
return lang, result.text
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--languages", default="CJ")
parser.add_argument("--whisper_size", default="medium")
args = parser.parse_args()
if args.languages == "CJE":
lang2token = {
'zh': "ZH|",
'ja': "JP|",
"en": "EN|",
}
elif args.languages == "CJ":
lang2token = {
'zh': "ZH|",
'ja': "JP|",
}
elif args.languages == "C":
lang2token = {
'zh': "ZH|",
}
Recognized voice file:
Data\meimei_en\raw/meimei_en/processed_0.wav|meimei_en|EN|But these were songs that didn't make it on the album.
Data\meimei_en\raw/meimei_en/processed_1.wav|meimei_en|EN|because I wanted to save them for the next album. And then it turned out the next album was like a whole different thing. And so they get left behind.
Data\meimei_en\raw/meimei_en/processed_2.wav|meimei_en|EN|and you always think back on these songs, and you're like.
Data\meimei_en\raw/meimei_en/processed_3.wav|meimei_en|EN|What would have happened? I wish people could hear this.
Data\meimei_en\raw/meimei_en/processed_4.wav|meimei_en|EN|but it belongs in that moment in time.
Data\meimei_en\raw/meimei_en/processed_5.wav|meimei_en|EN|So, now that I get to go back and revisit my old work,
Data\meimei_en\raw/meimei_en/processed_6.wav|meimei_en|EN|I've dug up those songs.
Data\meimei_en\raw/meimei_en/processed_7.wav|meimei_en|EN|from the crypt they were in.
Data\meimei_en\raw/meimei_en/processed_8.wav|meimei_en|EN|And I have like, I've reached out to artists that I love and said, do you want to?
Data\meimei_en\raw/meimei_en/processed_9.wav|meimei_en|EN|do you want to sing this with me? You know, Phoebe Bridgers is one of my favorite artists.
As you can see, each slice has a corresponding English character.
Then comes the annotation and the generation of BERT model files:
python3 preprocess_text.py
python3 emo_gen.py
python3 spec_gen.py
python3 bert_gen.py
After running, view the English training set:
Data\meimei_en\raw/meimei_en/processed_3.wav|meimei_en|EN|What would have happened? I wish people could hear this.|_ w ah t w uh d hh ae V hh ae p ah n d ? ay w ih sh p iy p ah l k uh d hh ih r dh ih s . _|0 0 2 0 0 2 0 0 2 0 0 2 0 1 0 0 0 2 0 2 0 0 2 0 1 0 0 2 0 0 2 0 0 2 0 0 0|1 3 3 3 6 1 1 3 5 3 3 3 1 1
Data\meimei_en\raw/meimei_en/processed_6.wav|meimei_en|EN|I've dug up those songs.|_ ay V d ah g ah p dh ow z s ao ng z . _|0 2 0 0 2 0 2 0 0 2 0 0 2 0 0 0 0|1 1 1 0 3 2 3 4 1 1
Data\meimei_en\raw/meimei_en/processed_5.wav|meimei_en|EN|So, now that I get to go back and revisit my old work,|_ s ow , n aw dh ae t ay g eh t t uw g ow b ae k ae n d r iy V ih z ih t m ay ow l d w er k , _|0 0 2 0 0 2 0 2 0 2 0 2 0 0 2 0 2 0 2 0 2 0 0 0 1 0 2 0 1 0 0 2 2 0 0 0 2 0 0 0|1 2 1 2 3 1 3 2 2 3 3 7 2 3 3 1 1
Data\meimei_en\raw/meimei_en/processed_1.wav|meimei_en|EN|because I wanted to save them for the next album. And then it turned out the next album was like a whole different thing. And so they get left behind.|_ b ih k ao z ay w aa n t ah d t uw s ey V dh eh m f ao r dh ah n eh k s t ae l b ah m . ae n d dh eh n ih t t er n d aw t dh ah n eh k s t ae l b ah m w aa z l ay k ah hh ow l d ih f er ah n t th ih ng . ae n d s ow dh ey g eh t l eh f t b ih hh ay n d . _|0 0 1 0 2 0 2 0 2 0 0 1 0 0 2 0 2 0 0 2 0 0 2 0 0 1 0 2 0 0 0 2 0 0 1 0 0 2 0 0 0 2 0 2 0 0 2 0 0 2 0 0 1 0 2 0 0 0 2 0 0 1 0 0 2 0 0 2 0 1 0 2 0 0 2 0 1 1 0 0 0 2 0 0 2 0 0 0 2 0 2 0 2 0 0 2 0 0 0 1 0 2 0 0 0 0|1 5 1 6 2 3 3 3 2 5 5 1 3 3 2 4 2 2 5 5 3 3 1 3 7 3 1 3 2 2 3 4 6 1 1
Data\meimei_en\raw/meimei_en/processed_2.wav|meimei_en|EN|and you always think back on these songs, and you're like.|_ ae n d y uw ao l w ey z th ih ng k b ae k aa n dh iy z s ao ng z , ae n d y uh r l ay k . _|0 2 0 0 0 2 2 0 0 3 0 0 2 0 0 0 2 0 2 0 0 2 0 0 2 0 0 0 2 0 0 0 2 0 0 2 0 0 0|1 3 2 5 4 3 2 3 4 1 3 1 1 1 3 1 1
At this point, the English data set has been processed.
Bert-vits2 English model training
Then run the training file:
python3 train_ms.py
You can train the English model locally.
It should be noted here that Chinese models and English models usually need to be trained separately. In other words, English training sets and Chinese training sets cannot be mixed for training.
There are significant differences between Chinese and English in terms of language structure, vocabulary and grammar. Chinese uses Chinese characters as the basic unit, while English uses letters as the basic unit. Chinese sentence structure and word order are also different from English. Therefore, Chinese models and English models require different processing methods and model architectures when learning language features and patterns.
Chinese and English text data are encoded differently. Chinese usually uses Unicode encoding, while English uses ASCII or Unicode encoding. This results in differences in the representation of Chinese and English text data. In mixed training, the encoding and processing methods of Chinese and English text data need to be unified, otherwise it will lead to inconsistencies and errors in the model training process.
Therefore, the so-called Mix mode of Bert-vits2 only refers to inference, not training. Of course, although it is not possible to mix data sets for training, it is still possible to open multiple processes for concurrent training of Chinese and English models.
Bert-vits2 mixed inference between Chinese and English models
After the training of the English model is completed (the so-called training is completed, it is often done by running 50 steps to see the effect), and the Chinese model is also placed in the Data directory. For training of the Chinese model, please go to: . Due to space limitations, I will not go into details here. Local training, available immediately. The 30-second audio material reproduces the sound of Swift speaking Chinese based on Bert-VITS2V2.0.2
The model structure is as follows:
E:\work\Bert-VITS2-v21_demo\Data>tree /f
Folder PATH listing for volume myssd
Volume serial number is 7CE3-15AE
E:.
├───meimei_cn
│ │ config.json
│ │ config.yml
│ │
│ ├───filelists
│ │ cleaned.list
│ │ short_character_anno.list
│ │ train.list
│ │ val.list
│ │
│ ├───models
│ │ G_50.pth
│ │
│ └───raw
│ └───meimei
│ meimei_0.wav
│ meimei_1.wav
│ meimei_2.wav
│ meimei_3.wav
│ meimei_4.wav
│ meimei_5.wav
│ meimei_6.wav
│ meimei_7.wav
│ meimei_8.wav
│ meimei_9.wav
│ processed_0.bert.pt
│ processed_0.emo.npy
│ processed_0.spec.pt
│ processed_0.wav
│ processed_1.bert.pt
│ processed_1.emo.npy
│ processed_1.spec.pt
│ processed_1.wav
│ processed_2.bert.pt
│ processed_2.emo.npy
│ processed_2.spec.pt
│ processed_2.wav
│ processed_3.bert.pt
│ processed_3.emo.npy
│ processed_3.spec.pt
│ processed_3.wav
│ processed_4.bert.pt
│ processed_4.emo.npy
│ processed_4.spec.pt
│ processed_4.wav
│ processed_5.bert.pt
│ processed_5.emo.npy
│ processed_5.spec.pt
│ processed_5.wav
│ processed_6.bert.pt
│ processed_6.emo.npy
│ processed_6.spec.pt
│ processed_6.wav
│ processed_7.bert.pt
│ processed_7.emo.npy
│ processed_7.spec.pt
│ processed_7.wav
│ processed_8.bert.pt
│ processed_8.emo.npy
│ processed_8.spec.pt
│ processed_8.wav
│ processed_9.bert.pt
│ processed_9.emo.npy
│ processed_9.spec.pt
│ processed_9.wav
│
└───meimei_en
│ config.json
│ config.yml
│
├───filelists
│ cleaned.list
│ short_character_anno.list
│ train.list
│ val.list
│
├───models
│ │ DUR_0.pth
│ │ DUR_50.pth
│ │ D_0.pth
│ │ D_50.pth
│ │ events.out.tfevents.1701484053.ly.16484.0
│ │ events.out.tfevents.1701620324.ly.10636.0
│ │ G_0.pth
│ │ G_50.pth
│ │ train.log
│ │
│ └───eval
│ events.out.tfevents.1701484053.ly.16484.1
│ events.out.tfevents.1701620324.ly.10636.1
│
└───raw
└───meimei_en
meimei_en_0.wav
meimei_en_1.wav
meimei_en_2.wav
meimei_en_3.wav
meimei_en_4.wav
meimei_en_5.wav
meimei_en_6.wav
meimei_en_7.wav
meimei_en_8.wav
meimei_en_9.wav
processed_0.bert.pt
processed_0.emo.npy
processed_0.wav
processed_1.bert.pt
processed_1.emo.npy
processed_1.spec.pt
processed_1.wav
processed_2.bert.pt
processed_2.emo.npy
processed_2.spec.pt
processed_2.wav
processed_3.bert.pt
processed_3.emo.npy
processed_3.spec.pt
processed_3.wav
processed_4.bert.pt
processed_4.emo.npy
processed_4.wav
processed_5.bert.pt
processed_5.emo.npy
processed_5.spec.pt
processed_5.wav
processed_6.bert.pt
processed_6.emo.npy
processed_6.spec.pt
processed_6.wav
processed_7.bert.pt
processed_7.emo.npy
processed_7.wav
processed_8.bert.pt
processed_8.emo.npy
processed_8.wav
processed_9.bert.pt
processed_9.emo.npy
processed_9.wav
Here meimei_cn represents the Chinese character model, and meimei_en represents the English character model. Each has only been trained for 50 steps.
Start the inference service:
python3 webui.py
Visit http://127.0.0.1:7860/ and enter in the text box:
[meimei_cn]<zh>但这些歌曲没进入专辑因为想留着他们下一张专辑用,然後下一張專輯完全不同所以他們被拋在了後面
[meimei_en]<en>But these were songs that didn't make it on the album.
because I wanted to save them for the next album. And then it turned out the next album was like a whole different thing. And so they get left behind.
Then set the language to mix.
Here, the text is identified by [role] and <language>, allowing the system to select the corresponding Chinese or English model for concurrent reasoning:
If there is only one English model and one Chinese model locally, you can also choose the auto model to perform automatic mixed Chinese and English inference:
但这些歌曲没进入专辑因为想留着他们下一张专辑用,然後下一張專輯完全不同所以他們被拋在了後面
But these were songs that didn't make it on the album.
because I wanted to save them for the next album. And then it turned out the next album was like a whole different thing. And so they get left behind.
The system will automatically detect the language of the text and select the corresponding model for inference.
Conclusion
In tasks such as technical article translation or video, cross-language information retrieval, etc., it is necessary to process the conversion and alignment between Chinese and English. Through Bert-vits2 Chinese and English mixed reasoning, these tasks can be handled more effectively and provide more accurate and coherent As a result, the address of Bert-vits2 Chinese and English mixed reasoning integration package is as follows:
https://pan.baidu.com/s/1iaC7f1GPXevDrDMCRCs8uQ?pwd=v3uc