According to the inherent way of thinking, the training link of deep learning should be in the cloud. After all, local hardware conditions are limited. But in fact, in terms of speech recognition and natural language processing, even a relatively small amount of data can train high-performance models. For students with limited budgets, there is no need to waste money on the "cloud". This time we will demonstrate how to train the Bert-VITS2 V2.0.2 model locally.
Bert-VITS2 V2.0.2 is based on existing data sets
Currently, Bert-VITS2 V2.0.2 generally has two training methods. The first one is based on the existing data set, that is, the voice data of each character of Genshin Impact has been annotated. This part of the content is public, but cannot be used commercially. It can be used in Download here:
https://pan.ai-hobbyist.org/Genshin%20Datasets/%E4%B8%AD%E6%96%87%20-%20Chinese/%E5%88%86%E8%A7%92%E8%89%B2%20-%20Single/%E8%A7%92%E8%89%B2%E8%AF%AD%E9%9F%B3%20-%20Character
We only need to select the character we like to download:
The second is that there is no existing data set, that is, suppose we want to clone the voice of any person on earth. In this case, we need to collect the voice material of this person and then create the data set ourselves.
This time we will only demonstrate the first training method, which is to train the Genshin Impact character in the existing data set. The second method will not be discussed for the time being.
Bert-VITS2 V2.0.2 configuration model
First clone the project:
git clone https://github.com/v3ucn/Bert-VITS2_V202_Train.git
Then download the new version of the bert model:
链接:https://pan.baidu.com/s/11vLNEVDeP_8YhYIJUjcUeg?pwd=v3uc
After the download is successful, unzip and place it in the bert directory of the project. The directory structure is as follows:
E:\work\Bert-VITS2-v202\bert>tree /f
Folder PATH listing for volume myssd
Volume serial number is 7CE3-15AE
E:.
│ bert_models.json
│
├───bert-base-japanese-v3
│ config.json
│ README.md
│ tokenizer_config.json
│ vocab.txt
│
├───bert-large-japanese-v2
│ config.json
│ README.md
│ tokenizer_config.json
│ vocab.txt
│
├───chinese-roberta-wwm-ext-large
│ added_tokens.json
│ config.json
│ pytorch_model.bin
│ README.md
│ special_tokens_map.json
│ tokenizer.json
│ tokenizer_config.json
│ vocab.txt
│
├───deberta-v2-large-japanese
│ config.json
│ pytorch_model.bin
│ README.md
│ special_tokens_map.json
│ tokenizer.json
│ tokenizer_config.json
│
└───deberta-v3-large
config.json
generator_config.json
pytorch_model.bin
README.md
spm.model
tokenizer_config.json
Then download the pre-trained model:
https://openi.pcl.ac.cn/Stardust_minus/Bert-VITS2/modelmanage/model_readme_tmpl?name=Bert-VITS2%E4%B8%AD%E6%97%A5%E8%8B%B1%E5%BA%95%E6%A8%A1-fix
Place it in the project's pretrained_models directory as follows:
E:\work\Bert-VITS2-v202\pretrained_models>tree /f
Folder PATH listing for volume myssd
Volume serial number is 7CE3-15AE
E:.
DUR_0.pth
D_0.pth
G_0.pth
Then put the Keqing data set mentioned above into the raw directory in the project's Data directory:
E:\work\Bert-VITS2-v202\Data\keqing\raw\keqing>tree /f
Folder PATH listing for volume myssd
Volume serial number is 7CE3-15AE
E:.
vo_card_keqing_endOfGame_fail_01.lab
vo_card_keqing_endOfGame_fail_01.wav
If you want to customize the directory structure, you can modify the config.yml file:
bert_gen:
config_path: config.json
device: cuda
num_processes: 2
use_multi_device: false
dataset_path: Data\keqing
mirror: ''
openi_token: ''
preprocess_text:
clean: true
cleaned_path: filelists/cleaned.list
config_path: config.json
max_val_total: 8
train_path: filelists/train.list
transcription_path: filelists/short_character_anno.list
val_path: filelists/val.list
val_per_spk: 5
resample:
in_dir: raw
out_dir: raw
sampling_rate: 44100
At this point, the model and data set are configured.
Bert-VITS2 V2.0.2 data preprocessing
The annotated original data set cannot be directly trained and needs to be preprocessed. First, the original data file needs to be transcribed into a standard annotation file:
python3 transcribe_genshin.py
Generated file:
Data\keqing\raw/keqing/vo_card_keqing_endOfGame_fail_01.wav|keqing|ZH|我会勤加练习,拿下下一次的胜利。
Data\keqing\raw/keqing/vo_card_keqing_endOfGame_win_01.wav|keqing|ZH|胜负本是常事,不必太过挂怀。
Data\keqing\raw/keqing/vo_card_keqing_freetalk_01.wav|keqing|ZH|这「七圣召唤」虽说是游戏,但对局之中也隐隐有策算谋略之理。
Here ZH represents Chinese. The new version of Bert-VITS2 V2.0.2 also supports Japanese and English, and the codes are JP and EN respectively.
Then the text is preprocessed and the BERT model readable file is generated:
python3 preprocess_text.py
python3 bert_gen.py
After execution, training set and validation set files will be generated:
E:\work\Bert-VITS2-v202\Data\keqing\filelists>tree /f
Folder PATH listing for volume myssd
Volume serial number is 7CE3-15AE
E:.
cleaned.list
short_character_anno.list
train.list
val.list
After the check is correct, the data preprocessing is completed.
Bert-VITS2 V2.0.2 local training
Everything is ready, just training is needed. Don’t worry yet, open the Data/keqing/config.json configuration file:
{
"train": {
"log_interval": 50,
"eval_interval": 50,
"seed": 42,
"epochs": 200,
"learning_rate": 0.0001,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 8,
"fp16_run": false,
"lr_decay": 0.99995,
"segment_size": 16384,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"skip_optimizer": false
},
"data": {
"training_files": "Data/keqing/filelists/train.list",
"validation_files": "Data/keqing/filelists/val.list",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,
"n_mel_channels": 128,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 1,
"cleaned_text": true,
"spk2id": {
"keqing": 0
}
},
"model": {
"use_spk_conditioned_encoder": true,
"use_noise_scaled_mas": true,
"use_mel_posterior_encoder": false,
"use_duration_discriminator": true,
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
2,
2,
2
],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [
16,
16,
8,
2,
2
],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 256
},
"version": "2.0"
}
The parameter that needs to be adjusted here is batch_size. If the video memory is not enough, it needs to be adjusted downward. Otherwise, the problem of "exploding the video memory" will occur. Assuming that the video memory is 8G, it is best not to exceed 8.
At the same time, it is recommended to adjust the log_interval and eval_interval parameters a little smaller for the first training, which is the saving interval of training, so as to facilitate inference verification at any time during the training process.
Then enter the command to start training:
python3 train_ms.py
The program returns:
11-22 13:20:28 INFO | data_utils.py:61 | Init dataset...
100%|█████████████████████████████████████████████████████████████████████████████| 581/581 [00:00<00:00, 48414.40it/s]
11-22 13:20:28 INFO | data_utils.py:76 | skipped: 31, total: 581
11-22 13:20:28 INFO | data_utils.py:61 | Init dataset...
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
11-22 13:20:28 INFO | data_utils.py:76 | skipped: 0, total: 5
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
INFO:models:Loaded checkpoint 'Data\keqing\models\DUR_0.pth' (iteration 7)
INFO:models:Loaded checkpoint 'Data\keqing\models\G_0.pth' (iteration 7)
INFO:models:Loaded checkpoint 'Data\keqing\models\D_0.pth' (iteration 7)
Indicates that training has started.
During the training process, you can pass the command:
python3 -m tensorboard.main --logdir=Data/keqing/models
To view the loss rate, visit:
http://localhost:6006/#scalars
Generally, if the training loss rate is less than 50%, and the loss function tends to be stable on both the training set and the validation set, the model can be considered to have converged. The converged model can be used by us. How to use the trained model, please go to: Desirable and sultry, based on the new version of Bert-vits2V2.0.2 tone model Raiden Shogun Yaegamiko I Key reasoning integration package sharing, due to space limitations, I will not go into details here.
The trained model is stored in the Data/keqing/models directory:
E:\work\Bert-VITS2-v202\Data\keqing\models>tree /f
Folder PATH listing for volume myssd
Volume serial number is 7CE3-15AE
E:.
│ DUR_0.pth
│ DUR_550.pth
│ DUR_600.pth
│ DUR_650.pth
│ D_0.pth
│ D_600.pth
│ D_650.pth
│ events.out.tfevents.1700625154.ly.24008.0
│ events.out.tfevents.1700630428.ly.20380.0
│ G_0.pth
│ G_450.pth
│ G_500.pth
│ G_550.pth
│ G_600.pth
│ G_650.pth
│ train.log
│
└───eval
events.out.tfevents.1700625154.ly.24008.1
events.out.tfevents.1700630428.ly.20380.1
It should be noted that the first training requires copying the pre-trained model to the models directory.
Conclusion
In addition to Chinese, Bert-VITS2 V2.0.2 also supports Japanese and English, and also provides a Mix reasoning mode that mixes Chinese, English and Japanese. If you want to know what happened next, listen to the breakdown in the next chapter.