Local training, available out of the box, Bert-VITS2 V2.0.2 version is locally trained based on the existing data set (Genshin Impact)

Insert image description here

According to the inherent way of thinking, the training link of deep learning should be in the cloud. After all, local hardware conditions are limited. But in fact, in terms of speech recognition and natural language processing, even a relatively small amount of data can train high-performance models. For students with limited budgets, there is no need to waste money on the "cloud". This time we will demonstrate how to train the Bert-VITS2 V2.0.2 model locally.

Bert-VITS2 V2.0.2 is based on existing data sets

Currently, Bert-VITS2 V2.0.2 generally has two training methods. The first one is based on the existing data set, that is, the voice data of each character of Genshin Impact has been annotated. This part of the content is public, but cannot be used commercially. It can be used in Download here:

https://pan.ai-hobbyist.org/Genshin%20Datasets/%E4%B8%AD%E6%96%87%20-%20Chinese/%E5%88%86%E8%A7%92%E8%89%B2%20-%20Single/%E8%A7%92%E8%89%B2%E8%AF%AD%E9%9F%B3%20-%20Character

We only need to select the character we like to download:

The second is that there is no existing data set, that is, suppose we want to clone the voice of any person on earth. In this case, we need to collect the voice material of this person and then create the data set ourselves.

This time we will only demonstrate the first training method, which is to train the Genshin Impact character in the existing data set. The second method will not be discussed for the time being.

Bert-VITS2 V2.0.2 configuration model

First clone the project:

git clone https://github.com/v3ucn/Bert-VITS2_V202_Train.git

Then download the new version of the bert model:

链接:https://pan.baidu.com/s/11vLNEVDeP_8YhYIJUjcUeg?pwd=v3uc

After the download is successful, unzip and place it in the bert directory of the project. The directory structure is as follows:

E:\work\Bert-VITS2-v202\bert>tree /f  
Folder PATH listing for volume myssd  
Volume serial number is 7CE3-15AE  
E:.  
│   bert_models.json  
├───bert-base-japanese-v3  
│       config.json  
│       README.md  
│       tokenizer_config.json  
│       vocab.txt  
├───bert-large-japanese-v2  
│       config.json  
│       README.md  
│       tokenizer_config.json  
│       vocab.txt  
├───chinese-roberta-wwm-ext-large  
│       added_tokens.json  
│       config.json  
│       pytorch_model.bin  
│       README.md  
│       special_tokens_map.json  
│       tokenizer.json  
│       tokenizer_config.json  
│       vocab.txt  
├───deberta-v2-large-japanese  
│       config.json  
│       pytorch_model.bin  
│       README.md  
│       special_tokens_map.json  
│       tokenizer.json  
│       tokenizer_config.json  
└───deberta-v3-large  
        config.json  
        generator_config.json  
        pytorch_model.bin  
        README.md  
        spm.model  
        tokenizer_config.json

Then download the pre-trained model:

https://openi.pcl.ac.cn/Stardust_minus/Bert-VITS2/modelmanage/model_readme_tmpl?name=Bert-VITS2%E4%B8%AD%E6%97%A5%E8%8B%B1%E5%BA%95%E6%A8%A1-fix

Place it in the project's pretrained_models directory as follows:

E:\work\Bert-VITS2-v202\pretrained_models>tree /f  
Folder PATH listing for volume myssd  
Volume serial number is 7CE3-15AE  
E:.  
    DUR_0.pth  
    D_0.pth  
    G_0.pth

Then put the Keqing data set mentioned above into the raw directory in the project's Data directory:

E:\work\Bert-VITS2-v202\Data\keqing\raw\keqing>tree /f  
Folder PATH listing for volume myssd  
Volume serial number is 7CE3-15AE  
E:.  
vo_card_keqing_endOfGame_fail_01.lab  
vo_card_keqing_endOfGame_fail_01.wav

If you want to customize the directory structure, you can modify the config.yml file:

bert_gen:  
  config_path: config.json  
  device: cuda  
  num_processes: 2  
  use_multi_device: false  
dataset_path: Data\keqing  
mirror: ''  
openi_token: ''  
preprocess_text:  
  clean: true  
  cleaned_path: filelists/cleaned.list  
  config_path: config.json  
  max_val_total: 8  
  train_path: filelists/train.list  
  transcription_path: filelists/short_character_anno.list  
  val_path: filelists/val.list  
  val_per_spk: 5  
resample:  
  in_dir: raw  
  out_dir: raw  
  sampling_rate: 44100

At this point, the model and data set are configured.

Bert-VITS2 V2.0.2 data preprocessing

The annotated original data set cannot be directly trained and needs to be preprocessed. First, the original data file needs to be transcribed into a standard annotation file:

python3 transcribe_genshin.py

Generated file:

Data\keqing\raw/keqing/vo_card_keqing_endOfGame_fail_01.wav|keqing|ZH|我会勤加练习,拿下下一次的胜利。  
Data\keqing\raw/keqing/vo_card_keqing_endOfGame_win_01.wav|keqing|ZH|胜负本是常事,不必太过挂怀。  
Data\keqing\raw/keqing/vo_card_keqing_freetalk_01.wav|keqing|ZH|这「七圣召唤」虽说是游戏,但对局之中也隐隐有策算谋略之理。

Here ZH represents Chinese. The new version of Bert-VITS2 V2.0.2 also supports Japanese and English, and the codes are JP and EN respectively.

Then the text is preprocessed and the BERT model readable file is generated:

python3 preprocess_text.py  
  
python3 bert_gen.py

After execution, training set and validation set files will be generated:

E:\work\Bert-VITS2-v202\Data\keqing\filelists>tree /f  
Folder PATH listing for volume myssd  
Volume serial number is 7CE3-15AE  
E:.  
    cleaned.list  
    short_character_anno.list  
    train.list  
    val.list

After the check is correct, the data preprocessing is completed.

Bert-VITS2 V2.0.2 local training

Everything is ready, just training is needed. Don’t worry yet, open the Data/keqing/config.json configuration file:

{  
  "train": {  
    "log_interval": 50,  
    "eval_interval": 50,  
    "seed": 42,  
    "epochs": 200,  
    "learning_rate": 0.0001,  
    "betas": [  
      0.8,  
      0.99  
    ],  
    "eps": 1e-09,  
    "batch_size": 8,  
    "fp16_run": false,  
    "lr_decay": 0.99995,  
    "segment_size": 16384,  
    "init_lr_ratio": 1,  
    "warmup_epochs": 0,  
    "c_mel": 45,  
    "c_kl": 1.0,  
    "skip_optimizer": false  
  },  
  "data": {  
    "training_files": "Data/keqing/filelists/train.list",  
    "validation_files": "Data/keqing/filelists/val.list",  
    "max_wav_value": 32768.0,  
    "sampling_rate": 44100,  
    "filter_length": 2048,  
    "hop_length": 512,  
    "win_length": 2048,  
    "n_mel_channels": 128,  
    "mel_fmin": 0.0,  
    "mel_fmax": null,  
    "add_blank": true,  
    "n_speakers": 1,  
    "cleaned_text": true,  
    "spk2id": {  
      "keqing": 0  
    }  
  },  
  "model": {  
    "use_spk_conditioned_encoder": true,  
    "use_noise_scaled_mas": true,  
    "use_mel_posterior_encoder": false,  
    "use_duration_discriminator": true,  
    "inter_channels": 192,  
    "hidden_channels": 192,  
    "filter_channels": 768,  
    "n_heads": 2,  
    "n_layers": 6,  
    "kernel_size": 3,  
    "p_dropout": 0.1,  
    "resblock": "1",  
    "resblock_kernel_sizes": [  
      3,  
      7,  
      11  
    ],  
    "resblock_dilation_sizes": [  
      [  
        1,  
        3,  
        5  
      ],  
      [  
        1,  
        3,  
        5  
      ],  
      [  
        1,  
        3,  
        5  
      ]  
    ],  
    "upsample_rates": [  
      8,  
      8,  
      2,  
      2,  
      2  
    ],  
    "upsample_initial_channel": 512,  
    "upsample_kernel_sizes": [  
      16,  
      16,  
      8,  
      2,  
      2  
    ],  
    "n_layers_q": 3,  
    "use_spectral_norm": false,  
    "gin_channels": 256  
  },  
  "version": "2.0"  
}

The parameter that needs to be adjusted here is batch_size. If the video memory is not enough, it needs to be adjusted downward. Otherwise, the problem of "exploding the video memory" will occur. Assuming that the video memory is 8G, it is best not to exceed 8.

At the same time, it is recommended to adjust the log_interval and eval_interval parameters a little smaller for the first training, which is the saving interval of training, so as to facilitate inference verification at any time during the training process.

Then enter the command to start training:

python3 train_ms.py

The program returns:

11-22 13:20:28 INFO     | data_utils.py:61 | Init dataset...  
100%|█████████████████████████████████████████████████████████████████████████████| 581/581 [00:00<00:00, 48414.40it/s]  
11-22 13:20:28 INFO     | data_utils.py:76 | skipped: 31, total: 581  
11-22 13:20:28 INFO     | data_utils.py:61 | Init dataset...  
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]  
11-22 13:20:28 INFO     | data_utils.py:76 | skipped: 0, total: 5  
Using noise scaled MAS for VITS2  
Using duration discriminator for VITS2  
INFO:models:Loaded checkpoint 'Data\keqing\models\DUR_0.pth' (iteration 7)  
INFO:models:Loaded checkpoint 'Data\keqing\models\G_0.pth' (iteration 7)  
INFO:models:Loaded checkpoint 'Data\keqing\models\D_0.pth' (iteration 7)

Indicates that training has started.

During the training process, you can pass the command:

python3 -m tensorboard.main --logdir=Data/keqing/models

To view the loss rate, visit:

http://localhost:6006/#scalars

Generally, if the training loss rate is less than 50%, and the loss function tends to be stable on both the training set and the validation set, the model can be considered to have converged. The converged model can be used by us. How to use the trained model, please go to: Desirable and sultry, based on the new version of Bert-vits2V2.0.2 tone model Raiden Shogun Yaegamiko I Key reasoning integration package sharing, due to space limitations, I will not go into details here.

The trained model is stored in the Data/keqing/models directory:

E:\work\Bert-VITS2-v202\Data\keqing\models>tree /f  
Folder PATH listing for volume myssd  
Volume serial number is 7CE3-15AE  
E:.  
│   DUR_0.pth  
│   DUR_550.pth  
│   DUR_600.pth  
│   DUR_650.pth  
│   D_0.pth  
│   D_600.pth  
│   D_650.pth  
│   events.out.tfevents.1700625154.ly.24008.0  
│   events.out.tfevents.1700630428.ly.20380.0  
│   G_0.pth  
│   G_450.pth  
│   G_500.pth  
│   G_550.pth  
│   G_600.pth  
│   G_650.pth  
│   train.log  
└───eval  
        events.out.tfevents.1700625154.ly.24008.1  
        events.out.tfevents.1700630428.ly.20380.1

It should be noted that the first training requires copying the pre-trained model to the models directory.

Conclusion

In addition to Chinese, Bert-VITS2 V2.0.2 also supports Japanese and English, and also provides a Mix reasoning mode that mixes Chinese, English and Japanese. If you want to know what happened next, listen to the breakdown in the next chapter.

Guess you like

Origin blog.csdn.net/zcxey2911/article/details/134555041