Speech | Detailed explanation of lightweight speech synthesis paper and project implementation

2023_LIGHTWEIGHT AND HIGH-FIDELITY END-TO-END TEXT-TO-SPEECH WITH MULTI-BAND GENERATION AND INVERSE SHORT-TIME FOURIER TRANSFORM

Paper:https://arxiv.org/pdf/2210.15975.pdf

Code:GitHub - misakiudon/MB-iSTFT-VITS-multilingual: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform with Multilingual Cleaners

Table of contents

1. Detailed explanation of the paper

1.1. Introduction

1.2.VITS algorithm

1.3.Proposed method

1.4.Experiments 

 1.5.Conclusion

2. Project implementation

2.1.Data preparation

2.2.Data preprocessing

2.3. Text processing

2.4.Training

2.5. Reasoning

【PS】

【PS1】ERROR: Could not build wheels for pyopenjtalk, which is required to install pyproject.toml-based projects

【PS2】AttributeError: 'HParams' object has no attribute 'seed'

【PS3】EOFError: Ran out of input

【PS4】The data does not generate the corresponding spec.pt file

【PS5】  TypeError: __init__() takes 1 positional argument but 2 were given

【PS6】Traceback (most recent call last):  File "sc_test.py", line 2, in     import soundcard  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/__init__.py", line 4, in     from soundcard.pulseaudio import *  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 265, in     _pulse = _PulseAudio()  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 76, in __init__    assert self._pa_context_get_state(self.context)==_pa.PA_CONTEXT_READYAssertionError

Expand

Introduction to multilingual processing methods

Korean

 Method 1: Korean to phoneme (G2p)

 Method 2: Acoustic Model

VScode shortcut keys

1. Detailed explanation of the paper

1.1. Introduction

The previous two stages of speech synthesis (acoustic model and Vocoders) are introduced. Because VITS is a high-quality end-to-end model, the model proposed in the paper is based on a lightweight end-to-end model of VITS. The paper mainly uses several decoding methods in the model. Part, that is, converting the latent acoustic features into wavaform, replacing part of the decoder with a simple inverse short-time Fourier transform (iSTFT) to efficiently complete the conversion from the frequency domain to the time domain. When inferring to increase the speed, use multi-stage processing . In the proposed method, each iSTFTNet,subsegment signals. During inference, it is 4.1 times faster than the original VITS.

1.2.VITS algorithm

1.2.1. The principle of VITS is briefly introduced . I will not introduce too much here. For more information, please check [ VITS paper summary and project recurrence ]

1.2.2. Inference speed of each model

RTF calculation is used here, that is, the time token is divided by the synthesized speech length , as an objective standard.

1.3.Proposed method

1.3.1. In the output layer, replace the HiFi-GAN in the original vits with the sample inverse short-time Fourier transform

1.3.2. Multi-band iSTFT VITS automatic encoding

The picture below shows the overall framework of the model.

 

1.3.3. Multi-stream iSTFT VITS is
different from MB-iSTFT-VITS. The MS-iSTFT-VITS waveform is completely trainable and is no longer subject to fixed sub-band signals. limit.

1.4.Experiments 

Dataset: LJ Speech dataset

Five vits-based models were compared:

  1. VITS : Formal VITS1, its hyperparameters are the same as the original VITS.
  2. Nix-TTS : Pre-trained model of Nix-TTS 2. The model used is an optimized ONNX version [27]. Note that the dataset used in the experiments is exactly the same as that used by Nix-TTS.
  3. iSTFT-VITS : A model that incorporates iSTFTNet into the VITS decoder part. The architecture of iSTFTNet is V1-C8C8I, which is the best-balanced model described in [14]. The architecture contains two upsampling ratios [8, 8]. The size, jump length and window length of the Fast Fourier Transform (FFT) are set to 16, 4 and 16 respectively.
  4. A proposed model introduced in Section 3.2 of MB-iSTFT-VITS . The number of sub-bands N is set to 4. The upsampling ratio of the two residual blocks is [4,4] to match the decomposed resolution of each subband signal. The pseudo-QMF analysis filter decomposes the signal to match the resolution of each subband. The FFT size, transition length and window length of the iSTFT part are the same as the resolution of each sub-band signal decomposed by the pseudo-QMF analysis filter.
  5. MS-iSTFT-VITS : Another proposed model introduced in Section 3.3. According to [10], the kernel size of the convolution-based trainable synthesis filter is set to 63 without bias. The kernel size of the trainable synthesis filter is set to 63 with no bias. Other conditions are the same as MB-iSTFT-VITS.

The picture below shows the comparison results

 1.5.Conclusion

In this paper, an end-to-end TTS system is proposed that enables high-speed speech synthesis on the device. The proposed method builds on the successful end-to-end model called VITS, but adopts several techniques to speed up inference, such as reducing the redundancy of decoder calculations through iSTFT, and adopting a multi-band parallel strategy. Due to The proposed model is optimized in a completely end-to-end manner, and compared with the traditional two-stage method, the proposed model can fully enjoy its powerful optimization process. Experimental results show that the speech generated by the method is as natural as the speech synthesized by VITS, while also generating waveforms at a faster speed. Future research includes extending the proposed method to multi-speaker models.

2. Project implementation

2.1.Data preparation

The prepared voice data must be 22050Hz, single channel (Mono), PCM-16bit

individual

dataset/001.wav|您好
dataset/001.wav|안녕하세요
dataset/001.wav|こんにちは。

multiple people

dataset/001.wav|0|こんにちは。

The middle number is the character ID, starting from 0 

# Cython-version Monotonoic Alignment Search
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace

 The paper provides several training structures, because in the paper MB-iSTFT-VITS has the best effect, this paper uses this model

Change paths at training_files` and `validation_files`

Create a new config.json file of your own data set

Revise:

"training_files":"./filelists/history_train_filelist.txt.cleaned",

"validation_files":"./filelists/history_train_filelist.txt.cleaned",

"text_cleaners":["cjke_cleaners2"], #多语言自定义函数

 # "text_cleaners":["korean_cleaners"], #训练语言

2.2.Data preprocessing

# Single speaker
# python preprocess.py --text_index 1 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'japanese_cleaners'

#本文实例
# 方法一:声学模型(暂时未运行)
 python preprocess.py --text_index 1 --filelists ./filelists/history_train_filelist.txt ./filelists/history_val_filelist.txt --text_cleaners 'korean_cleaners'
# 方法二:G2p模型
 python preprocess.py --text_index 1 --filelists ./filelists/history_train_filelist.txt ./filelists/history_val_filelist.txt --text_cleaners 'cjke_cleaners2'

# Mutiple speakers
python preprocess.py --text_index 2 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'japanese_cleaners'

Generate after running

2.3. Text processing

There are slight differences in the way text is processed in China, the UK, Korea and Japan.

Including text processing, transformation methods, text regularization, symbol processing

2.4.Training

# 单人
#python train_latest.py -c <config> -m <folder>
python train_latest.py -c configs/myvoice.json  -m myvoice_model(文件夹名称,随便起)

# 多人
#python train_latest_ms.py -c <config> -m <folder>

 

After training, they are stored in the logs folder.

2.5. Reasoning

import warnings
warnings.filterwarnings(action='ignore')

import os
import time
import torch
import utils
import argparse
import commons
import utils
from models import SynthesizerTrn
from text.symbols import symbols
from text import cleaned_text_to_sequence
#日语from g2p import pyopenjtalk_g2p_prosody
#韩语
from g2pk2 import G2p
import soundcard as sc
import soundfile as sf


def get_text(text, hps):
    text_norm = cleaned_text_to_sequence(text)
    if hps.data.add_blank:
        text_norm = commons.intersperse(text_norm, 0)
    text_norm = torch.LongTensor(text_norm)
    return text_norm

def inference(args):

    config_path = args.config
    G_model_path = args.model_path

    # check device
    if  torch.cuda.is_available() is True:
        print("Enter the device number to use.")
        key = input("GPU:0, CPU:1 ===> ")
        if key == "0":
            device="cuda:0"
        elif key=="1":
            device="cpu"
        print(f"Device : {device}")
    else:
        print(f"CUDA is not available. Device : cpu")
        device = "cpu"

    # load config.json
    hps = utils.get_hparams_from_file(config_path)
    
    # load checkpoint
    net_g = SynthesizerTrn(
        len(symbols),
        hps.data.filter_length // 2 + 1,
        hps.train.segment_size // hps.data.hop_length,
        **hps.model).cuda()
    _ = net_g.eval()
    _ = utils.load_checkpoint(G_model_path, net_g, None)

    # play audio by system default
    speaker = sc.get_speaker(sc.default_speaker().name)

    # parameter settings
    noise_scale     = torch.tensor(0.66)    # adjust z_p noise
    noise_scale_w   = torch.tensor(0.8)    # adjust SDP noise
    length_scale    = torch.tensor(1.0)     # adjust sound length scale (talk speed)

    if args.is_save is True:
        n_save = 0
        save_dir = os.path.join("./infer_logs/")
        os.makedirs(save_dir, exist_ok=True)

    ### Dummy Input ###
    with torch.inference_mode():
        #日语stn_phn = pyopenjtalk_g2p_prosody("速度計測のためのダミーインプットです。")
        stn_phn = G2p("소프트웨어 교육의 중요성이 날로 더해가는데 학생들은 소프트웨어 관련 교육을 쉽게 지루해해요")
        stn_tst = get_text(stn_phn, hps)
        # generate audio
        x_tst = stn_tst.cuda().unsqueeze(0)
        x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).cuda()
        audio = net_g.infer(x_tst, 
                            x_tst_lengths, 
                            noise_scale=noise_scale, 
                            noise_scale_w=noise_scale_w, 
                            length_scale=length_scale)[0][0,0].data.cpu().float().numpy()

    while True:

        # get text
        text = input("Enter text. ==> ")
        if text=="":
            print("Empty input is detected... Exit...")
            break
        
        # measure the execution time 
        torch.cuda.synchronize()
        start = time.time()

        # required_grad is False
        with torch.inference_mode():
            #日语stn_phn = pyopenjtalk_g2p_prosody(text)
            #韩语
            stn_phn = G2p(text)
            stn_tst = get_text(stn_phn, hps)

            # generate audio
            x_tst = stn_tst.cuda().unsqueeze(0)
            x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).cuda()
            audio = net_g.infer(x_tst, 
                                x_tst_lengths, 
                                noise_scale=noise_scale, 
                                noise_scale_w=noise_scale_w, 
                                length_scale=length_scale)[0][0,0].data.cpu().float().numpy()

        # measure the execution time 
        torch.cuda.synchronize()
        elapsed_time = time.time() - start
        print(f"Gen Time : 0.0621")
        
        # play audio
        speaker.play(audio, hps.data.sampling_rate)
        
        # save audio
        if args.is_save is True:
            n_save += 1
            data = audio
            try:
                save_path = os.path.join(save_dir, str(n_save).zfill(3)+f"_{text}.wav")
                sf.write(
                     file=save_path,
                     data=data,
                     samplerate=hps.data.sampling_rate,
                     format="WAV")
            except:
                save_path = os.path.join(save_dir, str(n_save).zfill(3)+f"_{text[:10]}〜.wav")
                sf.write(
                     file=save_path,
                     data=data,
                     samplerate=hps.data.sampling_rate,
                     format="WAV")

            print(f"Audio is saved at : {save_path}")


    return 0

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument('--config',
                        type=str,
                        required=True,
                        #default="./logs/ITA_CORPUS/config.json" ,    
                        help='Path to configuration file')
    parser.add_argument('--model_path',
                        type=str,
                        required=True,
                        #default="./logs/ITA_CORPUS/G_1200.pth",
                        help='Path to checkpoint')
    parser.add_argument('--is_save',
                        type=str,
                        default=True,
                        help='Whether to save output or not')
    args = parser.parse_args()
    
    inference(args)

If something goes wrong, please check [PS5] 

【PS】

【PS1】ERROR: Could not build wheels for pyopenjtalk, which is required to install pyproject.toml-based projects

The following error occurred when installing pip install -r requirements.txt

Reference 1

 try

pip install pycocotools -i https://pypi.python.org/simple

 Still error

【PS2】AttributeError: 'HParams' object has no attribute 'seed'

The configuration file in config is missing seed.

Re-modify the configuration file

【PS3】EOFError: Ran out of input

The author’s default number of GPUs for training is 5

Change num_worker to 0

then appeared again

Change line 35 in /workspace/tts/MB-iSTFT-VITS-multilingual/text/__init__.py to

  sequence = [_symbol_to_id[symbol] for symbol in cleaned_text if symbol in _symbol_to_id.keys()]

【PS4】The data does not generate the corresponding spec.pt file

【PS5】  TypeError: __init__() takes 1 positional argument but 2 were given

 The processing libraries for different languages ​​​​are different.

【PS6】Traceback (most recent call last):
  File "sc_test.py", line 2, in <module>
    import soundcard
  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/__init__.py", line 4, in <module>
    from soundcard.pulseaudio import *
  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 265, in <module>
    _pulse = _PulseAudio()
  File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 76, in __init__
    assert self._pa_context_get_state(self.context)==_pa.PA_CONTEXT_READY
AssertionError

reference

'import soundcard' throws an error when run through a systemd service · Issue #133 · bastibe/SoundCard (github.com)

Expand

Introduction to multilingual processing methods

Korean

  • korean to ipa

scarletcho/KoG2P: Korean grapheme-to-phone conversion in Python (github.com)

 Method 1: Korean to phoneme (G2p)

 fixed data format

Modify the processing of text in /workspace/tts/MB-iSTFT-VITS-multilingual/text/__init__.py and cleaners.py

Modify the configuration file/workspace/tts/MB-iSTFT-VITS-multilingual/configs/own data set name.json

from text.korean import latin_to_hangul, number_to_hangul, divide_hangul, korean_to_lazy_ipa, korean_to_ipa
from g2pk2 import G2p

def cjke_cleaners2(text):
  text = re.sub(r'(.*?)',
                  lambda x: korean_to_ipa(x.group(1))+' ', text)
  text = re.sub(r'\s+$', '', text)
  text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text)
  for texts in text:
        cleaned_text = korean_to_ipa(text[4:-4])
  if re.match(r'[^\.,!\?\-…~]', text[-1]):
        text += '.'
  return text
 Method 2: Acoustic Model

For text processing

In Korean, in addition to the way provided by the author

First, add korean.py, which includes processing methods for Korean and punctuation marks. The following is the specific code.

import re
from jamo import h2j, j2hcj
import ko_pron


# This is a list of Korean classifiers preceded by pure Korean numerals.
_korean_classifiers = '군데 권 개 그루 닢 대 두 마리 모 모금 뭇 발 발짝 방 번 벌 보루 살 수 술 시 쌈 움큼 정 짝 채 척 첩 축 켤레 톨 통'

# List of (hangul, hangul divided) pairs:
_hangul_divided = [(re.compile('%s' % x[0]), x[1]) for x in [
    ('ㄳ', 'ㄱㅅ'),
    ('ㄵ', 'ㄴㅈ'),
    ('ㄶ', 'ㄴㅎ'),
    ('ㄺ', 'ㄹㄱ'),
    ('ㄻ', 'ㄹㅁ'),
    ('ㄼ', 'ㄹㅂ'),
    ('ㄽ', 'ㄹㅅ'),
    ('ㄾ', 'ㄹㅌ'),
    ('ㄿ', 'ㄹㅍ'),
    ('ㅀ', 'ㄹㅎ'),
    ('ㅄ', 'ㅂㅅ'),
    ('ㅘ', 'ㅗㅏ'),
    ('ㅙ', 'ㅗㅐ'),
    ('ㅚ', 'ㅗㅣ'),
    ('ㅝ', 'ㅜㅓ'),
    ('ㅞ', 'ㅜㅔ'),
    ('ㅟ', 'ㅜㅣ'),
    ('ㅢ', 'ㅡㅣ'),
    ('ㅑ', 'ㅣㅏ'),
    ('ㅒ', 'ㅣㅐ'),
    ('ㅕ', 'ㅣㅓ'),
    ('ㅖ', 'ㅣㅔ'),
    ('ㅛ', 'ㅣㅗ'),
    ('ㅠ', 'ㅣㅜ')
]]

# List of (Latin alphabet, hangul) pairs:
_latin_to_hangul = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [
    ('a', '에이'),
    ('b', '비'),
    ('c', '시'),
    ('d', '디'),
    ('e', '이'),
    ('f', '에프'),
    ('g', '지'),
    ('h', '에이치'),
    ('i', '아이'),
    ('j', '제이'),
    ('k', '케이'),
    ('l', '엘'),
    ('m', '엠'),
    ('n', '엔'),
    ('o', '오'),
    ('p', '피'),
    ('q', '큐'),
    ('r', '아르'),
    ('s', '에스'),
    ('t', '티'),
    ('u', '유'),
    ('v', '브이'),
    ('w', '더블유'),
    ('x', '엑스'),
    ('y', '와이'),
    ('z', '제트')
]]

# List of (ipa, lazy ipa) pairs:
_ipa_to_lazy_ipa = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [
    ('t͡ɕ','ʧ'),
    ('d͡ʑ','ʥ'),
    ('ɲ','n^'),
    ('ɕ','ʃ'),
    ('ʷ','w'),
    ('ɭ','l`'),
    ('ʎ','ɾ'),
    ('ɣ','ŋ'),
    ('ɰ','ɯ'),
    ('ʝ','j'),
    ('ʌ','ə'),
    ('ɡ','g'),
    ('\u031a','#'),
    ('\u0348','='),
    ('\u031e',''),
    ('\u0320',''),
    ('\u0339','')
]]


def latin_to_hangul(text):
    for regex, replacement in _latin_to_hangul:
        text = re.sub(regex, replacement, text)
    return text


def divide_hangul(text):
    text = j2hcj(h2j(text))
    for regex, replacement in _hangul_divided:
        text = re.sub(regex, replacement, text)
    return text


def hangul_number(num, sino=True):
    '''Reference https://github.com/Kyubyong/g2pK'''
    num = re.sub(',', '', num)

    if num == '0':
        return '영'
    if not sino and num == '20':
        return '스무'

    digits = '123456789'
    names = '일이삼사오육칠팔구'
    digit2name = {d: n for d, n in zip(digits, names)}

    modifiers = '한 두 세 네 다섯 여섯 일곱 여덟 아홉'
    decimals = '열 스물 서른 마흔 쉰 예순 일흔 여든 아흔'
    digit2mod = {d: mod for d, mod in zip(digits, modifiers.split())}
    digit2dec = {d: dec for d, dec in zip(digits, decimals.split())}

    spelledout = []
    for i, digit in enumerate(num):
        i = len(num) - i - 1
        if sino:
            if i == 0:
                name = digit2name.get(digit, '')
            elif i == 1:
                name = digit2name.get(digit, '') + '십'
                name = name.replace('일십', '십')
        else:
            if i == 0:
                name = digit2mod.get(digit, '')
            elif i == 1:
                name = digit2dec.get(digit, '')
        if digit == '0':
            if i % 4 == 0:
                last_three = spelledout[-min(3, len(spelledout)):]
                if ''.join(last_three) == '':
                    spelledout.append('')
                    continue
            else:
                spelledout.append('')
                continue
        if i == 2:
            name = digit2name.get(digit, '') + '백'
            name = name.replace('일백', '백')
        elif i == 3:
            name = digit2name.get(digit, '') + '천'
            name = name.replace('일천', '천')
        elif i == 4:
            name = digit2name.get(digit, '') + '만'
            name = name.replace('일만', '만')
        elif i == 5:
            name = digit2name.get(digit, '') + '십'
            name = name.replace('일십', '십')
        elif i == 6:
            name = digit2name.get(digit, '') + '백'
            name = name.replace('일백', '백')
        elif i == 7:
            name = digit2name.get(digit, '') + '천'
            name = name.replace('일천', '천')
        elif i == 8:
            name = digit2name.get(digit, '') + '억'
        elif i == 9:
            name = digit2name.get(digit, '') + '십'
        elif i == 10:
            name = digit2name.get(digit, '') + '백'
        elif i == 11:
            name = digit2name.get(digit, '') + '천'
        elif i == 12:
            name = digit2name.get(digit, '') + '조'
        elif i == 13:
            name = digit2name.get(digit, '') + '십'
        elif i == 14:
            name = digit2name.get(digit, '') + '백'
        elif i == 15:
            name = digit2name.get(digit, '') + '천'
        spelledout.append(name)
    return ''.join(elem for elem in spelledout)


def number_to_hangul(text):
    '''Reference https://github.com/Kyubyong/g2pK'''
    tokens = set(re.findall(r'(\d[\d,]*)([\uac00-\ud71f]+)', text))
    for token in tokens:
        num, classifier = token
        if classifier[:2] in _korean_classifiers or classifier[0] in _korean_classifiers:
            spelledout = hangul_number(num, sino=False)
        else:
            spelledout = hangul_number(num, sino=True)
        text = text.replace(f'{num}{classifier}', f'{spelledout}{classifier}')
    # digit by digit for remaining digits
    digits = '0123456789'
    names = '영일이삼사오육칠팔구'
    for d, n in zip(digits, names):
        text = text.replace(d, n)
    return text


def korean_to_lazy_ipa(text):
    text = latin_to_hangul(text)
    text = number_to_hangul(text)
    text=re.sub('[\uac00-\ud7af]+',lambda x:ko_pron.romanise(x.group(0),'ipa').split('] ~ [')[0],text)
    for regex, replacement in _ipa_to_lazy_ipa:
        text = re.sub(regex, replacement, text)
    return text


def korean_to_ipa(text):
    text = korean_to_lazy_ipa(text)
    return text.replace('ʧ','tʃ').replace('ʥ','dʑ')

 Then call it in /workspace/tts/MB-iSTFT-VITS-multilingual/text/cleaners.py

from text.korean import latin_to_hangul, number_to_hangul, divide_hangul, korean_to_lazy_ipa, korean_to_ipa

def korean_cleaners(text):
    #Pipeline for Korean text
    text = latin_to_hangul(text)
    text = number_to_hangul(text)
    text = divide_hangul(text)
    text = re.sub(r'([\u3131-\u3163])$', r'\1.', text)
    return text

def korean_cleaners2(text):
    #Pipeline for Korean text
    text = latin_to_hangul(text)
    g2p = G2p()
    text = g2p(text)
    text = divide_hangul(text)
    text = re.sub(r'([\u3131-\u3163])$', r'\1.', text)
    return text

VScode shortcut keys

Ctrl+F: Find keywords

Click the arrow on the left to replace

Ctrl+alt+enter replace all

references

[1] tonnetonne814/MB-iSTFT-VITS-44100-Ja: Supports 44100Hz Japanese sound source MB-iSTFT-VITS: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-time Fourier Transform. (github.com)

Guess you like

Origin blog.csdn.net/weixin_44649780/article/details/132717425