2023_LIGHTWEIGHT AND HIGH-FIDELITY END-TO-END TEXT-TO-SPEECH WITH MULTI-BAND GENERATION AND INVERSE SHORT-TIME FOURIER TRANSFORM
Table of contents
1. Detailed explanation of the paper
【PS2】AttributeError: 'HParams' object has no attribute 'seed'
【PS3】EOFError: Ran out of input
【PS4】The data does not generate the corresponding spec.pt file
【PS5】 TypeError: __init__() takes 1 positional argument but 2 were given
Introduction to multilingual processing methods
Method 1: Korean to phoneme (G2p)
1. Detailed explanation of the paper
1.1. Introduction
The previous two stages of speech synthesis (acoustic model and Vocoders) are introduced. Because VITS is a high-quality end-to-end model, the model proposed in the paper is based on a lightweight end-to-end model of VITS. The paper mainly uses several decoding methods in the model. Part, that is, converting the latent acoustic features into wavaform, replacing part of the decoder with a simple inverse short-time Fourier transform (iSTFT) to efficiently complete the conversion from the frequency domain to the time domain. When inferring to increase the speed, use multi-stage processing . In the proposed method, each iSTFTNet,subsegment signals. During inference, it is 4.1 times faster than the original VITS.
1.2.VITS algorithm
1.2.1. The principle of VITS is briefly introduced . I will not introduce too much here. For more information, please check [ VITS paper summary and project recurrence ]
1.2.2. Inference speed of each model
RTF calculation is used here, that is, the time token is divided by the synthesized speech length , as an objective standard.
1.3.Proposed method
1.3.1. In the output layer, replace the HiFi-GAN in the original vits with the sample inverse short-time Fourier transform
1.3.2. Multi-band iSTFT VITS automatic encoding
The picture below shows the overall framework of the model.
1.3.3. Multi-stream iSTFT VITS is
different from MB-iSTFT-VITS. The MS-iSTFT-VITS waveform is completely trainable and is no longer subject to fixed sub-band signals. limit.
1.4.Experiments
Dataset: LJ Speech dataset
Five vits-based models were compared:
- VITS : Formal VITS1, its hyperparameters are the same as the original VITS.
- Nix-TTS : Pre-trained model of Nix-TTS 2. The model used is an optimized ONNX version [27]. Note that the dataset used in the experiments is exactly the same as that used by Nix-TTS.
- iSTFT-VITS : A model that incorporates iSTFTNet into the VITS decoder part. The architecture of iSTFTNet is V1-C8C8I, which is the best-balanced model described in [14]. The architecture contains two upsampling ratios [8, 8]. The size, jump length and window length of the Fast Fourier Transform (FFT) are set to 16, 4 and 16 respectively.
- A proposed model introduced in Section 3.2 of MB-iSTFT-VITS . The number of sub-bands N is set to 4. The upsampling ratio of the two residual blocks is [4,4] to match the decomposed resolution of each subband signal. The pseudo-QMF analysis filter decomposes the signal to match the resolution of each subband. The FFT size, transition length and window length of the iSTFT part are the same as the resolution of each sub-band signal decomposed by the pseudo-QMF analysis filter.
- MS-iSTFT-VITS : Another proposed model introduced in Section 3.3. According to [10], the kernel size of the convolution-based trainable synthesis filter is set to 63 without bias. The kernel size of the trainable synthesis filter is set to 63 with no bias. Other conditions are the same as MB-iSTFT-VITS.
The picture below shows the comparison results
1.5.Conclusion
In this paper, an end-to-end TTS system is proposed that enables high-speed speech synthesis on the device. The proposed method builds on the successful end-to-end model called VITS, but adopts several techniques to speed up inference, such as reducing the redundancy of decoder calculations through iSTFT, and adopting a multi-band parallel strategy. Due to The proposed model is optimized in a completely end-to-end manner, and compared with the traditional two-stage method, the proposed model can fully enjoy its powerful optimization process. Experimental results show that the speech generated by the method is as natural as the speech synthesized by VITS, while also generating waveforms at a faster speed. Future research includes extending the proposed method to multi-speaker models.
2. Project implementation
2.1.Data preparation
The prepared voice data must be 22050Hz, single channel (Mono), PCM-16bit
individual
dataset/001.wav|您好
dataset/001.wav|안녕하세요
dataset/001.wav|こんにちは。
multiple people
dataset/001.wav|0|こんにちは。
The middle number is the character ID, starting from 0
# Cython-version Monotonoic Alignment Search
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
The paper provides several training structures, because in the paper MB-iSTFT-VITS has the best effect, this paper uses this model
Change paths at training_files` and `validation_files`
Create a new config.json file of your own data set
Revise:
"training_files":"./filelists/history_train_filelist.txt.cleaned",
"validation_files":"./filelists/history_train_filelist.txt.cleaned",
"text_cleaners":["cjke_cleaners2"], #多语言自定义函数
# "text_cleaners":["korean_cleaners"], #训练语言
2.2.Data preprocessing
# Single speaker
# python preprocess.py --text_index 1 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'japanese_cleaners'
#本文实例
# 方法一:声学模型(暂时未运行)
python preprocess.py --text_index 1 --filelists ./filelists/history_train_filelist.txt ./filelists/history_val_filelist.txt --text_cleaners 'korean_cleaners'
# 方法二:G2p模型
python preprocess.py --text_index 1 --filelists ./filelists/history_train_filelist.txt ./filelists/history_val_filelist.txt --text_cleaners 'cjke_cleaners2'
# Mutiple speakers
python preprocess.py --text_index 2 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'japanese_cleaners'
Generate after running
2.3. Text processing
There are slight differences in the way text is processed in China, the UK, Korea and Japan.
Including text processing, transformation methods, text regularization, symbol processing
2.4.Training
# 单人
#python train_latest.py -c <config> -m <folder>
python train_latest.py -c configs/myvoice.json -m myvoice_model(文件夹名称,随便起)
# 多人
#python train_latest_ms.py -c <config> -m <folder>
After training, they are stored in the logs folder.
2.5. Reasoning
import warnings
warnings.filterwarnings(action='ignore')
import os
import time
import torch
import utils
import argparse
import commons
import utils
from models import SynthesizerTrn
from text.symbols import symbols
from text import cleaned_text_to_sequence
#日语from g2p import pyopenjtalk_g2p_prosody
#韩语
from g2pk2 import G2p
import soundcard as sc
import soundfile as sf
def get_text(text, hps):
text_norm = cleaned_text_to_sequence(text)
if hps.data.add_blank:
text_norm = commons.intersperse(text_norm, 0)
text_norm = torch.LongTensor(text_norm)
return text_norm
def inference(args):
config_path = args.config
G_model_path = args.model_path
# check device
if torch.cuda.is_available() is True:
print("Enter the device number to use.")
key = input("GPU:0, CPU:1 ===> ")
if key == "0":
device="cuda:0"
elif key=="1":
device="cpu"
print(f"Device : {device}")
else:
print(f"CUDA is not available. Device : cpu")
device = "cpu"
# load config.json
hps = utils.get_hparams_from_file(config_path)
# load checkpoint
net_g = SynthesizerTrn(
len(symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
**hps.model).cuda()
_ = net_g.eval()
_ = utils.load_checkpoint(G_model_path, net_g, None)
# play audio by system default
speaker = sc.get_speaker(sc.default_speaker().name)
# parameter settings
noise_scale = torch.tensor(0.66) # adjust z_p noise
noise_scale_w = torch.tensor(0.8) # adjust SDP noise
length_scale = torch.tensor(1.0) # adjust sound length scale (talk speed)
if args.is_save is True:
n_save = 0
save_dir = os.path.join("./infer_logs/")
os.makedirs(save_dir, exist_ok=True)
### Dummy Input ###
with torch.inference_mode():
#日语stn_phn = pyopenjtalk_g2p_prosody("速度計測のためのダミーインプットです。")
stn_phn = G2p("소프트웨어 교육의 중요성이 날로 더해가는데 학생들은 소프트웨어 관련 교육을 쉽게 지루해해요")
stn_tst = get_text(stn_phn, hps)
# generate audio
x_tst = stn_tst.cuda().unsqueeze(0)
x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).cuda()
audio = net_g.infer(x_tst,
x_tst_lengths,
noise_scale=noise_scale,
noise_scale_w=noise_scale_w,
length_scale=length_scale)[0][0,0].data.cpu().float().numpy()
while True:
# get text
text = input("Enter text. ==> ")
if text=="":
print("Empty input is detected... Exit...")
break
# measure the execution time
torch.cuda.synchronize()
start = time.time()
# required_grad is False
with torch.inference_mode():
#日语stn_phn = pyopenjtalk_g2p_prosody(text)
#韩语
stn_phn = G2p(text)
stn_tst = get_text(stn_phn, hps)
# generate audio
x_tst = stn_tst.cuda().unsqueeze(0)
x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).cuda()
audio = net_g.infer(x_tst,
x_tst_lengths,
noise_scale=noise_scale,
noise_scale_w=noise_scale_w,
length_scale=length_scale)[0][0,0].data.cpu().float().numpy()
# measure the execution time
torch.cuda.synchronize()
elapsed_time = time.time() - start
print(f"Gen Time : 0.0621")
# play audio
speaker.play(audio, hps.data.sampling_rate)
# save audio
if args.is_save is True:
n_save += 1
data = audio
try:
save_path = os.path.join(save_dir, str(n_save).zfill(3)+f"_{text}.wav")
sf.write(
file=save_path,
data=data,
samplerate=hps.data.sampling_rate,
format="WAV")
except:
save_path = os.path.join(save_dir, str(n_save).zfill(3)+f"_{text[:10]}〜.wav")
sf.write(
file=save_path,
data=data,
samplerate=hps.data.sampling_rate,
format="WAV")
print(f"Audio is saved at : {save_path}")
return 0
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--config',
type=str,
required=True,
#default="./logs/ITA_CORPUS/config.json" ,
help='Path to configuration file')
parser.add_argument('--model_path',
type=str,
required=True,
#default="./logs/ITA_CORPUS/G_1200.pth",
help='Path to checkpoint')
parser.add_argument('--is_save',
type=str,
default=True,
help='Whether to save output or not')
args = parser.parse_args()
inference(args)
If something goes wrong, please check [PS5]
【PS】
【PS1】ERROR: Could not build wheels for pyopenjtalk, which is required to install pyproject.toml-based projects
The following error occurred when installing pip install -r requirements.txt
Reference 1
try
pip install pycocotools -i https://pypi.python.org/simple
Still error
【PS2】AttributeError: 'HParams' object has no attribute 'seed'
The configuration file in config is missing seed.
Re-modify the configuration file
【PS3】EOFError: Ran out of input
The author’s default number of GPUs for training is 5
Change num_worker to 0
then appeared again
Change line 35 in /workspace/tts/MB-iSTFT-VITS-multilingual/text/__init__.py to
sequence = [_symbol_to_id[symbol] for symbol in cleaned_text if symbol in _symbol_to_id.keys()]
【PS4】The data does not generate the corresponding spec.pt file
【PS5】 TypeError: __init__() takes 1 positional argument but 2 were given
The processing libraries for different languages are different.
【PS6】Traceback (most recent call last):
File "sc_test.py", line 2, in <module>
import soundcard
File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/__init__.py", line 4, in <module>
from soundcard.pulseaudio import *
File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 265, in <module>
_pulse = _PulseAudio()
File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/soundcard/pulseaudio.py", line 76, in __init__
assert self._pa_context_get_state(self.context)==_pa.PA_CONTEXT_READY
AssertionError
reference
Expand
Introduction to multilingual processing methods
Korean
- korean to ipa
scarletcho/KoG2P: Korean grapheme-to-phone conversion in Python (github.com)
Method 1: Korean to phoneme (G2p)
fixed data format
Modify the processing of text in /workspace/tts/MB-iSTFT-VITS-multilingual/text/__init__.py and cleaners.py
Modify the configuration file/workspace/tts/MB-iSTFT-VITS-multilingual/configs/own data set name.json
from text.korean import latin_to_hangul, number_to_hangul, divide_hangul, korean_to_lazy_ipa, korean_to_ipa
from g2pk2 import G2p
def cjke_cleaners2(text):
text = re.sub(r'(.*?)',
lambda x: korean_to_ipa(x.group(1))+' ', text)
text = re.sub(r'\s+$', '', text)
text = re.sub(r'([^\.,!\?\-…~])$', r'\1.', text)
for texts in text:
cleaned_text = korean_to_ipa(text[4:-4])
if re.match(r'[^\.,!\?\-…~]', text[-1]):
text += '.'
return text
Method 2: Acoustic Model
For text processing
In Korean, in addition to the way provided by the author
First, add korean.py, which includes processing methods for Korean and punctuation marks. The following is the specific code.
import re
from jamo import h2j, j2hcj
import ko_pron
# This is a list of Korean classifiers preceded by pure Korean numerals.
_korean_classifiers = '군데 권 개 그루 닢 대 두 마리 모 모금 뭇 발 발짝 방 번 벌 보루 살 수 술 시 쌈 움큼 정 짝 채 척 첩 축 켤레 톨 통'
# List of (hangul, hangul divided) pairs:
_hangul_divided = [(re.compile('%s' % x[0]), x[1]) for x in [
('ㄳ', 'ㄱㅅ'),
('ㄵ', 'ㄴㅈ'),
('ㄶ', 'ㄴㅎ'),
('ㄺ', 'ㄹㄱ'),
('ㄻ', 'ㄹㅁ'),
('ㄼ', 'ㄹㅂ'),
('ㄽ', 'ㄹㅅ'),
('ㄾ', 'ㄹㅌ'),
('ㄿ', 'ㄹㅍ'),
('ㅀ', 'ㄹㅎ'),
('ㅄ', 'ㅂㅅ'),
('ㅘ', 'ㅗㅏ'),
('ㅙ', 'ㅗㅐ'),
('ㅚ', 'ㅗㅣ'),
('ㅝ', 'ㅜㅓ'),
('ㅞ', 'ㅜㅔ'),
('ㅟ', 'ㅜㅣ'),
('ㅢ', 'ㅡㅣ'),
('ㅑ', 'ㅣㅏ'),
('ㅒ', 'ㅣㅐ'),
('ㅕ', 'ㅣㅓ'),
('ㅖ', 'ㅣㅔ'),
('ㅛ', 'ㅣㅗ'),
('ㅠ', 'ㅣㅜ')
]]
# List of (Latin alphabet, hangul) pairs:
_latin_to_hangul = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [
('a', '에이'),
('b', '비'),
('c', '시'),
('d', '디'),
('e', '이'),
('f', '에프'),
('g', '지'),
('h', '에이치'),
('i', '아이'),
('j', '제이'),
('k', '케이'),
('l', '엘'),
('m', '엠'),
('n', '엔'),
('o', '오'),
('p', '피'),
('q', '큐'),
('r', '아르'),
('s', '에스'),
('t', '티'),
('u', '유'),
('v', '브이'),
('w', '더블유'),
('x', '엑스'),
('y', '와이'),
('z', '제트')
]]
# List of (ipa, lazy ipa) pairs:
_ipa_to_lazy_ipa = [(re.compile('%s' % x[0], re.IGNORECASE), x[1]) for x in [
('t͡ɕ','ʧ'),
('d͡ʑ','ʥ'),
('ɲ','n^'),
('ɕ','ʃ'),
('ʷ','w'),
('ɭ','l`'),
('ʎ','ɾ'),
('ɣ','ŋ'),
('ɰ','ɯ'),
('ʝ','j'),
('ʌ','ə'),
('ɡ','g'),
('\u031a','#'),
('\u0348','='),
('\u031e',''),
('\u0320',''),
('\u0339','')
]]
def latin_to_hangul(text):
for regex, replacement in _latin_to_hangul:
text = re.sub(regex, replacement, text)
return text
def divide_hangul(text):
text = j2hcj(h2j(text))
for regex, replacement in _hangul_divided:
text = re.sub(regex, replacement, text)
return text
def hangul_number(num, sino=True):
'''Reference https://github.com/Kyubyong/g2pK'''
num = re.sub(',', '', num)
if num == '0':
return '영'
if not sino and num == '20':
return '스무'
digits = '123456789'
names = '일이삼사오육칠팔구'
digit2name = {d: n for d, n in zip(digits, names)}
modifiers = '한 두 세 네 다섯 여섯 일곱 여덟 아홉'
decimals = '열 스물 서른 마흔 쉰 예순 일흔 여든 아흔'
digit2mod = {d: mod for d, mod in zip(digits, modifiers.split())}
digit2dec = {d: dec for d, dec in zip(digits, decimals.split())}
spelledout = []
for i, digit in enumerate(num):
i = len(num) - i - 1
if sino:
if i == 0:
name = digit2name.get(digit, '')
elif i == 1:
name = digit2name.get(digit, '') + '십'
name = name.replace('일십', '십')
else:
if i == 0:
name = digit2mod.get(digit, '')
elif i == 1:
name = digit2dec.get(digit, '')
if digit == '0':
if i % 4 == 0:
last_three = spelledout[-min(3, len(spelledout)):]
if ''.join(last_three) == '':
spelledout.append('')
continue
else:
spelledout.append('')
continue
if i == 2:
name = digit2name.get(digit, '') + '백'
name = name.replace('일백', '백')
elif i == 3:
name = digit2name.get(digit, '') + '천'
name = name.replace('일천', '천')
elif i == 4:
name = digit2name.get(digit, '') + '만'
name = name.replace('일만', '만')
elif i == 5:
name = digit2name.get(digit, '') + '십'
name = name.replace('일십', '십')
elif i == 6:
name = digit2name.get(digit, '') + '백'
name = name.replace('일백', '백')
elif i == 7:
name = digit2name.get(digit, '') + '천'
name = name.replace('일천', '천')
elif i == 8:
name = digit2name.get(digit, '') + '억'
elif i == 9:
name = digit2name.get(digit, '') + '십'
elif i == 10:
name = digit2name.get(digit, '') + '백'
elif i == 11:
name = digit2name.get(digit, '') + '천'
elif i == 12:
name = digit2name.get(digit, '') + '조'
elif i == 13:
name = digit2name.get(digit, '') + '십'
elif i == 14:
name = digit2name.get(digit, '') + '백'
elif i == 15:
name = digit2name.get(digit, '') + '천'
spelledout.append(name)
return ''.join(elem for elem in spelledout)
def number_to_hangul(text):
'''Reference https://github.com/Kyubyong/g2pK'''
tokens = set(re.findall(r'(\d[\d,]*)([\uac00-\ud71f]+)', text))
for token in tokens:
num, classifier = token
if classifier[:2] in _korean_classifiers or classifier[0] in _korean_classifiers:
spelledout = hangul_number(num, sino=False)
else:
spelledout = hangul_number(num, sino=True)
text = text.replace(f'{num}{classifier}', f'{spelledout}{classifier}')
# digit by digit for remaining digits
digits = '0123456789'
names = '영일이삼사오육칠팔구'
for d, n in zip(digits, names):
text = text.replace(d, n)
return text
def korean_to_lazy_ipa(text):
text = latin_to_hangul(text)
text = number_to_hangul(text)
text=re.sub('[\uac00-\ud7af]+',lambda x:ko_pron.romanise(x.group(0),'ipa').split('] ~ [')[0],text)
for regex, replacement in _ipa_to_lazy_ipa:
text = re.sub(regex, replacement, text)
return text
def korean_to_ipa(text):
text = korean_to_lazy_ipa(text)
return text.replace('ʧ','tʃ').replace('ʥ','dʑ')
Then call it in /workspace/tts/MB-iSTFT-VITS-multilingual/text/cleaners.py
from text.korean import latin_to_hangul, number_to_hangul, divide_hangul, korean_to_lazy_ipa, korean_to_ipa
def korean_cleaners(text):
#Pipeline for Korean text
text = latin_to_hangul(text)
text = number_to_hangul(text)
text = divide_hangul(text)
text = re.sub(r'([\u3131-\u3163])$', r'\1.', text)
return text
def korean_cleaners2(text):
#Pipeline for Korean text
text = latin_to_hangul(text)
g2p = G2p()
text = g2p(text)
text = divide_hangul(text)
text = re.sub(r'([\u3131-\u3163])$', r'\1.', text)
return text
VScode shortcut keys
Ctrl+F: Find keywords
Click the arrow on the left to replace
Ctrl+alt+enter replace all