SLT2021: IMPROVED PARALLEL WAVEGAN VOCODER WITH PERCEPTUALLY WEIGHTED SPECTROGRAM LOSS

0. Title

IMPROVED PARALLEL WAVEGAN VOCODER WITH PERCEPTUALLY WEIGHTED SPECTROGRAM LOSS

Improved PARALLEL-WaveGAN vocoder: through intuitive weighted spectrum Loss

1. Summary

This paper proposes a spectral domain perceptual weighting technique for text-to-speech (TTS) system based on parallel-WaveGAN. The recently proposed Parallel WaveGAN vocoder uses a fast non-autoregressive WaveNet model to successfully generate a waveform sequence. By using the multi-resolution short-time Fourier transform (MR-STFT) standard with a generative adversarial network, a light-weight convolutional network can be effectively trained without any knowledge distillation process. In order to further improve the accuracy of vocoding, we propose to apply frequency weighting to the MR-STFT loss function. The proposed method penalizes perceptually sensitive errors in the frequency domain, and optimizes the model to reduce auditory noise in synthesized speech. The subjective listening test results show that our proposed method obtains 4.21 and 4.26 TTS average opinion scores for Korean male and female speakers, respectively

关键词: Text-to-speech, speech synthesis, neural vocoder, Parallel WaveGAN

Text to speech, speech synthesis, neural vocoder, parallel WaveGAN

2. Introduction

The generation model of the original speech waveform has significantly improved the quality of the neurotext-to-speech (TTS) system [1, 2]. Specifically, autoregressive generative models such as WaveNet have successfully replaced the traditional parametric vocoder [2-5]. The non-autoregressive version, including parallel WaveNet, provides a fast waveform generation method based on the teacher-student framework [6, 7]. In this method, the model is trained using the probability density distillation method, in which the knowledge of the autoregressive teacher WaveNet is transferred to the inverse autoregressive flow student model [8]

 

In our previous work, we introduced the generative adversarial network training method into the parallel WaveNet framework [9], and proposed parallel WaveGAN by combining adversarial training with the multi-resolution short-time Fourier transform (MR-STFT) standard [10] . , 11]. Although the GAN-based non-autoregressive model can be trained only by using the adversarial loss function, it has been proved that the use of the MRSTFT loss function is beneficial for improving the training efficiency [10, 13, 14]. In addition, because the parallel WaveGAN only trains the WaveNet model without any density distillation, the entire training process becomes much easier than traditional methods, and the model can generate natural speech waveforms with only a few parameters

 

In order to further improve the performance of parallel WaveGAN, this paper proposes a spectral domain perception weighting method to optimize the MR-STFT standard. A frequency-dependent masking filter is designed to punish errors near the valley of the spectrum that are sensitive to human ear perception [15]. By applying this filter to the calculation of the STFT loss function in the training step, the network can be instructed to reduce the noise components in those areas. Therefore, compared with the original Parallel WaveGAN, the sound produced by the proposed model is more natural

 

Our contribution can be summarized as follows:

  • We propose a perceptually weighted MR-STFT loss function and a traditional countermeasure training method. This method improves the quality of synthesized speech in a neural TTS system based on parallel WaveGAN
  • Since the proposed method does not change the network architecture, it retains a small number of parameters in the original Parallel WaveGAN and maintains a fast reasoning speed. In particular, in a single GPU environment with 1.83 M parameters, the 24 kHz speech waveform generated by this system is 50.57 times faster than real-time speed
  • Our method has average opinion score (MOS) results of 4.21 and 4.26 for Korean male and female speakers in the neural TTS system.

3. Others-easy to understand

The idea of ​​using a loss function based on STFT is not new. In their research on spectrogram inversion, Sercan et al. [16] first proposed the spectral convergence and logarithmic scale STFT amplitude loss, and our previous work proposed to combine them in a multi-resolution form [9]. In addition, the perceptual noise shaping filter significantly improves the quality of synthesized speech in the autoregressive WaveNet framework [17]. According to the characteristics of the human auditory system, an external noise shaping filter is designed to reduce the perceptually sensitive noise in the spectral valley area. This filter acts as a pre-processor in the training step; therefore, WaveNet understands the distribution of noise-shaped residual signals. In the synthesis step, the enhanced speech can be reconstructed by applying its inverse filter to the output of WaveNet

 

However, it turns out that the effectiveness of the filter is not applicable to non-autoregressive generative models, including WaveGlow [18] and Parallel WaveGAN. One possible reason for this situation may be that if there is no previous time step information, it is difficult for the non-autoregressive model to capture the characteristics of the noise-shaped residual signal. To solve this problem, the proposed system applies a frequency-dependent mask to the process of calculating the STFT loss function. Since this method does not change the distribution of the target speech, it can stably optimize the non-autoregressive WaveNet while significantly reducing the auditory noise component

 

Fig.1 Calculate the amplitude distance (MD) obtained when the spectrum is converged: (a) the weight matrix of the spectrum mask, (b) the MD before coating the mask (conventional method), and (c) after coating the mask MD (recommended method)

4. Other-not easy to understand

Among them, Wt, f represent the weight coefficient of the spectrum mask. The weight matrix W is constructed by repeating a time-invariant frequency masking filter along the time axis, and its transfer function is defined as follows

Where ~αk represents the k-th linear prediction (LP) coefficient, and its order is p, which is obtained by averaging all the frequency spectra extracted from the training data. As shown in Figure 1a, the weight matrix of the spectral template is designed to represent the overall characteristics of the spectral resonance peak structure. This can focus on the loss in the valley frequency region of the spectrum that is more sensitive to the human ear. When calculating the STFT loss (Figure 1b), the filter is used to penalize the loss in those areas (Figure 1c). As a result, the training process can guide the model to further reduce the perceptual noise in synthesized speech1

Guess you like

Origin blog.csdn.net/u013625492/article/details/112971796