[Timing] WaveNet audio generation model paper notes

Paper title: WaveNet: A Generative Model for Raw Audio
Paper download: https://arxiv.org/abs/1609.03499
Paper author: Google DeepMind
Paper year: 2016
Paper cited: 3333 (2022/4/2)

ABSTRACT

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-ofthe-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

1 INTRODUCTION

This work explores raw audio generation techniques inspired by recent advances in neural autoregressive generative models that operate on complex distributions such as images (van den Oord et al., 2016a;b) and text (Józefowicz et al., 2016). modeling. Model joint probabilities of pixels or words using neural architectures as a product of conditional distributions, leading to state-of-the-art generation.

Notably, these architectures are capable of modeling distributions of thousands of random variables (e.g. 64×64 pixels in PixelRNN (van den Oord et al., 2016a)). The question addressed in this paper is whether a similar approach can successfully generate wideband raw audio waveforms with very high temporal resolution, at least 16,000 samples per second (see Figure 1).
insert image description here
This paper introduces WaveNet, an audio generative model based on the PixelCNN (van den Oord et al., 2016a;b) architecture. The main contributions of this work are as follows:

  • We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS) assessed by human raters.
  • To handle the long-term temporal dependencies required for raw audio generation, we develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields.
  • We show that a single model can be used to generate distinct voices when conditioned on speaker identity.
  • The same architecture shows strong results when tested on small speech recognition datasets, and is promising when used to generate other audio modalities such as music.

We believe WaveNets provide a general and flexible framework for many applications that rely on audio generation (eg TTS, music, speech enhancement, speech conversion, source separation).

2 WAVENET

In this paper, we introduce a new generative model that operates directly on raw audio waveforms. The waveform x = {x1, . . . , xT } is decomposed into a product of conditional probabilities as follows:
insert image description here
Each audio sample xt is thus conditioned on samples from all previous time steps.

Similar to PixelCNNs, conditional probability distributions are modeled by convolutional layers. There is no pooling layer in the network, and the output of the model has the same time dimension as the input. The model outputs a categorical distribution over the next value xt using a softmax layer, which is optimized to maximize the log-likelihood of the data with respect to the parameters. Because the log-likelihood is tractable, we tune the hyperparameters on the validation set and can easily measure whether the model is overfitting or underfitting .

2.1 DILATED CAUSAL CONVOLUTIONS

insert image description here
The main component of WaveNet is causal convolutions . As shown in Figure 2, through causal convolution, it is ensured that the model does not violate the order in which the data is modeled: the model’s prediction p(xt+1 | x1, …, xt) at time step t cannot depend on any future time step xt +1, xt+2, . . . , xT . For images, causal convolution is equivalent to masked convolution , which can be implemented by constructing a mask tensor and performing element-wise multiplication on this mask to apply the previous convolution kernel. For one-dimensional data like audio, this can be more easily achieved by shifting the output of a standard convolution by a few time steps .

When training, conditional predictions at all time steps can be done in parallel, since the true label x is known at all time steps. When using model generation, predictions are sequential: after each sample is predicted, it is fed back into the network to predict the next sample .

Since models with causal convolutions do not have recurrent connections, they are often faster to train than RNNs, especially when applied to very long sequences . One of the problems with causal convolutions is that they require many layers or large filters to increase the receptive field. For example, in Figure 2, the receptive field is only 5 (= #layers + filter length - 1). In this paper, we use dilated convolutions to increase the receptive field by several orders of magnitude without greatly increasing the computational cost .

Dilated convolutions (convolutions with holes) apply a filter to an area larger than its length by skipping input values ​​by a certain step size. It is equivalent to a convolution with a larger filter obtained by extending the original filter with zeros, but is significantly more efficient. Dilated convolutions effectively allow the network to operate at coarser scales than ordinary convolutions. This is similar to pooling or strided convolution, but here the output has the same size as the input. A dilated convolution with dilation=1 produces a standard convolution . Figure 3 depicts dilated causal convolutions with dilations 1, 2, 4, and 8, respectively. Dilated convolutions have been used before in various contexts, such as signal processing (Holschneider et al., 1989; Dutilleux, 1989) and image segmentation (Chen et al., 2015; Yu & Koltun, 2016).
insert image description here
Stacked dilated convolutions allow the network to have a very large receptive field with only a few layers, while maintaining the input resolution and computational efficiency of the entire network. In this paper, the dilation of each layer is doubled, up to a limit, and then repeated: for example, 1, 2, 4, . . . , 512, 1, 2, 4, . . . , 512, 1, 2, 4, . . . , 512 .

There are two considerations for this configuration. First, the exponential growth of the inflation factor will cause the receptive field to grow exponentially with depth (Yu & Koltun, 2016). For example each 1, 2, 4, . . . , 512 block has a receptive field of size 1024, which can be seen as a more efficient and discriminative (non-linear) counterpart of the 1×1024 convolution. Second, stacking these blocks further increases the model capacity and receptive field size .

2.2 SOFTMAX DISTRIBUTIONS

One way to model the conditional distribution p(xt | x1, …, xt−1) on a single audio sample is to use a mixture model such as a mixture of density networks (Bishop, 1994) or a mixture of conditional Gaussian scales (MCGSM) (Theis & Bethge, 2015). However, the PixelCNN authors show that even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values), the softmax distribution tends to work better . One reason for this is that categorical distributions are more flexible, and arbitrary distributions can be modeled more easily because it doesn't make any assumptions about their shape .

Because raw audio is typically stored as a sequence of 16-bit integer values ​​(one per timestep), the softmax layer needs to output 65,536 probabilities at each timestep to simulate all possible values. To make this more tractable, we first apply the µ-law companding transform (ITU-T, 1988) to the data, which is then quantized to 256 possible values: where -1 < xt < 1 and µ = 255
insert image description here
. This nonlinear quantization yields significantly better reconstructions than simple linear quantization schemes . For speech in particular , we found that the reconstructed signal after quantization sounds very similar to the original signal .

2.3 GATED ACTIVATION UNITS

We use the same gated activation function as the gated PixelCNN:
insert image description here
where * denotes the convolution operator, ○ denotes the element-wise multiplication operator, σ( ) is the sigmoid function, k is the layer index, f and g denote the filter and gate, W is a learnable convolutional filter. In our initial experiments, we observed that this nonlinearity modeled audio signals significantly better than ReLU (Nair & Hinton, 2010) .

2.4 RESIDUAL AND SKIP CONNECTIONS

insert image description here
Residuals (He et al., 2015) and parametric skip connections are used in the network to speed up convergence and enable training of deeper models . Figure 4 shows a residual block of the model, which is stacked many times in the network.

2.5 CONDITIONAL WAVENETS

Given an additional input h, WaveNets can model the conditional distribution p(x|h) of the audio given that input. Equation (1) now becomes that
insert image description here
by tuning the model on other input variables, the WaveNet can be guided to generate audio with desired characteristics. For example, in a multi-speaker setting, we can select speakers by providing speaker identities as additional inputs to the model. Likewise, for TTS, we need to provide information about the text as additional input.

We condition the model on other inputs in two different ways: global conditioning and local conditioning. The global condition is characterized by a single latent representation h affecting the output distribution at all time steps, such as a loudspeaker embedded in a TTS model. The activation function of equation (2) now becomes:
insert image description here
where V∗,k is a learnable linear projection and the vector VT∗,kh is broadcast in the time dimension.

For local conditioning, we have a second time series ht, possibly with a lower sampling frequency than the audio signal, such as language features in TTS models. We first transform this time series using a transposed convolutional network (learned to upsample), map it to a new time series y = f(h) with the same resolution as the audio signal, and then use an activation function: where Vf,k
insert image description here
* y is now a 1×1 convolution. As an alternative to transposing a convolutional network, it is also possible to use Vf,k∗h and repeat these values ​​over time. We found this to be slightly less effective in our experiments.

2.6 CONTEXT STACKS

We have mentioned several different ways to increase the receptive field size of a WaveNet: increasing the number of dilation layers, using more layers, larger filters, larger dilation factors, or combinations thereof . A complementary approach is to use a separate, smaller context stack (stack) to process most of the audio signal, and locally adjust a larger WaveNet that only processes a smaller part of the audio signal (cropped at the end) . Multiple context stacks with different lengths and numbers of hidden units can be used. Stacks with larger receptive fields have fewer units per layer. Context stacking can also have pooling layers run less frequently. This keeps computational requirements at a reasonable level and is consistent with the intuition that modeling temporal dependencies on longer time scales requires less capacity.

3 EXPERIMENTS

To measure WaveNet's audio modeling performance, we evaluate it on three different tasks: multi-speaker speech generation (not conditioned on text), TTS, and musical audio modeling. The following website provides samples drawn from WaveNet for these experiments: https://www.deepmind.com/blog/wavenet-generation-model-raw-audio/.

3.1 MULTI-SPEAKER SPEECH GENERATION

For the first experiment, we investigated free-form speech generation (not conditioned on text). We use the English multi-speaker corpus from the CSTR Voice Cloning Toolkit (VCTK) (Yamagishi, 2012) and tune WaveNet on speakers only. Conditioning is done by feeding speaker IDs to the model in the form of one-hot vectors. This dataset contains 44 hours of data from 109 different speakers.

Since the model is not conditioned on text, it generates non-existent but human-like words in a fluent manner, with realistic intonation. This is similar to generative models of language or images, where samples appear realistic at first glance but are visibly unnatural on closer inspection. Part of the reason for the lack of long-distance coherence is that the model has a limited receptive field size (around 300 ms), which means it can only remember the last 2-3 phonemes it produced .

A single WaveNet is capable of modeling speech from any speaker by conditioning on the speaker's one-hot encoding. This confirms that it is powerful enough to capture the features of all 109 speakers from the dataset in a single model. We observe that adding speakers leads to better validation set performance compared to training on only a single speaker . This suggests that WaveNet's internal representation is shared among multiple speakers .

Finally, we observe that the model also discovers other features in the audio besides the speech itself. It also mimics the sound quality and recording quality, as well as the speaker's breathing and mouth movements, for example.

3.2 TEXT-TO-SPEECH

For the second experiment, we investigated TTS. We used the same single-person speech database as Google's North American English and Mandarin TTS systems. The North American English dataset contains 24.6 hours of speech data, and the Mandarin dataset contains 34.8 hours; both are spoken by professional female speakers.

WaveNets for the TTS task are locally conditioned on linguistic features obtained from the input text. In addition to linguistic features, we also train WaveNet conditioned on logarithmic fundamental frequency (log F0) values. External models predicting logF0 values ​​and call duration from linguistic features were also trained for each language. WaveNets have a receptive field size of 240 ms. As example-based and model-based speech synthesis baselines, hidden Markov model (HMM)-driven unit selection connections (Gonzalvo et al., 2016) and statistical parameters based on long short-term memory recurrent neural networks (LSTM-RNN) (Zen et al., 2016) built a speech synthesizer. Since the baseline and WaveNet are trained using the same dataset and language features, these speech synthesizers can be compared fairly.

To evaluate the performance of WaveNets on the TTS task, we conduct subjective pairwise comparison tests and Mean Opinion Score (MOS) tests. In the paired comparison test, after listening to each pair of samples, the subjects were asked to choose which they liked, but they could choose "neutral" if they didn't have any preference. In the MOS test, after listening to each stimulus, subjects were asked to rate the naturalness of the stimulus on a Likert five-point scale (1: poor, 2: poor, 3: average, 4: good, 5 :excellent). Please refer to Appendix B for details.

Figure 5 shows the results of selected subjective pairwise comparison tests (see Appendix B for the complete table). As can be seen from the results, WaveNet outperforms the baseline statistical parameters and cascaded speech synthesizers in both languages. We find that WaveNet conditioned on linguistic features can synthesize speech samples with natural segmental quality, but it sometimes has unnatural prosody by emphasizing wrong words in sentences. This may be due to the long-term dependence of the F0 profile: WaveNet's receptive field size of 240 ms is insufficient to capture this long-term dependence. WaveNet based on linguistic features and F0 values ​​does not have this problem: the outer F0 prediction model runs at a lower frequency (200 Hz), so it can learn the long-range dependencies that exist in the F0 profile.
insert image description here
Table 1 shows the MOS test results. As can be seen from the table, WaveNets achieve a 5-scale MOS at naturalness above 4.0, significantly outperforming the baseline systems. They are the highest MOS values ​​reported in these training datasets and test sentences. The MOS gap from the best synthesized speech to natural speech decreases from 0.69 to 0.34 (51%) in US English and from 0.42 to 0.13 (69%) in Mandarin.
insert image description here

3.3 MUSIC

In the third set of experiments, we train WaveNets to model two music datasets:

  • The MagnaTagA Tune dataset (Law & V on Ahn, 2009), contains about 200 hours of music audio. Each 29-second clip is annotated from a set of 188 tags that describe the music's genre, instrumentation, tempo, volume, and mood.
  • YouTube Piano Dataset, which contains approximately 60 hours of solo piano music obtained from YouTube videos. Because it's limited to a single instrument, it's much easier to model.

Although it is difficult to evaluate these models quantitatively, it is possible to evaluate them subjectively by listening to the samples they produce. We found that expanding the receptive field is critical to obtaining pleasing-sounding samples . Even with receptive fields of a few seconds, these models fail to achieve long-term consistency, resulting in instantaneous variations in genre, instrument, volume, and sound quality. Still, even when produced by unconditional models, these samples are often harmonious and aesthetically pleasing.

Of particular interest are conditional music models, which can generate music given a set of tags specifying, for example, genre or instrument. Similar to the conditional speech model, we insert biases that depend on the binary vector representation of the label associated with each training clip. This makes it possible to control various aspects of the model's output at sampling time by entering a binary vector encoding the desired properties of the sample. We have trained such a model on the MagnaTagATune dataset; although the label data bundled with the dataset is relatively noisy and has many omissions, after cleaning labels by merging similar labels and removing those with too few associated clips After looking at the data, we found this approach to be quite effective.

3.4 SPEECH RECOGNITION

Although WaveNet is designed as a generative model, it is directly applicable to discriminative audio tasks such as speech recognition .

Traditionally, speech recognition research has focused on the use of logarithmic mel-filter bank (mel-filterbank) energies or mel-frequency cepstral coefficients (MFCCs), but has recently turned to raw audio (Palaz et al ., 2013; Tüske et al., 2014; Hoshen et al., 2014). Recurrent neural networks such as LSTM-RNNs have become key components in these new speech classification pipelines because they can model long-distance contexts. We have shown that with WaveNets, dilated convolutional layers can make receptive fields longer in a more efficient manner than with LSTM units .

As a final experiment, we use WaveNets for speech recognition on the TIMIT (Garofolo et al., 1993) dataset. For this task, we add a mean pooling layer after dilated convolutions, which aggregates activations into coarser-grained frames spanning 10 ms (160x downsampling). The pooling layer is followed by some non-causal convolutions. We train WaveNet with two loss terms, one for predicting the next sample and one for classifying frames, the model generalizes better than a single loss and achieves 18.8 PER on the test set, as far as we know , which is from a model trained on TIMIT directly on raw audio.

insert image description here
(from appendix)

4 CONCLUSION

This paper introduces WaveNet, a deep generative model of audio data that operates directly at the waveform level. WaveNets are autoregressive, combining causal filters with dilated convolutions so that their receptive fields grow exponentially with depth, which is important for long-distance temporal dependencies in analog audio signals . We have shown how WaveNets can be conditioned on other inputs either globally (e.g. speaker identity) or in a local fashion (e.g. linguistic features). When applied to TTS, WaveNets produce samples that outperform current state-of-the-art TTS systems in terms of subjective naturalness. Finally, WaveNets show very promising results when applied to musical audio modeling and speech recognition.

Guess you like

Origin blog.csdn.net/weixin_39653948/article/details/123926899