Vocoder, detailed explanation of vocoder - speech signal processing learning (10)

references:

[1] Vocoder (taught by teaching assistant Xu Bojun) bilibili

[2] Oord A, Dieleman S, Zen H, et al. Wavenet: A generative model for raw audio[J]. arXiv preprint arXiv:1609.03499, 2016.

[3] https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

[4] Review: DilatedNet — Dilated Convolution (Semantic Segmentation) | by Sik-Ho Tsang | Towards Data Science

[5] Jin Z, Finkelstein A, Mysore G J, et al. FFTNet: A real-time speaker-dependent neural vocoder[C]//2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018: 2251-2255.

[6] Kalchbrenner N, Elsen E, Simonyan K, et al. Efficient neural audio synthesis[C]//International Conference on Machine Learning. PMLR, 2018: 2410-2419.

[7] Prenger R, Valle R, Catanzaro B. Waveglow: A flow-based generative network for speech synthesis[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019: 3617-3621.

[8] Flow-based Deep Generative Models | Lil'Log (lilianweng.github.io)

[9] NICE of thin and long flow: basic concepts and implementation of flow models - Scientific Spaces | Scientific Spaces (kexue.fm)

Table of contents

一、Introduction

Spectrogram and phase

Waveform Synthesis Methods

Why does Vocoder do independent research?

二、Neural Vocoder——WaveNet

WaveNet ideas

WaveNet architecture

WaveNet Softmax Distribution

WaveNet Dilated Causal Convolution

WaveNet Residual and Skip Connections

Conditional WaveNet

WaveNet Summary

3. Neural Vocoder——FFTNet

FFTNet architecture

FFTNet Tips

FFTNet Summary

四、Neural Vocoder——WaveRNN

WaveRNN Dual Softmax Layer

WaveRNN architecture

WaveRNN acceleration tips

WaveRNN summary

五、Neural Vocoder——WaveGlow

Flow-Based Model

WaveGlow Architecture

WaveGlow Summary

六、Vocoder Conclusion


一、Introduction

Vocoder:When it comes to speech synthesis, the vocoder is an important component that is responsible for converting the digitized speech signal back into audible sound. Vocoder is the abbreviation of "Voice Coder", which is an algorithm or device that processes speech signals and is used to analyze and synthesize sounds. It receives digital signals (usually obtained through means such as speech recognition) and processes them into sound signals that the human ear can understand.

  • First, let’s briefly talk about what Vocoder does. As mentioned before, what is generally operated in the model is the sound spectrum spectrogram, and Vocoder converts the spectrogram into a sound signal that we can listen to.

Spectrogram and phase
  • To understand Vocoder, first we need to know where Spectrogram comes from. It is known that we have a sound signal x, then the following process


    STFT\{x\} = X(t, f) = A_{t,f}e^{i\theta_{t,f}}

    • STFT: "Short-Time Fourier Transform". It is a signal processing technique used to convert signals from the time domain to the frequency domain. STFT decomposes the signal into temporally localized spectral information by applying a window of the Fourier transform in the time domain.

    • t: time

    • f: frequency

  • First perform STFT on the sound signal to obtain a function Amplitude, θ is the phase. Different frequencies at different times have their own amplitudes and phases. And A here is Spectrogram.

  • In other words, Spectrogram still lacks phase information from the sound signal. It’s strange, then why did we discard the phase information when we generated it before? We draw the phase, as shown in the figure, and we can see that the phase information is almost equivalent to the existence of noise. In other words, it is very difficult to generate this phase or restore the phase from Spectrogram.

  • Then the question arises, is phase really so important? Can we just give it a random combination and produce a better sound? In the video course, the sound will become very strange after randomly giving phase, so this is still somewhat important. However, phase generation is still too difficult.

Waveform Synthesis Methods
  • If we don't generate a phase, how can we synthesize a sound signal that humans can hear, that is, a waveform? We decided to use the acoustic features generated by the model, which generate waveforms directly through Vocoder.

  • Traditional Methods:

    • Heuristic methods:Griffin-Lim algorithm

      The Griffin-Lim algorithm is a heuristic method used to estimate the phase information of audio signals to reconstruct the time-domain waveform (waveform) from the amplitude spectrum (amplitude). This algorithm is particularly suitable for sound synthesis or audio reconstruction scenarios. The basic idea of ​​this algorithm is to reconstruct the phase of the spectrum by estimating the amplitude spectrum and randomly initializing it, then inverting these estimated spectra, and applying STFT again to obtain a new time domain waveform. This process is then repeated, iterating multiple times, hoping to eventually obtain a reasonable phase estimate. However, the speech it generates still sounds unnatural.

  • Neural Vocoder:

    • Generative neural networks

      This is the case when using a neural network when in trouble (×), but the quality of the speech it generates is indeed very good.

    • Directly generate waveform from acoustic features

Why does Vocoder do independent research?
  • We may be curious, why should Vocoder be studied separately instead of directly doing End2End training behind TTS, VC and other models? Because Vocoder is a way to convert spectrograms into waveforms, Vocoder can be used as long as it generates spectrograms, which makes it highly versatile, while the generation goal of other models becomes to generate spectrograms, which will reduce The overall task difficulty allows them to focus more on the processing of sound signals.

二、Neural Vocoder——WaveNet

WaveNet ideas
  • The first model to generate waveforms was proposed. It is essentially an autoregressive model. How did this come about?

  • We can zoom in on the waveform, and we actually find that it is essentially a series of connected values.

  • So we can naturally think of the autoregressive model. If you want to know the value of xt, you need to take the values ​​of x1~xt-1 as input. And WaveNet does just that.

  • A value undergoes Dilated Convolution based on the previous value, and finally generates the next value, and then the generated value is used as output to generate the next value.

  • The main component of WaveNet isCausal Convolution Network. We’ll talk about it in detail soon. Here you can take a brief look.

    Causal Convolution: is a convolution operation in a convolutional neural network. It has a "causality" constraint, that is, every element in the output An element can only depend on the element before it in the input sequence. This constraint is very useful when working with time series data because it ensures that the model does not become "future" dependent on information beyond the current time step.

WaveNet architecture

Skip connections: refers to a technique in a neural network that directly connects the output of a layer to the input of a non-adjacent layer or deeper layer of the subsequent layer. . The main purpose of this technology is to improve the training effect of neural networks, improve gradient propagation and model learning capabilities.

  • We can briefly understand the main architecture of WaveNet. Input is the so-called x1~xt-1, and Output is xt. The input will first enter the causal convolution layer (Casual Conv), then enter the dilated convolution layer (Dilated Conv), pass through different activation functions (activate function), dot multiply the results, and then pass through the 1×1 CNN, and then The original results are added up (residual learning), so we count it as a "layer". The output of this layer will be passed to the next layer as input, and there will be a total of k layers.

  • The CNN output of each layer will be pulled out and added separately (such operations are called Skip-connections), and after ReLU, 1×1 CNN, ReLU, 1×1 CNN, Softmax, the output will finally be generated.

WaveNet Softmax Distribution

μ-law encoding algorithm: Mainly used for compression and quantization of audio data. μ-law (mu-law) is a nonlinear coding method that is usually used to compress audio signals to make them more suitable for digital signal processing. It will perform non-linear transformation of the audio signal, reduce the dynamic range of the input signal, and make smaller amplitude signals easier to represent, reducing the sensitivity to noise. This encoding method can reduce the amount of data through compression and quantization while retaining the main characteristics of the audio signal.

  • For WaveNet, the original sound signal is an arrangement of a series of numbers, which is converted into a one-hot vector and then passed into the network as input. Including its output will also be a one-hot vector sequence.

  • And we can also transform the problem of generating sound signals into the following formula:


    p(x) = \prod_{t=1}^T {p(x_t | x_1, \dots x_{t-1})}

     

    That is: given x1 to xt-1, find the probability distribution of xt.

  • And this can also cause a problem. In fact, sound signals are stored in computers using 16-bit integers, which means that the range is [-32768, 32767]. If this format is used to train the model, the problem becomes A classification problem with 65536 categories was proposed. There are too many categories and it is too difficult for the model to learn, so we hope to compress these 65536 categories into 256 categories, that is, use 8-bit integers to represent the data of sound signals. So can we use Linear Mapping directly for conversion?

  • No, we don't. We will first linearly compress the range of [-32768, 32767] into [-1, 1], and then pass it through a method named μ-law Algorithm, the algorithm formula is as follows:


    f(x_t) = \mathrm{sign}(x_t)\frac{\ln(1+\mu\left|x_t\right|)}{\ln(1+\mu)}, \mu = 255 \\ \ mathrm { sign } ( x ) = \ begin { cases } -1 , & ; x < 0 \\ 0, & x = 0\\1, & x > 0 \end{cases}
     

  • After passingμ-law, the range is still [-1, 1], and then it is pulled back to the range of [0, 255] . The overall process is as follows:

    [-32768, 32767] (16-bit int) -> [-1, 1] -> [-1, 1] (μ-law) -> [0, 255] (8-bit int)

WaveNet Dilated Causal Convolution
  • We said before that the main component of WaveNet isCausal Convolution Network (Causal Convolution). Specifically, one value is based on the previous value. Dilated Convolution. So what do these network terms mean?

  • A causal convolutional network is a convolutional network with a "causality" constraint. Its output yt will only see the input sequence xt and previous information, but not future information. We draw it, and its most typical feature is the "right triangle" pattern below.

  • Dilated convolution is different from ordinary convolution. As shown in the figure below, the convolution kernels of ordinary convolution are close together, while the convolution kernels of Dilated Convolution are arranged separately, thus achieving the "dilate" of the convolution field of view.

  • Combining the features of these two, we get WaveNet’s Dilated Causal Convolution.

  • The biggest benefit of dilated convolution is that it can exponentially increase the field of view that each value of the output can see. The picture is as follows. If ordinary causal convolution is used, after 4 layers of convolution, only the 5 input values ​​can be seen in one output value. But if dilated causal convolution is used, at the same depth, one output value can see a full 16 values, which greatly improves the field of view.

WaveNet Residual and Skip Connections
  • We have just finished talking about the WaveNet Softmax operation and Dilated Causal Convolution. There is another structure left. In each layer, the convolution results are passed through different activation functions, and then the respective results are Click multiplier. We call this operation Gated Activation Unit.

  • It can be expressed as the following formula:


    \mathbf{z} = \tanh(W_{f,k}*\mathbf{x})\odot\sigma(W_{g,k}*\mathbf{x})
     

    This means that after passing the convolution, through the two activation functions tanh and sigmoid, the two results are then dot multiplied to become the final result.

Conditional WaveNet
  • At this point, it seems we still haven’t talked about Spectrogram? After all, the input of WaveNet just described is the output of the first few timesteps (x1, ..., xt-1), and the output is xt. So when is Spectrogram used? In fact, it is placed in the Gated Activation Unit.

  • We call the inserted Spectrogram that changes with timeLocal Condition. The formula is as follows, usingy to represent Spectrogram. It still passes through a CNN like x, adds it to x, and then passes through two activation functions, and the final result is a dot product.


    \mathbf{z} = \tanh(W_{f,k}*\mathbf{x} + V_{f,k}*\mathbf{y}) \odot \sigma(W_{g,k}*\mathbf{ x} + V_{g,k}*\mathbf{y})
     

  • There is “Local Condition” and naturally there is also Global Condition. This refers to the additional conditions we add to the sound, such as whose voice is used, what kind of emotion it contains, what kind of speaking method is used, etc. These conditions will apply to the entire speech production. The condition may be a value or a one-hot vector. Regardless, they are added in the same way as above. The formula is expressed as follows:


    \mathbf{z} = \tanh(W_{f,k} * \mathbf{x} + V_{f,k}^{T}\mathbf{h}) \odot \sigma(W_{g,k}* \mathbf{x} + V_{g,k}^{T}\mathbf{h})
     

WaveNet Summary
  • The sound quality synthesized by WaveNet is still very good, but because it is an autoregressive model, and there are 16,000 values ​​in the sound signal per second, which means that using WaveNet to generate sound for one second requires 16,000 operations, so when generating will be very slow. The main purpose of the model mentioned below is to solve the problem of slow generation speed.

3. Neural Vocoder——FFTNet

FFTNet, like WaveNet, is also an autoregressive model. The difference is that it changes the deep CNN to a simpler calculation method, so that each calculation will take less time. In addition, there are some training and synthesis techniques in FFTNet. As long as an autoregressive model is used to generate speech, these techniques can be applied to improve the quality of the generated speech.

FFTNet architecture
  • The input and output of FFTNet are the same as WaveNet. The input is the so-called x1~xt-1, and the output is xt. The input data will first be cut into two segments, xl and xr respectively, and then added together through different CNNs to obtain z, and then through a ReLU, A 1×1 CNN, a ReLU, get the new x, and then perform the same operation as above. Because of the addition operation, the size of the input data will be halved each time it passes through a layer. When the final size becomes 1, this is the final output.

  • Using the formula to express the above process is:


    z = W_L * x_L + W_R * x_R \\ x = \mathrm{ReLU}(\mathrm{conv1×1}(\mathrm{ReLU}(z)))
     

  • The addition of Spectrogram is similar to WaveNet. It is added in the step of generating z. We use < a i=3>h represents Spectrogram, and the formula used to express joining is as follows:


    z = (W_L * x_L + W_R * x_R) + (V_L * h_L + V_R * h_R)
     

  • This is FFTNet. The architecture is very simple, but the effect is similar to WaveNet. Why can it achieve such excellent results? We can look at the schematic diagram below. To generate the rightmost red point, we need to push it to the left to know the 2 red points of the previous layer. The next layer needs to know the 4 red points. Through multi-layer superposition, we It is not difficult to know that the final output field of view will be very large.

FFTNet Tips
  • Zero padding:

    • Adding some zeros before the input sound signal will make the training more stable.

  • Conditional sampling:

    • We have said that the final output of WaveNet is actually a classification problem, that is to say, the category with the highest probability is selected from the final distribution (distribution) as the final output. Depending on the situation, sometimes we do not need to find the one with the highest probability, but randomly sample according to the distribution to get the final output.

  • Injected noise:

    • Both WaveNet and FFTNet are autoregressive models, and both use teacher forcing during training, using real x1-xt-1. It is hoped that the output will be as close to the xt of the answer as possible. However, in actual use, we use the output of the previous step of the model as input, which means that if the answer generated in the previous step is not very good, such errors will be passed to the next step, eventually causing the entire model to collapse. Lose.

    • So we can add some Gaussian noise to the input x during the training process. The model trained in this way will be more stable when inferring and generating sounds.

  • Post-synthesis denoising

    • Use the above method of adding noise for training. When the final generated model is put into use, the generated sound signal will be a bit noisy, so signal processing is used to denoise the generated sound signal (post-synthesis denoising).

FFTNet Summary
  • FFTNet uses a simpler architecture and can generate sound signals almost as good as WaveNet at a faster speed. The author even said in the paper that the model can achieve real-time conversion using CPU (real time using CPU), which means that it takes less than 1 second to generate a 1s sound signal.

  • However, in actual use, the speed cannot be as fast as the author said, but it is still much faster than WaveNet. Moreover, the little tricks used in FFTNet are very useful for autoregressive models.

  • Finally, let’s take a look at the effects of each model. The MOS used here is the mean opinion score we mentioned before, that is, by finding a group of people, letting them listen to the sound, giving the score a score of 1-5, and finally calculating the average score. The ones with "+" are models that use small tricks during training. It can be seen that the score of FFT is still not as high as WaveNet, but the frequency scores of both models have improved by leaps and bounds after using small tricks.

四、Neural Vocoder——WaveRNN

It was still proposed by Google. To sum up this model in one sentence, CNN was used to process time series before. This time we use RNN.

WaveRNN Dual Softmax Layer

Gated Recurrent Unit (GRU):GRU is a variant of Recurrent Neural Network (RNN) used to process and model time series data. Such as voice signal. GRU contains Update Gate and Reset Gate, which function similarly to the input gate and forget gate in LSTM and are used to control the flow and transfer of information.

  • Before talking about the WaveRNN architecture, let us first understand the Softmax layer of WaveRNN. We said before that WaveNet's Softmax layer compresses 16-bit data into 8-bit data for processing. In WaveRNN, Softmax splits the 16-bit data into two 8-bit data, and predicts the two data twice to achieve the 16-bit effect.

  • The two divided data are called Coarse, written as c_t, and the other is called Fine, written as f_t.

WaveRNN architecture
  • The structure diagram of Coarse 8-bit operation is as follows. The four units in the red box can be understood as GRU operations. The lower left corner x_t^c represents the 16-bit data of the previous time, that is, the result of x_t-1. The upper left corner is the hidden stat of the previous time, and the output is the hidden state of the current time y_c, through two layers of linear transform, and then softmax, you can get Coarse 8-bit, which is the final result.

  • Fine 8-bit and Coarse 8-bit are actually similar, using another GRU for processing. The only difference is that we will not only use the previous 16-bit data as input, but also use this time The generated Coarse 8-bit c_t is used as input.

  • Unfortunately, there are only the above figures in the original paper, and there is no detailed explanation of the model deployment method. There are many ways to deploy WaveRNN on the Internet. You can check it yourself and will not go into details here.

WaveRNN acceleration tips

In WaveRNN, some methods are also mentioned to speed up the model. The paper states that using the following techniques, WaveRNN can implement Real-time operations on the mobile phone’s CPU.

  • Sparse WaveRNN

    • That is, Weight Pruning, which sets some relatively small weight values ​​to 0 to reduce calculation time consumption.

  • Subscale WaveRNN

    • It turns out that we generated it from 1, 2, 3,..., but in this variant, we fold the generated sequence to generate 1, 9, 17,..., and the same is true for other parts, such as generating them in sequence 5, 13, 21,..., so that 8 parts can be operated at the same time, and the overall speed can be 8 times faster.

WaveRNN summary
  • Very simple, yet very powerful.

五、Neural Vocoder——WaveGlow

In fact, the reason why the speech generation speed is slow is essentially because it uses an autoregressive model. Some people wonder, what would happen if an autoregressive model was not used to generate speech signals? WaveGlow was born.

Flow-Based Model
  • Before understanding the WaveGlow model, let's first understand the Flow-Based Model. Compared with two models in GAN, the generator needs to fool the discriminator. Flow-Based Model has only one model Flow, which is f(x) in the figure. The model will receive input of real data and map the data into high-latitude noise-like z, which is equivalent to a probability distribution. At the same time, f(x) as a transform is still reversible, that is, we can use z as input to sample the sound signal we need. .

  • The explanation from a mathematical point of view is as follows:


    \mathbf{x} = f^{-1}(\mathbf{z}) \Leftrightarrow \mathbf{z} = f(\mathbf{x})
     

  • Here z is a Gaussian distribution with a mean of 0 and a standard deviation of 1. Therefore we can write its probability density function (Probability Density Function, PDF)q. In fact, it is the PDF of the Gaussian distribution. If you throw a certain z into it, you can tell us how likely it is to appear. The formula is as follows:


    q(\mathbf{z}) = \frac{1}{(2\pi)^{D/2}}\mathrm{exp}(-\frac{1}{2}\left\|\mathbf{z}\right\|^2)
     

  • In training, because x is mapped to a Gaussian distribution, if the transformation function If f is well trained, then it can map the original datax to a space in which The middle data points are more consistent with the characteristics of Gaussian distribution. Putting it into mathematics is x and throwing it into Gaussian distribution PDF, the value obtained should be very large) be calculated? We combine the above two formulas and get the following formula:x. So how should q(


    q(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}}\mathrm{exp}(-\frac{1}{2}\left\|f(\mathbf{x})\right\|^2) \left| \det{ \left[ \frac{\partial{f}}{\partial{\mathbf{x}}} \right] } \right|
     

  • Here is a variable replacement forz. We have learned in probability (sad, never learned) that variable replacement cannot be simple. Just change it, and you need to add the absolute value of the Jacobian Determinant later.

    Explanation of the barrage: One is in the space with the element of z as the coordinate, and the other is in the space with the element of x as the coordinate, so there must be jacobian

  • Take log for q(x). This is the objective function of our training: the larger the value, the better. The formula is as follows. In addition, the transformation function f we designed needs to meet two conditions: it is easily reversible, and its Jacobian is easy to calculate.


    \log q(\mathbf{x}) = -\frac{D}{2}\log{(2\pi)} - \frac{1}{2}\left\|f(x)\right\|^2 + \log \left| \det{ \left[ \frac{\partial{f}}{\partial{\mathbf{x}}} \right] } \right|
     

  • So how should we design the function f? We can split x into x_1 and x_2, we define the intermediate result as h, and we also define h is divided into two types of results, which are h_1 and h_2< /span>, then there is:h_2^(1) and h_1^(1), then we write the two results of the first level as


    \mathbf{h}_1 = \mathbf{x}_1 \\ \mathbf{h}_2 = \mathbf{x}_2 + m(\mathbf{x}_1)
     

  • Where, m represents some kind of operation. So taking the first level as an example, when we have the intermediate results h_1^(1) and h_2^(1) , if you want to perform reverse inference, there are:


    \mathbf{x}_1 = \mathbf{h}_1^{(1)} \\ \mathbf{x}_2 = \mathbf{h}_2^{(1)} - m(\mathbf{h}_1^ {(1)})
     

  • So the original datax after the above transformations, the final generated data is z a>. The sum of all the transformations in the middle is our f.

  • It is easier to understand by drawing the above process into a diagram, which probably looks like the following:

  • Of course, you can also change the order slightly and become like this:

  • Whether it is the above or below, the overall operation is reversible.

WaveGlow Architecture
  • The following figure is the architecture diagram of WaveGlow. The module framed with a dotted line in the middle is the transformation function f. We can give x the output result z, we can also provide < /span>. x to generate speechz

  • Starting from the input, x first becomes a vector through squeeze to vectors, then enters a 1×1 Conv, and then enters Affine coupling layer, this is the bunch of operations we mentioned before. Such operations will be performed a total of 12 times in WaveGlow, and each time is regarded as a layer.

  • We zoom in on the affine coupling layer (the picture is on the right). The input is first cut into x_a and < a i=3>x_b, the left one will be directly retained as output, and will also be thrown to a module WN for calculation, and the calculated result will be the same as x_b The addition becomes the output x_b'. It is worth mentioning that the module that performs operations on x_a is WaveNet (WN), which is where the name WaveGlow comes from. So similarly, Spectrogram can naturally be connected to the model through WN.

  • For the squeeze to vectors operation, assuming we input 1s of data, which is 16,000 points, we will transform it into a vector with a shape of (8, 2000). The reason is that if we do not transform the shape, when splitting the input data, we will split it into two 8000, which is very inconvenient for calculations, and the calculation time of the left and right sides will be very different. After the transformation, we can split it into two (4, 2000).

WaveGlow Summary
  • When using WaveGlow, our input is a whole piece of sound, so its generation speed is much faster than the autoregressive model.

  • However, such a model also has a big disadvantage, that is, it is very difficult to train. The original paper said that 8 Nvidia GV100 GPUs were used for training, which cost 350,000 yuan each at the time.

六、Vocoder Conclusion

  • Quality: WaveNet > others

  • In terms of training time: WaveRNN >= WaveNet >= FFTNet >> WaveGlow

  • In terms of generation speed:

    • Real-time standard: 16kHz, which means it must be able to perform 16,000 operations per second.

    • WaveGlow (520kHz) >> Real-time > WaveRNN >= FFTNet >> WaveNet (0.11kHz)

    • Traditional highly optimized Griffin-Lim algorithm: 507kHz

  • The current situation of Neural Vocoders is that they are either very slow to generate or difficult to train. And it requires a high degree of optimization to achieve very good results.

  • Therefore, the current research goal is to create a Vocoder that is fast, high-quality, and easy to train.

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/134527373