Speech Bandwidth Extension With WaveNet

Extended use of voice-band WAVENET

作者:Archit Gupta, Brendan Shillingford, Yannis Assael, Thomas C. Walters

Blog address: https://www.cnblogs.com/LXP-Never/p/12090929.html

Blog author: Ling against the war


Summary

  Large mobile communication systems often contain conventional communication transmission channel, the presence of narrowband bottlenecks, to produce an audio quality telephony. In the presence of high quality decoder, since the size and heterogeneity of the network, with modern high-quality audio decoder to transmit the audio sampling rate is high it is difficult in practice. This paper proposed a method in a communication node may codec rate by a low bandwidth expansion. To this end, we propose a number based on - Mel spectrum model, which is limited to 8 kHz bandwidth voice signals, and GSM-full-rate (FR) compression artifacts as a condition to reconstruct a high-resolution signal. MUSHRA In our evaluation, we show that the samples can be trained by the 8kHz GSMFR audio codec 24kHz to model the speech signal can be reconstructed mass slightly less than 16kHz bandwidth audio adaptive multi-rate codec ( to AMR-WB) audio codecs, and then close the original encoded signal and to the gap between the perceived quality of the original speech samples 24kHz about half. We further demonstrated that when through the same model, uncompressed 8kHz audio quality can be rebuilt better than 16kHz AMR-WB audio again in the same MUSHRA assessment.

Keywords : WaveNet, bandwidth expansion, super-resolution creates a model

1. Introduction and related work

  Conventional transmission channel is still part of many large communication system. These channel introduction bottleneck, limiting the bandwidth and voice quality. This is commonly referred to as telephone quality audio. All part of the infrastructure upgrade to be compatible with higher-quality audio codec can be difficult. Therefore, this paper proposes a method of communication nodes are not all infrastructure upgrades, communication nodes which can be used instead to expand the bandwidth of any incoming speech signal. To achieve this goal, we propose a model based WaveNet [1], the depth model to generate an audio waveform.

  WaveNet proved high quality speech synthesis based on the language characteristics are very effective. Further, WaveNet architecture has been used for text to speech log-mel spectrum [2], and other low-dimensional representation of the potential speech coding [3,4]. Taking into account the architecture wavenet ability to generate high-quality voice representation from constraints, we extend this technology to speech bandwidth extension (BWE) [5] problem, also known as super-resolution audio [6].

  Although BWE may be understood as the band-limited signal is extended to low and high frequencies, but in this case, we are particularly interested in the phone application, where audio, typically by low-rate speech codecs, such as the GSM Full Rate (FR ) [7], the highest frequency component of the reconstructed signal will be limited to 4kHz or less, resulting in reduced audio quality and intelligibility of potential damage. Therefore, we focus on the signal reconstructed from the input signal sampling rate of 8kHz sampling rate of 24kHz. In the past, bandwidth expansion in the vocoder speech that the use of technology in the field of the Gaussian mixture model and hidden Markov model [5]; recently, there is growing interest in the use of neural networks to model the spectral envelope [8] or predicted directly on the sample waveform [6,9,10], can enhance the quality than previous methods.

  In our experimental evaluation, we assessed the ability of our proposed model for the implementation of narrowband signal bandwidth extension. To illustrate the impact of our work produced, showed us a model trained, will increase in the 8kHz to 24kHz via GSM-FR codec speech signal can be reconstructed when 16kHz adaptive multi-rate wideband codec (AMR-WB) [11] similar or better audio quality of the audio production. GSM-FR is used in a conventional GSM mobile telephone codecs, and AMR-WB voice call definition is commonly used codecs. Although it is difficult to compare with previous work, but the lack of reusable code and set into different test our way to get a higher score than previous work [6] In MUSHRA assessment.

  It is worth mentioning that, we believe that our WaveNet the core may be replaced with more efficient computing architectures, such as parallel WaveNet [12], WaveGlow [13] or WaveRNN [14]. These architectures have shown that, while maintaining similar performance modeling, model version can usually reproduce more easily calculated. In this work, we have created a proof of concept based on high-quality bandwidth extension of WaveNet, because it has a superior ability to express and relatively easy training, making it easier to use other computing architecture to reproduce the results of the likelihood of possible .

2, the training step

2.1 model architecture

  WaveNet is a generative model, it waveform $ x = \ {x_1, ..., x_T \} $ concatenated probabilistic modeling product of conditional probabilities, this condition is given in the previous timesteps given sample . Conditions WaveNet model uses an additional input variable $ h $, and the distribution condition is modeled

$$p(\mathbf{x} | \mathbf{h})=\prod_{t=1}^{T} p\left(x_{t} | x_{1}, \ldots, x_{t-1}, \mathbf{h}\right)$$

This task used in conditions WaveNet model. Condition input by $ h $ 'stack conditions' from the five expansion (Dilated) convolutional layers, followed by two transposition (TRANSPOSE) convolution, which is the effect on the condition input sampling factor increases fourfold. Autoregressive (autoregressive) input is normalized in the range [1,1], and by the size of the convolution filter layer 4 and 512. Then, they are input to the core WaveNet model, WaveNet model has three layers 10 comprising the expansion (Dilated) convolutional layer, having a connection jump, just as in the original WaveNet architecture [1]. We use the expansion (Dilation) is a factor of 2; the size and number of filters 3 and 512, respectively. Skip connection output by a convolutional two layers, each layer 256 convolution filter. The output sample value distribution using 10 components quantized mixing logic (quantized logistic mixture) [15] Modeling.

 Figure 2: The process is described. The input audio is sampled 8khz converted mel logarithmic spectral representation,

Then as the input condition WaveNet stack. The model output high sampling rate audio and higher frequency 24khz signal from the prediction remaining.

2.2 Data Preparation

   Our model in LibriTTS [16] conducted a training and evaluation of the data sets. LibriTTS with known LibriSpeech corpus [17] The audio from the same source material, but comprises a 24kHz sampling (LibriSpeech opposite to 16kHz), the sampling resolution is 16 bits per sample. Audio books (and associated text) are these two data sets from a set of public domain, these books by English-speaking people to read a variety of accents in a variety of non-studio conditions, which often means recording there will be some background noise. Data train-clean-100 train-clean -360 and are used for different subsets of the training, each set having a small portion (1-2%) for evaluation. Hearing assessment is carried out on test-clean subset of the training set which contains a set of speaker-independent, to ensure that the training set is not used in the speaker.

2.3 Training

  This model uses maximum likelihood ML method for melb 8kHz band-limited waveform calculated spectra for training prediction waveform 24kHz. As with other examples of WaveNet, during training there are two types of inputs to the model, an autoregressive input sample comprising the previous time step, and the other is a condition input. Autoregressive input during teacher training is mandatory, so the input of high-quality 24kHz audio samples. We calculate log-mel spectrum from lower-bandwidth audio input as a condition.

  In other words, WaveNet describes the previous model:

$$p\left(\mathbf{x}_{\mathrm{hi}} | \mathbf{x}_{\mathrm{lo}}\right)=\prod_{t=1}^{T} p\left(x_{\mathrm{hi}, t} | x_{\mathrm{hi}, 1}, \ldots, x_{\mathrm{hi}, t-1}, \mathbf{x}_{\mathrm{lo}}\right)$$

Wherein $ x_ {hi} $ is from 24kHz waveform modeling method, $ x_ {Io} $ 8kHz narrowband data is expressed (mel logarithmic spectrum) with log mel spectrogram. $ X $} _ {} Io as the input condition setting WaveNet stack.

  We use Adam [18] optimizer, the learning rate of $ 10 ^ {- 4} $, momentum is set to 0: 9, epsilon is set to $ 10 ^ {- 8} $. We used a total of 64 batch_size, each core of batch_size 8. Each batch has eight tensor processing unit (TPU). 8 * 8 = 64.

3, experimental evaluation

Set 3.1

  In this evaluation, we are primarily interested in speech coding at a fixed conventional audio path setting enhancement, such as a call on a standard GSM mobile network. In this case, the codec typically work 4kHz bandwidth, resulting in a sampling rate 8kHz audio waveform.

   To generate the training set, LibriTTS clean-100 using a training set sox tool is pretreated, the original audio encoder by GSM-FR, to obtain a signal containing the original audio 24kHz and 8kHz sampling rate of the signal data sets, and for each sound , using a codec will lead to a further decline in quality. In order to generate the focus LibriTTS training utterance to the training set, select the audio area 350ms from a random point in the utterance. Using a Hann window of 50ms (12.5ms step size) from the training area 8kHz audio input mel spectrums generated number, and then mapped to 80 mel frequency bins, range from 125Hz to Nyquist frequency of the input signal. These conditions lead to parameters vector $ x_ {I0} $ 80Hz rate when a length of 80. Then a WaveNet network trained according to the obtained spectra calculated from the GSM audio, the predicted ground-true sampling rate of the audio in the same region. In earlier experiments, we found that compared with the original waveform directly as a condition, the condition of this spectrum method to perform better.

3.2 Results

  We use a hidden reference and Anchor (anchor) (MUSHRA) of multiple stimulation [20] hearing test method to assess our model. Each listener (the person required to test Audio) Ground-truth has 24kHz reference numerals, and several unlabeled test items: 24kHz reference, AMR-WB encoded audio, GSM-FR encoded audio (low anchor mass) , 8kHz audio (with default settings in sox downsampling), 8kHz sampling -24kHz to the predicted audio WaveNet, GSM-FR sampling WaveNet to the predicted audio -24kHz.

  Ratings were asked to score between 0 and 100 for each test utterance using a slider, the slider isometric regions are labeled as "poor", "poor", "good" and "excellent." Raters should be hidden reference scored nearly 100 points in place, anchor stimulus should get the lowest score. Usually, MUSHRA assessment is conducted by evaluators trained a small part of. However, the scores were used in this assessment are untrained, and therefore each utterance by 100 different raters scored to ensure a very narrow error bars.

Figure 3: Our model (WAVENET 8KHZ and WAVENET GSMFR) to 8KHZ GSMFR audio signal is a training object, use uncompressed audio 8KHZ and 8KHZ GSMFR evaluate and use MUSHRA hearing test method for evaluation. The original audio model were compared 24KHZ and 8KHZ, and AMR-WB 16kHz and GSM-FR 8KHZ codec.

  MUSHRA tests show from 8kHz 24kHz audio directly predict the performance of the model was slightly better than AMRWB codec, from GSM to predict the performance of 24kHz 8kHz coding model is only slightly worse than AMRWB.

  Selecting a set of samples from a clean test LibriTTS corpus hearing test. Through the test set for each speaker in a randomly selected 3--4 seconds words as samples, which led to 36 were randomly selected words to 8 MUSHRA listening test.

  MUSHRA listening test results are shown in FIG.

  Finally, in order to visually illustrate the quality of reconstructed samples, Figure 1 depicts an utterance from the original corpus LibriTTS, reconstructed (reconstructed) audio and GSM-FR spectrogram.

 Figure 1: spectrogram from LibriTTS corpus of discourse.

On: original audio,

In: reconstruction from WaveNet model based on the audio spectrum of GSMFR audio,

Next: from the spectrogram of GSM-FR audio.

4, summary

   提出了一种新的基于小波变换的语音带宽扩展模型。该模型能够从8kHz信号中重构出24kHz的音频,这些信号的质量与AMR-WB编码解码器在16kHz时产生的信号类似或更好。我们的上采样方法从标准的电话质量和gsm质量的音频中产生HD-Voice质量的音频,表明我们的音频超分辨率方法对于提高现有电话系统的音频质量是可行的。对于未来的工作,其他架构,如WaveRNN,可以在相同的任务上进行评估,以提高计算效率。

5、参考文献

[1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, WaveNet: A generative model for raw audio. in SSW, 2016, p. 125. 

[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., Natural tts synthesis by conditioning wavenet on mel spectrogram predictions, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779 4783.

[3] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, WaveNet based low rate speech coding, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 676 680.

[4] C. Garbacea, A. van den Oord, Y. Li, F. S. C. Lim, A. Luebs, O. Vinyals, and T. C. Walters, Low bit-rate speech coding with VQ-VAE and a WaveNet decoder, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

[5] E. R. Larsen and R. M. Aarts, Audio Bandwidth Extension: Application of Psychoacoustics, Signal Processing and Loudspeaker Design. USA: John Wiley &; Sons, Inc., 2004.

[6] V. Kuleshov, S. Z. Enam, and S. Ermon, Audio super resolution using neural networks, arXiv preprint arXiv:1708.00853, 2017.

[7] ESTI, GSM Full Rate Speech Transcoding, European Digital Cellular Telecommunications System, Tech. Rep. 06.10, 02 1992, version 3.2.0. [Online]. Available: https://www.etsi.org/deliver/etsi gts/06/0610/03.02. 00 60/gsmts 0610sv030200p.pdf

[8] J. Abel and T. Fingscheidt, Artificial speech bandwidth extension using deep neural networks for wideband spectral envelope estimation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP, pp. 1 1, 10 2017.

[9] Z.-H. Ling, Y. Ai, Y. Gu, and L.-R. Dai, Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 5, pp. 883 894, 2018.

[10] Y. Gu and Z.-H. Ling, Waveform modeling using stacked dilated convolutional neural networks for speech bandwidth extension. in INTERSPEECH, 2017, pp. 1123 1127.

[11] 3GPP, Mandatory speech CODEC speech processing functions; AMR speech CODEC; General description, 3rd Generation Partnership Project (3GPP), Technical Specification (TS) 26.071, 06 2018, version 15.0.0. [Online]. Available: https://portal.3gpp.org/desktopmodules/Specifications/ SpecificationDetails.aspx?specificationId=1386

[12] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis, Parallel WaveNet: Fast high-fidelity speech synthesis, in Proceedings of the 35th International Conference on Machine Learning, ser. Machine Learning Research, vol. 80. Stockholmsmssan, Stockholm Sweden: PMLR, 2018, pp. 3918 3926.

[13] R. Prenger, R. Valle, and B. Catanzaro, Waveglow: A flowbased generative network for speech synthesis, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

[14] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, Efficient neural audio synthesis, in International Conference on Machine Learning, 2018, pp. 2415 2424.

[15] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications, in International Conference on Learning Representations (ICLR), 2017.

[16] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, LibriTTS: A corpus derived from librispeech for text-to-speech, arXiv preprint arXiv:1904.02882, 2019.

[17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an asr corpus based on public domain audio books, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206 5210. 

[18] D. P. Kingma and J. Ba, ADAM: A method for stochastic optimization, in International Conference on Learning Representations (ICLR), 2015.

[19] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., In-datacenter performance analysis of a tensor processing unit, in International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1 12. [20] International Telecommunication Union, Method for the subjective assessment of intermediate sound quality (MUSHRA), ITU-R Recommendation BS.1534-1, Tech. Rep., 2001.

 

Guess you like

Origin www.cnblogs.com/LXP-Never/p/12090929.html