[IT] [2018.02] robust speech signal processing based on a phase

Here Insert Picture Description
This article is the University of Sheffield: doctoral thesis (author Erfan Loweimi), a total of 304.

Fourier analysis plays a key role in the speech signal processing. As a complex, which can be used amplitude and phase spectra represented in polar form. Amplitude spectrum in all aspects of speech processing have a wide range of applications. However, an attractive starting point is not the phase spectrum of speech signal processing. Significant relationship with respect to the structure of fine and coarse speech perception amplitude spectrum, the phase spectrum is difficult to interpret and process. In fact, there is no meaningful trends or extremes can facilitate the modeling process. Nevertheless, the voice phase spectrum recently attracted attention again. A lot of work has shown that it can be effectively applied to a variety of voice processing. Voice processing potential is now based on phase have been identified, and therefore the need for a basic model to help understand the way the phase encoded voice information.

This paper presents a novel phase domain filtering sound source model, which allows voice channel phase processing (filter) and excitation (source) component deconvolved. The model uses the Hilbert transform, and displays the mixed excitation channel elements in the phase domain, and provides a framework for operation by the phase separation of active ingredients, and a filter source. To study the validity of the method, a set of features extracted from the phase filter portion for automatic speech recognition (ASR), and the phase is performed using a portion of the source fundamental frequency estimation. The accuracy and robustness in both cases are described and discussed. Further, with the Hilbert transform the generalized logarithm function instead of a logarithmic function, and by calculating the regression filter group delay, thereby further improving the method.

Study on the feature extraction process and the statistical distribution of the phase spectrum representation. The results showed a bell-shaped distribution of the phase spectrum. Some statistical normalization methods, such as mean - variance standardization of Lapulasi, and Gaussian histogram equalization, successfully applied to phase-based features, and results in a significant improvement of robustness.

By using statistical regularization and robustness to gain broad number of functions that are implemented to encourage the use of more advanced technology based on statistical models, such as vector Taylor series (VTS). VTS in their original formula is assumed that log compression function. In order to simultaneously take advantage of the generalized VTS and logarithmic functions, first we proposed a new formula to combine the two into a unified framework, called generalized VTS (gVTS). To take full advantage gVTS framework, we propose a new estimation method of a noise channel, and then investigated the extended channel estimation method for gVTS frame group delay domain. The text of the issues raised were analyzed and discussed, and some solutions, and derive the corresponding formulas. Further, also studied the phase delay and group delay region domain additive noise and channel distortion effects and the results are used to derive the equations gVTS. HMM / GMM of Aurora-4 ASR tasks and DNN bottleneck in the system based on the experimental results under clean training mode and multi-modal demonstrate the effectiveness of the method in handling the additive noise and channel noise.

The Fourier analysis plays a key role in speech signal processing. As a complex quantity, it can be expressed in the polar form using the magnitude and phase spectra. The magnitude spectrum is widely used in almost every corner of speech processing. However, the phase spectrum is not an obviously appealing start point for processing the speech signal. In contrast to the magnitude spectrum whose fine and coarse structures have a clear relation to speech perception, the phase spectrum is difficult to interpret and manipulate. In fact, there is not a meaningful trend or extrema which may facilitate the modelling process. Nonetheless, the speech phase spectrum has recently gained renewed attention. An expanding body of work is showing that it can be usefully employed in a multitude of speech processing applications.Now that the potential for the phase-based speech processing has been established, there is a need for a fundamental model to help understand the way in which phase encodes speech information.In this thesis a novel phase-domain source-flter model is proposed that allows for deconvolution of the speech vocal tract (flter) and excitation (source) components through phase processing. This model utilises the Hilbert transform, shows how the excitation and vocal tract elements mix in the phase domain and provides a framework for efficiently segregating the source and filter components through phase manipulation. To investigate the efficacy of the suggested approach, a set of features is extracted from the phase filter part for automatic speech recognition (ASR) and the source part of the phase is utilised for fundamental frequency estimation. Accuracy and robustness in both cases are illustrated and discussed. In addition, the proposed approach is improved by replacing the log with the generalised logarithmic function in the Hilbert transform and also by computing the group delay via regression filter.Furthermore, statistical distribution of the phase spectrum and its representations along the feature extraction pipeline are studied. It is illustrated that the phase spectrum has a bell-shaped distribution. Some statistical normalisation methods such as mean-variance normalisation, Laplacianisation, Gaussianisation and Histogram equalisation are successfully applied to the phase-based features and lead to a significant robustness improvement.

The robustness gain achieved through using statistical normalisation and generalized logarithmic function encouraged the use of more advanced model-based statistical techniques such as vector Taylor Series (VTS). VTS in its original formulation assumes usage of the log function for compression. In order to simultaneously take advantage of the VTS and generalised logarithmic function, a new formulation is first developed to merge both into a unified framework called generalised VTS (gVTS). Also in order to leverage the gVTS framework, a novel channel noise estimation method is developed. The extensions of the gVTS framework and the proposed channel estimation to the group delay domain are then explored. The problems it presents are analysed and discussed, some solutions are proposed and fnally the corresponding formulae are derived. Moreover, the effect of additive noise and channel distortion in the phase and group delay domains are scrutinised and the results are utilised in deriving the gVTS equations. Experimental results in the Aurora-4 ASR task in an HMM/GMM set up along with a DNN-based bottleneck system in the clean and multi-style training modes confirmed the efficacy of the proposed approach in dealing with both additive and channel noise.

  1. introduction
  2. Background and related work
  3. Phase information
  4. Source domain phase - separation filter
  5. Generalized VTS phase for the ROBUST ASR / group delay domain
  6. Conclusions and Future Prospects for
    Appendix A Hilbert transform
    Appendix B for robust ASR generalized Taylor series vector (gVTS) Method
    Appendix C noise estimate based on the generalized series vector Taylor channel
    depth of neural networks for the ASR Appendix D
    Appendix E uses the database described
    in Appendix F feature extraction technology review

For more articles please exciting public concern number:Here Insert Picture Description

Published 252 original articles · won praise 156 · views 320 000 +

Guess you like

Origin blog.csdn.net/weixin_42825609/article/details/104268598