Transformer-Based Acoustic Modeling for Hybrid Speech Recognition

Transformer-Based Acoustic Modeling for Hybrid Speech Recognition

1. Abstract

The transformer based hybrid speech recognition model is discussed in different position coding methods, optimal configuration of the model under iterated loss conditions, and streaming applications under limited context conditions. In combination with the 4-ngram language model rescore, an effect improvement of 19%-26% was obtained.

2. Background introduction

  • Hybrid architecture
    encodes the input sequence x1,...,xt through an acoustic encoder into advanced vector representations z1,...,zt, and then obtains the posterior state transition matrix of each frame in the HMM for different phonemes according to these high-level encodings, and In this process, the vocabulary and some phonetic models are combined. Compared with the end-to-end model, the whole process may be separately trained, but according to the author's experience, this framework works better in actual problems, and can be combined with some external knowledge as a supplement (personalized lexicon).


  • The calculation method of Self-Attention and Multi-Head Attention attention weights uses a splicing method for multi-head attention. In order to stream signal processing, only part of the attention is given to the context information on the right, and the others are masked to infinity.
    Insert picture description hereInsert picture description here

  • Transformer framework

Residual connection is used in both sublayers of MHA and FN, and layer normalization is used three times before MHA and FN, and the activation function uses glue.
Insert picture description here

  • Position coding
    Sinusoid positional embedding (absolute position coding)
    Frame stacking (stacking multiple context vectors together), the text uses the superposition of the current frame and the following 8 frames, and the stride 2 sampling is used as the model input
    convolutional embedding (similar to frame stacking, It is also relative position coding, using two vgg blocks, stride interval 20, 80ms left-context and 80ms right context are consistent with frame stacking.

  • Training Deep Transformers
    uses iterated loss, some of the intermediate layer transformer layers are also used to calculate auxiliary cross entropy loss, which is added to the final loss function through the difference, and the linear layer parameters for calculating the softmax are discarded after training.

3. Experimental results

Insert picture description here
Relative position coding has better results than absolute position and non-position coding, and convolution has the best effect.
Insert picture description here
Under the condition of ensuring that the model parameters are roughly equivalent, regardless of whether the position encoding adopts Fs or convolution, the transormer structure always maintains a better effect of 2–4% on test-clean and 7–11% on test-other.
Insert picture description here
Using iterated loss makes it possible to train a deeper network. Using the output of the 6/12/18 layer as an auxiliary ceLoss, 7% and 13% WER reduction on test-clean and test-other are improved.
Insert picture description here
The optimal configuration and language model (4-gram and NNLM) have achieved the effect of state of the art, wer 2.26/4.85/
Insert picture description here
limit the number of context on the right side of the attention. It is found that the effect is better when the RC maintains a larger number. But even when the rc is relatively small, the number of attendees in the last few layers of the transormer is also large, so streaming is still difficult.

Guess you like

Origin blog.csdn.net/pitaojun/article/details/108560681