Transformers with convolutional context for ASR

Transformers with convolutional context for ASR

(1) Paper ideas

The original sinusoidal position encoding is replaced with the input representation learned by convolution. Compared with the original absolute position representation, this relative position encoding effect is more conducive to the later transformer to find long-distance dependencies (avoid shallow layer Learning of the transformer layer in terms of location information). Specific effect: WER reaches 4.7% (clean) and 12.9% (other) under the condition of LIbrispeech without LM model.

(2) Model structure

Insert picture description here
The structure on the left is the composition of one layer of the transformer:
the structure on the right is the composition of the entire transformer after adding the context: encoder end: K 2D convolution + layernorm + relu followed by a 2-D max pooling.
Decoder end: each transformer The block uses multiple multi-head attention layers for the encoder context, and performs 1d convolution on the previous prediction results, a total of N layers.
Insert picture description here

(3) Experimental results

The input is 80D log mel-filterbank coefficients calculated by 10ms with 25ms window + 3 basic frequency features and
2 2D convolution blocks: each block contains two layers of convolution, kernel size 3, max-pooling kernel 2. The first block feature map 64, second layer 128; 1d convolution on the decoder side has three layers, no max pooling layer.

Insert picture description here
The first line is the configuration of the convolutional context of the paper, and the second line of experiments uses the convolutional context on the decoder side to be replaced by absolute position encoding, the effect is significantly worse, and the splicing of the two (third line) also brings any improvement ; Increasing the encoder layer layer is very important to improve the effect. Increasing the relu layer of the encoder decoder also improves the model effect, but increasing the number of encoder and decoder multi heads has a certain negative impact on the effect.
Insert picture description here
In the case of the same amount of parameters, the effect of using a wider context size/deeper convolutional layer is better.
Insert picture description here
Increasing the number of encoder layers allows the model to pay more attention to the sound content, ignoring some noise and environmental sounds, which improves the model the most. Although the increase in the number of decoder layers is limited, it is still beneficial.

Insert picture description here
By comparing the results with other models, it can be found that the model used in this article has 12% and 16% improvement on the two data sets dev other and test other compared to other no LM models, indicating that the convolutional transformer configuration is more capable Learn the long-distance dependence between speech data, environmental noise and other features to better distinguish. For clean data, LM established by external text is also needed to further improve the effect.

Guess you like

Origin blog.csdn.net/pitaojun/article/details/108185879