Knowing things by learning | multi-level modeling method to improve the effect of Chinese speech recognition, recognized by the ISCSLP competition

Introduction: Speech is an important way for human information communication, and it is also an important bridge for human-computer interaction. Automatic speech recognition can be expressed as the process of transcribing human speech signals into written text output through computers. This article shares the end-to-end Chinese speech recognition method and specific implementation of the multi-level modeling unit proposed by Netease Yidun.

1. Introduction to Speech Recognition

Speech is an important way for human information communication, and it is also an important bridge for human-computer interaction. Automatic Speech Recognition (ASR) can be expressed as the process of transcribing human speech signals into written text output through computers.

The problem of automatic speech recognition has been an important research topic in the machine learning community since the 1950s. In the past few decades, speech recognition has experienced the development process from traditional speech recognition technology based on GMM-HMM to end-to-end speech recognition technology. In the traditional speech recognition framework, the entire automatic speech recognition system consists of multiple modules, including acoustic models, pronunciation dictionaries, and language modeling. The end-to-end speech recognition system uses a single sequence-to-sequence model to directly map the input acoustic feature sequence to the text sequence. Compared with the traditional GMM-HMM hybrid system, the end-to-end speech recognition method has simple training process and simple system composition. , good recognition effect and other advantages, it is a hot spot in academic research and industrial implementation.

2. Why use multilevel modeling?

Selecting the modeling unit is an important part of building an ASR system. At present, the end-to-end Chinese speech recognition system usually chooses Chinese characters as the modeling unit of the model. However, the choice of modeling unit is not only related to the output of the network, but also should be related to language features. Chinese has certain language characteristics. We often say that English is phonetic and Chinese is ideographic. When we see an English word, we may not know the word, but we can read it. If we see a Chinese character, if we don’t know it, we may be able to guess it. meaning, but cannot know the pronunciation of Chinese characters. That is to say, Chinese characters are a kind of text symbols, which have nothing to do with pronunciation. Therefore, Chinese characters are directly selected as the modeling unit of the end-to-end Chinese speech recognition system. It is difficult for the model to learn the mapping knowledge between acoustic features and pronunciation units. Mapping from phonetics to Chinese characters can become difficult.

In response to the above problems, we propose a multi-level modeling end-to-end Chinese speech recognition method, in addition to the Chinese character (Character-level) modeling unit, we also introduced the syllable (Syllable-level) as a modeling unit in the model . Specifically, the multi-level modeling method is based on the Encoder-Decoder architecture, using the multi-task learning hybrid CTC/Attention[1] method for training, in which the CTC branch uses syllables as the modeling unit, so that the model can learn from speech feature sequences to The mapping information of the syllable sequence, while the Attention branch uses Chinese characters as the modeling unit, and uses the sequence context information and acoustic features to convert the syllables into the final output Chinese characters. The multi-level modeling unit enables the model to fuse and learn multi-level information during training, including phonological information and sequence context information, thereby improving the performance of Chinese speech recognition.

3. Multilevel modeling approach

model architecture

Our model uses two levels of modeling units, including Chinese character modeling units and syllable modeling units. In the training phase, the network is trained using a data pair (X, Y) consisting of a speech feature sequence and annotated text. Each Chinese character can be represented by a toned syllable. The text sequence Y can be obtained through the open source tool Python-pinyin[2] The syllable sequence S_Y, such as "Beijing Tiananmen" is converted into the syllable sequence "bei3 jing1 tian1 an1 men2". The Chinese characters appearing in the training data are counted and numbered to obtain a Chinese character dictionary, and the syllables appearing in the training data are counted and numbered to obtain a syllable dictionary.

The diagram below shows our multilevel modeling system architecture. The whole model is mainly composed of the front convolution module, the Encoder module and the Decoder module. The front-end convolution module extracts the local features of the input sequence, and downsamples the sequence to reduce subsequent calculation overhead. Encoder can be a Conformer[3] network or a Transformer[4] network. Among them, the Conformer encoder layer is mainly composed of four sublayers: a feedforward network layer, a self-attention module, a convolution module and a second feedforward network layer. The Transformer encoder layer is mainly composed of two sublayers, namely the self-attention module and the feedforward network module. The decoder layer consists of three sub-modules stacked together: self-attention module, multi-head attention module and feed-forward network module.

insert image description here

The core points of the multilevel modeling method are:

Syllable modeling: The CTC branch uses syllables as the modeling unit, and the feature sequence X passes through the convolution module and the Encoder module to obtain the acoustic feature vector, and maps the feature into a vector of the size of the syllable dictionary through a fully connected layer, and then performs softmax normalization to obtain The probability distribution of each frame is combined with the target syllable sequence S_Y to calculate the CTC loss.

Chinese character modeling: The Attention branch uses Chinese characters as the modeling unit. The Decoder module receives the syllable embedding vector and the acoustic feature vector output by the Encoder module as input. The output of the Decoder maps the feature into a vector of the size of a Chinese character dictionary through a fully connected layer, and then performs Softmax normalization to obtain a probability distribution, which is combined with the target Chinese character sequence Y Calculate CE loss.

In the multi-level modeling method, the Deocder module receives the syllable Embedding and the acoustic feature vector as input, outputs the sequence of Chinese characters, and undertakes the task of converting syllables to Chinese characters (Syllable-to-character). Compared with using the Chinese character sequence information and acoustic features of the previous time step to predict the Chinese character of the next time step (Character-to-character), converting a sequence of syllables to a sequence of Chinese characters is a more challenging task. Therefore, we introduce an auxiliary task in the middle layer in the Decoder module to facilitate the conversion from syllables to Chinese characters, thereby improving the performance of the system. We named the auxiliary task module InterCE.

The calculation of InterCE loss is to use the output of the middle layer of the Decoder module, get the probability distribution through the linear layer and Softamx, and finally calculate and mark the cross entropy of the text. The objective function of the whole network is the weighted sum of syllable-based CTC loss, Chinese character-based CE loss and interCE loss.

model reasoning

After the model training is completed, in the inference stage, the Encoder module extracts the acoustic features, and the output of the Encoder obtains the probability distribution of each frame on the syllable dictionary through the linear layer and the Softmax function, and obtains the optimal N-best syllable sequence through the CTC prefix beam search . The Decoder module uses the syllable Embedding and the acoustic feature vector as input, and outputs the final Chinese character sequence. The calculation process of model inference is shown in the figure below.

insert image description here

Experimental verification

We validate the experimental performance of the multilevel modeling scheme on the Chinese open-source dataset AISHELL-1 [5], which contains 178 hours of Chinese speech data from 400 speakers. We verify the effect of the multilevel modeling scheme using the Conformer network and the Transformer network respectively.

Without using language model, based on Transformer network, the multi-level modeling method achieved 5.2% CER on AISHELL-1, outperforming baseline models based on Chinese character modeling and syllable modeling. Based on the Conformer network, the multi-level modeling scheme achieved a CER of 4.6% on Aishell-1, which is better than the recently published benchmark model based on Chinese character modeling. The experimental results show that the multi-level modeling method improves the performance of Chinese speech recognition.

insert image description here
insert image description here

We analyze the performance improvement brought by InterCE loss through ablation experiments. First, we add the InterCE loss auxiliary task to the model based on Chinese character modeling, and the results show that the InterCE loss auxiliary task can only bring about a slight performance improvement. Second, we remove InterCE loss from the multi-level modeling framework, and the results show a 0.1% performance drop on the validation and test sets. The results of the ablation experiments show that using syllable and Chinese character two-level modeling units in an end-to-end model can improve the performance of Chinese speech recognition. In addition, adding an auxiliary task InterCE loos under the framework of multi-level modeling can bring additional performance improvement.

insert image description here

4. Summary

In this paper, we propose an end-to-end Chinese speech recognition method with multi-level modeling units. By means of multi-level modeling, the model can fuse and learn multi-level information. In addition, we introduce the auxiliary task InterCE loss to further improve the accuracy of the model. In the inference stage, the input feature sequence generates a syllable sequence through the Encoder and the subsequent CTC branch, and then the Decoder module converts the syllable sequence into Chinese characters. The entire decoding process is completed through an end-to-end model without introducing an additional conversion model, thereby avoiding the Cumulative errors from multiple models. Our model achieves competitive performance on AISHELL-1, a widely used Chinese benchmark dataset, and outperforms recently published literature results.

For more details, please refer to the paper link: https://arxiv.org/abs/2205.11998.

Citation content

[1] Watanabe S, Hori T, Kim S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253.

[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.

[3] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.

[4] https://github.com/mozillazg/python-pinyin

[5] Bu H, Du J, Na X, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017: 1-5.

おすすめ

転載: blog.csdn.net/yidunmarket/article/details/128660145