A review of end-to-end streaming speech recognition research—speech recognition (paper reading)

Summary:

  • Speech recognition is an important way to realize human-computer interaction and a basic link in natural language processing. With the development of artificial intelligence technology, there is a demand for streaming speech recognition in a large number of application scenarios such as human-computer interaction. Streaming speech recognition is defined as outputting results while inputting speech. It can greatly reduce the processing time of speech recognition during human-computer interaction. At present, in the field of academic research, end-to-end speech recognition has achieved fruitful research results, but streaming speech recognition still has some challenges and difficulties in academic research and industrial applications. Therefore, in the past two years, end-to-end streaming speech recognition has Speech recognition has gradually become a research hotspot and focus in the field of speech. A comprehensive survey and analysis of the research carried out in recent years from the aspects of end-to-end streaming recognition models and performance optimization, including the following content: (1) Detailed analysis and summary of various aspects of end-to-end streaming speech recognition Methods and models, including CTC and RNN-T models that directly implement streaming recognition, as well as methods such as monotonic attention mechanisms that improve the attention mechanism to achieve streaming recognition; (2) Introducing end-to-end streaming speech recognition Methods for models to improve recognition accuracy and reduce delay. In terms of improving accuracy, there are mainly methods such as minimum word error rate training and knowledge distillation. In terms of reducing delay, there are mainly methods such as alignment and regularization; (3) Introduction to flow Some commonly used Chinese and English open source data sets for speech recognition and the performance evaluation standards of streaming recognition models; (4) The future development and prospects of end-to-end streaming speech recognition models are discussed.

introduction:

  • Speech recognition models have developed from the initial model based on GMM-HMM [1] to the deep neural network model based on DNN-HMM [2-4] to the current end-to-end [5-8] speech recognition model. three phases. Through the development of these three stages, the model structure has become simpler and the accuracy of speech recognition has almost reached saturation. However, most models are aimed at non-streaming speech recognition, and there are few tests on model performance. We will consider the issue of model recognition delay. In recent years, speech recognition models have entered the end-to-end era, no longer relying on the modeling components that have been used in traditional speech recognition systems for decades. A single network can be used to directly convert the input speech sequence into the output label sequence. Making the size of the model smaller, therefore, a large number of researchers began to shift from deep neural network models to studying end-to-end speech recognition models. In addition, a large number of studies have proven that end-to-end models have been used in academic research fields [7] and industrial production fields. [9-10] surpassed the deep neural network model based on DNN-HMM. In the next few years, end-to-end models will be the focus of research in the field of speech recognition. Common end-to-end models include CTC[11], RNN-T[12], attention-based encoderdecoder[13-14], LAS[8] and other models. The first two models can directly implement streaming recognition, and the latter two models Since the attention mechanism needs to obtain the complete acoustic sequence, stream recognition cannot be performed directly. Streaming speech recognition is also called real-time speech recognition. It means that the model starts to recognize when the user is speaking. In contrast, non-streaming recognition means that the model starts to recognize after the user has finished speaking a sentence or a paragraph.
  • With the continuous development of technology, various wearable and portable smart devices, as well as a large number of application software, have been fully integrated into public life. A series of applications such as commonly used input methods, online meetings, live broadcasts, and real-time translation have streaming voice. Identified needs. The end-to-end streaming recognition model does not require additional language models and is easier to deploy on the device side. In addition, various human-computer interaction scenarios that require streaming recognition, such as intelligent customer service, are also constantly emerging, so end-to-end streaming speech recognition The model will be a research hotspot in the next few years and also has broad application prospects. Therefore, this article mainly analyzes and summarizes the current research status of end-to-end streaming speech recognition models from aspects such as model structure, performance optimization, commonly used Chinese and English open source data sets, and model performance evaluation standards, and then proposes future development and prospects.
  • There are two related reviews in the field of speech recognition abroad in 2021. The literature [15] mainly summarizes the development of the structure and performance of speech recognition models in the past ten years, and predicts speech recognition from both research and application aspects. Development trends in the next ten years. Literature [16] provides a detailed overview of the development of end-to-end speech recognition models and their application in actual industrial production. At the same time, from an industry perspective, it focuses on how the end-to-end speech recognition model can solve problems in future application deployment. Some challenges and difficulties. The above two articles both summarize and outline the development of end-to-end speech recognition from a large field and a higher perspective, while this article focuses on the field of end-to-end streaming speech recognition to analyze and summarize its development. status quo.
    End-to-end streaming speech recognition model

1 End-to-end streaming speech recognition model

1.1 An end-to-end model that can directly implement streaming recognition

  • Among the end-to-end streaming speech recognition models, the models that can directly perform streaming recognition mainly include connectionist temporal classification (CTC) [11], recurrent neural network transducer (RNN-T) [12], recurrent neural aligner (RNA) [17] and other models. Literature [11] proposed a connectionist temporal classification (CTC) loss function to score the transcriptions generated by the recurrent neural network in the model, allowing the model to complete automatic alignment of audio frames and labels. From the perspective of the development of end-to-end speech recognition models, CTC was first applied to the end-to-end speech recognition model [5-6, 18-23]. It can directly convert the input speech sequence into the output label sequence. The structure is shown in Figure 1 [16]. The input speech sequence xt is encoded by the encoder and outputs feature representation henc t , and then passes through a linear classifier to obtain the probability P(yt|xt) of the output category at each moment.
    Insert image description here
  • The CTC model is able to achieve streaming speech recognition by using a unidirectional RNN in the encoder. Literature [12] proposed the recurrent neural network transducer (RNN-T) model, which provides a natural way for streaming speech recognition because its output depends on the previous output label sequence and the current step and previous input. The speech sequence, namely
    P(yu|x1:t ,y1:u - 1) , in this way eliminates the conditional independence assumption of CTC. Due to its natural streaming properties, It has been widely used in this field [9, 24-31].
    Insert image description here
  • The structure of the RNN-T model is shown in Figure 2 [16]. It contains an encoder network, a prediction network and a joint network. The encoder converts the input speech sequence xt into a high-level feature representation henc t. The prediction network is based on RNN The output label y1:u - 1 before -T generates a high-level representation h pre u . The joint network is a feedforward network that takes ht and hu as inputs and outputs zt,u.
  • In response to the conditional independence assumption of CTC, literature [17] proposed a new model: recurrent neural aligner (RNA). Similar to the CTC model, this model defines the probability distribution on the target label sequence, including the corresponding Blank labels at each time step in the input, the probability of the label sequence is calculated by marginalizing all possible blank label positions. However, this model does not make the conditional independence assumption of label prediction. In addition, it predicts an output label at each time step of the input instead of predicting multiple labels through RNN-T, thus simplifying the beam search decoding and making the training more efficient. Effective,in performing streaming speech recognition tasks, it is,successfully applied to a variety of spoken language recognition,tasks [32].

1.2 Improved end-to-end model that can realize streaming recognition

  • In end-to-end speech recognition models, models based on attention [33-36] cannot directly implement flow recognition due to their own characteristics, and these models have been proven to be effective in machine translation [37-38] , speech recognition [34, 39] and other fields. In this structure, first, the encoder encodes the entire input sequence and generates the corresponding hidden state sequence. Secondly, the decoder generates The state sequence is used to predict, and finally an output sequence is generated. At present, attention-based end-to-end models have made significant progress in related speech recognition [34, 40] tasks, achieving the best performance of non-streaming speech recognition models in terms of recognition accuracy [39]. However, attention-based models cannot be directly applied to streaming speech recognition problems. On the one hand, these models usually need to obtain complete acoustic sequences as input, so that encoding and decoding cannot be performed simultaneously; on the other hand, for speech, , they do not have a fixed length, and the computational complexity of the model increases quadratically as the input sequence increases. In order to apply the attention mechanism to streaming speech recognition tasks, a large number of researchers have conducted research on the above issues. By improving the global attention (local attention) mechanism, which part of the input sequence information should be processed at time t Encoding, and at the same time, for the problem of which part of the encoded information should be decoded, methods based on monotonic attention mechanism [41- 45] and chunk-wise were proposed. )[46-51], methods based on accumulation of information[52-55] and triggered attention[56-58].

1.2.1 Methods based on monotonic attention mechanism

  • Literature [42] proposed a local monotonic attention mechanism, which has locality and monotonicity. The locality helps the attention module of the model focus on a certain part of the input sequence that the decoder wants to transcribe. The monotonicity Generate alignments strictly from the start to the end of the input sequence. This mechanism forces the model to predict the center location at each decoding step and compute soft attention weights only around the center location. However, it is difficult to accurately predict the next center location based only on limited information. Compared with soft attention, hard monotonicity constraints limit the expressive ability of the model. Literature [43] proposed a monotonic chunk-wise attention (MoChA) mechanism to narrow the performance gap between soft and hard attention. It Adaptively segment the encoded state sequence into small chunks based on predicted selection probabilities, as shown in Figure 3 [43]. The chunk boundaries are represented by dotted lines, allowing the model to perform soft Note, however, that its training process is so complex and difficult that it is ultimately difficult to implement.
    Insert image description here
  • Literature [44] proposed monotonic multihead attention (MMA), which combines the advantages of multi-layer multihead attention and monotonic attention, and also proposed two variants, namely Hard MMA (MMA-H) and Infinite LookbackMMA (MMA-IL), the former is designed with streaming systems in mind where attention spans must be limited, while the latter emphasizes the quality of recognition systems. Literature [45] modified some variants of the model that applied the local monotonic attention mechanism, and also conducted a comprehensive comparison of these models. Finally, a simple and effective heuristic was implemented to perform local attention by using a fixed-size window. Methods.
1.2.2 Block-based approach
  • Literature [46] proposed Neural Transducer, which calculates the distribution of the next step based on partially observed input sequences and partially generated sequences, uses an encoder to process the input, and uses the processed results as the input of the Transducer, at each time Step size, according to the input block processed by the encoder, the Transducer determines that it can generate zero to more output labels, thereby achieving streaming decoding. However, since the model is bound by the time-dependent characteristics of the recurrent neural network, it only optimizes the corresponding The approximate optimal alignment path of the chunk sequence. Literature [47] uses the self-attention module to replace the RNN module in the RNN-T structure, and proposes a self-attention transducer (self-attention transducer, SAT), which can use self-attention blocks to simulate long-term dependencies within the sequence. At the same time, a block flow mechanism is introduced to limit the scope of self-attention by applying sliding windows, and stacking multiple self-attention blocks to simulate long-term dependencies. However, overall, although the block flow mechanism can help SAT achieve streaming decoding , but still caused a decrease in recognition accuracy. Therefore, the literature [49] proposed a synchronous transformer (SyncTransformer) model, which can perform encoding and decoding simultaneously. Its structure and reasoning process are shown in Figure 4 [49]. Sync-Transformer deeply combines transformer with SAT. In order to eliminate the dependence of the self-attention mechanism on future frames, it forces each node in the encoder to only pay attention to the left context and completely ignore the right context. Once the encoder generates a fixed-length block of state sequences, the decoder immediately starts predicting labels.
    Insert image description here
1.2.3 Method based on information stacking
  • Literature [53] proposed an adaptive computation time (ACT) algorithm, which supports RNN to learn how many calculation steps need to be taken between accepting input and generating output, providing a basis for subsequent adaptive calculation steps. The research laid the foundation. Literature [54] proposed a novel adaptive computation steps (ACS) algorithm, which enables the end-to-end speech recognition model to dynamically decide how many frames should be processed to predict language output. On the one hand, alignment The encoder calculates the probability of stopping at each encoder time step during the think interval and summarizes the context vector like a soft attention based model. On the other hand, this model continuously checks the accumulation of stopping probabilities and makes an immediate decision if the sum reaches a threshold. Output decisions. Literature [55] proposed the decoder-end adaptive computation steps (DACS) algorithm to solve the problem that the standard transformer cannot be directly used for stream recognition. This algorithm uses the confidence obtained from the encoder state. After the degree reaches a certain threshold, the output is triggered to transmit the decoding of the transformer ASR. The maximum look-ahead step is introduced to limit the number of time steps the DACS layer can view for each output step. To prevent reaching the end of speech too quickly, DACS uses an asynchronous multi-head attention mechanism for the transformer decoder, which destroys the stability of online decoding. Inspired by the integrate-and-fire model in spiking neural networks, literature [66] proposed a new soft monotonic contrast mechanism continuous integrate-and-fire (CIF) for sequence conversion, which can support various online recognition tasks as well as acoustic Boundary positioning. At each encoder step, the vector representation of the current encoder step and the corresponding weight of the amount of information contained in the scaled vector are accepted, the weights are accumulated forward and the vector information is integrated until the accumulated weight reaches a threshold, at which point the acoustic boundary is located , and the acoustic information of the current encoder step is shared by two adjacent tags, CIF divides the information into two parts: one part is used to complete the integration of the current tag; the other part is used for the integration of the next tag, and the simulation process is in the encoding When triggered at a certain time point during the decoder step, the integrated acoustic information is triggered to the decoder to predict the current label, as shown in Figure 5 [56]. Each dotted line represents a trigger until the entire acoustic sequence is encoded. Literature [57] proposed a memory-self-attention transducer (MSAT), whose structure is shown in Figure 6 [57]. The MSA module adds historical information to the restricted self-attention unit, by participating in the memory The state effectively simulates long-term context, and uses RNN loss to train the MSA module, realizing the application of this structure in streaming tasks.


    Insert image description here
    Insert image description here
1.2.4 Other methods
  • The method proposed above can achieve streaming speech recognition, but there are also problems. The method based on the monotonic attention mechanism makes the training process very difficult due to the use of soft and hard attention mechanisms; the block-based method often leads to performance degradation due to ignoring the relationship between blocks; and the method based on information stacking breaks the Transformer in training Parallelism, usually requires longer training time [58]. Literature [59] proposed triggered attention (TA) [59-61]. Its structure is shown in Figure 7 [59]. The TA decoder consists of a trigger model and an attention-based decoder neural network. The encoder The neural network is shared by a trigger network and an attention mechanism. Note that weights only see the encoder frames before the triggering event and some frames forward. During training, the forced alignment of the CTC output sequence is used to derive the time of the trigger. During decoding, the uncertainty of the trigger model trained by CTC is considered to generate alternative trigger sequences and output sequences respectively. Inference is performed in a frame-synchronous decoding manner. In addition, some researchers replaced the RNN in the RNN-T structure with Transformer and constructed the Transformer Transducer (TT) [62-69] structure. A large number of studies [62-69] have proven that this structure also has better streaming performance. Recognition ability.

2 Optimization methods and strategies for end-to-end streaming speech recognition models

  • The end-to-end streaming speech recognition model is the current research hotspot and focus in the field of speech recognition. For non-streaming models, it needs to occupy as little memory as possible to achieve higher recognition accuracy. However, for streaming recognition models, , both the recognition accuracy of the model and the recognition delay need to be considered. These two aspects jointly determine the performance of the streaming speech recognition model. The following will explore the optimization issues of the streaming speech recognition model from two aspects: latency and accuracy.

2.1 How to reduce the delay of streaming speech recognition model

  • When recognizing a sentence, there are generally two types of speech delays [70]: The first is the first token emission delay, which is based on analyzing the user's actual speaking start time and the actual generation of the first token by the speech recognition system. The time of this delay can be obtained; the second is user perceived latency, which starts when the user stops talking until the model sends the last non-empty label. This period of time is generally called user perception. Delay.
  • Recent research [70] shows that the main factors that affect the user-perceived delay of streaming speech recognition models are model structure, training standards, decoding hyperparameters, and endpoint indicators, while the size of the model and the model calculation speed do not always seriously affect the user-perceived delay. . At present, researchers mainly explore how to reduce the delay of the model from the perspectives of training strategies, alignment and regularization [71] training. The literature [72] proposes an adaptive look-ahead method to weigh the delay and word Error rate, the size of the context window is not fixed and can be modified dynamically. Two neural components, scout network (SN) and recognition network (RN), are introduced. Among them, scout network is responsible for detecting the start and end boundaries of a word in speech. The recognition network performs frame-synchronous single-channel decoding by looking ahead to the prediction boundary. Although this method achieves good results in weighing delay and accuracy, SN does not solve the heavy self-attention calculation that increases with the square of the left context length. .
  • Literature [73] proposed minimum latency training strategies based on MoChA, using external hard alignment extracted from the hybrid model as supervision to force the model to learn accurate alignment, and proposed latency constraint training on the decoder side ( DeCoT) and minimum delay training (MinLT) effectively reduce the delay of the model. Literature [74] proposes a dual-channel RNN-T+LAS model based on the model structure and endpoint indicators, in which LAS rescores the assumptions of RNN-T and at the same time predicts the end of query (end-of-query). ) symbol, integrating the EOQ endpoint indicator into the end-to-end model to help turn off the microphone. This method enables the end-to-end model to surpass the traditional hybrid model for the first time in terms of quality and delay trade-offs. Literature [75] proposed a new delay constraint method: self-alignment. This method does not require an external alignment model, but uses the Viterbi forced alignment of the self-trained model to find the lower delay alignment direction. From the perspective of delay regularization training, the literature [76] proposed a new sequence-level generation regularization method FastEmit based on the streaming model of Transducer, which can directly apply delay regularization to each sequence probability when training the transducer model, and It does not require any speech-word alignment information, and at the same time, the Fast-Emit method requires the least adjustment of hyperparameters compared to other regularization methods. Experiments on a large number of end-to-end models show that this method can achieve a good trade-off between word error rate and delay. From the above research, it can be seen that there are currently many methods such as restricted alignment and regularization that can relatively solve the delay problem of streaming speech recognition models. Although most methods reduce the delay of the model, they also lead to a decrease in recognition quality. This This will be a research direction that still needs to be explored in the future.
    Insert image description here

2.2 How to improve the accuracy of streaming speech recognition model

  • Improving the accuracy of speech recognition models has always been a hot topic. Since the birth of the first speech recognition system Sphinx [77] based on Hidden Markov Model (HMM) in 1988, speech recognition models have now entered the In the end-to-end era, researchers continue to explore in the hope that the accuracy of speech recognition models can be further improved, from traditional hybrid models [78] to deep neural network models [2] to the current end-to-end models [40]. While the model structure has changed, the accuracy of the speech recognition model has also been greatly improved. Like non-streaming models, ways to improve the accuracy of streaming models include changing the basic structure of the model, pre-training, expanding the data domain, minimum word error rate training (MWER) [79-83], knowledge distillation [84-89], etc. The way in which changing the model structure has been explained in Chapter 1. Literature [73] uses MoChA as a streaming speech recognition model. On the encoder side, multi-task learning is adopted and frame cross-entropy targets are used for pre-training, which improves the recognition accuracy of the model. Literature [83] proposed a novel and effective MWER training algorithm based on the
    RNN-T model, summing the scores of all possible comparisons for each hypothesis in the N best lists, And using them to calculate the expected edit distance between the reference and hypothesis, when adding end-ofsentence (EOS) for the endpointer (EP), the proposed MWER training can also significantly reduce high deletion errors. Literature [84] studied the model compression method based on knowledge distillation to train the CTC acoustic model, evaluated the frame-level knowledge distillation method and sequence-level knowledge distillation method of the CTC model, and improved the recognition of the model by conducting experiments on the WSJ data set. Accuracy. Reference [90] implements knowledge distillation from a non-streaming bidirectional RNN-T model to a streaming unidirectional RNN-T model. The experimental results show that the unidirectional RNN-T trained by the proposed knowledge distillation is better than the one trained by the standard method. The one-way model has better accuracy. Literature [85] studied the knowledge distillation of non-streaming to streaming TransformerTransducer model, and compared two different methods in the experiment: minimizing the L2 distance of the hidden vector and minimizing the head L2 distance. The experimental results show that based on Knowledge distillation based on latent vector similarity is better than knowledge distillation based on multi-head similarity.

3 Datasets and Evaluation Criteria

3.1 Dataset

  • Some common data sets such as Chinese Mandarin and English. The Chinese speech recognition open source data set is shown in Table 1. In 2015, the Speech and Language Technology Center of the Institute of Information Technology of Tsinghua University released the first open source Chinese speech database THCHS30 [91] to help researchers build the first speech recognition system. However, the total speech duration of this data set is only 35 hours, which is not sufficient for model training. In 2017, Beijing Hillshell Technology Co., Ltd. released the AISHELL-1[92] corpus, which became the largest open source Chinese speech recognition corpus at that time. , Surf Technology also released the ST-CMDS speech data set [93]. In 2018, Beijing Hill Shell Technology Co., Ltd. released the AISHELL-2 [94] corpus. Shanghai Original Language released the Primewords Set1 data set. In 2019, data Tang (Beijing) Technology Co., Ltd. has open sourced the Chinese Mandarin speech data set DTZH1505[93], which records the natural language speech of 6,408 speakers from 33 provinces in China’s eight major dialect regions. The duration is 1,505 hours, and the corpus content covers Social chat, human-computer interaction, intelligent customer service and vehicle commands, etc. [93] This is currently the largest and most comprehensive Chinese open source voice data set.
    Insert image description here

3.2 Evaluation indicators

  • For the end-to-end streaming recognition model, its performance is mainly evaluated through the accuracy of the model and the delay of recognition. In terms of accuracy, the word error rate of the sentence is calculated. WER) or character error rate (CER) to evaluate the model, commonly used to calculate the word error rate, take T as the total number of words in a sentence, S as the number of words replaced in the recognition result, and D as the number of words deleted in the recognition result. The number of words in the correct utterance [102], I is the number of inserted words that do not appear in the correct utterance but appear in the recognition result, then the word error rate (WER) is defined as:
    Insert image description here
  • The lower the value of WER, the higher the recognition accuracy of the model and the better the performance. In terms of delay, the real time factor (RTF) is the evaluation standard in the streaming speech recognition process. When its value is less than 1, the model is said to be recognized in real time. In addition, the sentence level or word level can also be calculated. The latency value. Taking M as the duration of a piece of audio and N as the duration of identifying this piece of audio, the real-time factor (RTF) is defined as:
    Insert image description here
  • The smaller the value of RTF, the smaller the delay and the better the performance of the model.

Guess you like

Origin blog.csdn.net/qq_38978225/article/details/129488877
Recommended