Detailed explanation of CTC, RNA, RNN-T, Neural Transducer, and MoChA models for speech recognition—speech signal processing learning (4)

references:

[1] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376.

[2] Graves A. Sequence transduction with recurrent neural networks[J]. arXiv preprint arXiv:1211.3711, 2012.

[3] Jaitly N, Le Q V, Vinyals O, et al. An online sequence-to-sequence model using partial conditioning[J]. Advances in Neural Information Processing Systems, 2016, 29.

[4] Chiu C C, Raffel C. Monotonic chunkwise attention[J]. arXiv preprint arXiv:1712.05382, 2017.

[5] Speech Recognition - CTC, RNN-T and more (part 3)_bilibili_bilibili

[6] March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes CTC, RNN-T & More - 4 - Zhihu (zhihu.com)

[7] Speech recognition model CTC, RNN-T, Neural Transducer, MoCha study notes - Zhihu (zhihu.com)

Table of contents

一、Connectionist Temporal Classification(CTC)

1.1 Mechanism

1.2 Training issues

1.3 Is CTC really feasible?

1.4 Problems with CTC

二、RNN Transducer(RNN-T)

2.1 Recurrent Neural Aligner(RNA)

2.2 RNN-T, RNA and further improvement

2.3 Improvement of RNN-T based on RNA (adding Language Model)

三、Neural Transducer

4. Monotonic Chunkwise Attention (MoChA)

5. Summary of Several Major Models


 

The CTC model appeared in 2006, but for a long time no one used it because CTC had major problems.

一、Connectionist Temporal Classification(CTC)

1.1 Mechanism
  1. CTC can achieve on-line streaming speech recognition, which is the so-calledonline real-time speech recognition, because it adopts a>One-way RNN architecture.

  2. CTC has only one Encoder. After the Encoder converts the input xi into hi, it then obtains the distribution of all tokens through a linear Classifier (that is, transform first, then softmax)

  3. Since each acoustic feature is actually very small, the information it contains may not constitute a token, so it adds a new mechanism to add a new token to the token:

    • You can output this token when you don’t know what to output

    • CTC ignores down sampling: inputT acoustic features and it will output< a i=3>T tokens

    • Finally, CTC will fuse duplicate tokens and delete token .

1.2 Training issues
  1. So how should CTC be trained? In theory, it should be calculatedThe cross entropy of each output token distribution and the correct one-hot vector, but in fact because its acoustic features are too short, the output tokens are far more than the correct number of tokens. Youdo not know what each correct label is.

  2. So to a large extent, we need to create labels that can be used for training.

  3. The operation of creating labels by ourselves is called Alignment.

  4. Then this leads to the fact that there are actually many Alignment methods. For example, in the figure below, for 4 token output, truth only has two tokens (great). We have many Alignment methods. So how should we choose labels for training? The final result is:I want them all! All possible Alignment results are used for training.

1.3 Is CTC really feasible?
  • Here are some results of speech recognition using CTC using different tokens:

  • The following is a comparative experiment conducted on different Models using the same data set.

    • Among them, WER is the error rate.

    • It is not difficult to see that CTC must be followed by further processing (Language Model) to look good.

    • Therefore, many people even kicked CTC out of the ranks of end2end models.

1.4 Problems with CTC
  • We can regard the final linear classifier of CTC as a Decoder. These Decoders only eat one h vector at a time and output a token, and each Decoder is independent.

  • Then you will encounter the following problems:

    • The first three frames indeed correspond to the token "c", which together form a "c" pronunciation.

    • Since each Classifier exists independently, if the first frame is recognized as c and the second frame is incomprehensible, it is recognized as , the third frame is recognized as c

    • Then the CTC result will be similar to stuttering, and the final output will be "cc"

    • But we can expect that the Encoder can learn that when the first acoustic feature is recognized as c and the second is recognized as , The third one can appropriately reduce the probability of c in identification.

In fact, there is another way, which is to combine LAS and CTC together, perform CTC at the same time during the Encoder process of LAS, train two models at the same time, and have two Loss at the same time. In fact, we can regard CTC as the Encoder of LAS.

二、RNN Transducer(RNN-T)

2.1 Recurrent Neural Aligner(RNA)

Although RNA was published later than RNN-T, from the perspective of model derivation, RNA is a transition between RNN-T and CTC.

LSTM (Long Short-Term Memory):A variant of RNN specifically designed to process sequence data such as text, speech, and time series. Compared with traditional RNN, LSTM has stronger memory capacity and a mechanism to prevent the vanishing gradient problem, making it perform better when processing long sequence data.

  • To address weaknesses in CTC, each Classifier is independently identified in

  • RNA actually replaces the Linear Classifier of CTC with LSTM

2.2 RNN-T, RNA and further improvement

RNA can become RNN-T by modifying it again.

  • Consider a question, can we recognize a vector map as a string of tokens? For example, if you hear "/θ/, /ð/", you will recognize it as th

  • RNN-T can do this. RNA only inputs a vector and outputs a token, but RNN-T can keep outputting tokens on the same vector until the model is satisfied (you can use is the number as the end mark)

  • For example, the following example:

    • Here, each represents the end of the recognition of an acoustic feature and the beginning of the next recognition.

    • Therefore, in fact, in the original output, there will be T is the number of input acoustic features mentioned before)T exists. (Here

  • However, this also leads to the same training problems as CTC, namelyAlignment problem of the data set. The data given is not used when , which will lead to the following situation:

    • Here, 4 acoustic features are input, corresponding to the voice of "Awesome". However, we do not know that should Where to put it

    • Then the response method chosen by RNN-T is the same as CTC, exhaustive, all training!

2.3 Improvement of RNN-T based on RNA (adding Language Model)
  • In fact, RNN-T does not replace the original linear classifier with an RNN like RNA, but trains a separate RNN model asLM, to realize the dependency between classifiers.

    • That is the blue square at the top of the picture. It will receive a previously generated token to produce an output to affect the generation of the next token.

    • However, this RNN chose to ignore the symbol. There are several reasons for this:

      • First of all, this actually plays the role of a LM, and its input can only be token. When training it, a large amount of text is used for pre-training, but there is no symbol in the text, so < a i=3>∅ Ignore.

      • Secondly, we said before that for the Alignment problem, we chose exhaustion, which means we must use a good algorithm, which requires a disregard∅< LM of /span>. Therefore, this design is for model training. The algorithm used in the model will be introduced below.

三、Neural Transducer

CTC, RNA, and RNN-T only read one acoustic feature at a time, while Neural Transducer reads many acoustic features at a time.

  • Neural Transducer reads many acoustic features at a time in a window (the size is set by yourself), processes them through the attention mechanism, and finally outputs a string of tokens.

  • Note that when using attention, only use it in one window.

  • How big should the window be? Experimental results of the original paper:

    • It’s good to pay attention~

    • One more thing to add here is that the aboveLSTM-ATTENTION is the one introduced beforeLocation-aware attention .

4. Monotonic Chunkwise Attention (MoChA)

MoChA, based on Neural Transducer, changes the number of steps of window movement to dynamic.

  • The key point of MoChA is to add an additional Module, which is similar to the attention mechanism:

    • Input: two vectors, hidden state and acoustic feature

    • Output: yes/no, that is, whether the window is placed here

  • Reasoning process:

    • First, enter frames one by one to see if you want to bring the window in.

    • After puts down the window, run the attention mechanism and generate a token. Note that only one token is generated in this window and will not be output.

    • Then proceed with the previous steps, continue to move the window and see if you want to put the window down. After putting it down, continue to output text.

5. Summary of Several Major Models

  • LAS: is seq2seq

  • CTC: decoder is seq2seq of linear classifier

  • RNA: seq2seq that inputs something and outputs something

  • RNN-T: seq2seq that inputs one thing and can output multiple things

  • Neural Transducer: RNN-T that inputs one window at a time

  • MoCha: window mobile and scalable Neural Transducer

This ends the speech recognition task explanation. In addition, there are 3 elective courses on speech recognition tasks. It is recommended to strike while the iron is hot and check out Elective One, Elective Two and Elective Three in the column to which this article belongs.​ 

 

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/134012653