Alignment of HMM, CTC and RNN-T, detailed explanation of alignment - Speech signal processing learning (3) (Elective 2)

references:

Speech Recognition (option) - Alignment of HMM, CTC and RNN-Tbilibilibilibili

March 2020 New Program Li Hongyi Human Language Processing Exclusive Notes Alignment - 7 - Zhihu (zhihu.com)

All cited papers are omitted this time

 

Table of contents

1. The difference between E2E model, CTC and RNN-T

The idea of ​​​​the E2E model

Ideas of CTC and RNN-T models

2. Problems to be solved

3. Introduction to alignment

4. Exhaustive method

Exhaustive HMM

Exhaustive CTC

Exhaustive RNN-T

5. Summary


1. The difference between E2E model, CTC and RNN-T

The idea of ​​​​the E2E model
  • In fact, for end-to-end models, such as LAS, it always looks for a token sequence when decoding, making P of Token Sequence Y maximum. X given Acoustic features vectors


    \text{Decoding: } Y^* = \arg \max_Y{\log P(Y|X)}
     

  • Why do you say that? Let's take a brief look at the structure of LAS. Every time we output a probability distribution, we can use this probability distribution as the probability of output token, so multiply the probabilities of all final tokens, and the result is P(Y|X) .

  • Of course, when solving the above equation, we do not directly find the largest token in each probability distribution, but use beam search and other strategies to find the optimal solution. During the training process, we can also bring the training goals into the above formula. Assuming that Y^hat is the final correct result, then the training goal is to find an optimal model parameter to make P(Y^hat|X) bigger and better.


    \text{Training: } \theta^* = \arg \max_\theta{\log P_\theta(\widehat{Y}|X)}
     

Ideas of CTC and RNN-T models
  • For CTC and RNN-T, due to the different lengths of the token sequence and the acoustic feature sequence, it is impossible to directly calculate the probability of the acoustic feature sequence corresponding to the token sequence. They additionally require alignment operations.

  • Taking CTC as an example, assume that the output token sequence is "ab" and there are 4 acoustic feature sequences. Since the lengths of the two are different, we need to copy a and b, or insert the ∅ symbol into them to make their length become The length must be consistent with the input acoustic feature sequence to calculate P(Y|X).

  • Therefore, in fact, CTC and RNN-T can only calculate the probability of a certain alignment, but it is difficult to calculate the probability of generating a certain token sequence. So what should we do? The solution adopted here is to learn from the HMM approach and add up the probabilities of all possible alignments as the final probability of the token sequence. The formula is as follows. In addition, both the training and decoding processes can refer to the previous end-to-end model.


    P(Y|X) = \sum_{h\in align(Y)} P(h|X)
     

2. Problems to be solved

  1. First, how should we exhaust all possible alignments? In fact, CTC and RNN-T are exhaustive in the same way as HMM.

  2. Second, how should we add up the probabilities of all alignments?

  3. Then, how should we train these models? HMM uses the forward algorithm, while CTC and RNN-T use the gradient descent method. So how do we calculate the gradient for the probability results of many alignments?

  4. Finally, how should we perform inference and decoding to solve our target expression?

3. Introduction to alignment

  • The alignments required by HMM, CTC and RNN-T are both similar and different. We assume that the input has 6 acoustic feature vectors (lengthT=6), with character as the token unit (although this unit is still too large for HMM ), the output is "c", "a", "t" (lengthN=3).

  • For HMM, what it has to do is to repeat the three letters of cat so that the length after repetition is equal to the length of the acoustic feature vector sequence.

  • For CTC, there are two ways. One is to copy the 3 letters of cat, or you can insert a symbol into them, so that the length is finally equal to the length of the acoustic feature vector sequence. (Refer to its reasoning process, which is to remove the ∅ symbol and reduce the repeated letters between the ∅ symbols to one letter)

  • For RNN-T, the same number of ∅ symbols as the length of the acoustic feature vector sequence is added.

4. Exhaustive method

Exhaustive HMM
  • How should we exhaust all alignments of HMM? We can convert the HMM alignment just described into a pseudo-function process:

    • Here, the letter c is repeated t1 times, a is repeated t2 times, and so on

    • Furthermore, since all letters must appear, ti > 0.

  • Then, we can use this to draw a state diagram (Trellis Graph). The state diagram is as follows:

    • We need to go from the red point on the upper left to the blue point on the lower right.

    • The way to walk can be to go down to the right or sideways.

    • Going right and down means outputting the next letter, and going sideways means copying the current letter. ,

  • State charts can very well avoid illegal alignments, which will never lead to the end.

Exhaustive CTC
  • The difference between CTC and HMM is that it can also insert the ∅ symbol into it, both at the beginning and at the end. We convert this process into a pseudo function as follows:

    • First, you can output the ∅ symbol at the beginning, or you can choose not to output it.

    • The second is to output the current symbol and the number of ∅ symbols in each round.

    • And the number of tokens and the number of ∅ together need to be equal to the length of the acoustic feature vector sequence.

  • We draw the state diagram as follows:

    • We need to move from the red point to one of the 2 blue points.

    • There are two options when starting out, go to the ∅ row or go to the letter row.

    • There are three options in the letter row: copy horizontally, insert ∅ in the lower right direction, and output the next letter in the right direction.

  • However, if you choose to enter row ∅ at the beginning, the move and choice will be different:

    • Compared to the letter row, there are only two choices in the ∅ row.

    • You can move sideways to copy, you can move down and right to enter the next token, but you cannot move dayward.

  • Therefore, we say that CTC has different moves in different rows. There are two final destinations.

  • Let’s give a few examples of legal alignments and draw their status diagrams:

  • However, CTC also has special circumstances. Referring to the strategy adopted by CTC during reasoning, if the two tokens in the token sequence are the same, we have only two moves in the first identical token line, such as the following example of outputting "see":

    • At this point, in the row entering the first e, we can only have two ways to move

    • You can copy and insert ∅ lines, but you cannot jump directly to the next e line.

    • If you directly enter the next e line, it means that two e's are output continuously. Then during inference, CTC will fuse the two e's together and finally output only one e.

Exhaustive RNN-T
  • In RNN-T, ∅ with the same length as the acoustic feature vector sequence is inserted, that is, T ∅. After we figure out its rules (an acoustic feature vector can keep outputting tokens and let it look at it happily until it outputs ∅, which means that it can go to the next vector when it feels like it), we can write the pseudo code:

    • Among the three letters of cat, we have 4 positions to insert, and since RNN-T needs to determine whether it is over, we must insert ∅ in the last part of cat, because when RNN-T sees ∅, it means going to the next one Acoustic feature vector

    • Each time, we will output the nth token and output a certain number (cn) of ∅.

    • You can choose not to output ∅ for the first few times, but cN must output ∅ for the last time.

    • The sum of cn must be equal to the length of the acoustic feature vector sequence.

  • We converted the above process into a state diagram. In order to ensure that it ends with ∅, we dug an extra grid to the right of the last line:

    • From the blue dot on the upper left, we need to walk to the blue dot on the lower right

    • There are two ways to move, one is to move sideways, which means inserting ∅; the other is to move downward, which means entering the next token.

    • It can be seen that the last row has an extra grid to the right to ensure that the last step must be to insert ∅ horizontally.

  • We drew several possible alignments, and also gave illegal alignments (outside the box):

5. Summary

  • We draw the state machine diagram of each model for comparison:

    • HMM starts from c and can be copied or jumped to the next token

    • CTC can start from ∅ or c, and there are two ways to end (end from t, end from ∅). You can copy, you can choose to go to ∅, or you can choose to enter the next token.

    • RNN-T can start from ∅ or c, but it must end with ∅. And each token needs to come out immediately after entering it, and the current token cannot be generated again.

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/134692567