Voice Notes: CTC

1 Introduction

  The full name of CTC, Connectionist temporal classification, can be understood as a neural network-based temporal classification. The training of the acoustic model in speech recognition belongs to supervised learning. It is necessary to know the label corresponding to each frame in order to perform effective training. In the data preparation stage of training, the speech must be forcibly aligned. For a frame of speech data, it is difficult to give a label, but it is easy to judge the corresponding pronunciation label with dozens of frames of data. The introduction of CTC can relax this one-to-one correspondence requirement, and only one input sequence and one output sequence can be trained. There are two advantages: there is no need to align and label the data one by one; CTC directly outputs the probability of sequence prediction without external post-processing.

  There are the following problems in end-to-end speech recognition:

    1). The length of the input speech sequence and the label (ie, the text result) are inconsistent
    2). The position of the label and the input sequence is uncertain (alignment problem)

  That is, the length problem and the alignment problem , multiple input frames correspond to one output or one input to multiple outputs.

2. Structure

  The system can be modeled by bidirectional rnn. RNN is used to train the probability distribution of different phonemes at each moment.
  Input: The features of each frame entered in time series.
  Output: The output at each moment is a softmax, which represents the different probabilities of K+1 categories, K represents the number of phonemes, and 1 represents blank. (Classification problem, is a phoneme or blank)

  For a given sequence of input features of length T and any output label sequence π={π1,π2,π3,….,πT}. The probability that the output is this sequence is the product of the probabilities of the corresponding labels at each moment:

   Write the pr probability in the above formula as y, and it becomes the original formula in the paper (y represents the probability of softmax output):

 

3. Loss function 

  Since the output sequence and the final training label are generally unequal in length, we use x to represent the input sequence, y to represent the label for , and a to represent the sequence we predicted before: a many-to-one correspondence criterion β (removing blanks and repetitions) is used. ), so that the above output sequence corresponds to the given label sequence, such as (a,-,b,c,-,-) and (-,-,a,-,b,c) are mapped to labels y (a ,b,c).

  \beta ^{-1}Represents the inverse process of β, that is, one-to-many, that is, mapping (a, b, c) to all possible repetitions and blanks, so the final label y is the given input sequence x. Each sequence under the LSTM model The sum of the probabilities of the labels:

 

  So given an input sequence x and a label l*, the objective function of the LSTM maximizes the above probability value (minimizes the negative logarithm).

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324878056&siteId=291194637