Speech Recognition - Loss Function

  It has been almost a month since I came to school, and it is still a little difficult to see speech recognition. HMM, GMM, and DNN seem to understand, but with a lot of practice and time, it will gradually improve. Recently, I finally have some understanding of a small knowledge point, and I wrote it down quickly. It can be regarded as some small achievements in the past month. It is inevitable that there are mistakes, warmly welcome to spray me, and I will correct it in time!

  ---------------------------------------------------

  In the training of speech recognition models, the loss function is a common concept, so what is the loss function and what is it used for?

  Understanding a thing and figuring out the context is very beneficial for understanding, so start from the beginning.

  The general process of speech recognition is like this (not very accurate, according to personal understanding). The speech signal is recognized according to the phoneme, so the signal is divided into frames and windowed at the same time, and then the frequency domain is converted, and then the features of the speech signal are extracted, and finally the features are input into the recognition model to obtain the result.

  The voice signal is a short-term stable and long-term time-varying signal. Generally speaking, the voice signal can be considered to have little change when it is about 20ms - 30ms. The small segment of voice thus segmented has great changes on the microscopic level, and it can probably be marked as a Phonemes, and there is little change in the macro, this process is framing. The form we usually see most when recording is the time-domain representation of the speech signal. The unit of speech recognition is not a word, but phonemes. We can process speech signals in this analogous way to integration, where a phoneme is a very short signal region in a speech signal. Of course, framing is not a simple segmentation. In order to ensure smooth signal and facilitate processing, there will be overlap between frames. At the same time, the signal is also windowed.

  The performance of time-domain signals is actually not conducive to our processing and extraction of required signal features. Therefore, we use short-time Fourier transform to extract frequency features from time-domain signals and convert them into frequency-domain signals. This process is also a reduction in frequency. dimension process. So far, the feature extraction of the signal becomes easier. However, our recognition of speech signals is not to directly recognize speech signals, but to extract part of the features in the signal uniformly, which not only reduces the dimension and facilitates processing, but also the data input to the recognition model is unified, and the method of extracting features is generally MFCC ( Mel cepstral coefficient), the specific content can be found in the post of the great god on the Internet. It is very good. I will not go into details here, just know that this is a method of extracting the characteristics of speech signals. So far, the speech signal has changed from a series of irregular waveforms to an M*N matrix, indicating that M kinds of features have been extracted. The speech signal is divided into N frames, and each frame is used as a column of the matrix.

  At this point, most of the preparations for speech recognition are done, and the recognition model is now started. The model can be simply regarded as a function for the time being. The above M*N matrix is ​​input to the model, and after the model is processed, the result can be output. Of course, under normal circumstances, it is not as we think that the input speech signal can directly get the text. The output of the speech model is generally a probability. The input of the model is assumed to have M nodes, and the output has K nodes. Each output node represents a Phoneme, and the output value of the node is a probability. When the probability exceeds a certain threshold, the phoneme represented by the current node is considered to be selected.

  So the question is, how did the model come about? Since the model can be regarded as a function, how do the parameters in the function come from? Both the model and the model parameters are trained using a large amount of training data.

  So before training, how did the parameters of the model come from, and how did he slowly get a suitable parameter for real speech recognition through training? The concept of loss function is used here.

 

  There are many models for speech recognition, such as Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), now very popular Deep Neural Network (DNN), Convolutional Neural Network (CNN) and so on. These models have parameters in them. In fact, at the beginning, before training the model, we just randomly initialized the parameters. As training data, when the data is input to the model, we actually know in advance what results should be output, but in fact, for a model whose parameters are randomly initialized, the output results will not always be completely correct, so the actual data results and should be output. The gap (or distance) between the results can be regarded as a measure to measure whether our initialization parameters are suitable, and this distance can be expressed by the loss function.

  So to summarize, the loss function is not mysterious. Just like when we first learned about equations, we didn’t know what an equation was at first, but gradually we learned that an equation is not mysterious. It turns out that an equation is an equation, but there are unknowns in this equation. The same is true for the loss function. It is a measure to indicate how far the model under the current parameters is from our ideal model, so that the model parameters can be adjusted appropriately. The next time the training data is input, there will be an output. , the gap between the output and the expected result is still measured by the loss function, and the parameter size is adjusted again according to the loss function this time. After many cycles, the gap between the actual output and the expected result is narrowing, and our model is getting better and better, until it can be used in practice and used to identify test data.

  

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324794503&siteId=291194637