The whole process of speech recognition based on GMM-HMM

1. Basic knowledge

        Speech recognition technology is a technology that allows machines to convert voice signals into corresponding text or commands through recognition and understanding.

        Difficulties in speech recognition: regional, scene, physiological, cocktail problem (multiple people).

        Classification of speech recognition tasks: isolated word recognition, continuous word recognition.

        Speech recognition task processing flow:

        1) Speech preprocessing

        2) Speech recognition algorithm: traditional GMM-HMM algorithm, based on deep learning algorithm

2. Speech preprocessing

        1) Digitization: discretize the analog voice signal collected from the sensor into a digital signal

        2) Pre-emphasis: The purpose is to emphasize the high-frequency part of the voice, remove the influence of lip radiation, and increase the high-frequency resolution of the voice.

        3) Endpoint detection: also called voice activity detection (Voice Activity Detection, VAD), the purpose is to distinguish voice and non-voice areas, that is, to remove silence and noise, and find a piece of really effective content of voice.

        VAD algorithms can be roughly divided into three categories: threshold-based, classifier-based, and model-based

        4) Framing: Because of the short-term stationarity of speech, it is necessary to perform "short-term analysis", that is, to segment the signal, and each segment is called a frame (generally 10~30ms). Although continuous segmentation can be used for framing, However, the method of overlapping segments is generally used. This is to make the transition between frames smooth and maintain their continuity.

        5) Windowing: Use a movable finite-length window for weighting to realize the framing of the speech signal. The purpose of windowing is to reduce the truncation effect of speech frames. Common windows are: rectangular window, Hanning window and Hamming window.        

        Usually, truncation and framing of the signal require windowing, because truncation will basically cause leakage in the frequency domain, and the window function can reduce the impact of truncation and improve the resolution of the transformation result. (See: What is a leak?

        The cost of windowing is that the parts at both ends of a frame signal are weakened, and they are not paid as much attention as the middle part, so the overlapping segmentation method can be used for framing.

3. Speech recognition algorithm (GMM-HMM algorithm)

        Knowledge Supplement:

        1) Triphone means: [previous phoneme, current phoneme, next phoneme] three phonemes to represent a phoneme, because the adjacent phonemes of the same phoneme are different, the pronunciation of the phoneme will also be different

        2) Each phoneme consists of N states, and N states are used to represent the phoneme: the whole process of start -> stability -> end

        3) A feature frame sequence often has many frames, far more than the number of states. Our approach is to repeatedly output each state in order, and then calculate the sum of the probabilities of each state sequence.

        For example: the three phonemes are: a, b, c, respectively, and the number of feature frames is 10, then the possible state sequences are aaabbbcccc, aaaabbbccc, etc., we will sum the probabilities of all state sequences as this feature frame sequence by these three The probability of a phoneme being produced.

        4) In speech recognition, the audio data is the observation state in the HMM (you can hear and see it directly), and the phoneme corresponding to the audio is the hidden state (the observation state is generated from the hidden state).

        5) Bayesian formula:

P\left ( Y|X \right )=\frac{P\left ( Y \right )P\left ( X|Y \right )}{P\left ( X \right )}

        in:

                P\left ( Y \right ): Prior probability, which represents the initial knowledge of the probability of a random variable

                P\left ( X|Y \right ): Likelihood probability, also known as class conditional probability density (class approximate density), expresses the performance of another random variable related to it under the admission of prior conditional probability

                P\left ( Y|X \right ):Posterior probability. Indicates the probability of Y after the current condition of X

        Training process: Each time a frame sequence and a word sequence are input into the model, the training process belongs to the learning problem in the HMM model, and the EM algorithm is used to iteratively solve it.

        Training steps :

        1) Refine word sequences into triphone sequences

        2) Exhaustively enumerate all possible continuous state sequences of the current triphone sequence

        3) Initialize the initial state matrix (π), hide the state transition matrix (A), and generate the observed state probability matrix (B).

        4) According to the model parameters initialized in the previous step \bar{\lambda} =\left ( \pi ,A,B \right ), the probability (prior) of each state sequence is obtained through the forward or backward algorithmP\left ( I|O,\overline{\lambda } \right )

        5) Calculate each state sequence to obtain the logarithmic likelihood function of the current triphone logP\left ( O,I|\lambda \right ), \lambdawhich is a variable.

        6) Find the expected Q(E-step) of the triphone sequence on each state sequence (priori):

        Q\left ( \lambda, \overline{\lambda } \right )=\sum_{I}^{}logP\left ( O,I|\lambda \right )P\left ( I|O,\overline{\lambda } \right )

                And it is the expectation of the maximum (M-step). So as to obtain new model parameters \lambda =\left ( \pi ,A,B \right )(Lagrangian multiplier method)

        7) Repeat steps 3) 4) 5) 6) until convergence.

        When our generated observation state probability matrix (emission matrix) is represented by GMM, we have n more mixed Gaussian models, so we need to sum on different GMM models, the formula is as follows:

        Q\left ( \lambda, \overline{\lambda } \right )=\sum_{Z}^{}\sum_{I}^{}logP\left ( O,I,Z|\lambda \right )P\left ( Z,I|O,\overline{\lambda } \right )

        where Z=\left \{z _{1},z _{2},...,z _{n} \right \}n represents the parameters of the mixed Gaussian model.

        After the training is over, the parameters of the GMM-HMM system \lambda =\left ( \pi ,A,B \right )are output Z=\left \{z _{1},z _{2},...,z _{n} \right \}.

        Recognition process (decoding): According to the GMM-HMM system parameters and the current input frame sequence, the path with the maximum probability is obtained.       

        Identification steps:

        1) Exhaustively enumerate all possible state sequences corresponding to the current frame sequence

        2) According to \lambda =\left ( \pi ,A,B \right ), the probability that the feature frame sequence is generated by each state sequence is obtained

        3) A sequence of feature frames is far more than the number of states corresponding to words. After the same word sequence is refined into the number of states, it is necessary to expand the number of states in the sequence to the number of feature frames. There are many possibilities for the expanded state sequence. We The sum of the probabilities of each possible state sequence is used as the likelihood probability that the feature frame is recognized as this word sequence P\left ( O|I \right )(if the generated state sequence is I, then the probability of generating a feature frame sequence is O).

        4) Multiply the likelihood probability of the word sequence calculated in the previous step P\left ( O|I \right )and the prior probability of the word sequence in the language model P\left ( I \right )as the posterior probability of the word sequence P\left ( I|O \right )(the feature frame sequence O has appeared, P\left ( O \right )=1).

        5) Find the word sequence with the largest posterior probability as the recognition result of the feature frame sequence.

Guess you like

Origin blog.csdn.net/weixin_43284996/article/details/127350822