Kaldi Speech Recognition Technology (6) ----- DTW and HMM-GMM

Kaldi Speech Recognition Technology (6) ----- DTW and HMM-GMM

foreword

In the previous content, we have completed the feature extraction, so in this chapter we mainly take notes on the theoretical part. If you know what you are doing, you can learn more efficiently, briefly review speech recognition, and then introduce the most commonly used and simplest DTW (Dynamic Time Warping) algorithm for speech recognition.

1. Overview of Speech Recognition

Today, speech recognition has made breakthroughs. On August 20, 2017, the error rate of Microsoft's speech recognition system was reduced from 5.9% to 5.1% , reaching the level of a professional stenographer; the speech dictation accuracy rate of iFLYTEK, the leader in the domestic speech recognition industry, reached 95%. Be strong. Large domestic companies such as Ali, Baidu, and Tencent have also made efforts in speech recognition, and the prospects are promising.

Moreover, the speech recognition system is not only used for the mobile phone interaction and smart speaker commands mentioned above, but also plays a role that cannot be ignored in many fields such as toys, furniture, automobiles, justice, medical care, education, and industry. After all, in the current era when artificial intelligence has just started, voice interaction is the most efficient way of human-computer interaction before equipment can easily detect human thoughts.

Now the huge language database is difficult to place on the mobile terminal, which is why almost all mobile phone voice assistants need to be connected to the Internet. The development of speech recognition is not without an offline version, but it is not difficult to find that the accuracy of the offline version is much lower than the online version.

In addition, we mentioned just now that many voice manufacturers claim that the accuracy rate has reached more than 90%, which can be said to be very remarkable. It is not an exaggeration to say that every increase in the accuracy rate by 1% at this time is a qualitative leap. This requires not only a fairly complete database, but also an efficient identification and extraction algorithm and a self-learning system to meet such an accuracy rate.

However, we have to look at such data from a dialectical perspective. As the saying goes, the Chinese language can be described as broad and profound; and the test of the accuracy data given by the manufacturer is difficult to be extensive, so some users are using voice When identifying the function, it is found that it is still "mentally retarded", which is normal.

Recognition extraction algorithm and self-learning system, here we might as well briefly understand their working process: first, the speech recognition system preprocesses the collected target speech, this process is already very complicated, including speech signal sampling, anti-aliasing band Pass filtering, removal of individual pronunciation differences and equipment, noise effects caused by the environment, etc. Then, feature extraction is performed on the processed speech.

2. Basic principles of speech recognition

The essence of sound is vibration, which can be represented by a waveform. For recognition, the wave needs to be divided into frames. Multiple frames constitute a state, and three states constitute a phoneme . The commonly used phoneme set in English is a phoneme set composed of 39 phonemes by Carnegie Mellon University. Chinese generally uses all initials and finals as the phoneme set. In addition, Chinese recognition is divided into tonal and atonal. Words or Chinese characters are then synthesized through the phoneme system.

Of course, subsequent matching and post-content processing also require corresponding algorithms to complete. The self-learning system is more for the database. A speech recognition system that converts speech into text must have two databases, one is an acoustic model database that can be matched with the extracted information, and the other is a text language database that can be matched with it. These two databases need to train and analyze a large amount of data in advance, which is the so-called self-learning system, so as to extract useful data models to form the database;

In addition, during the recognition process, the self-learning system will summarize the user's usage habits and recognition methods, and then summarize the data into the database, so that the recognition system will be more intelligent for the user.

To further summarize the entire identification process:

  • Process the collected target voice to obtain the voice part containing key information
  • Extract key information
  • Recognize the smallest unit word, analyze the grammatical arrangement
  • Analyze the semantics of the entire sentence, arrange the key content in sentences, and adjust the composition of the text
  • Modify content with slight deviations based on overall information

3. DTW (Dynamic Time Warping) Algorithm

How features are converted to phonemes
insert image description here

alignment
insert image description here

The relatively simple one in speech recognition is based on the DTW algorithm. DTW (Dynamic Time Warping) algorithm principle: Based on the idea of ​​dynamic programming (DP), it solves the template matching problem of different pronunciation lengths. Compared with the HMM model algorithm, the training of the DTW algorithm requires almost no additional calculations. Therefore, in isolated word speech recognition, DTW algorithm is still widely used .

In the training and recognition phase, the endpoint detection algorithm is first used to determine the start and end of the speech. For the reference template {R(1), R(2),...,R(m),...,R(M)}, R(m) is the speech feature vector of the mth frame. For the test template {T(1), T(2), ..., T(n), ..., T(N)}, T(n) is the speech feature vector of the nth frame of the test template. The reference template and the test template generally use the same type of feature vector, the same frame length, the same window function and the same frame shift.

For the test and reference templates T and R, the distance D[T, R] between their similarities, the smaller the distance, the higher the similarity. The Euclidean distance is usually used in the DTW algorithm. For the case where N and M are not the same, the alignment of T(n) and R(m) needs to be considered. Generally, the method of dynamic programming (DP) will realize the mapping from T to R.

Each frame number n=1 of the test template is marked on the horizontal axis in a two-dimensional rectangular coordinate system, each frame number m=1 of the reference template is marked on the vertical axis, and the frame number is represented by these A grid can be formed by drawing some vertical and horizontal lines on the integer coordinates of , and each intersection point (n, m) in the grid represents the intersection point of a frame in the test mode and a frame in the training mode. The DP algorithm can be attributed to finding a path through several grid points in this grid, and the grid points passed by the path are the frame numbers for distance calculation in the test and reference templates.

DTW( Dynamic Time Warping)

According to the principle of the closest distance, the corresponding relationship between the elements of the two sequences is constructed, and the similarity of the two sequences is evaluated.

Requirements
(1) One-way correspondence, no turning back
(2) One-to-one correspondence, no free time
(3) Shortest distance after correspondence (the closer the distance, the higher the matching degree)

eg:
insert image description here

Algorithm implementation
insert image description here
insert image description here
insert image description here
insert image description here

DTW's python package

pip install dtw
pip install dtw_c
pip install fastdtw

Disadvantages of using DTW

  1. DTW can only be recognized in units of words (isolated word recognition), and the effect of continuous speech recognition is poor;

  2. The amount of calculation is large. DTW needs to store all training data, and the space usage and test time consumption increase linearly with the amount of training data;

  3. It is too dependent on the original pronunciation of the speaker, and cannot be dynamically trained on the sample;

  4. Recognition performance relies too much on endpoint detection (i.e., feature extraction performance).

4. GMM-HMM

This is a mature system, and I don't know much about it so far, so I will only give a brief introduction to this part.

GMM-HMM Speech Recognition System:
insert image description here

Purpose: Express speech generation as a probabilistic model

insert image description here

  • GMM (Gaussian Mixture Model)
    insert image description here

  • HMM
    insert image description here

  • GMM-HMM
    insert image description here


reference:

GMM-HMM model

GMM-HMM acoustic model (depth analysis)

DTW (Dynamic Time Warping) Algorithm Principle and Application

Xiao Lijun. Research on the Algorithm of Speech Recognition of Isolated Words Based on DTW Model[D]. Central South University, 2010.

Guess you like

Origin blog.csdn.net/yxn4065/article/details/129116613