How to take three white is the traditional method is based on a speech recognition embedded online voice recognition system kaldi's (GMM + HMM + NGRAM) within three months Overview

In front of the blog in recent months, I have said from traditional voice (telephony) cut into intelligent speech (speech recognition). The beginning is to learn the basics of speech recognition, learn to write after he learned of a group of students in the PPT to make a presentation ( speech recognition conventional method (GMM + HMM + NGRAM) Overview ). After some time the boss arranged a specific task: to build an online kaldi based speech recognition system on our own ARM chip, three men spend about three months to complete. Because we are all white areas of speech recognition, the requirements can be lower, to achieve on traditional GMM-HMM. To be honest we received this task is a bit mind not the end, do not know can not be completed on time, after all, we are not familiar with speech recognition, to kaldi unfamiliar. Since the tasks assigned, and also to bite the bullet on and do our best to complete. I instinctively first in line with Baidu / google search, search, see if there is some experience for reference, so that we can avoid detours. Unfortunately, I never found something of value. No way, we can only groping forward based on their previous experience. Ultimately, we planned to spend less than three months to complete the online platform to build on embedded speech recognition system. Although only a demo, but to actually do behind the commercial products to lay a good foundation, accumulated a lot of experience. Today, I'll share how we do it, to want to do similar products friend to be a reference.

 

Now do as a project, we must have a plan in several stages to complete the project. When I learned the basics of speech recognition for kaldi have a simple understanding (before doing voice recognition on known kaldi name, no way in recent years artificial intelligence (AI) too hot. Intelligent voice as the main landing artificial intelligence one point, many of which were implemented based on kaldi. I was doing voice will naturally concerned about the dynamic of this popular field). According to a simple understanding of kaldi, I put the project into three phases, the first is to learn kaldi, for kaldi have a deeper understanding, based on kaldi while to figure out what to do behind the scheme of things to do, plan to spend about a month time to complete. The second stage is the software architecture design, writing code, training models, it took a period of about one month to complete. The third stage is to debug, enhance the recognition rate, or spend about a month to complete. The scheduled time will do fine tuning according to the actual situation.

 

1, the first stage

The first stage is to learn kaldi. Since we are three people to do this project, I put the learning task into three parts: data preparation and MFCC, GMM-HMM model training, decoding network creation and decoding. The rest of the decoding network creation and decoding have I learned after two other students pick a good module. The learning process is to look at online articles, blog look and see kaldi code, scripts process. After you figure out the science behind what to do, while doing a group of students in the PPT to speak, so that we can improve. See my previous article (decoding-related based WFST speech recognition decoder ). There are two types of decoding Kaldi: offline (model used for debugging, etc.) and online (used for online identification, etc.), online wherein there are two ways, one is from the voice data acquired by the MIC PortAudio doing online voice recognition, other kind of speech recognition is done online by way of reading WAV audio files. We do is online speech recognition, two is a good reference, especially from the MIC acquisition mode, it is necessary to understand the operating mechanism by PortAudio. So I built a thchs30 line identification on the PC to debug, basically figure out the operating mechanism according to code-based online blog. Kaldi 16KHz sampling rate is set, every 25ms time frame (10ms in which the frame-shift), each set of 27 is concentrated to make a MFCC feature extraction and decoding, so that a group of processing speech length is 285ms (25+ (27-1) * 10 = 285), a total of 4560 (16 * 285 = 4560) sampling points. After each treatment and then removed after a set of a set do MFCC and decodes the buffer, to see if the decoded word to be recognized out, some words printed.

 

2, the second stage 

The first stage is to learn, the second stage will really work. We develop on Linux, after the first set goals to build a complete system: The device with a data cable attached to the PC, real-time online can recognize English numerals 0-9 (selected to recognize these are readily available online because of Britons said audio source, we can save recorded audio source work, good saving time), that is, people speak English in front of a digital device after 0-9 on the PC screen can be printed out in real time, better recognition rate close to the value of GMM-HMM model. Everyone's task is to follow the first stage. MFCC learning data preparation and data preparation of students to related work, such as tagging, etc., a good model for training students, and then transplant kaldi in MFCC related code. Learning model training students to start preparing a working model of training, and other data to be prepared well after the start of training. I am responsible for the entire design software architecture, but also the kaldi vast majority (except MFCC) transplanted into our system. By learning kaldi, so I designed this line of how speech recognition software architecture have a deeper understanding. Speech recognition in two phases, namely the training phase and the recognition phase. Training phase is to get the model to identify the stage with. It is relatively independent, we have to train the model based on kaldi, finally get final.mdl and other documents to identify the stages of software use (read the files are decoded when initializing the network). Phase identification software mainly of two parts, the sound collection and identification (including feature extraction and decoding). Thus there are two Thread system, a sound acquisition thread (audio capture thread), it is done based on ALSA, responsible for acquisition and pre-processing sound (e.g., noise reduction), the other is to identify the thread (kaldi process thread), is responsible for MFCC and decoding. Two thread through interactive data ring buffer, keeping in mind the protection of data. Software architecture block diagram of this system are as follows:

 

Discussion of software architecture that no problem after I started writing code to build a software framework. Create a thread, etc. are some of the routine of living in Linux. Audio capture thread where do first initialization, including initialization of ALSA configuration module and pre-processing, and the like. API work to complete a collection of audio data, after reading the data to do pre-processing and post-processing audio data well into the ring buffer while activating kaldi process thread, let kaldi process thread start and then at regular intervals by the ALSA_LIB work. Kaldi thread also do some initialization work and then sleep on awaiting activation. After activating start taking voice data in the ring buffer, and then do the MFCC and decoder. Sleep on and then wait for the next complete re-activated. When setting up the software framework Kaldi related code has not been transplanted into, kaldi process thread simply to get in from the ring buffer in the PCM voice data written into the file, and then listening with CoolEdit, normal voice software framework to explain the basic shape. At first audio capture thread did not increase in pre-processing module, debugging to get there from ALSA PCM data written to a file found to hear the noise, on the addition of noise suppression (ANS) module. This module is webRTC inside. webRTC in three pre-processing module (AEC / ANS / AGC) a few years ago I used this time to take over simple to handle it with good will, de-noising effects are quite good. ANS is a loop 10ms, and as I said before kaldi in line identification decoding process once a group of 27 is 285ms, I would take the least common multiple of both 570ms time as a loop of audio capture thread. ALSA after taking from 57 minutes to speech data (570/10 = 57) times make noise suppression, voice data and then written into the suppression ring buffer. Kaldi thread 285ms activated or takes out the voice data processing to do, to take just two (570/285 = 2).

 

After setting up the software architecture began to transplant kaldi code. Kaldi large amount of code, can not and need all migrate to our system, we need only need to transplant it. How can transplant the code we need it? After considering I used the following method: first-line decoding transplant-related code in, and then began to stop compile, report what is wrong what prompted what is missing on the increase until compile. This approach ensures that the necessary files are transplanted into the system, but the function is possible that some files useless to that level yet to file to function level. Pressed for time, the problem is temporary care. More migration process is a manual labor, need to be careful and meticulous. Problems encountered during migration went online to search, finally satisfactorily resolved. Kaldi main use of the three open source libraries: openfst, BLAS, LAPACK. BLAS and LAPACK conventional method I use, that is the official website to download the compiled library, then put the library and header file system "/ usr / lib" and under "/ use / include", so that other code use. kaldi support has BALS library has ATLAS / CLAPACK / openBLAS / MKL and so on. Kaldi when running on Ubuntu PC X86 on the use of Intel's MKL, can not be used on ARM, you need to use one of the other several. I was down in the assessment openBLAS, mainly because of three things: 1) it is the BSD; 2) It supports multiple architectures (ARM / X86 / MIPS / .... ), It is an open source library the best performance (in various architectures are embedded in a lot of assembly code), used by many well-known companies, such as IBM / ARM / nvidia / huawei etc; 3) it has multiple compiler options choose from, such as single-threaded / multi-threading options, such as setting the number of threads. Early BLAS codes are used fortran written in C and later for its packaging, so the system plus support for the fortran. To openFST, I have found that it is not much use, it is useless to conventional methods, but directly into the code used in the transplant system. I have no problem after transplantation good compiler other students as well as the rest of the MFCC and ALSA interfaces (interfaces with ALSA alternative kaldi in the PortAudio Interface) is also related to transplant into it. Such transplants work even if the end. Comparison of the code under the code kaldi transplanted into the system and in kaldi SRC, should be used only a small part. The following figure shows kaldi transplanted into the file system (not listed associated header files). Model is also responsible for training the students also have a preliminary model generated files, these files into the system can run up on the PC screen had printed out the words people speak, but incorrect. This is the normal way, because no debugging it!

 

 

3, the third stage

The third stage is debugging. After the end of the second stage to speak the word out there, but they are wrong, they need to troubleshoot the problem location. Line from the speech recognition system can be divided into two large angle: and a model code. First, we need to locate a problem or issue code model implementation, starting with model investigation. In the first stage using substantially clear mechanism thchs30 line decoding is tri1 tone model, the recognition rate was poor. Now pay attention to the recognition rate, and the model is replaced tri2b, the recognition rate increased. This shows kaldi in online decoded code is no problem, the problem lies in the recognition rate differential model. Moreover, so many people in the world with kaldi, if there is a problem decoding line should have a fix. So we decided to put our model generated files into thchs30 in to verify that the model in question. In order to exclude audio data from the MIC input noise interference, and to verify the file read manner. Our model to find the file into the basic does not recognize the right, indicating that the model is problematic. Model responsible for the students to investigate and found that the source for the training of all 8K sampling, but online decoding are 16K samples, which we dug ourselves a hole, with the re-sampling program of all 8K to 16K turn into a this will fill the pit, but the recognition rate is still bad. Also found that training set are all British pronunciation, while the test set is the Chinese people's pronunciation, there is a certain accent, it is best to use our Chinese people's own pronunciation as the training set. So we own and recorded sources for training, in order to increase the training data, but also please many others recorded audio. After training to get a new model, it is placed inside thchs30 validation, recognition rate of six or seven, and this shows the general direction of the model, in order to improve the recognition rate, the model also need to continue debugging.

 

The next section of code depends on whether there is a problem. The model of the new production into our own system, and with the way from the audio file data (our system can collect data from both MIC can also read data from the audio file, read data from an audio file for debug) to replace from the data collected MIC (this is done to exclude the interference factors such as noise) whether the code point of view there is a problem. Run down and found that the recognition rate is still very poor, which shows that our code is also problematic. In the second stage I have been part of the code debugging, ensures kaldi process thread from where to get the audio data in PCM ring buffer is no problem. There are two aspects need to debug, one is sent to the PCM data if the MFCC OK, and second, our online decoding mechanism to keep kaldi in online decoding mechanism exactly the same. A good debugging soon. Second, to further in-depth study thoroughly understand kaldi in online decoding mechanism, we correct it and not the same place, after twenty-three days debugging with thchs30 recognition rate in almost, and this shows our code after debugging is also a good base, and later to begin a tune performance.

 

Do front line identified by reading data from the audio file, some relatively clean data. Now read the audio data from the MIC do really online recognition, and the recognition rate is significantly lower down after the test, which shows that we do not completely pre-treatment (front only when debugging module added ANS). I took before and after processing audio data dump out to listen with CoolEdit, indeed sometimes the sound quality is not good, so I again webRTC the AGC module to add the phrase, once again dump the audio data before and after treatment to listen, to hear many times feel the sound quality normal. Run added again capture audio data from the AGC MIC online recognition, the recognition rate really has been significantly improved. Pretreatment can do has been done, in order to further increase the recognition rate, it is up to the force model. Please do model students while recording source more people to train while trying various models, eventually using a tri4b, we had a relatively good recognition rate. Since we use the GMM-HMM, and now the mainstream speech recognition is no longer used, the boss felt no need to adjust, and will certainly be back with a mainstream model, but online voice recognition software code embedded on the whole, especially and audio capture software architecture is useful, based on the code behind would do the real product.

 

For veteran voice recognition field, this line of embedded speech recognition system is very immature. But the ride through this system, so we have a little more feel for the voice recognition field, also had a good start to the boss with confidence, and may continue to do so. Things on the engineering ones, want more in-depth behind to do so, accumulate more experience in the field of speech recognition. This system does not take any useful reference information is purely groping take out based on our past experience. The products may not do the same, but a lot of problem-solving ideas are the same. If you have friends also take an online speech recognition system on embedded welcome to explore, Dachu a better online speech recognition system.

Guess you like

Origin www.cnblogs.com/talkaudiodev/p/11240033.html