WFST-based speech recognition decoder

Learning speech recognition some time. The boss asked us to take Kaldi based on a speech recognition system, a speech on the device through the MIC, the PC attached to the console device can display the contents of the basic real-time speech. Because we are white, the beginning may be required to be lower, it is the traditional GMM-HMM, isolated word recognition can be achieved even if the standards, followed with increasing capacity in this area, do a little more difficult. After the tasks assigned according to my simple understanding of kaldi before the module is divided into three parts: data preparation and MFCC, GMM-HMM, decoding and decoding network created by three people each responsible for part of the learning, master the basic principles, to find out what things to do. In two other students after the first pick of the module for me to be responsible for network construction and decode the decoding section.

 

After the three of us learned kaldi twenty-three weeks feeling down kaldi is not very easy to use, there are four main reasons. First, we are all new to some of the concepts and routines, speech recognition is not entirely clear. Second kaldi documents less than normal, is not conducive to understand the code. Third, kaldi is a collection of algorithms and tools, voice recognition entire process is complicated by a lot of shell and Perl scripts to string together these tools to achieve and difficult to read. Fourth kaldi is implemented in C ++, and we previously mainly used C development software (chip company's software engineers mostly in the bottom of the C development software). We overcame these difficulties, sort out what to do for each module, but also to figure out the software training and recognition of two major points, there are some modules in the training and recognition, such as MFCC, but only in some modules there are training. Training is to give recognition to prepare, get training acoustic models, dictionary and language model, based on WFST these three combined into one large decoding network for recognition. Real running speech recognition systems recognize only part of the software is running.

 

I am responsible for the concrete to decode, there are two major components, one to generate the decoded network, and second decoding network-based decoding. There is a top priority WFST (weighted finite state transducer). WFST semiring algebra belongs. I was learning to control birth, the number of high school, matrices, probability theory, when an undergraduate, etc., on the matrix more in-depth study (control subjects require a higher matrix) graduate school, have never studied other branches of mathematics, Semirings algebraic theory is obviously something new. Because there is no basis, it is not very easy to learn, the Internet also say that this discipline people learn mathematics and theoretical computer science it easier. At this stage we just take a voice recognition system, does not require in-depth study of algorithms, plus the time is not allowed (the boss gave us set the dealline), it is only to understand the basic principles WFST, followed by the script will be familiar with specific examples and code flow. Kaldi There are two types of decoders: offline and online. yesno is the most simple examples of offline decoder, by running this example basically figure out the meaning of the relevant scripts and functions. Because we want to take the online real-time system, offline reference of less, but also no too concerned about, instead go to the kaldi in the online decoder. kaldi in the online decoder available in two versions: online (old version) and online2 (new version). Official website recommended online2 (based on the example RM (resource management)), and claimed to be gradually discarded the old version online. But now RM corpus can not be downloaded from the network, so the examples RM can not run, can only go with the old version online. Fortunately, a lot of users say or old versions of online easy to use, and finally gave me a reassuring. Read related blog, are based on kaldi the only Chinese in identifying examples thchs30 do online decoding. According to the guidance of the blog, first under the corpus, do all kinds of training are decoded networks. Then download portaudio, the MIC collected from the PC to the voice data. Finally, rewrite the script runs, an example of such a decoding ran up online, real-time display the text says on the PC console. In the code add some log to track that it will figure out the mechanisms and processes various calls while online decoding software implementation.

 

So in addition to the relevant decoding algorithm other basically figure out. By convention, within the group of students to do PPT to speak, so that we can improve. I will do according to their own understanding of the relevant contents of the speech recognition decoder (or borrow some pictures of the various documents and the blog, and then to express my gratitude, not to list). Here is what I do PPT, give a friend in need to see. If there is an error, please noted, thank you very much!

 

Guess you like

Origin www.cnblogs.com/talkaudiodev/p/11064283.html