Voice recognition technology (AI VQ HMM), voice samples and open source tools Kaldi, etc., dual microphone array

- AI: Computer vision, speech recognition, NLP (Natural Language Processing) 
  artificial intelligence is transitioning from a relatively basic computational intelligence to a higher level of intelligence. Higher levels of intelligence include 3 stages: perception (perception) intelligence, computer vision (computer vision), and cognition (cognition) stage . The
  first stage is perception (perception) intelligence. The machine must be able to hear and see. . Hearing is what we often say about speech recognition. The machine converts a word spoken by a person into text information from a sound signal.
  Then there is vision, computer vision, which can see things, distinguish some faces, or objects, and even some changes in emotions.
  After passing the perception stage, artificial intelligence enters the next cognition stage. The machine started to understand something. For example, in speech recognition, the machine only recognizes the text, but it cannot really know what information the person wants to express. This involves understanding natural language.

- Speech recognition technology: Vector quantization VQ, Hidden Markov Model HMM and other
  current mainstream speech recognition systems generally use acoustic models based on Deep Neural Networks-Hidden Markov Model (DNN-HMM) . The input of the acoustic model is the traditional speech waveform through windowing, framing, and then extracted spectral features, such as PLP, MFCC and FBK. The output of the model generally uses acoustic modeling units of different granularities, such as mono-phone, mono-phone state, and bound phone state (tri-phone state). Different neural network structures can be used from input to output, the input acoustic features are mapped to obtain the posterior probabilities of different output modeling units, and then combined with HMM for decoding to obtain the final recognition result.
  Most of the most advanced speech technology solutions are phonetic-based, including pronunciation models, acoustic models, and language models. Under normal circumstances, most of these models are based on hidden Markov models (HMM) and N-gram models.
  The main networks used in speech recognition are lstm and rnn networks. The difficulty lies in the separation of speech, such as the cocktail party problem.

- Speech recognition research and application

  IFLYTEK and Baidu speech recognition.
  The Machine Intelligence Laboratory of Alibaba Dharma Academy open sourced a new generation of speech recognition model DFSMN, which increased the global record of speech recognition accuracy to 96.04%. This data test is based on the world's largest free speech recognition database LibriSpeech. Compared with the most widely used LSTM model in the industry, the DFSMN model has faster training speed and higher recognition accuracy. Using the new DFSMN model of smart audio or smart home equipment, compared with the previous generation technology, the deep learning training speed is 3 times, and the speech recognition speed is 2 times.
  The development team behind the Firefox browser is launching Common Voice (https://voice.mozilla.org/zh-CN).
  Facebook AI Research Institute recently open sourced a simple and efficient end-to-end automatic speech recognition (ASR) system wav2letter , Wav2letter implements the paper Wav2Letter: an End-to-End ConvNet-based Speech Recognition System (https://arxiv.org/abs/1609.03193) and Letter-Based Speech Recognition with Gated ConvNets (https://arxiv.org /abs/1712.09444).

  High-performance computing research direction. Like Kaldi, OpenBLAS is one of the most popular projects on Github, and it is currently the world's best open source matrix calculation library. Even companies such as IBM, ARM, and Nvidia use OpenBLAS in their products. It is understood that OpenBLAS is also one of the two default underlying matrix libraries in the Kaldi community , and it is the cornerstone of the open source voice community.  https://github.com/xianyi/OpenBLAS
- Double Wheat Array

   Dual microphone array, dual microphone array, dual microphone array XFM10213 noise reduction module. Dual microphone array, as the name implies, only 2 microphones are used in the recording system. Google Home uses a dual-microphone hardware configuration. Amazon Echo, Google Home, Xiaodu speakers. Smart home.
   Introduction to Microphone Array-https:
  //blog.csdn.net/qq_23660243/article/details/78689295 Is the effect of dual microphone array much worse? It depends on the algorithm system. Although there are similarities in the technical routes adopted by Double Mic and Domic, the algorithm system is quite different.
Since the more microphones, the easier it is to achieve better noise reduction and voice enhancement effects, so in order to achieve the same or similar effects, the dual-microphone array technology is more challenging than the multi-microphone array algorithm technology.
  The microphone array is actually a sound collection system that uses multiple microphones to collect sounds from different spatial directions. After the microphones are arranged according to the specified requirements, and the corresponding algorithm (arrangement + algorithm) can solve many room acoustic problems, such as sound source localization, de-reverberation, speech enhancement, blind source separation, etc.

  As early as the 1970s and 1980s, microphone arrays have been used in the research of speech signal processing. Since the 1990s, speech signal processing algorithms based on microphone arrays have gradually become a new research hotspot. In the "voice control age", the importance of this technology is particularly prominent.
  In the frequency response, it is also possible to analyze the direction of the received voice signal source and its changes based on the application of beamforming in the time domain similar to that of the spatial filter. These analyses can display the intensity and angle of the voice signal in the form of a beam from the polar coordinate diagram.
  Usually used in mobile phones (such as Apple iPhone, Samsung Galaxy series, etc.) and computers (such as Lenovo Y series, etc.). Using this technology, the difference between the phases of the sound waves received by the two microphones can be used to filter the sound waves, and the environmental background sound can be eliminated to the greatest extent, leaving only the required sound waves. For a device with this configuration in a noisy environment, it can make the listener sound very clear and no noise.
  The microphone array is different from the antenna array.

  The actual problem that can be solved: noise suppression. Echo suppression. Go to reverb. Single or multiple sound source localization. Estimated number of sound sources. Source separation. Cocktail party effect.

> Voice data sample
  VoxForge is a very active crowdsourced speech recognition database and trained model library).

Speech recognition sample based on Julius-https://github.com/julius-speech/dictation-kit.
An open source project that collects audio samples from a crowd requires about 10,000 hours of audio with various accents. - https://github.com/mozilla/voice-web
  open source data size of up to 1000 hours of the world's largest Chinese open source database AISHELL-2, and supporting the development of a better system-level recipe, AISHELL-2 is also equipped with a set of The evaluation data set, TEST&DEV data includes three devices: iOS, Android, and high-fidelity Mic, which can make the experimental test more scientific and diverse. The AISHELL-2 database is open source.
  Chinese open source database AISHELL-2- https://github.com/kaldi-asr/kaldi/tree/master/egs/aishell2

Mass data or user data. Vertical category. Some vertical fields related to travel involve querying information related to locations, hotels, restaurants, train tickets, and air tickets. The team collected a lot of data in these fields.

> Speech recognition open source technology : Kaldi, openFst, OpenGrm, open source speech recognition tools: CMU Sphinx, Kaldi, HTK, Julius and ISIP, etc.

  Five speech recognition tools based on HMM and N-gram models: CMU Sphinx, Kaldi, HTK, Julius and ISIP. They are all top projects in the open source world. Unlike commercial speech recognition tools such as Dragon and Cortana, these open source and free tools can provide developers with greater freedom and lower development costs, so they have always maintained in the development circle Strong vitality.
  1. Kaldi- https://github.com/tramphero/kaldi
  Kaldi 3rd Offline Technology Exchange Conference-https: //www.lieyunwang.com/archives/444470

 Kaldi is currently the most popular open source speech recognition framework in the world? Kaldi also supports the open source MRCP protocol-unimrcp, which can be used as an invitation to support the processing of various media resources.
 Kaldi is a free, open source toolbox for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using OpenFst).
 Kaldi is written in C++. The core library supports modeling of any speech context size, subspace Gaussian Mixture Model (SGMM) and standard Gaussian Mixture The acoustic model modeling of the model, as well as all the frequently used linear transformations and affine transformations. Kaldi source code is released under the Apache License V2.0 agreement.
  Kaldi is a very powerful speech recognition tool library, mainly developed and maintained by the "soul man" Daniel Povey. It currently supports the training and prediction of various speech recognition models such as GMM-HMM, SGMM-HMM, DNN-HMM, etc. Among them, the neural network in DNN-HMM can also be customized by configuration files. Neural network structures such as DNN, CNN, TDNN, LSTM, and Bidirectional-LSTM can all be supported. It is currently one of the very active projects on Github, and many domestic and foreign voices The R&D testing of similar technology companies is based on Kaldi's initial start.
  Kaldi also has an iVector-based speaker and environment adaptive model, which can improve the robustness of the entire speech recognition system. And the language model (RNNLM), using the RNN language model to do Rescoring can better model long-related words. Kaldi’s mainstream acoustic model: Chain model, WER can be relatively reduced by 6% to 8% on the public data set, and the training criterion is changed from CE+sMBR to LF-MMI, which can achieve three times the frame rate training and decoding, and supports tdnn. /lstm/rnn network structure.
  Based on Kaldi's voiceprint recognition practice, the core technology of Kuaishangtong is voiceprint recognition. Based on Kaldi, it has tried speaker information from i-vector (acoustic feature) and dnn ubm/i-vector based on end-to-end deep learning. Mainstream voiceprint recognition methods are extracted, and finally an embedded technical route is formed. At present, efforts are being made to promote the rapid implementation of voiceprint recognition technology in the urban Internet of Things, financial scenes, public security and justice, transportation, medical care, education and other fields.

  Researchers of Automatic Speech Recognition (ASR) can choose from a variety of open source toolboxes to build a recognition system. Famous ones are: HTK, Julius (implemented in these two C languages), Sphinx-4 (Java language recognition), RWTH ASR toolbox (implemented in C++). The main purpose of Kaldi is acoustic model research. Therefore, the closest competitors are HTK and RWTH ASR Toolbox (RASR). DNN / HMM speech recognition system
  However, for Kaldi's specific needs: finite-state transducer (FST), extended linear algebra support and non-restrictive license, led to the development of Kaldi.

Important features of Kaldi include:
-Integration of Finite State Transducer (compiled OpenFst toolbox as a library)-Extended
linear algebra support
-Scalable design
-Open source license-Apache v2.0, minimal open source agreement
-Complete Method-Kaldi provides a complete method for building a speech recognition system
-careful testing-basically all codes have corresponding test routines

Kaldi toolbox:
1. Kaldi code structure and design choices, including introduction of various components of speech recognition system
2. Introduction to feature extraction
3. Acoustic model
4. Speech decision tree
5. Language model
6. Decoder
7. Brief introduction Benchmark results

  2. openFst- http://www.openfst.org/twiki/bin/view/FST/FstQuickTour
FST (finite-state transducers, finite state machine)
  openFST (application of FST in speech recognition: language model grammar, pronunciation dictionary , Context-sensitive acoustic units and HMM can all be represented by FST; they are compounded together to form a combination of HCLG.fst and viterbi, which can be used for speech recognition decoding).

- 安装openfst
sudo yum install openfst openfst-devel opengrm-ngram opengrm-ngram-devel
wget http://openfst.cs.nyu.edu/twiki/pub/GRM/ThraxDownload/thrax-1.1.0.tar.gz
tar xfv thrax-1.1.0.tar.gz
cd thrax-1.1.0
./configure
sudo make install

Guess you like

Origin blog.csdn.net/ShareUs/article/details/94132326