2019, this eight automatic speech recognition program you should know!

2019, this eight automatic speech recognition program you should know!

Author | Derrick Mwiti Translation | Nuka-Cola edited | Linda AI front REVIEW:  based on human speech recognition and computer processing power, are collectively referred to as speech recognition. Currently, this technology is widely used in some of the user authentication system, as well as for Google smart assistant, Siri or Cortana and other smart devices give instructions.

Essentially, we vocals and training by storing automatic speech recognition system to find words and expression patterns among the voice. In this article, we will work together to understand a few important papers designed to use machine learning and deep learning technology to solve this problem.

More premium content please pay attention to micro-channel public number "AI Frontline" (ID: ai-Front) Deep Speech 1: speech recognition-end scale-up

The author Silicon Valley from Baidu Artificial Intelligence Laboratory Institute. Deep Speech 1 does not require phoneme dictionary, but the use of optimized RNN training system designed to utilize multiple GPU performance gains. The model error rate of 16% achieved in the Switchboard 2000 Hub5 dataset. The reason for using GPU, because of its need to invest thousands of hours training the model data. In addition, the model can effectively deal with noisy speech acquisition environment.

Deep Speech: Scaling up end-to-end speech recognition

https://arxiv.org/abs/1412.5567v2

Deep Speech primary building unit is a recurrent neural network, which has completed training, capable of taking the voice spectrum and generates a result of the English text transcript. RNN aims input sequence into a sequence of character probabilities after transcription. RNN unit has five hidden layer, three-layer non-recursive nature. In each time step, these layers are processed independently of the non-recursive data. The fourth layer is a layer having a two-way recursive two hidden units. One group carried forward recursion, the other group was backward recursion. After the completion of the prediction, the model calculates connectionist temporal classification (CTC) loss function to measure the prediction error. Nesterov training is completed using the acceleration gradient method.

In order to reduce the variance of the training period, the authors add 5% to 10% abandonment rates feeder layer among the front. However, this does not affect the activation of a recursive function to hide. In addition, the authors also system which integrates a set of N-gram language vinegar, because the N-gram model can easily take advantage of large-scale unlabeled text corpus for training. The following figure shows an example RNN transcription:

Below shows the performance comparison results of the present model and other models:

End speech recognition in English and Mandarin: Deep Speech 2

In the second iteration Deep Speech among the authors used to identify end-depth learning Mandarin and English voice. The proposed model can handle different languages ​​and accents them, and continue to maintain the ability to adapt to noisy environments. 7 times speed increase achieved in the model of the previous generation using the high-performance computing (HPC) technology. Within their data centers, the authors use GPU to realize Batch Dispatch.

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

https://arxiv.org/abs/1512.02595v1

Its English voice system utilizes 11,940 hours of audio training from voice, and Mandarin Chinese system uses 9400 hours of voice audio from training. During training, the authors use synthetic data to further increase the amount of data.

This model is used in architecture as many as 11 layers, layer with a two-way recursive convolution layer. Calculating a power of the model is faster than the Deep Speech 8 times. The author uses Batch Normalization optimization.

In terms of activation function, the authors use a linear limiter rectifier (ReLU) function. This architecture with essentially similar to Deep Speech 1. The architecture is a recurrent neural network trained for picking up speech spectrogram and audio output text transcription. In addition, they also use CTC loss function model training.

It is shown below the word error rate in the arrangement of the various layers convolution comparison result.

The result of the comparison shown in FIG Deep Speech 1 and word error rate Deep Speech 2. 2 Deep Speech word error rate is significantly lower.

The authors use the "Wall Street Journal" news articles composed of two sets of test data collection system benchmarked. The model in the case of three-quarters of humanity to achieve a better word error rate. Further, the system is also used to LibriSpeech corpus.

Use two-way recursive DNN achieve first-pass large vocabulary continuous speech recognition

Authors of this paper from Stanford University. In this context, they proposed the idea of ​​using the neural network model and perform first-pass large vocabulary speech recognition technology.

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

https://arxiv.org/abs/1408.2873v2

Use connectionist temporal classification (CTC) loss function to train the neural network. When CTC led the authors were able to train a neural network, and predicted "Wall Street Journal" character sequence LVCSR language corpus, access to character error rate (CER) of less than 10%.

They will be N-gram language model training with CTC made of neural networks. The reaction-diffusion model of neural network architecture (RDNN). Use a modified version of the non-linear rectifier, the new system trim large activation function in order to prevent its divergence occurs during network training. The following is RDNN drawn character error rate results.

English Conversation between man and machine phone voice recognition

The authors from the IBM Institute want to verify whether the current speech recognition technology has been able to comparable with humans. They also proposed a set of papers in acoustic and language modeling techniques.

The acoustic side involves three models: one with multiple features LSTM input, and the other for the use of the speaker adversarial multi-task learning and training made LSTM, thirdly, compared with 25 convolution residual network layer.

The language model using the character LSTM convolution WaveNet language model. The authors of English conversation on the telephone system LVCSR Switchboard / CallHome subset (SWB / CH) were given 5.5% / 10.3% word error rate.

English Conversational Telephone Speech Recognition by Humans and Machines

https://arxiv.org/abs/1703.02136v1

As used herein architectures include bi-layers 4-6, each single-1024; plus a linear bottleneck layer, comprising 256 units; an output layer, contains 32,000 units. Training of the cross-entropy cover 14, and then used to strengthen the MMI (mutual information) standards for a stochastic gradient descent (SGD) training sequence.

The authors smoothing effect is achieved by adding cross Expansive entropy loss function. LSTM using the Torch with CuDNN 5.0 version of the back-end implementation. Cross entropy model is trained on a respective single device completes NVIDIA K80 GPU, and 700 M each round sample training period is about two weeks.

For network convolution acoustic modeling, the authors trained a residual network. The following table shows several ResNet architecture and their actual performance on the test data shown in FIG.

The following diagram shows how the network-adapted acoustic model residuals. The network contains residual units 12, 30 and the layer weights 67100000 parameters using training accelerating gradient Nesterov, 0.03 learning rate, momentum 0.99. CNN also uses Torch with cuDNN 5.0 version backend. Cross entropy training period of 80 days, to 1.5 billion samples, using a NVIDIA K80 GPU, GPU 64 per batch.

By the following figure, you can see the error rate LSTM and ResNets of:

The authors also try four LSTM language models, were WordLSTM, Char-LSTM, WordLSTM-MTL and Char-LSTM-MTL. The following figure shows the architecture of these four models.

Word-LSTM wherein the buried layer has a word, LSTM two layers, a fully connected layers, and a softmax layer. Char-LSTM owns LSTM layer by a sequence of characters for estimating embedded. Word-LSTM use with Char-LSTM cross entropy loss function to predict the next word. As the name suggests, Word-LSTM-MTL and Char-LSTM-MTL which introduced multi-task learning (MTL) mechanism.

WordDCC embedded layer consists of a word, a plurality of layers having a causal convolution, convolution expanded layer, fully connected layers, and a residual layer connection SoftMax composed.

Wav2Letter ++: the fastest open source speech recognition system

The authors from Facebook AI Research depth study proposed an open source speech recognition framework --Wav2Letter. The framework was written by C ++, and uses ArraFire tensor library.

wav2letter++: The Fastest Open-source Speech Recognition System

https://arxiv.org/abs/1812.07625v1

The reason for using ArrayFire tensor library, because it can run on multiple back-end, back-end, including CUDA GPU and CPU back end, thus significantly improve execution speed. Compared with other tensor C ++ libraries, and the use of the built ArrayFire arrays is also relatively easier. Monolayer the MLP (multilayer perceptron) as shown in FIG left with a binary cross-entropy function is a loss how to build and train.

The model uses the "Wall Street Journal" (WSJ) data sets were tested, using a total of two types of neural network architecture training time assessment: recursion, contains 30 million argument; pure convolution, contains 100 million parameters. For the model shown in the figure on the word error rate of LibreSpeech.

SpecAugment: Simple data for an automatic speech recognition method for enhancing

Google Brain authors presupposes a simple voice recognition data enhancement, and name it SpecAugment. The method can operate on a logarithmic spectrum of the input audio.

In LibreSpeech test-other focus, the authors achieved 6.85% of WER (word error rate) under the premise without having language model, and WER further improvement to 5.8% after the use of a language model. For Switchboard, which respectively 7.2% / 14.6% word error rate on the Switchboard / CallHome.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

https://arxiv.org/abs/1904.08779v2

Using this method, the authors were able to train a set called Listen, Attend and Spell (LAS) end to end ASR (Automatic Speech Recognition) network. Using data enhancement strategies include time warping, frequency masking and time masking and so on.

LAS in this network among the input spectrum is transmitted to a number of double convolutional neural network (CNN) which, with a step size of 2. The output is further through a CNN having a stacked bidirectional d LSTM encoder - wherein the unit size is w, to generate a series of vectors attention.

Attention Each vector is fed to a unit for the dimension w double RNN decoder by the output labeled transcription. The authors use a 16 k's Word Piece Model of LibriSpeech corpus before the main set of 1 k Word Piece Model of Switchboard text tokenization. The final result of transcription acquired by the beam search, cluster size is 8.

The following figure shows the results of LAS + SpecAugment word error rate performance.

Wav2Vec: No pre-training supervision method for speech recognition

The authors from Facebook AI Research by the expression of the original audio learning, to explore how an unsupervised way to achieve pre-trained speech recognition. The resulting result Wav2Vec, not a large-scale model train set on the audio data obtained mark.

Represented thus obtained is used to improve the acoustic model training. Training and pre-optimized binary classification task by comparing the noise of a simple multi-layer neural network convolution, resulting Wav2Vec success reached 2.43% of WER on nov92 test data sets.

wav2vec: Unsupervised Pre-training for Speech Recognition

https://arxiv.org/abs/1904.05862v3

The method used in the pre-training, the optimization model is implemented using a single context for the next sample prediction. The model of the original audio signal as input, then the application and the network encoder network context.

The audio signal encoder first potential embedding space, and the context of the network is responsible for combining the plurality of time steps coder to arrive at the completion of the culture he represented vertically. Next, the objective function is calculated from among the two networks.

The context of the encoder and the layers causal network comprises a convolution layer having 512 channels, a set of normalized layer and a non-linear activation function ReLU. During training, the network generated by the context indicates which is fed to the acoustic model. Training and evaluation of acoustic models use wav2letter ++ Toolkit is complete. In decoding, the authors use a separate dictionary and language model derived from the training data set on the WSJ language modeling to achieve.

Shown below, for comparison with other models word error rates of the speech recognition model.

ASR for scalable multi-language model nerve Corpus

In the article, the number of challenges Amazon Alexa authors appear when using neuro-linguistic model of large-scale ASR system brings solutions.

Scalable Multi Corpora Neural Language Models for ASR

https://arxiv.org/abs/1907.01677

The authors attempt to solve the challenges include: 

 

  • NLM training on multiple heterogeneous corpora

  • The first pass through the contact names in the model class passed to the NLM, to create a personalized neuro-linguistic model (NLM)

  • The NLM into the ASR system, while controlling the delayed impact

For heterogeneous corpus based learning to achieve this task, the authors use stochastic gradient descent variants to estimate the parameters of the neural network. This approach to be successful, requires small batches each learning data set must be independent and identical (IID) samples. By relevance of the respective corpus extracted from the samples to construct a small random subset of data batches, the system model is constructed for each N-gram data source, and a correlation of the linear interpolation weight set in the development weight optimization.

By sampling the large text corpus from the NLM and using the N-gram model estimated corpus, the system is built up similar N-gram model NLM thereby led through the generated synthetic data LM. In addition, the authors use a sub-word NLM generate synthetic data, to ensure corpus thus obtained is not limited to the reserve vocabulary ASR version of the current system. Written text corpus used in the model contains a total of more than 50 billion words. NLM architecture consists of two long - term memory projection Recurrent Neural Network (LSTMP) layers, each layer comprising a hiding unit 1024, to invest in 512 dimensions. Presence of residual connections between the layers.

FIG lower part of the result for the model given in FIG. I 1.6% which is obtained by synthesizing the opposite WER data generated from NLM.

to sum up

Here, we have reviewed the most recent common to all types of environment automatic speech recognition technology.

Above mentioned papers / summary which also contains the code that implements its links, we look forward to publishing your own actual test results.

Original link:

https://heartbeat.fritz.ai/a-2019-guide-for-automatic-speech-recognition-f1e1129a141c 

Guess you like

Origin www.cnblogs.com/think90/p/11568669.html