Depth neural network - Chinese Speech Recognition

1. Background

Speech is the most natural human interaction. After the invention of the computer so that the machine can "understand" human language, understand the meaning of the language, and to make the correct answer has become one goal. This process uses three major techniques, i.e., automatic speech recognition (Automatic Speech Recognition,
the ASR), NLP (Natural Language Processing,
the NLP) and speech synthesis (speech synthesis, SS). The purpose of speech recognition technology is to allow the machine can understand human speech, is a typical cross-disciplinary task.

2. Overview

Model speech recognition system consists of two parts consisting of acoustic and language models, acoustic models corresponding to speech-to-phoneme probability calculation, the language model corresponding to the phoneme to character probability calculation.

A continuous speech recognition system may be substantially consists of four parts: a feature extraction, an acoustic model, language model and a decoding section. Specific procedure is to first extract obtained after the acoustic features from the speech data, and then trained models to obtain a statistical acoustic model, identifying as a template, and the combined language model obtained through the decoding process a recognition result.

Is a key part of an acoustic model speech recognition system, it is the role of the acoustic feature sequence generation unit will be described, the speech signals are classified. We can use the model to calculate the period of the acoustic feature vectors belonging to the observed probability of each acoustic unit and the feature in accordance with the guidelines likelihood state sequence into the sequence.
This paper addresses the data set , Tsinghua University THCHS30 Chinese voice data set .
Detailed code tutorial Chinese speech recognition

2.1 Feature Extraction

The neural network can be trained as the audio input, so the first step we want to characteristics of the audio data extraction. Common feature extraction is based on the human vocal mechanism and auditory perception, perception of understanding the nature of the sound from the sound to the auditory mechanism .
Some common acoustic features are as follows:

(1) linear prediction coefficients (the LPC), linear prediction analysis is an analog of human utterance principle, obtained by the channel model analysis stub cascade. Suppose the transfer function of the system with the all-pole digital filters are similar, usually a 16 poles 12 can be characteristic of a voice signal is described. So for the speech signal at time n,
we can use a linear combination timing signal before analog approximation. Then calculated sample value and the sample value of the speech signal linear prediction, and so between the two reaches of the mean square error (MSE) minimum, can be obtained LPC.
(2) perceptual linear prediction (PLP), PLP is a characteristic parameter based on the auditory model. This parameter is a characteristic equivalent to the LPC, it is a set of all-pole model prediction coefficients of the polynomial. Except that the
PLP is based ear Xin sleep through calculation applied to spectral analysis in the input speech signal through the ear auditory model process, instead of the time-domain signal LPC used, the advantage is beneficial anti-noise speech feature extraction.

(3) Mel Frequency Cepstral Coefficients (MFCC), also based MFCCs ear hearing characteristics, Mel frequency
of cepstrum band division is designated on the Mel scale isometric logarithmic Mel frequency scale value and the actual frequency distribution relations more in line with the human ear, so you can make the voice signal has a better representation.

(5) a filter bank based on the feature Fbank (Filter bank), Fbank feature extraction method is rather
to remove the last step of MFCC discrete cosine transform, with MFCC feature, characterized in Fbank retains more of the original speech data.

(6) spectrogram (Spectrogram), speech spectrogram spectrogram is generally spectrogram obtained by processing the received time domain signal, so long as there is a sufficient length of time can be a time domain signal. Spectrogram observed characteristics of the speech signal strength in different frequency bands can be seen to change with time.
This article is by the spectrogram as a feature input by CNN image processing training. Spectrogram can be understood as a period of time spectrogram superposition, so the main extracting spectrogram is divided into: framing, windowing, fast Fourier transform (FFT).

2.1.1 reading audio

The first step, we need to find ways to exploit the audio module scipy into useful information, fs is the sampling frequency, wavsignal voice data. fs our data set are 16khz.

import scipy.io.wavfile as wav
filepath = 'test.wav'

fs, wavsignal = wav.read(filepath)

2.1.2 sub-frame, windowed

Speech signals are not stable macroscopically, microscopically smooth, having a short time stationarity (10-30ms the voice signal may be considered approximately constant for the pronunciation of a phoneme), generally takes 25ms.
In order to process the voice signal, we want to windowed speech signal, that is, only once the data processing window. Because the actual voice signal is very long, we can not and do not have very long-time data processing. Every sensible solution is to take a piece of data for analysis, and then remove the piece of data before being analyzed. Our windowing operation refers to the Hamming window operation principle is put in a frame of data is multiplied by a function   and get a new frame of data. Formulas are given below.
How to take only one piece of data it? Because the data window after we have Hamming a fast Fourier transform (an FFT), it is assumed that the signal within the window is a signal representative of a period (that is the window left and right ends can be substantially continuous), usually a small no significant periodicity of audio data, together with the Hamming window, the data is close to a periodic function.
Since plus Hamming window, only the middle of the reflected data, data is lost on both sides, so the other will move the window when there is overlap, when the opening takes 25ms, 10ms can take the step (wherein a generally the value 0.46).
The formula is:
! [] (Https://imgconvert.csdnimg.cn/aHR0cHM6Ly9jZG4ubmxhcmsuY29tL3l1cXVlL19fbGF0ZXgvZDRjYmM0ZjEyY2JlZmM3YTU1MDY5ODM2NDJhYzVlZjAuc3Zn?x-oss-process=image/format,png#card=math&code=W (n-, A) = (. 1-A) * -a cos [\ frac {2n \ pi } {N-1}], 0 \ geq n \ leq N-1 & height = 34.59770114942529 & width = 332.2988505747127)

code:

import numpy as np

x=np.linspace(0, 400 - 1, 400, dtype = np.int64)#返回区间内的均匀数字
w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1))

time_window = 25
window_length = fs // 1000 * time_window

# 分帧
p_begin = 0
p_end = p_begin + window_length
frame = wavsignal[p_begin:p_end]

plt.figure(figsize=(15, 5))
ax4 = plt.subplot(121)
ax4.set_title('the original picture of one frame')
ax4.plot(frame)

# 加窗

frame = frame * w
ax5 = plt.subplot(122)
ax5.set_title('after hanmming')
ax5.plot(frame)
plt.show()

Photo effects:
image.png

2.1.3 Fast Fourier Transform (FFT)

A speech signal is more difficult to see in the time domain characteristics thereof, it is usually converted to energy in the frequency domain distribution, so we do a fast Fourier transform on the signal after the window function processing for each frame in the time domain in FIG converted into a frequency spectrum of each frame, and then we can get spectrogram for spectrum overlay each window.

Code:

from scipy.fftpack import fft

# 进行快速傅里叶变换
frame_fft = np.abs(fft(frame))[:200]
plt.plot(frame_fft)
plt.show()

# 取对数,求db
frame_log = np.log(frame_fft)
plt.plot(frame_log)
plt.show()

2.2 CTC(Connectionist Temporal Classification)

Turning to speech recognition, if there is a clip of audio data set and the corresponding transcription, but we do not know how the transcription characters and audio phoneme alignment, which would greatly increase the difficulty of training the speech recognizer. If the data is not adjustment process, it means you can not use some simple training methods. In this regard, we can choose the first method is to develop a rule, such as "a character corresponding to the input phoneme ten", but people vary speech rate, this approach is prone to flaws. To ensure the reliability of the model, the second method, that is, manually align the position of each character in the audio, and better model performance training effect, because we know that each time step input of true information. But its shortcomings are obvious - even the right size of the data set, such an approach is still very time-consuming. In fact, the development of poor regulation accuracy, manual debugging when using too long not only in the field of speech recognition, other work, such as handwriting recognition, adding action tags in the video, will also face these problems.
In this scenario, it is useless CTC. CTC is a way for the network to automatically align a good way to learn, very suitable for speech recognition and handwriting recognition. In order to describe some of the more images, we can input sequence (Audio) mapped to X = [x1, x2, ... , xT], the corresponding output sequence (transcription) is, Y = [y1, y2, ... , yU] . After that, the character with the phoneme alignment operation is equivalent to establish an accurate map, the details can be seen between the X and the Y- CTC classic article .
Loss function part of the code:

def ctc_lambda(args):
    labels, y_pred, input_length, label_length = args
    y_pred = y_pred[:, :, :]
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

Decoding part of the code:

#num_result为模型预测结果,num2word 对应拼音列表。
def decode_ctc(num_result, num2word):
	result = num_result[:, :, :]
	in_len = np.zeros((1), dtype = np.int32)
	in_len[0] = result.shape[1];
	r = K.ctc_decode(result, in_len, greedy = True, beam_width=10, top_paths=1)
	r1 = K.get_value(r[0][0])
	r1 = r1[0]
	text = []
	for i in r1:
		text.append(num2word[i])
	return r1, text

3. acoustic model

The main use of CNN model processes the image and extracts the main feature added loss function defined CTC to train the maximum value of the pool. If there is input and label, then the model structure can be set ourselves, if accuracy can be improved, it is desirable. Also you can join LSTM-peer network architecture, operating on CNN and pooling information on the Internet a lot, not repeat them here. Interested readers can refer to the period of the convolutional neural network AlexNet  .
Code:

class Amodel():
    """docstring for Amodel."""
    def __init__(self, vocab_size):
        super(Amodel, self).__init__()
        self.vocab_size = vocab_size
        self._model_init()
        self._ctc_init()
        self.opt_init()

    def _model_init(self):
        self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
        self.h1 = cnn_cell(32, self.inputs)
        self.h2 = cnn_cell(64, self.h1)
        self.h3 = cnn_cell(128, self.h2)
        self.h4 = cnn_cell(128, self.h3, pool=False)
        # 200 / 8 * 128 = 3200
        self.h6 = Reshape((-1, 3200))(self.h4)
        self.h7 = dense(256)(self.h6)
        self.outputs = dense(self.vocab_size, activation='softmax')(self.h7)
        self.model = Model(inputs=self.inputs, outputs=self.outputs)

    def _ctc_init(self):
        self.labels = Input(name='the_labels', shape=[None], dtype='float32')
        self.input_length = Input(name='input_length', shape=[1], dtype='int64')
        self.label_length = Input(name='label_length', shape=[1], dtype='int64')
        self.loss_out = Lambda(ctc_lambda, output_shape=(1,), name='ctc')\
            ([self.labels, self.outputs, self.input_length, self.label_length])
        self.ctc_model = Model(inputs=[self.labels, self.inputs,
            self.input_length, self.label_length], outputs=self.loss_out)

    def opt_init(self):
        opt = Adam(lr = 0.0008, beta_1 = 0.9, beta_2 = 0.999, decay = 0.01, epsilon = 10e-8)
        self.ctc_model.compile(loss={'ctc': lambda y_true, output: output}, optimizer=opt)


4. Language Model

4.1 Introduction to statistical language model

Statistical language model is the basis of natural language processing, which is a mathematical model has a certain context-sensitive properties, but also a probabilistic graphical model of nature, and is widely used in machine translation, speech recognition, phonetic input, image character recognition, spelling correction, look for typos and search engines. In many tasks, the computer needs to know whether a sequence of words can constitute a people to understand, no typos and meaningful sentences, such as these words:

许多人可能不太清楚到底机器学习是什么,而它事实上已经成为我们日常生活中不可或缺的重要组成部分。
不太清楚许多人可能机器学习是什么到底,而它成为已经日常我们生活中组成部分不可或缺的重要。
不清太多人机可楚器学许能习是么到什底,而已常我它成经日为们组生中成活部不重可的或缺分要。

The first sentence syntactically, meaning clear, the second sentence meaning yet it is also clear, third and even a vague meaning all. This is from a rule-based perspective to understand, in the last century before the 1970s, scientists think so. And later, Jared Nick uses a simple statistical model to solve this problem. From a statistical point of view, the probability of the first sentence of the lot, such as Shi   , while the second followed, such as Shi  , the third smallest, such as Shi . According to this model, the probability of occurrence of the first sentence is the second of 20 times the power of 10, not to mention the third sentence, so the first sentence of the most common sense.

4.2 Modeling

Assume that the sentence S is generated, there is a series of words w1, w2, ... wn configuration, the sentence is the probability of occurrence of S:
! [] (Https://imgconvert.csdnimg.cn/aHR0cHM6Ly9jZG4ubmxhcmsuY29tL3l1cXVlL19fbGF0ZXgvNzk3Mjg5OGFlYmM0ZTdjZjYzMDAwY2NmZmEzNGU2MjEuc3Zn?x-oss-process=image / format, png # card = math & code = P (S) = P (w1, w2, ..., wn) = P (w1) * P (w2 | w1) * P (w3 | w1, w2) ... P (wn | w1, w2, ..., wn- 1) & height = 18.50574712643678 & width = 603.448275862069)
due to the limitations of computer memory space and calculation power, we clearly need to be more reasonable calculation method. In general, considering only related to the previous word, you can have a pretty good accuracy, in actual use, generally considered the first two words about enough, before considering rare cases related to the first three, Therefore, we can choose to take the following formula:
! [] (https://imgconvert.csdnimg.cn/aHR0cHM6Ly9jZG4ubmxhcmsuY29tL3l1cXVlL19fbGF0ZXgvNjY5MzUyNDY5ZDIwNmJlMDhkN2EyMjViYzBiNjMwZWMuc3Zn?x-oss-process=image/format,png#card=math&code=P (S) = P (W1, w2, ..., wn) = P (w1) * P (w2 | w1) * P (w3 | w2) ... P (wn | wn-1) & height = 18.50574712643678 & width = 502.0689655172414)
and P we can get statistics by climbing word frequency to calculate the probability.

4.3 Pinyin to achieve text

Pinyin of Chinese characters turn is dynamic programming algorithm, with the algorithm to find the shortest path is basically the same. We can enter the Chinese language as a communication problem, each pinyin can correspond to multiple characters, and each character once read only one sound, the spelling of each word corresponding to the left has put together, it has become a directed graph.Unnamed picture .png

The model test

Acoustic model tests:
image.png
language model test:
image.png
As the model is simple and the data set is too small, the effect is not very good model.
Project Source Address: https://momodel.cn/explore/5d5b589e1afd9472fe093a9e?type=app

6. References

Thesis: Speech Recognition Technology Progress and Prospects
blog: ASRT_SpeechRecognition
blog: DeepSpeechRecognition

about us

Mo (URL: momodel.cn ) is a Python support of artificial intelligence online modeling platform that can help you quickly develop, training and deployment model.


Mo AI clubs  are sponsored by the site R & D and product design team, committed to the development and use artificial intelligence to reduce the threshold of the club. Team with big data processing and analysis, visualization and data modeling experience, has undertaken multidisciplinary intelligence project, with design and development capability across the board from the bottom to the front end. The main research directions for the management of large data analysis and artificial intelligence technology, and in order to promote data-driven scientific research.

Currently the club held six machine-learning technology salon themed activities under the line in Hangzhou weekly, from time to time to share articles and academic exchanges. Hoping to converge from all walks of life to artificial intelligence interested friends, continue to grow exchanges, promote the democratization of artificial intelligence, wider use.
image.png

Published 36 original articles · won praise 4 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_44015907/article/details/100148218