[Speech recognition] GMM speech recognition based on MFCC [including Matlab source code 535 period]

1. Introduction

MFCC (Mel-frequency cepstral coefficients): Mel-frequency cepstral coefficients. The Mel frequency is proposed based on the human hearing characteristics, and it has a non-linear relationship with the Hz frequency. Mel Frequency Cepstral Coefficient (MFCC) uses this relationship between them to calculate the Hz spectrum characteristics. Mainly used for feature extraction of voice data and reduction of operational dimensions. For example: for a frame with 512-dimensional (sampling points) data, after MFCC, the most important 40-dimensional (generally speaking) data can be extracted and the purpose of dimensioning is also achieved.
MFCC generally goes through several steps: pre-emphasis, framing, windowing, fast Fourier transform (FFT), mel filter bank, discrete cosine transform (DCT). The most important ones are FFT and mel filtering These two have carried out the main maintenance operations.
1. Pre-emphasis
Pass the sampled digital voice signal s(n) through a high pass filter: a generally takes about 0.95. The signal after pre-emphasis is:

The purpose of pre-emphasis is to enhance the high frequency part, flatten the frequency spectrum of the signal, and keep it in the whole frequency band from low frequency to high frequency, and can use the same signal-to-noise ratio to find the frequency spectrum. At the same time, it is also to eliminate the effects of the vocal cords and lips in the process, to compensate for the high frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high frequency formant.

2. Framing
In order to facilitate the analysis of speech, the speech can be divided into small segments, which are called frames. First gather N sampling points into an observation unit, which is called a frame. Normally, the value of N is 256 or 512, and the time covered is about 20-30ms. In order to avoid too much change between two adjacent frames, there will be an overlapping area between two adjacent frames. This overlapping area contains M sampling points, usually the value of M is about 1/2 or 1/3 of N . Generally, the sampling frequency of the voice signal used in speech recognition is 8KHz or 16KHz. For 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000×1000=32ms.

3. The windowed
voice is constantly changing in a long range and cannot be processed without fixed characteristics. Therefore, each frame is substituted into the window function, and the value outside the window is set to 0. The purpose is to eliminate the possibility of both ends of each frame. The resulting signal discontinuity. Commonly used window functions include square window, Hamming window and Hanning window, etc. According to the frequency domain characteristics of the window function, Hamming window is often used.

Multiply each frame by the Hamming window to increase the continuity between the left and right ends of the frame. Assuming that the signal after framing is S(n), n=0,1...,N-1, N is the frame size, then after multiplying the Hamming window, W(n) has the following form:
different values of a will produce For different Hamming windows, generally a is 0.46.

4. Fast Fourier Transform
Since it is usually difficult to see the characteristics of the signal in the time domain, it is usually converted to the energy distribution in the frequency domain for observation. Different energy distributions can represent the characteristics of different voices. . Therefore, after multiplying the Hamming window, each frame must undergo a fast Fourier transform to obtain the energy distribution on the frequency spectrum. Fast Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame. And take the modulus square of the frequency spectrum of the speech signal to obtain the power spectrum of the speech signal. Suppose the DFT of the speech signal is:

Where x(n) is the input voice signal, and N represents the number of Fourier transform points.

Here we need to introduce the Nyquist frequency first. The Nyquist frequency (Nyquist frequency) is half of the sampling frequency of a discrete signal system. It is named after Harry Nyquist or the Nyquist-Shannon sampling theorem. The sampling theorem states that as long as the Nyquist frequency of the discrete system is higher than the highest frequency or bandwidth of the sampled signal, aliasing can be avoided. In the voice system, I usually take the sampling rate of 16khz, and the frequency of human occurrence is between 300hz~3400hz. According to the definition of Nyquist frequency, the Nyquist frequency is equal to 8khz, which is higher than the highest frequency of human occurrence, which meets the restriction conditions of Nyquist frequency. FFT is calculated by intercepting half of the sampling rate according to the Nyquist frequency. Specifically, assuming that there are 512 sampling points in a frame, the number of Fourier transform points is also 512, and the number of points output after FFT calculation is 257(N/2+ 1), which means the frequency component of N/2+1 points from 0 (Hz) to sampling rate/2 (Hz). That is to say, during the FFT calculation, not only the signal is transferred from the time domain to the frequency domain and the influence of points higher than the highest frequency of the sampled signal is removed, but the dimensionality is also reduced.

5. Mel filter bank
Since human ears are sensitive to different frequencies and have a nonlinear relationship, we divide the frequency spectrum into multiple Mel filter banks according to the sensitivity of human ears. Within the range of Mel scale, each filter The center frequency of the device is a linear distribution with equal intervals, but not at equal intervals in the frequency range. This is due to the formula of frequency and Mel frequency conversion. The formula is as follows:

The log in the formula is based on log10, which is lg.

Pass the energy spectrum through a set of Mel-scale triangular filter banks, and define a filter bank with M filters (the number of filters is similar to the number of critical bands), the filter used is a triangular filter, the center The frequency is f(m), m=1,2,...,M. M usually takes 22-26. The interval between each f(m) decreases as the value of m decreases, and widens as the value of m increases, as shown in the figure:

The k in the formula refers to the subscript of the point after FFT calculation, which is 0~257 in the previous example. f(m) also corresponds to the subscript of the point. The specific method is as follows:

1. Determine the lowest (usually 0hz) and highest (usually one-half of the sampling rate) frequency of the voice signal and the number of Mel filters

2. Calculate the mel frequency corresponding to the lowest and highest frequency

3. Calculate the distance between the center frequencies of two adjacent mel filters: (highest mel frequency-lowest mel frequency)/(number of filters + 1)

4. Convert each center Mel frequency to frequency

5. Calculate the frequency corresponding to the subscript of the midpoint of the FFT

For example: assuming the sampling rate is 16khz, the lowest frequency is 0hz, the number of filters is 26, and the frame size is 512, then the number of Fourier transform points is also 512, then the lowest Mel is obtained from the conversion formula of Mel frequency and actual frequency The frequency is 0, and the highest Mel frequency is 2840.02. The center frequency distance is: (2840.02-0)/(26+1)=105.19, so we can get the center frequency of the Mel filter bank: [0, 105.19, 210.38,... ,2840.02], and then convert this group of center frequencies into actual frequency groups (just follow the formula, which is not listed here), and finally calculate the subscript of the actual frequency group corresponding to the FFT point. The calculation formula is: in the actual frequency group Of each frequency/sampling rate* (Fourier transform points + 1). In this way, the FFT point subscript group is obtained: [0,2,4,7,10,13,16,...,256], which is f(0),f(1),...,f(27).
With these, we are calculating the output of each filter, the calculation formula is as follows:
where M refers to the number of filters, N refers to the number of points in the FFT (257 in the above example). After the above calculation, we get a dimensionality equal to the number of filters for each frame of data, which reduces the dimensionality (26 dimensions in this example).

6. Discrete Cosine Transform.
Discrete Cosine Transform is often used in signal processing and image processing to perform lossy data compression on signals and images. This is because Discrete Cosine Transform has a strong "energy concentration" characteristic: most natural signals The energy (including sound and image) is concentrated in the low-frequency part after the discrete cosine transform, which is actually a dimension of each frame of data. The formula is as follows:
Bring the logarithmic energy of each filter mentioned above into the discrete cosine transform, and obtain the L-order Mel-scale Cepstrum parameter. L order refers to the order of MFCC coefficients, usually 12-16. Here M is the number of triangular filters.

7. Extraction of dynamic difference parameters The
standard cepstrum parameter MFCC only reflects the static characteristics of speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. Experiments show that combining dynamic and static features can effectively improve the recognition performance of the system. The calculation of the difference parameter can use the following formula:
where dt represents the t-th first-order difference, Ct represents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative, which can be 1 Or 2. Substituting the result of the above formula again can get the parameters of the second-order difference.
Therefore, the entire composition of MFCC is actually: N-dimensional MFCC parameters (N/3 MFCC coefficients + N/3 first-order difference parameters + N/3 second-order difference parameters) + frame energy (this item can be replaced according to needs).
The frame energy here refers to the volume (ie energy) of a frame, which is also an important feature of speech and is very easy to calculate. Therefore, usually adding the logarithmic energy of a frame (definition: the sum of the squares of the signal in a frame, then taking the logarithmic value based on 10, and multiplying by 10) makes the basic speech feature of each frame one more Dimension, including a logarithmic energy and the remaining cepstral parameters. In addition, explain what is going on with the 40-dimensional at the beginning. Assuming that the order of the discrete cosine transform is 13, then after the first-order and second-order difference, it will be 39-dimensional, and the frame energy will be 40-dimensional in total. Of course, this is OK. Adjust dynamically according to actual needs.

Second, the source code

% ====== Load wave data and do feature extraction
clc,clear
waveDir='trainning\';
speakerData = dir(waveDir);
%Matlab使用dir函数获得指定文件夹下的所有子文件夹和文件,并存放在在一种为文件结构体数组中.
% dir函数可以有调用方式
% dir('.') 列出当前目录下所有子文件夹和文件
% dir('G:\Matlab') 列出指定目录下所有子文件夹和文件
% dir('*.m') 列出当前目录下符合正则表达式的文件夹和文件
% 得到的为结构体数组每个元素都是如下形式的结构体
%         name    -- filename
%         date    -- modification date
%         bytes   -- number of bytes allocated to the file
%         isdir   -- 1 if name is a directory and 0 if not
%         datenum -- modification date as a MATLAB serial date number
% 分别为文件名,修改日期,大小,是否为目录,Matlab特定的修改日期
% 可以提取出文件名以作读取和保存用.
speakerData(1:2) = [];
speakerNum=length(speakerData);%speakerNum:人数；


% ====== Feature extraction
fprintf('\n读取语音文件并进行特征提取...       ');
% cd('D:\MATLAB7\toolbox\dcpr\');
for i=1:speakerNum
	fprintf('\n正在提取第%d个人%s的特征\n', i, speakerData(i,1).name(1:end-4));
    [y, fs, nbits]=wavread(['trainning\' speakerData(i,1).name]);
    epInSampleIndex = epdByVol(y, fs);		% endpoint detection端点检测
    y=y(epInSampleIndex(1):epInSampleIndex(2));	% silence is not used去除静音
    speakerData(i).mfcc=wave2mfcc(y, fs);
    fprintf('  完成！！');
end

save speakerData speakerData;		% Since feature extraction is slow, you can save the data for future use if the features are not changed.
graph_MFCC;                                 %由于特征提取速度慢，如果功能没有改变，可以保存供日后使用的数据，
fprintf('\n');
clear all;
fprintf('特征参数提取完成！ \n\n请点击任意键继续...');
 pause;


% ====== GMM training
fprintf('\n训练每个语者的高斯混合模型...\n\n');
load speakerData.mat
gaussianNum=12;					% No. of gaussians in a GMM高斯混合模型中的高斯个数
speakerNum=length(speakerData);


for i=1:speakerNum
	fprintf('\n为第%d个语者%s训练GMM……\n', i,speakerData(i).name(1:end-4));
	[speakerGmm(i).mu, speakerGmm(i).sigm,speakerGmm(i).c] = gmm_estimate(speakerData(i).mfcc,gaussianNum);
    fprintf('  完成！！');
end

fprintf('\n');
save speakerGmm speakerGmm;
pause(10);
clear all;
fprintf('高斯混合模型训练结束！ \n\n请点击任意键继续...');
 pause;

% ====== recognition
fprintf('\n识别中...\n\n');
load speakerData;
load speakerGmm;

[filename, pathname] = uigetfile('*.wav','select a wave file to load');
    if pathname == 0
        errordlg('ERROR! No file selected!');
        return;
    end    
wav_file = [pathname filename];
[testing_data, fs, nbits]=wavread(wav_file);
 pause(10);
match= MFCC_feature_compare(testing_data,speakerGmm);
 disp('待测模型匹配中，请等待10秒！')
 pause(10);
[max_1 index]=max(match);
if length(filename)>7
   fprintf('\n\n\n说话人是%s。',speakerData(index).name(1:end-4));
else
   fprintf('\n\n\n说话人是%s。',filename(1:end-4));
end