[Speech analysis] based on matlab cepstrum analysis and MFCC coefficient calculation [including Matlab source code 556]

1. Introduction

1 Mel frequency cepstrum coefficient (MFCC)

In any Automatic speech recognition system, the first step is to extract features. In other words, we need to extract the identifiable components of the audio signal, and then throw away other messy information, such as background noise, emotions, and so on.
Insert picture description here
Knowing how the speech is produced is of great help to our understanding of speech. People produce sound through the vocal tract, and the shape (shape?) of the vocal tract determines what kind of sound is emitted. The shape of the vocal tract includes tongue, teeth, etc. If we can accurately know the shape, then we can accurately describe the generated phoneme. The shape of the sound channel is shown in the envelope of the short-term power spectrum of speech. And MFCCs is a feature that accurately describes this envelope.

MFCCs (Mel Frequency Cepstral Coefficents) is a feature widely used in automatic speech and speaker recognition. It was created by Davis and Mermelstein in 1980. Since then. In the field of speech recognition, MFCCs can be said to stand out in terms of artificial features, stand out, and have never been surpassed (as for the feature learning of Deep Learning, that is a later story).

Well, at this point, we mentioned a very important keyword: the shape of the vocal tract, and then we know that it is very important, and we also know that it can be displayed in the envelope of the short-term power spectrum of speech. Hey, what is the power spectrum? What is envelope? What are MFCCs? Why is it effective? How to get it? Let's talk slowly below.

2 Spectrogram

We are dealing with speech signals, so how to describe it is very important. Because different descriptions show different information. So what kind of description is conducive to our observation and understanding? Here we first understand a thing called spectrogram.
Insert picture description here
Here, the speech is divided into many frames, and each frame of speech corresponds to a frequency spectrum (calculated by short-term FFT), and the frequency spectrum represents the relationship between frequency and energy. In actual use, there are three types of spectrograms, namely linear amplitude spectrum, logarithmic amplitude spectrum, and self-power spectrum (the amplitude of each spectral line in the logarithmic amplitude spectrum is calculated logarithmically, so the unit of its ordinate is dB( Decibel). The purpose of this transformation is to make those low-amplitude components relatively high-amplitude components can be pulled up, in order to observe the periodic signal hidden in the low-amplitude noise).
Insert picture description here
Let us first express the frequency spectrum of one frame of speech by coordinates, as shown in the left figure above. Now we rotate the spectrum on the left by 90 degrees. Get the middle figure. Then map these amplitudes to a grayscale representation (can also be understood as quantizing the continuous amplitudes into 256 quantized values?), 0 means black, and 255 means white. The larger the amplitude value, the darker the corresponding area. This results in the rightmost picture. Then why is it so? The purpose is to increase the dimension of time, so that a segment of speech can be displayed instead of the frequency spectrum of a frame of speech, and static and dynamic information can be seen intuitively. The advantages will be presented later.

In this way, we will get a spectrogram that changes over time. This is the spectrogram that describes the speech signal.
Insert picture description here
The picture below is a spectrogram of a speech, the very dark place is the peak (formants) in the spectrogram.
Insert picture description here
Then why do we express speech in the spectrogram?

First of all, the properties of Phones can be better observed here. In addition, by observing the formants and their transitions, sounds can be better recognized. Hidden Markov Models (Hidden Markov Models) implicitly model the spectrogram to achieve good recognition performance. Another function is that it can intuitively evaluate the quality of the TTS system (text to speech) by directly comparing the degree of matching between the synthesized speech and the natural speech spectrogram.

3 Cepstrum Analysis

The following is a spectrogram of a voice. Peaks represent the main frequency components of speech. We call these peaks formants, and formants carry the identification attributes of the sound (just like a personal ID card). So it is particularly important. Use it to recognize different sounds.
Insert picture description here
Since it is so important, then we need to extract it! What we want to extract is not only the position of the resonance peaks, but also the process of their transformation. So what we extract is the Spectral Envelope. This envelope is a smooth curve connecting these resonance peak points.
Insert picture description here
We can understand that the original frequency spectrum is composed of two parts: the envelope and the details of the frequency spectrum. The logarithmic spectrum is used here, so the unit is dB. Now we need to separate the two parts so that we can get the envelope.

Insert picture description here
How to separate them? That is, how to obtain log H[k] and log E[k] to satisfy log X[k] = log H[k] + log E[k] on the basis of given log X[k]?

In order to achieve this goal, we need to Play a Mathematical Trick. What is this trick? It is to do FFT on the frequency spectrum. Doing Fourier transform on the frequency spectrum is equivalent to Inverse FFT (IFFT). One thing to note is that we are processing in the logarithmic domain of the spectrum, which is also part of Trick. At this time, doing IFFT on the logarithmic spectrum is equivalent to describing the signal on a pseudo-frequency axis.
Insert picture description here
From the above figure, we can see that the envelope is mainly low-frequency components (you need to change your thinking at this time, at this time, the horizontal axis should not be regarded as frequency, we can regard it as time), and we can regard it as a A sine signal with 4 cycles per second. In this way, we give it a peak at 4 Hz above the pseudo-coordinate axis. The details of the frequency spectrum are mainly high frequencies. We think of it as a sinusoidal signal with 100 cycles per second. In this way, we give it a peak at 100 Hz above the pseudo-coordinate axis.
Superimposing them is the original spectrum signal.
Insert picture description here
In practice, we already know log X[k], so we can get x[k]. Then we can know from the figure that h[k] is the low frequency part of x[k], then we can get h[k] by passing x[k] through a low-pass filter! That's right, here we can separate them and get the h[k] we want, which is the envelope of the spectrum.

x[k] is actually the cepstrum (this is a newly created word, the first four letters of the word spectrum in the spectrum are reversed to be the word of the cepstrum). And the h[k] we are concerned about is the low frequency part of the cepstrum. h[k] describes the envelope of the frequency spectrum, which is widely used to describe features in speech recognition.

So now to summarize the cepstrum analysis, it is actually such a process:

1) The original speech signal is Fourier transformed to obtain the frequency spectrum: X[k]=H[k]E[k];

Only considering the magnitude is: |X[k] |=|H[k]||E[k] |;

2) We take the logarithm on both sides: log||X[k] ||= log ||H[k] ||+ log ||E[k] ||.

3) Take the inverse Fourier transform on both sides to get: x[k]=h[k]+e[k].

This actually has a professional name called homomorphic signal processing. Its purpose is to transform non-linear problems into linear problems. Corresponding to the above, the original speech signal is actually a convolutional signal (the sound channel is equivalent to a linear time-invariant system, and the generation of sound can be understood as an excitation passing through this system). The first step is to turn it into a convolution Multiplicative signal (convolution in the time domain is equivalent to the product in the frequency domain). The second step converts the multiplicative signal into an additive signal by taking the logarithm, and the third step performs an inverse transformation to restore it to a volume signal. At this time, although the front and rear are time domain sequences, the discrete time domains they are in are obviously different, so the latter is called the cepstrum frequency domain.

In summary, cepstrum is a spectrum obtained by Fourier transform of a signal after logarithmic operation and then inverse Fourier transform. Its calculation process is as follows:
Insert picture description here
4 Mel-Frequency Analysis (Mel-Frequency Analysis)

Okay, here, let’s take a look at what we just did? Give us a piece of speech, and we can get its spectral envelope (smooth curve connecting all resonance peak points). However, experiments on human auditory perception show that human auditory perception only focuses on certain specific areas, rather than the entire spectrum envelope.

The Mel frequency analysis is based on human auditory perception experiments. Experimental observations found that the human ear is like a filter bank, it only pays attention to certain frequency components (human hearing is selective to frequency). In other words, it only allows signals of certain frequencies to pass, and simply ignores certain frequency signals that it does not want to perceive. However, these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency area, and they are densely distributed, but in the high frequency area, the number of filters becomes relatively small and the distribution is very sparse.
Insert picture description here
The human auditory system is a special non-linear system, and its sensitivity to signals of different frequencies is different. In the extraction of speech features, the human auditory system has done a very good job. It can not only extract semantic information, but also extract the personal characteristics of the speaker, which are beyond the reach of existing speech recognition systems. If the processing characteristics of human auditory perception can be simulated in the speech recognition system, it is possible to improve the speech recognition rate.

Mel Frequency Cepstrum Coefficient (MFCC) takes into account the characteristics of human hearing. First, the linear frequency spectrum is mapped to the Mel nonlinear frequency spectrum based on auditory perception, and then converted to the cepstrum.

The formula for converting ordinary frequencies to Mel frequencies is: as
Insert picture description here
can be seen from the figure below, it can convert non-uniform frequencies into uniform frequencies, that is, a uniform filter bank.
Insert picture description here
5 Mel-Frequency Cepstral Coefficients

We pass the spectrum through a set of Mel filters to get the Mel spectrum. The formula is: log X[k] = log (Mel-Spectrum). At this time, we perform cepstrum analysis on log X[k]:

1) Take the logarithm: log X[k] = log H[k] + log E[k].

2) Perform inverse transformation: x[k] = h[k] + e[k].

The cepstrum coefficient h[k] obtained on the Mel spectrum is called the Mel frequency cepstrum coefficient, or MFCC for short.
Insert picture description here
Now let's summarize the process of extracting MFCC features: (the specific mathematical process is too much online, so I don’t want to post it here)

1) Perform pre-emphasis, framing and windowing on the voice first;

2) For each short-term analysis window, obtain the corresponding frequency spectrum through FFT;

3) Pass the above spectrum through the Mel filter bank to obtain the Mel spectrum;

4) Perform cepstrum analysis on the Mel spectrum (take the logarithm and do the inverse transform, the actual inverse transform is generally realized by DCT discrete cosine transform, and take the 2nd to 13th coefficients after DCT as the MFCC coefficients), and get Mel frequency cepstral coefficient MFCC, this MFCC is the feature of this frame of speech; at
Insert picture description here
this time, the speech can be described by a series of cepstral vectors, and each vector is the MFCC feature vector of each frame.
Insert picture description here
In this way, the speech classifier can be trained and recognized through these cepstral vectors.

Second, the source code

% 倒谱计算与显示
clear all; clc; close all;
[y,fs]=wavread('C3_4_y_1.wav');
y=y(1:1000);
N=1024;                                                       % 采样频率和FFT的长度
len=length(y);
time=(0:len-1)/fs;                                         % 时间刻度
figure(1), subplot 311; plot(time,y,'k');           % 画出信号波形
title('(a)信号波形'); axis([0 max(time) -1 1]);
ylabel('幅值'); xlabel(['时间/s' 10]); grid;

nn=1:N/2; ff=(nn-1)*fs/N;                       % 计算频率刻度
z=Nrceps(y);                                            %求取倒谱
figure(1), subplot 312; plot(time,z,'k');       % 画出倒谱图
title('(b)信号倒谱图'); axis([0 time(512) -0.2 0.2]); grid; 
ylabel('幅值'); xlabel(['倒频率/s' 10]);
% DCT系数的计算与并恢复
clear all; clc; close all;

f=50;                                               % 信号频率
fs=1000;                                        % 采样频率
N=1000;                                     % 样点总数
n=0:N-1;
xn=cos(2*pi*f*n/fs);                    % 构成余弦序列
y=dct(xn) ;                                     % 离散余弦变换
num=find(abs(y)<5);                     % 寻找余弦变换后幅值小于5的区间
y(num)=0;                                    % 对幅值小于5的区间的幅值都置为0
zn=idct(y);                                     % 离散余弦逆变换
subplot 211; plot(n,xn,'k');                % 绘制xn的图
title('(a)原始信号'); xlabel(['样点' 10 ]); ylabel('幅值');
subplot 212; plot(n,zn,'k');                % 绘制zn的图
title('(b)重建信号'); xlabel(['样点' 10 ]); ylabel('幅值');

%  绘制Mel滤波器组的频率响应曲线
clear all; clc; close all;

% 调用melbankm函数,0-0.5区间设计24个Mel滤波器,用三角形窗函数
bank=melbankm(24,256,8000,0,0.5,'t');
bank=full(bank);
bank=bank/max(bank(:));              % 幅值归一化

df=8000/256;                         % 计算分辨率
ff=(0:128)*df;                       % 频率坐标刻度
for k=1 : 24                         % 绘制24个Mel滤波器响应曲线
    plot(ff,bank(k,:),'k'); hold on;
end
hold off; grid;
xlabel('频率/Hz'); ylabel('相对幅值')
% MFCC计算程序
clear all; clc; close all;

[x1,fs]=wavread('C3_4_y_4.wav');                            % 读入信号C3_4_y_4.wav
wlen=200;                                                                  % 帧长
inc=80;                                                                     % 帧移
num=8;                                                                      %Mel滤波器个数
x1=x1/max(abs(x1));                                                 % 幅值归一化
time=(0:length(x1)-1)/fs;
subplot 211; plot(time,x1,'b') 
title('(a)语音信号');
ylabel('幅值'); xlabel(['时间/s' ]);  
ccc1=Nmfcc(x1,fs,num,wlen,inc);
fn=size(ccc1,1)+4;                                                  %前后各有两帧被舍弃
cn=size(ccc1,2);
z=zeros(1,cn);

Three, running results

Insert picture description here

Four, remarks

Complete code or writing add QQ 1564658423 past review
>>>>>>
[Feature extraction] Audio watermark embedding and extraction based on matlab wavelet transform [Include Matlab source code 053]
[Speech processing] Voice signal processing based on matlab GUI [Include Matlab Source code issue 290]
[Voice acquisition] based on matlab GUI voice signal collection [including Matlab source code 291]
[Voice modulation] based on matlab GUI voice amplitude modulation [including Matlab source code 292]
[Speech synthesis] based on matlab GUI voice synthesis [including Matlab Source code issue 293]
[Voice encryption] Voice signal encryption and decryption based on matlab GUI [With Matlab source code 295]
[Speech enhancement] Matlab wavelet transform-based voice enhancement [Matlab source code 296]
[Voice recognition] Based on matlab GUI voice base frequency Recognition [Including Matlab source code 294]
[Speech enhancement] Matlab GUI Wiener filtering based voice enhancement [Including Matlab source code 298]
[Speech processing] Based on matlab GUI voice signal processing [Including Matlab source code 299]
[Signal processing] Based on Matlab speech signal spectrum analyzer [including Matlab source code 325]
[Modulation signal] Digital modulation signal simulation based on matlab GUI [including Matlab source code 336]
[Emotion recognition] Voice emotion recognition based on matlab BP neural network [including Matlab source code 349 Issue]
[Voice Steganography] Quantified Audio Digital Watermarking Based on Matlab Wavelet Transform [Include Matlab Source Code Issue 351]
[Feature extraction] based on matlab audio watermark embedding and extraction [including Matlab source code 350 period]
[speech denoising] based on matlab low pass and adaptive filter denoising [including Matlab source code 352 period]
[emotion recognition] based on matlab GUI voice emotion classification Recognition [Including Matlab source code 354 period]
[Basic processing] Matlab-based speech signal preprocessing [Including Matlab source code 364 period]
[Speech recognition] Matlab Fourier transform 0-9 digital speech recognition [Including Matlab source code 384 period]
[Speech Recognition] 0-9 digital speech recognition based on matlab GUI DTW [including Matlab source code 385]
[Voice playback] Matlab GUI MP3 design [including Matlab source code 425]
[Voice processing] Speech enhancement algorithm based on human ear masking effect Noise ratio calculation [Including Matlab source code 428]
[Speech denoising] Based on matlab spectral subtraction denoising [Including Matlab source code 429]
[Speech recognition] BP neural network speech recognition based on the momentum item of matlab [Including Matlab source code 430]
[Voice steganography] based on matlab LSB voice hiding [including Matlab source code 431]
[Voice recognition] based on matlab male and female voice recognition [including Matlab source code 452]
[Voice processing] based on matlab voice noise adding and noise reduction processing [including Matlab source code Issue 473]
[Speech denoising] based on matlab least squares (LMS) adaptive filter [including Matlab source code 481]
[Speech enhancement] based on matlab spectral subtraction, least mean square and Wiener filter speech enhancement [including Matlab source code 482 period】
[Communication] based on matlab GUI digital frequency band (ASK, PSK, QAM) modulation simulation [including Matlab source code 483]
[Signal processing] based on matlab ECG signal processing [including Matlab source code 484]
[Voice broadcast] based on matlab voice Broadcast [Including Matlab source code 507]
[Signal processing] Matlab wavelet transform based on EEG signal feature extraction [Including Matlab source code 511]
[Voice processing] Based on matlab GUI dual tone multi-frequency (DTMF) signal detection [Including Matlab source code 512 】
【Voice steganography】based on matlab LSB to realize the digital watermark of speech signal 【Include Matlab source code 513】
【Speech enhancement】Speech recognition based on matlab matched filter 【Include Matlab source code 514】
【Speech processing】Based on matlab GUI voice Frequency domain spectrogram analysis [including Matlab source code 527]
[Speech denoising] based on matlab LMS, RLS algorithm voice denoising [including Matlab source code 528]
[Voice denoising] based on matlab LMS spectral subtraction voice denoising [including Matlab Source code issue 529]
[Voice denoising] based on matlab soft threshold, hard threshold, compromise threshold voice denoising [including Matlab source code 530]
[Voice recognition] based on matlab specific person's voice recognition discrimination [including Matlab source code 534]
[ Speech denoising] based on matlab wavelet soft threshold speech noise reduction [including Matlab source code 531]
[speech denoising] based on matlab wavelet hard threshold speech noise reduction [including Matlab source code 532]
[speech recognition] based on matlab MFCC and SVM specific Human gender recognition [including Matlab source code 533]
[Voice recognition] GMM speech recognition based on MFCC [including Matlab source code 535 period]
[Voice recognition] Based on matlab VQ specific person isolated words voice recognition [including Matlab source code 536 period]
[Voice recognition] based on matlab GUI voiceprint recognition [including Matlab] Source code issue 537]
[Acquisition and reading] based on matlab voice collection and reading [including Matlab source code 538]
[Voice editing] based on matlab voice editing [including Matlab source code 539]
[Voice model] based on matlab voice signal mathematical model [including Matlab source code 540]
[Speech soundness] based on matlab voice intensity and loudness [including Matlab source code 541]
[Emotion recognition] based on matlab K nearest neighbor classification algorithm voice emotion recognition [including Matlab source code 542]
[Emotion recognition] based on matlab Support vector machine (SVM) speech emotion recognition [including Matlab source code 543]
[Emotion recognition] Neural network-based speech emotion recognition [including Matlab source code 544]
[Sound source localization] Sound source localization based on matlab different spatial spectrum estimation Algorithm comparison [Include Matlab source code 545]
[Sound source localization] Based on matlab microphone receiving signal under different signal-to-noise ratio [Include Matlab source code 546]
[Sound source localization] Room impulse response based on matlab single sound source and dual microphones [ Contains Matlab source code 547]
[Sound source localization] Matlab generalized cross-correlation sound source location [Matlab source code 548 is included]
[Sound source location] Matlab array manifold matrix-based signal display [Matlab source code 549]
[Features Extraction] based on matlab formant estimation [including Matlab source code 550 period]
[Feature extraction] based on matlab pitch period estimation [including Matlab source code 551]
[Feature extraction] based on matlab voice endpoint detection [including Matlab source code 552]
[Voice coding] based on matlab ADPCM codec [including Matlab source code 553]
[Voice Encoding] based on matlab LPC encoding and decoding [including Matlab source code 554 period]
[voice encoding] based on matlab PCM encoding and decoding [including Matlab source code 555 period]

Guess you like

Origin blog.csdn.net/TIQCmatlab/article/details/114977769