Speech Recognition Simulation System Based on MATLAB

The speech recognition system implemented in this paper mainly conducts in-depth research on the extraction of speech recognition feature parameters and the matching of recognition models. First, an overview of speech recognition is given, and the system framework of speech recognition is given. Then is exactly the problem of how to realize speech recognition, and this process can be divided into two parts: the first is the extraction of speech feature parameter Mel cepstral coefficient (MFCC), mainly includes carrying out preprocessing process to speech signal, feature parameter calculation The acquisition process, and focuses on the analysis of double-threshold endpoint detection. The second is the process of vector quantization (VQ) model matching, which mainly includes the training phase and the recognition phase: the training phase is mainly to generate a one-to-one corresponding codebook for each speaker, using the classic LBG algorithm vector quantization to form a codebook, through Calculate the distance between the codebook for training and recognition, that is, the person to be recognized can be identified. Under the platform of MATLAB software, better experimental simulation results have been obtained. Experiments show that MFCC parameters have good robustness and can achieve the purpose of accurately identifying the identity of the speaker.

1. Basic principles

Speech signal generation model:

 The speech signal can be considered to be stationary in a relatively short period of time. Under this premise, the classical speech signal model can be represented by a linear time-invariant system. In order to understand the characteristics of the speech signal, a model of speech generation is given, as shown in Figure 4-1. A discrete time-domain speech signal generation model is established, which plays a very important role in further research and various specific applications. A simpler model is shown in the figure, which can fully meet the needs for most researches and applications.

The discrete time-domain model of speech has three parts: excitation source, vocal tract model and radiation model.

(1) Incentive source

The excitation source is determined by the position of the voiced and unvoiced switch to determine whether the generated speech signal is unvoiced or voiced. During voiced speech, the excitation signal is generated by a periodic pulse generator, and the generated sequence is a shock sequence with a frequency equal to the gene frequency. In order to make the excitation signal of the voiced sound have the actual waveform of the glottal pulse, we need to pass the impulse sequence through a glottal pulse model filter G(z), and the function of multiplying the coefficient is to adjust the amplitude of the voiced sound signal; The generator generates excitation, and the function of multiplying by the coefficient is to adjust the amplitude of the unvoiced signal.

(2) Channel model

Second, the channel model V(Z) gives the channel transfer function in the discrete time domain, and the actual channel is assumed to be a sound tube with variable cross-sectional area. Using hydrodynamic methods it can be derived that, in most cases, the vocal tract model is an all-pole function. Therefore, V(z) can be expressed as:

It can be seen that here the acoustic tubes with continuously changing cross-sectional areas are approximately divided into a series of P-segment short acoustic tubes. P is called the order of this all-pole filter. Obviously, the larger the P value is, the more consistent the transfer function of the model is with the actual channel transfer function. Generally, the P value is 8-12.

(3) Radiation model

The radiation model R(z) is related to the shape of the mouth. Studies have shown that the lip radiation is more significant at the high-frequency end, and has less influence at the low-frequency end, so we regard the radiation model R(z) as a first-order high-pass filter. Its expression is:

In the above model, G(z) and R(z) remain unchanged, and the pitch period, the position of the voiced and unvoiced sound switch, and the parameters in the vocal tract model parameters all change with time.

 System Block Diagram

 Building this system is divided into two phases: training phase and recognition phase. In the training phase, each speaker of the system is recorded, and the model codebook of each speaker is established in turn; while in the identification phase, the parameters extracted from the recording of the speaker to be identified are compared with the codebook, if the distance between the two If it is less than a certain threshold, the speaker can be identified, otherwise it cannot be identified.

2. Speech signal preprocessing

Preprocessing includes dezeroing, pre-emphasis, framing, and windowing. The reason for pre-emphasis is that the higher the frequency of the speech signal, the smaller the spectral value. When the speech signal’s spectrum is doubled, the amplitude of its power spectrum drops by about 6dB, so the purpose of pre-emphasis is to enhance the high-frequency part. , so that the spectrum of the signal becomes flat, which is convenient for spectrum analysis. Framing is to regard the speech signal as many short segments of stable speech. This analysis and processing method is called a short-term analysis method. Windowing, in fact, is to pass the speech signal through the filter bank. It is reflected in the convolution of the time domain expression in the time domain, and the product of the frequency expression in the frequency. Why do we need to add a window? Because later we will Fast Fourier Transform (FFT) is performed on the data, which assumes that the signal in the window is a periodic signal. By adding a window, the speech signal has periodic characteristics. The preprocessing process of the speech signal has a detailed implementation process in the following content.

3. Feature extraction

The physiological structure of the vocal organs of different people is different, and the movements of the vocal organs are not the same even when people who grow up in different environments produce the same sound. This kind of information that can characterize the speaker is expressed through physically measurable parameter characteristics such as formant frequency and bandwidth, average fundamental frequency, and basic shape of the spectrum. The feature parameters should be different for different speakers, and this difference is called Interspeaker Variance. Inter-speaker differences are produced by different vocal tract characteristics of the speakers, and this difference can distinguish different speakers.

So how to extract feature parameters from the speech signal is the key to speaker recognition. Ideally, these features should have the following characteristics:

(1) It has a high ability to distinguish speakers and can fully reflect the differences between individual speakers;

(2) When the input voice is affected by the transmission channel and noise, it can have better robustness;

(3) It is easy to extract, easy to calculate, and has good independence between the parameters of each dimension of the feature;

(4) It is not easy to be imitated.

After the speech signal is preprocessed, only a few seconds of speech will generate a large amount of data. The process of extracting speaker features is actually the process of removing redundant information in the original speech and reducing the amount of data. The selection and extraction of feature parameters is very important for the speaker recognition effect. Generally, the feature parameters of the speech signal are divided into two categories: the first category is the time domain feature parameters, and the second category is the transform domain feature parameters. The former can usually be used for endpoint detection, and the latter can be used to extract features and generate recognition codebooks. The extraction process of these two types of parameters is described in detail in the implementation of extracting Mel cepstrum coefficients.

4. Implementation process

The implementation of speech recognition is mainly divided into two parts: the first is to extract the characteristic parameters; the second is the recognition model. Finding feature parameters and extraction algorithms with good performance is the key issue to improve the performance of the recognition system, and finding a suitable recognition model for pattern matching is also an important part of the system.

Speech recognition refers to the analysis and processing of the speaker's voice signal to extract corresponding features or establish a corresponding model to confirm the identity of the speaker. Currently, the most commonly used feature parameters in speech recognition are channel-based LPCC (linear prediction cepstral coefficient) and auditory characteristic-based MFCC (Mel frequency cepstral coefficient) parameters. MFCC parameter identification efficiency is higher than LPCC, the convergence speed is fast, and the robustness is good. Therefore, in this paper, the most commonly used feature parameters, Mel cepstral coefficient (MFCC) and vector quantization (VQ) are selected to realize speaker recognition. Next, we first extract Mel cepstrum coefficient and study the specific process of speaker recognition.

1. Extract the Mel cepstral coefficient (MFCC)

Mel cepstral coefficient (MFCC) is the most commonly used parameter in speaker recognition algorithm, which reflects the nonlinear correspondence between Mel frequency and Hertz frequency. The analysis of MFCC conforms to the characteristics of human hearing. The human ear has some special functions, and can distinguish various sounds in noisy environments and various abnormal situations, and the cochlea plays a key role. The cochlea is essentially equivalent to a filter bank. The cochlear filtering function is performed on a logarithmic frequency scale. Below 1000 Hz is a linear scale, and above 1000 Hz is a logarithmic scale, which makes the human ear sensitive to high frequencies. According to this principle, a group of filters similar to the function of the human cochlea, that is, Mel frequency filters, are studied. MFCC is a speech feature parameter extracted by Fourier analysis, which is similar to an exponential form. The relationship between it and the actual frequency domain is:

In the formula, Mel is the perceived frequency domain, and Hz is the actual frequency domain. Transforming the spectrum of the speech signal into the perceptual domain can better simulate the processing of the auditory process.

The detailed extraction process of MFCC parameters is shown in the figure:

 1.1 Signal preprocessing

(1) Voice signal digitization. The voice signal is an analog signal, which cannot be directly used for digital analysis. Therefore, we need to convert the analog signal into a digital signal. This process is called analog/digital conversion. After the voice input signal goes through two steps of sampling and quantization, a digital voice signal that is discrete in time and amplitude is obtained. First of all, let’s talk about sampling. The Nyquist sampling law states that the sampling frequency should be twice the frequency of the original speech in order not to lose information. Combined with the actual situation, the frequency range of normal human pronunciation is from 40Hz to 3400Hz, so the sampling frequency of speech signal processing should be greater than 8000Hz. In order to improve the recognition rate, it has been verified that the sampling frequency of 4.41kHz is more appropriate, so we will do the following simulation experiments Select a sampling frequency of 4.41kHz in the

(2) to zero

To remove the zero is to remove the silent part and find the really effective content of the speech. Removing the mute part will make it more convenient and fast in subsequent use, and prevent the situation where the input signal is 0 as the denominator.

(3) Pre-emphasis  

Pre-emphasis is a signal processing method that compensates the high-frequency components of the input signal at the sending end. As the signal rate increases, the signal is greatly damaged during transmission. In order to obtain a better signal waveform at the receiving terminal, it is necessary to compensate for the damaged signal. The idea of ​​pre-emphasis technology is to enhance the signal at the beginning of the transmission line. The high-frequency component of the signal is used to compensate for the excessive attenuation of the high-frequency component during transmission. The pre-emphasis has no effect on the noise, so the output signal-to-noise ratio is effectively improved. The larger the signal-to-noise ratio, the smaller the noise mixed in the signal, and the higher the sound quality of the sound playback, otherwise the opposite is true.

We apply pre-emphasis in this part mainly to eliminate the effect caused by the vocal cords and lips during the vocalization process, and to compensate the high-frequency part of the voice signal suppressed by the pronunciation system. And it can highlight the high-frequency formant, flatten the spectrum of the signal, and facilitate spectrum analysis. We choose a pre-emphasis digital filter implementation with 6dB/octave boost frequency characteristics, whose Z transfer function is:

Among them, the value of u is close to 1, and the typical value is between 0.94-0.97

(4) Framing

The Fourier transform requires the input signal to be stationary. The speech signal is not stable on the macroscopic level, but stable on the microscopic level, and has short-term stationarity (the speech signal can be considered to be approximately unchanged within 10-30ms), so the speech signal can be divided into some short segments for processing. Each short segment is called a frame (CHUNK). In the later stage, the voice signal needs to be windowed. Windowing will cause the details of the waveform to be lost. At this time, in order to prevent the loss of waveform information, there must be an overlapping part between the previous frame and the next frame during triage. This part is called frame shift. The frame shift is generally 1/3-1/2 of the frame length, so that the information of the previous frame can be found again

(5) Add window

Adding a window means multiplying it with a window function, and after adding a window, it is for Fourier expansion. The purpose of windowing: one is to make the overall situation more continuous and avoid the Gibbs effect; the other is to make the originally non-periodic voice signal show some characteristics of a periodic function when windowing is added; thirty to increase the continuity at the left and right ends of the frame . The cost of windowing is that the two ends of a frame signal are weakened, so when framing, there needs to be overlap between frames. In speech signal processing, a commonly used window function is the Hanning window. The Hanning window can effectively overcome the leakage phenomenon and is the most widely used.

Hanning window expression:

1.2. Signal Fast Fourier Transform

Since the transformation of the signal in the time domain is usually difficult to see the characteristics of the signal, it is usually converted into an energy distribution in the frequency domain for observation. Different energy distributions can represent different speech characteristics. Therefore, after multiplying by the Hanning window, each frame must undergo a fast Fourier transform to obtain the energy distribution on the spectrum. Fast Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame. And take the modulus square of the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal.

1.3. Frequency response weighting by Mel filter bank

The magnitude of the Fourier transformed function will be weighted by the frequency response of a set of Mel-scaled triangular filter banks. The center frequency of the Mel scale filter bank is evenly arranged according to the Mel frequency. The two bottom points of each triangular filter are the centers of the adjacent filters. The center frequency and bandwidth of these filters are roughly consistent with the auditory critical sideband filter bank. . Here we design a filter bank with 100 triangular filters, the picture shows the triangular filter.

1.4. Discrete Cosine Transform (DCT)

After obtaining the logarithmic energy output by each filter bank, it is actually difficult to linearly process the signal, and the homomorphic spectral transformation is to perform a logarithmic operation on them, and become the sum of the components to form a vector. In order to remove the correlation between the signals of each dimension, it is necessary to perform a discrete cosine transform on the vector, and then we can obtain the characteristic parameters we want to extract, that is, Mel cepstral coefficients.

2. VQ model matching

Vector quantization (VQ) matching

The vector quantization (VQ) matching process is to regard the voice of each speaker as a signal source, extract the feature parameter vector from the training voice of each speaker, and then use the traditional LBG algorithm to cluster to generate a codebook . Matching with the codebook obtained from the test has confirmed the speaker of the person to be identified.

Our training set is a vector source with M (the number of sampling points of the speech signal is M)

Where: M=recording time×sampling frequency=2×44100=88200

training set: ;

The source vector is N-dimensional then:

Assuming that the number of code vectors is K (set to an empirical value of 16), the codebook is expressed as: ;

Each code vector is an N-dimensional vector:

 

 That is to say, the code vector Cn is required to be the average vector of all training sample vectors in the coding area Sn. In the implementation, it is necessary to ensure that each coding region has at least one training sample vector, so that the denominator of the above formula is not 01.

3. VQ workflow

During the operation of the program, it is mainly divided into two parts: training and recognition. The first part of training refers to extracting the voice features of each speaker and using the LBG algorithm to perform feature training. The idea of ​​​​the LBG algorithm is to use the split The idea is usually increased by multiples until the number of codebooks increases until the error of the codebook reaches the preset value. The final result is equivalent to dimensionality reduction: at this time, the dimension of the codebook is K×N. , the second part of recognition is to extract the test sound through the same feature extraction method and generate a codebook. After the codebook is generated, it is matched with the previously trained codebook, and the Euclidean distance is calculated by the dist function, and Generates a matrix of N×P dimensions. This matrix will be called by other programs to generate the final recognition result

5. System simulation

1. Training

(1) Select the number: number the trainer to facilitate the storage and recognition of voice information;

(2) Start recording: call the audiorecorder function to record, the recording time is 3 seconds, and use audiowrite to save the recording file;

(3) Training codebook: After the training recording is over, a codebook is generated from the recording of the trainee.

2. Identification module:

(1) Training recording: the recognizer speaks at will, the time is set to 3 seconds, and the endpoint detection process is realized at the same time;

(2) Voice playback: play the voice to be recognized;

(3) Feature extraction: extract MFCC parameters from the voice of the recognizer, and generate a recognition codebook;

(4) Recognition result: match the codebook of the recognizer with the training codebook, and display the result "the speaker to be recognized matches the Nth trainer".

6. Simulation results

We choose one person to conduct the speaker simulation recognition experiment, and mark them as No. 3 respectively. The specific operation steps of the simulation experiment are divided into two steps: training and recognition.

  (1) Training phase: input three-second voice signal

 Finally, through FFT transformation, Mel filter, and then DCT transformation, the Mel cepstral coefficients are obtained. The obtained system feature vector is vector quantized by the LBG algorithm, and the quantized codebook is obtained.

 During the voice test, it is also similar to voice recording, and the codebook of the test voice segment is obtained through the same process. Comparing it with all the codebooks obtained from training, the one with the shortest Euclidean distance is the speaker you are looking for. As shown in the figure below, the experiment verifies that the participant's voice can be correctly identified.

Source code project file sharing:

Link: https://pan.baidu.com/s/1OaAieuwp2fUWj-shx-9u0g 
Extraction code: duoj

Guess you like

Origin blog.csdn.net/qq_51533426/article/details/130141151