[Experiment] Speech Recognition

Summary and induction for school digital signal processing experiments;

Speech Recognition

Topics and related requirements are here .

data preprocessing

General steps:

  • get raw audio

  • detection

  • Framing

  • window

  • feature extraction

endpoint detection

Endpoint Detection Parameter Indicators relative value
Initial short-term energy high threshold 50
Initial short-term energy low threshold 10
Initial short-term zero-crossing rate high threshold 10
Initial short-term zero-crossing rate low threshold 2
Maximum silence length 8ms
Speech minimum length 20ms

Here we are performing threshold-based VAD, by extracting features in the time domain (short-term energy, short-term zero-crossing rate, etc.) or frequency domain (MFCC, spectrum, etc.), and setting the threshold reasonably to achieve the purpose of distinguishing speech from non-speech .

The initial feature threshold of our endpoint detection is shown in the table on the right. Through such indicators, the audio information we detect can filter out most of the noise.

The double-ended detection results of some digital and name signals are as follows:

Framing

Since the speech signal has short-term stationary and time-invariant characteristics, it is analyzed in a short-term frame in the time domain.

The window length of 30ms and the frame shift of 10ms used in the experiment have better results.

window

There are differences in the graphs obtained through different window processing lags, among which the change of the rectangular window is relatively sharp, and the transformation of the Hanning window and Hamming window is relatively stable.

  • Time-domain signal after rectangular window processing
  • Time-domain signal after Hamming window processing
  • Time-domain signal processed by Haining window

Time Domain Speech Recognition

Extract speech features

The processed data is normalized, and the feature value of each frame is calculated, and the frame number is used as the dimension of the feature vector to obtain the feature vector for classification. The eigenvector is composed of zero-crossing rate, energy and magnitude.

classifier used

  • If most of the K nearest neighbors of a sample in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of this category of samples.

  • The basic model is a linear classifier with the largest interval defined on the feature space, and the hyperplane is used to divide the data set with the largest interval to obtain the expression of the hyperplane.

  • On the basis of knowing the probability of occurrence of various situations, a decision analysis method is used to obtain the probability that the expected value of the net present value is greater than or equal to 0 by forming a decision tree, evaluate the project risk, and judge its feasibility.

  • Integrated LearningBased on KNN

    • The integrated learning algorithm completes the learning task by constructing and combining multiple classifiers, and usually has a high accuracy rate, but its disadvantage is that the model is relatively complex.

    • Implementation process

      • Randomly samples the eigenvalues ​​of the original data at random.

      • Send the sampled data to several KNN learners for decision making.

      • All results are voted on.

Time Domain Speech Analysis

The following are the discriminant matrix and AUC curve obtained by KNN, SVM, decision tree, and integrated learning KNN classifier.

The results are as follows:

Classifier Accuracy Precision Type I Error Rate Type II Error Rate F-score
KNN 0.32 0.33 1.00 0.80 0.25
SVM 0.42 1.00 0.00 0.67 0.50
Tree 0.37 0.50 0.25 0.50 0.50
Integrated learning KNN 0.44 0.60 0.67 0.25 0.67

Comparison of the effects of different windowing methods

window type Accuracy Precision Type I Error Rate Type II Error Rate F-score
rectangular window 0.44 0.6 0.67 0.25 0.67
Hamming window 0.39 1 0 0..50 0.67
Haining window 0.45 0.67 1 0.6 0.5

F-score performance: rectangular window = Hamming window > Haining window

The speech signal recognition effect after rectangular window processing is better


Frequency Domain Speech Recognition

experiment process

  • Record audio from your computer.

  • The obtained audio data is preprocessed, and the ideal audio data is obtained through endpoint detection.

  • Extract MFCC features of audio data.

  • Do a DTW algorithm search.

  • Summarize the obtained results.


MFCC feature extraction

  • Framing

  • FFT: Perform 256-point Fast Fourier Transform on each frame of speech.

  • Triangular Bandpass Filter

  • Discrete Cosine Transformation: Discrete Cosine Transformation (DCT) coefficients are used to concentrate energy in the first few items to achieve the purpose of reducing the discriminant parameters and improving the calculation speed without losing its accuracy.

  • Difference cepstrum parameters: first-order difference MFCC parameters, reflecting the connection between two adjacent frames, so we merge mfcc parameters and first-order difference mfcc parameters to form MFCC parameters, and remove the first and last two frames to increase the accuracy of MFCC parameters , to enhance the predictive power of the experiment.

  • DTW


Time Domain Speech Analysis

We use the cross-validation method, the ratio of the training set to the test set is 9:1. After 1000 repeated experiments, the accuracy of the number recognition and name recognition obtained under different window functions is as follows. The x-axis is the number of tests, and the y-axis is the accuracy rate.

As shown by the accuracy rates of each rectangular window in the above table, the overall accuracy rate of digit recognition under the rectangular window is the lowest, while the accuracy rates of digit recognition under the Hamming window and the Haining window are not much different. However, the accuracy rate of the name under the three windows does not fluctuate much.


Related videos:

We can find that the endpoint detection can only detect the first character in the name during name recognition. This is because the parameters of the endpoint detection have not been adjusted.


After the second parameter adjustment:

Voiceprint Recognition Based on GMM Model

GMM

Gaussian Mixture Model (GMM) is one of the most commonly used models in voiceprint recognition, because in voiceprint recognition, how to summarize the voice features well and how to match the test voice with the training voice are very complicated and difficult to solve problems, and GMM turns these problems into problems such as model operation and probability calculation, and solves them. The Gaussian mixture model can approximate any continuous probability distribution, so it can be regarded as a universal approximator for continuous probability distributions. The GMM model is a supervised training process. Its basic idea is to use the known sample results to deduce the parameter values ​​that are most likely (that is, the maximum probability) to lead to the result. Under this principle, GMM usually uses the maximum expectation algorithm (EM) model to iterate until converge to determine the parameters.

Our work

The experimental procedure is as follows:

Firstly, the voice signal is preprocessed by framing, windowing and other methods. After the effective voice signal is obtained, the MFCC feature of the voice signal is extracted and used as the input of the GMM model to train the GMM model. Finally, the GMM model is tested through the test set. carry out testing.

Use the extracted voice MFCC features to train the GMM model through the fitgmdist function to get the voice features of each person

Failed to transfer and re-upload to cancel

Speaker Recognition Results

We put the voice signal of each student reading the name and the voice signal of each student reading the number into the GMM model and the SVM for training and testing respectively, and compared the advantages and disadvantages of the two models.

Note that the digital voice used for training is I1, the name voice used for training is I2, the digital voice used for testing is O1, and the name voice used for testing is O2. We use I1 and I2 as training sets for training, and For each case the accuracy was tested with O1 and O2 respectively. The results are shown in the figure. It can be seen that when the speaker's voice is a name, the recognition effect is better, but when it is a number, it is poor. In addition, the effect of the GMM model is better than that of the SVM. The accuracy of the SVM training set is 1, while the accuracy of the test set is low, and the generalization ability of the model is poor, while the GMM generalization ability is better.

Failed to transfer and re-upload to cancel

written in the back

The corresponding code is in Copy2000/DSP_experiment: school experiment (github.com) , which is very messy, that's it, and the video may be very slow because it is placed on the server of github.

There are also two small papers:

Guess you like

Origin blog.csdn.net/bruce__ray/article/details/131144092