VAD of speech recognition-silent detection

1 Introduction

         Silence detection is very important for speech recognition. What is silence detection? As the name implies, it is to detect the state of the voice, the silent state or the active state, so as to ensure that a complete sentence of voice data is sent to the voice recognition model, eliminating some noise interference. As shown in the figure below, of course, there is a problem here, that is, how long the silent state is regarded as the end of the speech, and how much speech energy and how long the state lasts as the beginning of the speech.

 

2. Introduction to the algorithm

         2.1 Voice activation status detection

               Under normal circumstances, there must be more or less noise in the volume recorded by the microphone, such as impact sound, percussion sound, etc. Generally, the noise that needs to be eliminated is the end and fast type to prevent the detection as a false activation state. Noise is time-sensitive, that is to say, the noise at different times and locations must be different, so there needs to be a process of sampling and judging the noise.

               The voice is in a stable state in a short period of time, generally 10 to 30ms, that is, the voice state during this period of time is almost the same. Therefore, there must be a framing process for the input audio data. Of course, frame shifting is also required under normal circumstances. For example, each time the input is 30ms, but I only take 20ms audio each time. The second change will not jump too much, so the data processed next time contains the last 10ms of the last data.

               |————10ms——————|—————10ms—————|—————10ms—————|

               |<------------------------First data processing------------------- -------->|

                                                              |<------------------------Second data processing------------------- -------->|

            The smallest unit of data processing each time is one frame, which is the state mentioned in the figure above. Of course, for the general situation, a frame of 10ms can be used without frame shifting. First obtain the energy of the background noise. The energy of each frame can be the square mean value. The first n frames can be used as the energy value of the background noise. The energy value of the noise needs to be set with a minimum threshold to prevent misprocessing in a quiet state. Simply comparing the current voice energy and noise energy is obviously not enough. It is also necessary to monitor the change state of the voice, whether it is beating, and the maintenance time of this state. When these conditions are met at the same time, it can be judged as active. Common methods: zero-crossing detection and threshold control.

The routine is as follows:

//Energy calculation

float CalculateEnergy(const float *frames, size_t frames_count)

{
    float energy = 0;
    for (size_t i = 0; i < frames_count; i++) {
        energy += frames[i] * frames[i];
    }

    return energy / (float)frames_count;
}

//Zero crossing detection

short CalculateZeroCross(const short *frame, size_t frames_count) {     short zero_cross_cnt = 0;     short cur_status = 0; //current status     short last_status = 0; //last status     for (size_t i = 0; i <frames_count; i++) //Circularly judge all data in a frame     {         cur_status = (frame[i]> 0)? 1: -1;





        if ((last_status != 0) && (cur_status != last_status)) {             zero_cross_cnt++;         }         last_status = cur_status;     }



    return zero_cross_cnt; //Return the number of beats
}

Each time it is judged to be in an active state, the time or the number of frames at this time needs to be recorded as the subsequent end judgment.

      2.1 Voice end state detection

            The judgment of the end of the speech is relatively simple, as long as the difference between the current time and the last activation time is judged to be greater than a certain threshold, it can be judged as the end of the sentence.

3. Summary

       The above VAD detection methods are suitable for low-demand situations. Generally, the effect of noise state detection is not bad. If you need more demanding VAD detection, you can't just limit the characteristics of the time domain, but also need to judge the respective bands in the frequency domain. For some of the features, you can refer to webrtc's vad detection.

             

 

 

 

Guess you like

Origin blog.csdn.net/tulongyongshi/article/details/105875829