Personal summary of the VAD process

-1.

Catch the duck on the shelves, let the audio segmentation, ready-made methods to find a WebRTC VAD to have a look. I have only been in contact for 2 days, and make a record. I hope the great god will criticize it for errors.


0. The overall process and concept:

        A VAD system usually includes two parts, feature extraction and speech/non-speech decision (endpoint detection);

        Noise : The background sound is called noise. There are noises from the external environment and noises from the equipment itself. In actual use, if there is a long period of silence, it will make the user feel very unnatural. Therefore, the receiving end often sends some packets during the silent period, thereby generating background noise that makes the user feel more comfortable, that is, so-called comfort noise.

        Mute : The energy value is continuously maintained at a low level for several consecutive frames. Ideally, the mute energy value is 0, but it cannot be done in practice because there is generally a background sound, and the background sound has a basic energy value.

        Endpoint : the critical point of mute and effective voice signal changes. In practical applications, for example, during a telephone conversation, when the user is not speaking, no voice packets are sent, which can further reduce the voice bit rate. When the user's voice signal energy is lower than a certain threshold, it is considered to be a silent state, and no voice packets are sent. When a sudden active sound is detected, a voice signal is generated and transmitted. Using this technology can obtain more than 50% of the bandwidth. In the same way, in the actual test process, we also need to consider discontinuous speech, such as stuttering, hesitation, and heaving, the accuracy of language recognition, to avoid abnormal or unreasonable situations in the processing of breakpoint detection.

1. Feature extraction:

        Commonly used feature extraction can be divided into five categories: energy-based features can be implemented in hardware, frequency domain, cepstrum, harmonics, and long-term information

        Based on energy : The criterion is to detect the strength of the signal, and assume that the energy of the speech is greater than the energy of the background noise, so that when the energy is greater than a certain threshold, it can be considered that there is speech. However, when the noise is as large as the voice, the energy feature cannot distinguish the voice from pure noise. Therefore, it can achieve better results at high signal-to-noise ratio. When the signal-to-noise ratio reaches 0dB, it is more robust based on voice harmonics and long-term voice features. Earlier energy-based methods divided wide-band speech into sub-bands and calculated energy on the sub-bands; because speech contains a lot of energy in the frequency band below 2KHz, and noise in the frequency band above 24KHz or 4KHz tends to have higher energy than the 02HKz band. This is actually the concept of spectrum flatness, which has been used in webrtc. When the signal-to-noise ratio is lower than 10dB, the ability to distinguish between speech and noise will decline at an accelerated rate.

        Frequency domain characteristics: The time domain signal is transformed into a frequency domain signal through short-time Fourier transform. Even when the signal-to-noise ratio is close to 0dB, the long-term envelope of some frequency bands can still distinguish speech and noise.

        Cepstrum feature: For VAD, the energy cepstrum peak value determines the pitch of the speech signal, and MFCC is also used as a feature;

        Harmonic features: An obvious feature of speech is that it contains the fundamental frequency F 0 F_0F0 and its multiple harmonic frequencies. Even in strong noise scenarios, this feature of harmonics also exists. The fundamental frequency can be found using the autocorrelation method.

        Long-term characteristics: Voice is an unsteady signal. Ordinary speech rate usually emits 10-15 phonemes per second, and the spectral distribution of phonemes is different, which causes the statistical characteristics of speech to change over time. On the other hand, most of the daily noise is steady-state (changing relatively slowly), such as white noise/machine noise.

2. The concept of endpoint detection

        Prerequisite: It is generally believed that there is continuous fluctuation between 10ms-30ms, which is considered to be a stable signal with speech

        Endpoint detection, also called voice activity detection, Voice Activity Detection, VAD, its purpose is to distinguish between voice and non-voice areas. In layman's terms, endpoint detection is to accurately locate the start point and end point of the voice from the noisy voice, remove the silent part, remove the noise part, and find a truly effective content of the voice.

        The use of a speech recognition system in a noisy environment, or the speaker’s emotional or psychological changes, resulting in pronunciation distortion, pronunciation speed and pitch changes, will produce the Lombard/Loud effect. Studies have shown that even in a quiet environment, more than half of the recognition errors of speech recognition systems come from endpoint detectors.

3. Classification of endpoint detection

        VAD algorithms can be roughly divided into three categories: threshold-based VAD, VAD as a classifier, and model VAD.

        Threshold-based VAD: By extracting features in the time domain (short-term energy, short-term zero-crossing rate, etc.) or frequency domain (MFCC, spectral entropy, etc.), and setting the threshold reasonably, the purpose of distinguishing speech and non-speech is achieved. This is the traditional VAD method.

        VAD as a classifier: You can regard speech detection as a two-classification problem of speech/non-speech, and then use machine learning to train the classifier to achieve the purpose of detecting speech. In webrtc, this idea is used, and the model used is the Gaussian mixture model.

        Model VAD: A complete acoustic model can be used (the granularity of the modeling unit can be very coarse), based on the decoding, through global information, to distinguish between speech and non-speech segments. Read the useful LSTM implementation. It is based on the deep learning method.

        As the forefront of the entire process, VAD needs to be completed locally in real time. Because computing resources are very limited, VAD generally uses a certain algorithm in the threshold method; the classification method optimized by engineering may also be used; and the model VAD is currently difficult to deploy and apply locally.

 

Review articles

https://blog.csdn.net/alice_tl/article/details/97433737?spm=1000.2123.3001.4430

https://www.cnblogs.com/dylancao/p/7663755.html

The following articles are detailed explanations of Webrtc's VAD process, based on the Gaussian mixture model. The following article can be compared to the code to see. I just take a look at it, and I don’t want to switch to speech recognition, haha.

The above is written with reference to the following articles.

With this parameter setting, the function call process is the clearest: https://www.cnblogs.com/damizhou/p/11318668.html

More detailed process of updating weights of Gaussian mixture model: https://blog.csdn.net/book_bbyuan/article/details/78944630?depth_1-

Demo made by neural network LSTM

Endpoint detection by code pytorch neural network https://github.com/Ryuk17/SpeechAlgorithms

Guess you like

Origin blog.csdn.net/gbz3300255/article/details/108973453