Voice Activation (VAD) Detection for Speech Recognition (1)

guide

Voice Activation Detection (Vioce Activation Detection) for VADshort 检测语音信号是否存在. VAD technology is widely used in the field of speech. In speech recognition, we can use VAD to detect the gaps in the speech signal for long speech, use this gap to segment the speech, and divide the long speech into short speech for speech recognition. In telephone communication, in order to reduce the space used for storing data, we can use VAD technology to remove the voice signal in the gap.

There are many detection algorithms for VAD, and a relatively simple algorithm is to detect through 短时能量(STE,short time energy)and use the characteristics of energy. 短时过零率(ZCC,zero cross counter)The short-term energy refers to 一帧语音信号的能量the zero-crossing rate 一帧语音的时域信号穿过0的次数. In addition, some VAD detection algorithms will integrate speech features of multiple dimensions, including 能量特征, 频域特征, 倒谱特征, 谐波特征, 长时信息特征and so on.

Below we will STEimplement a VAD algorithm based on, mainly based auditokon

Auditok implements VAD detection

  • Install
pip install auditok
  • read audio file

By auditokreading the audio file and drawing the waveform of the audio file

import os,auditok

wav_path = "example.wav"
#读取音频文件
audio = auditok.load(wav_path)
#绘制语音波形图
audio.plot()
#跳过开始的前2s,跳过没有声音的音频
audio = auditok.load(wav_path, skip=2)

insert image description here

  • VAD detects and splits audio

auditokProvides a splitfunction that can judge whether someone is speaking by the strength of the sound signal energy, so as to segment the audio according to the gap of the speech. This is very important for segmenting a long speech audio. Usually, the ASR model cannot be processed at one time. long audio

save_slice_path = "slice_wav/slice"
#检测音频中的声音进行切分
audio_slices = audio.split(
    min_dur=1,              #包含声音最短的音频长度
    max_dur=15,             #包含声音最长的音频长度,超过这个长度会被切断
    max_silence=0.3,        #音频中没有声音音频的最长长度
    energy_threshold=55     #判断音频中包含声音必须大于这个阈值
)
#切分音频
for i, r in enumerate(audio_slices):
    post_id = os.path.basename(wav_path)[:-4]
    # 输出分割音频中包含的信息
    print("slice wav {i}: {r.meta.start:.3f}s -- {r.meta.end:.3f}s".format(i=i, r=r))
    # 播放分段的音频
    r.play(progress_bar=True)
    # 将分段后的音频保存为wav文件
    audio_name = "{}_{}.wav".format(post_id,i+1)
    save_wav_path = os.path.join(save_slice_path,audio_name)
    filename = r.save(save_wav_path)
    print("save:{}".format(filename))

question

Above we used auditokthe energy of the speech signal to segment the audio gap, but this algorithm also introduces a problem, what if the audio is 由人的说话声音+BGM组合成or 多人同时说话? That is to say, even if the person does not speak in a piece of audio, the BGM is always there. At this time, STEit is obviously impossible for us to divide the audio.

In the next article, we will introduce how to use the model to realize the segmentation of speech

reference

  1. Voice Activity Detection
  2. Still not VAD? Three minutes to understand the voice activation detection method
  3. Python uses auditok voice activity detection tool teaching and examples

Guess you like

Origin blog.csdn.net/sinat_29957455/article/details/128244421