VAD Interruption Scheme

what is an interruption

Interruption is when the robot speaks, and the user starts talking before the robot finishes speaking. The general practice is to stop the robot from speaking when the user's voice is detected for 100-200ms. This has a disadvantage, that is, if there is a loud environment or someone talking around the user, it will cause false interruptions. This article mainly introduces several methods. Avoid this problem.

Dynamic minimum sound time

VAD has such a parameter  min_speak_ms [数字] 可选参数 默认100ms 最小的声音时间 单位毫秒, the default value is 100ms.

The meaning of this parameter is to set a minimum sound time. Only when the sound duration exceeds this value will it be considered as a valid sound, which will trigger interruption and submit it to the ASR server for recognition.

Why did I choose this 100ms as the default value, because after the call is connected, many people are used to saying "Hello" or "Hello", the duration of "Hello" is generally 100-200s, "Hello" The duration is 200-300ms.

When developing business processes, you can set this value dynamically to avoid interruption of invalid sounds. For example, the first sound after the call is connected, set 100ms, and the subsequent sound settings, 200-300ms. It can be very effective to avoid false interruptions.

keyword break

Version 2.1 has been implemented. The user submits the ASR identification after a pause, and sends the identification result to the business program, which is to let the business program control whether it needs to be interrupted.

Detect the speaking voice, pause the playback first, and resume playback when no valid text is recognized.

After the VAD module detects the user's voice, it pauses playing the robot's voice, and starts to submit the voice stream to the ASR server for identification. If the ASR server does not return a valid sentence, resume the robot's voice (not simply starting from the paused place, but using VAD) Algorithm, locate the starting point of the nearest sound, and then start playing) The following is an example.

Robot: Hello, I'm the XX sales department. May I ask if you have a recent one (the sound of the client is checked at this time, and the playback will be paused)

Text recognized by ASR is not a valid answer. Could be ambient noise etc. The robot resumes speaking.

Robot: Do you plan to buy a house recently? (Using VAD to detect the starting position of the latest sentence of the interruption point and start playing, and no longer play "Hello, I am XX Sales Department" repeatedly, nor simply start playing from the interruption point).

This solution can effectively solve the interruption of environmental noise, which causes the robot to stop talking. It can listen to the user's voice and continue to speak like a real person. 

 

Reprinted from: http://www.ddrj.com/smartivr/break.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326104409&siteId=291194637