[Speech Algorithm] Use endpoint detection and Baidu speech recognition technology to realize video subtitle generation


foreword

The subtitle file contains many pieces of information, and each piece represents the start and end time and content of a sentence, so endpoint detection technology and speech recognition technology are involved.

  • Endpoint detection: pydub.silence.detect_nonsilent
  • Speech recognition: aip.AipSpeech (Baidu interface)

pip install pydub
pip install baidu-aip

1. Process

  • Extract audio from video
  • Perform endpoint detection on audio to generate sentence-by-sentence audio
  • Speech recognition for individual sentences of audio
  • Integrate into subtitle srt format

2. Code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from moviepy.editor import *
from pydub import *
from aip import AipSpeech

video_file = r'C:\Users\Lenovo\Desktop\video_sep\test.mp4'
audio_file = r'C:\Users\Lenovo\Desktop\test.wav'
srt_file = r'C:\Users\Lenovo\Desktop\srt\test.srt'

## transform to audio
video = VideoFileClip(video_file)
video.audio.write_audiofile(audio_file, ffmpeg_params=['-ar','16000','-ac','1'])

## segment
sound = AudioSegment.from_wav(audio_file)
timestamp_list = silence.detect_nonsilent(sound, 700, sound.dBFS*1.3, 1)    # look here
for i in range(len(timestamp_list)):
    d = timestamp_list[i][1] - timestamp_list[i][0]
    print("Section is :", timestamp_list[i], "duration is:", d)
print('dBFS: {0}, max_dBFS: {1}, duration: {2}, split: {3}'.format(round(sound.dBFS,2),round(sound.max_dBFS,2),sound.duration_seconds,len(timestamp_list)))

def format_time(ms):
    hours = ms // 3600000
    ms = ms % 3600000
    minutes = ms // 60000
    ms = ms % 60000
    seconds = ms // 1000
    mseconds = ms % 1000
    return '{:0>2d}:{:0>2d}:{:0>2d},{:0>3d}'.format(hours, minutes, seconds, mseconds)

## 以下在百度AI开放平台申请获得
## https://ai.baidu.com/tech/speech
APP_ID = ''
API_KEY = ''
SECRET_KEY = ''
client = AipSpeech(APP_ID, API_KEY, SECRET_KEY)

idx = 0
text = []
for i in range(len(timestamp_list)):
    d = timestamp_list[i][1] - timestamp_list[i][0]
    data = sound[timestamp_list[i][0]:timestamp_list[i][1]].raw_data
    ## asr
    result = client.asr(data, 'pcm', 16000, {
    
    'lan': 'zh',})     ## and look here
    if result['err_no'] == 0:
        text.append('{0}\n{1} --> {2}\n'.format(idx, format_time(timestamp_list[i][0]), format_time(timestamp_list[i][1])))
        text.append( result['result'][0]) #.replace(",", "")
        text.append('\n')
        idx = idx + 1
        # print(format_time(timestamp_list[i][0]/ 1000), "txt is ", result['result'][0])
with open(srt_file,"w") as f:
    f.writelines(text)

3. Other ways to generate subtitles

3.1 Endpoint detection by double threshold method

The principle of the double-threshold method is that the energy of voiced sounds is higher than that of unvoiced sounds, and the zero-crossing rate of unvoiced sounds is higher than that of silent parts. Therefore, its core lies in: first use the energy to distinguish the voiced part, and then use the zero-crossing rate to extract the unvoiced part to complete the endpoint detection.

insert image description here

3.2 Speech recognition through SpeechRcognition

SpeechRcognition can be said to be a speech recognition aggregator, including seven recognizers such as Google, Bing, and IBM:

  • recognize_bing():Microsoft Bing Speech
  • recognize_google(): Google Web Speech API
  • recognize_google_cloud():Google Cloud Speech - requires installation of the google-cloud-speech package
  • recognize_houndify(): Houndify by SoundHound
  • recognize_ibm():IBM Speech to Text
  • recognize_sphinx():CMU Sphinx - requires installing PocketSphinx
  • recognize_wit():Wit.ai

The basic usage is as follows:

import speech_recognition as sr
r = sr.Recognizer()
test = sr.AudioFile(r'C:\Users\Lenovo\Desktop\test.wav')
with test as source:
    audio = r.record(source)
r.recognize_google(audio, language='zh-CN', show_all= True)

But it seems that you need to go over the wall to use it...

3.3 Directly generate subtitle files through the autosub package

autosub is a python library that can directly generate subtitle files. The basic usage is as follows:

autosub -S zh-CN -D zh-CN [你的视频/音频文件名]

However, this method also needs to bypass the wall. I tried to change the proxy and it didn't work...

4. Summary

Generally speaking, the two technical blocks required for subtitle generation have multiple implementation methods, and the pydub plus baidu-aip I finally chose is a relatively simple and effective one. However, the actual measurement effect did not meet my expectations, because the endpoint detection was not very accurate at the beginning, resulting in the wrong context in the wrong sentence, and the speech recognition will also be biased. The further endpoint detection method has to comprehensively consider energy and zero-crossing rate, and it is best to add a custom limit that the length of each sentence cannot be too different, and so on.

Guess you like

Origin blog.csdn.net/tobefans/article/details/125433832