One-click intelligent video speech to text - Easily extract video speech and generate copy based on PaddlePaddle speech recognition and Python

Preface

Nowadays, more and more people are entering the self-media industry, and short videos have gradually become mainstream. However, many times they want to convert the voice in the video into text. For example, after recording a meeting video, make meeting minutes; for example, on the Internet I want to take notes on a tutorial video; for example, I need to extract the copy from the video for use; for example, I need to add subtitles to the video; at this time, I just need to convert the video into text.
For those who are not video editing professionals, it is quite troublesome to deal with, but there are many gadgets available online. Most of these tools boast their own technologies and models, but they are all online models or cannot be used again after a period of use. Used, these tools are actually AI tools derived from interfaces provided by some large companies, and the results are good. However, during the processing, the processed files need to be uploaded to the server of a large company for processing, which may involve some data security issues. A large part of this data may involve data leakage and security issues.
The core algorithm of this project is based on PaddlePaddle's speech recognition plus Python implementation. The model used can have its own training, supports local deployment, supports both GPU and CPU reasoning, and can handle short speech recognition, long speech recognition, and realize input speech. Identify.

1. Video speech extraction

If you want to recognize the speech in the video, you must first extract the speech in the video. There are many ways to extract the speech in the video. You can use video editing software (such as Adobe Premiere Pro, Final Cut Pro) to extract the audio track. Then export it as an audio file. You can also use tools such as FFmpeg or moviepy to extract the audio from the video through the command line.
Moviepy is used here to extract the speech in the video. MoviePy is a feature-rich Python module designed for video editing. Using MoviePy, you can easily perform various basic video operations, such as video editing, video splicing, title insertion, etc. In addition, it supports video synthesis and advanced video processing, and you can even add custom advanced special effects. This module can read and write most common video formats, including GIF. MoviePy works seamlessly regardless of whether you are using a Mac, Windows or Linux system, and you can use it on different platforms.
MoviePy and FFmpeg environment installation:

pip install moviepy
pip install ffmpeg

Because the bit rate of the audio track extracted from the video using moviepy is not 16000, it cannot be directly input into the speech recognition model. Here we also need to use FFmpeg to convert the audio sampling rate to 16000 to extract the audio track
Insert image description here
:

def video_to_audio(video_path,audio_path):
    video = VideoFileClip(video_path)
    audio = video.audio
    audio_temp = "temp.wav"

    if os.path.exists(audio_path):
        os.remove(audio_temp)

    audio.write_audiofile(audio_temp)
    audio.close()

    if os.path.exists(audio_path):
        os.remove(audio_path)
    cmd = "ffmpeg -i " + audio_temp + " -ac 1 -ar 16000 " + audio_path
    subprocess.run(cmd,shell=True)

Insert image description here

2. Voice recognition

1.PaddleSpeech speech recognition

PaddleSpeech is an all-in-one speech algorithm toolbox developed by Feipian. It contains a variety of leading international speech algorithms and pre-trained models. It provides a variety of speech processing tools and pre-trained models for users to choose from, and supports speech recognition, speech synthesis, sound classification, voiceprint recognition, punctuation recovery, speech translation and other functions. You can find quality projects and training tutorials based on PaddleSpeech here: https://aistudio.baidu.com/projectdetail/4692119?contributionType=1

Speech recognition (Automatic Speech Recognition, ASR) is a task of extracting language and text content from a piece of audio.
Insert image description here
Currently Transformer and Conformer are the mainstream models in the field of speech recognition. For tutorials in this area, you can read the official course of PaddleSpeech: PaddleSpeech Speech Technology Course

2. Environment dependency installation

My current environment is win10, the GPU is N card 3060, using cuda 11.8, cudnn 8.5. In order to facilitate packaging later, I use conda to install the environment. If there is no GPU, you can also install the cpu version:

conda create -n video_to_txt python=3.8
python -m pip install paddlepaddle-gpu==2.5.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

3. Model download

You can download the model that suits you from the official git: https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/README_cn.md
Convert the model:

import argparse
import functools

from ppasr.trainer import PPASRTrainer
from ppasr.utils.utils import add_arguments, print_arguments

parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
add_arg('configs',          str,   'models/csfw/configs/conformer.yml',    '配置文件')
add_arg("use_gpu",          bool,  True,                       '是否使用GPU评估模型')
add_arg("save_quant",       bool,  False,                      '是否保存量化模型')
add_arg('save_model',       str,   'models',                  '模型保存的路径')
add_arg('resume_model',     str,   'models/csfw/models', '准备导出的模型路径')
args = parser.parse_args()
print_arguments(args=args)


# 获取训练器
trainer = PPASRTrainer(configs=args.configs, use_gpu=args.use_gpu)

# 导出预测模型
trainer.export(save_model_path=args.save_model,
               resume_model=args.resume_model,
               save_quant=args.save_quant)

4. Voice recognition

Use the model for short speech recognition:

 def predict(self,
                audio_data,
                use_pun=False,
                is_itn=False,
                sample_rate=16000):
        # 加载音频文件,并进行预处理
        audio_segment = self._load_audio(audio_data=audio_data, sample_rate=sample_rate)
        audio_feature = self._audio_featurizer.featurize(audio_segment)
        input_data = np.array(audio_feature).astype(np.float32)[np.newaxis, :]
        audio_len = np.array([input_data.shape[1]]).astype(np.int64)

        # 运行predictor
        output_data = self.predictor.predict(input_data, audio_len)[0]

        # 解码
        score, text = self.decode(output_data=output_data, use_pun=use_pun, is_itn=is_itn)
        result = {
    
    'text': text, 'score': score}
        return result

Look at the recognition results, they are all integrated into one piece, without short sentences or punctuation marks:
Insert image description here

5. Sentence segmentation and punctuation marks

You can train the punctuation line number model based on Feipiao's ERNIE
Insert image description here
: add punctuation code:

import json
import os
import re

import numpy as np
import paddle.inference as paddle_infer
from paddlenlp.transformers import ErnieTokenizer
from ppasr.utils.logger import setup_logger

logger = setup_logger(__name__)

__all__ = ['PunctuationPredictor']


class PunctuationPredictor:
    def __init__(self, model_dir, use_gpu=True, gpu_mem=500, num_threads=4):
        # 创建 config
        model_path = os.path.join(model_dir, 'model.pdmodel')
        params_path = os.path.join(model_dir, 'model.pdiparams')
        if not os.path.exists(model_path) or not os.path.exists(params_path):
            raise Exception("标点符号模型文件不存在,请检查{}和{}是否存在!".format(model_path, params_path))
        self.config = paddle_infer.Config(model_path, params_path)
        # 获取预训练模型类型
        pretrained_token = 'ernie-1.0'
        if os.path.exists(os.path.join(model_dir, 'info.json')):
            with open(os.path.join(model_dir, 'info.json'), 'r', encoding='utf-8') as f:
                data = json.load(f)
                pretrained_token = data['pretrained_token']

        if use_gpu:
            self.config.enable_use_gpu(gpu_mem, 0)
        else:
            self.config.disable_gpu()
            self.config.set_cpu_math_library_num_threads(num_threads)
        # enable memory optim
        self.config.enable_memory_optim()
        self.config.disable_glog_info()

        # 根据 config 创建 predictor
        self.predictor = paddle_infer.create_predictor(self.config)

        # 获取输入层
        self.input_ids_handle = self.predictor.get_input_handle('input_ids')
        self.token_type_ids_handle = self.predictor.get_input_handle('token_type_ids')

        # 获取输出的名称
        self.output_names = self.predictor.get_output_names()

        self._punc_list = []
        if not os.path.join(model_dir, 'vocab.txt'):
            raise Exception("字典文件不存在,请检查{}是否存在!".format(os.path.join(model_dir, 'vocab.txt')))
        with open(os.path.join(model_dir, 'vocab.txt'), 'r', encoding='utf-8') as f:
            for line in f:
                self._punc_list.append(line.strip())

        self.tokenizer = ErnieTokenizer.from_pretrained(pretrained_token)

        # 预热
        self('近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书')
        logger.info('标点符号模型加载成功。')

    def _clean_text(self, text):
        text = text.lower()
        text = re.sub('[^A-Za-z0-9\u4e00-\u9fa5]', '', text)
        text = re.sub(f'[{
      
      "".join([p for p in self._punc_list][1:])}]', '', text)
        return text

    # 预处理文本
    def preprocess(self, text: str):
        clean_text = self._clean_text(text)
        if len(clean_text) == 0: return None
        tokenized_input = self.tokenizer(list(clean_text), return_length=True, is_split_into_words=True)
        input_ids = tokenized_input['input_ids']
        seg_ids = tokenized_input['token_type_ids']
        seq_len = tokenized_input['seq_len']
        return input_ids, seg_ids, seq_len

    def infer(self, input_ids: list, seg_ids: list):
        # 设置输入
        self.input_ids_handle.reshape([1, len(input_ids)])
        self.token_type_ids_handle.reshape([1, len(seg_ids)])
        self.input_ids_handle.copy_from_cpu(np.array([input_ids]).astype('int64'))
        self.token_type_ids_handle.copy_from_cpu(np.array([seg_ids]).astype('int64'))

        # 运行predictor
        self.predictor.run()

        # 获取输出
        output_handle = self.predictor.get_output_handle(self.output_names[0])
        output_data = output_handle.copy_to_cpu()
        return output_data

    # 后处理识别结果
    def postprocess(self, input_ids, seq_len, preds):
        tokens = self.tokenizer.convert_ids_to_tokens(input_ids[1:seq_len - 1])
        labels = preds[1:seq_len - 1].tolist()
        assert len(tokens) == len(labels)

        text = ''
        for t, l in zip(tokens, labels):
            text += t
            if l != 0:
                text += self._punc_list[l]
        return text

    def __call__(self, text: str) -> str:
        # 数据batch处理
        try:
            input_ids, seg_ids, seq_len = self.preprocess(text)
            preds = self.infer(input_ids=input_ids, seg_ids=seg_ids)
            if len(preds.shape) == 2:
                preds = preds[0]
            text = self.postprocess(input_ids, seq_len, preds)
        except Exception as e:
            logger.error(e)
        return text

Reasoning result:
Insert image description here

6. Long audio recognition

Long audio recognition requires splitting the audio through VAD , then identifying the short audio, and splicing the results to finally obtain the long speech recognition result. VAD is voice endpoint detection technology, which is the abbreviation of Voice Activity Detection. Its main task is to accurately locate the start and end points of speech from speech with noise.

    def get_speech_timestamps(self, audio, sampling_rate):
        self.reset_states()
        min_speech_samples = sampling_rate * self.min_speech_duration_ms / 1000
        min_silence_samples = sampling_rate * self.min_silence_duration_ms / 1000
        speech_pad_samples = sampling_rate * self.speech_pad_ms / 1000

        audio_length_samples = len(audio)

        speech_probs = []
        for current_start_sample in range(0, audio_length_samples, self.window_size_samples):
            chunk = audio[current_start_sample: current_start_sample + self.window_size_samples]
            if len(chunk) < self.window_size_samples:
                chunk = np.pad(chunk, (0, int(self.window_size_samples - len(chunk))))
            speech_prob = self(chunk, sampling_rate).item()
            speech_probs.append(speech_prob)

        triggered = False
        speeches: List[dict] = []
        current_speech = {
    
    }
        neg_threshold = self.threshold - 0.15
        temp_end = 0

        for i, speech_prob in enumerate(speech_probs):
            if (speech_prob >= self.threshold) and temp_end:
                temp_end = 0

            if (speech_prob >= self.threshold) and not triggered:
                triggered = True
                current_speech['start'] = self.window_size_samples * i
                continue

            if (speech_prob < neg_threshold) and triggered:
                if not temp_end:
                    temp_end = self.window_size_samples * i
                if (self.window_size_samples * i) - temp_end < min_silence_samples:
                    continue
                else:
                    current_speech['end'] = temp_end
                    if (current_speech['end'] - current_speech['start']) > min_speech_samples:
                        speeches.append(current_speech)
                    temp_end = 0
                    current_speech = {
    
    }
                    triggered = False
                    continue

        if current_speech and (audio_length_samples - current_speech['start']) > min_speech_samples:
            current_speech['end'] = audio_length_samples
            speeches.append(current_speech)

        for i, speech in enumerate(speeches):
            if i == 0:
                speech['start'] = int(max(0, speech['start'] - speech_pad_samples))
            if i != len(speeches) - 1:
                silence_duration = speeches[i + 1]['start'] - speech['end']
                if silence_duration < 2 * speech_pad_samples:
                    speech['end'] += int(silence_duration // 2)
                    speeches[i + 1]['start'] = int(max(0, speeches[i + 1]['start'] - silence_duration // 2))
                else:
                    speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
                    speeches[i + 1]['start'] = int(max(0, speeches[i + 1]['start'] - speech_pad_samples))
            else:
                speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))

        return speeches

For long speech recognition:

    def predict_long(self,
                     audio_data,
                     use_pun=False,
                     is_itn=False,
                     sample_rate=16000):
        self.init_vad()
        # 加载音频文件,并进行预处理
        audio_segment = self._load_audio(audio_data=audio_data, sample_rate=sample_rate)
        # 重采样,方便进行语音活动检测
        if audio_segment.sample_rate != self.configs.preprocess_conf.sample_rate:
            audio_segment.resample(self.configs.preprocess_conf.sample_rate)
        # 获取语音活动区域
        speech_timestamps = self.vad_predictor.get_speech_timestamps(audio_segment.samples, audio_segment.sample_rate)
        texts, scores = '', []
        for t in speech_timestamps:
            audio_ndarray = audio_segment.samples[t['start']: t['end']]
            # 执行识别
            result = self.predict(audio_data=audio_ndarray, use_pun=False, is_itn=is_itn)
            score, text = result['score'], result['text']
            if text != '':
                texts = texts + text if use_pun else texts + ',' + text
            scores.append(score)
            logger.info(f'长语音识别片段结果:{
      
      text}')
        if texts[0] == ',': texts = texts[1:]
        # 加标点符号
        if use_pun and len(texts) > 0:
            if self.pun_predictor is not None:
                texts = self.pun_predictor(texts)
            else:
                logger.warning('标点符号模型没有初始化!')
        result = {
    
    'text': texts, 'score': round(sum(scores) / len(scores), 2)}
        return result

Inference result:
Insert image description here
sentence segmentation result:

Some big babies, I really don’t know what you think? I called my date to a single man and woman. They are both the same age. The 28-year-old woman is a kindergarten teacher, and the man from the south is an engineer. They met for the first time early last month. They had a good impression of each other, so they arranged another meeting. We met three or four times and have known each other off and on for more than a month. Last night they met again, after dinner...

3. UI and saving

1. UI interface

For convenience of application, the library Gradio is used here. Gradio is an open source Python library used to quickly build applications for machine learning and data science demonstrations. It helps you quickly create a simple and beautiful user interface to present your machine learning model to clients, collaborators, users or students. In addition, you can quickly deploy models through automatic sharing links and get feedback on model performance. During development, you can use built-in manipulation and interpretation tools to interactively debug your model. Gradio is suitable for a variety of situations, including demonstrating machine learning models to clients/collaborators/users/students, quickly deploying models and getting performance feedback, and interactively debugging models during development using built-in manipulation and interpretation tools.

pip install gradio
#You can use Tsinghua mirror source to install faster
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple gradio

import os
from moviepy.editor import *
import subprocess
import gradio as gr
from ppasr.predict import PPASRPredictor
from ppasr.utils.utils import add_arguments, print_arguments

configs = "models/csfw/configs/conformer.yml"
pun_model_dir = "models/pun_models/"
model_path = "models/csfw/models"

predictor = PPASRPredictor(configs=configs,
                           model_path=model_path,
                           use_gpu=True,
                           use_pun=True,
                           pun_model_dir=pun_model_dir)

def video_to_audio(video_path):
    file_name, ext = os.path.splitext(os.path.basename(video_path))
    video = VideoFileClip(video_path)
    audio = video.audio
    audio_temp = "temp.wav"

    audio_name = file_name + ".wav"
    if os.path.exists(audio_temp):
        os.remove(audio_temp)

    audio.write_audiofile(audio_temp)
    audio.close()

    if os.path.exists(audio_name):
        os.remove(audio_name)

    cmd = "ffmpeg -i " + audio_temp + " -ac 1 -ar 16000 " + audio_name
    subprocess.run(cmd,shell=True)

    return audio_name

def predict_long_audio(wav_path):
    result = predictor.predict_long(wav_path, True, False)
    score, text = result['score'], result['text']
    return text


# 短语音识别
def predict_audio(wav_path):
    result = predictor.predict(wav_path, True, False)
    score, text = result['score'], result['text']
    return text

def video_to_text(video,operation):
    audio_name = video_to_audio(video)
    if operation == "短音频":
        text = predict_audio(audio_name)
    elif operation == "长音频":
        text = predict_long_audio(audio_name)
    else:
        text = ""

    print("视频语音提取识别完成!")
    return text

ch = gr.Radio(["短音频","长音频"],label="选择识别音频方式:")

demo = gr.Interface(fn=video_to_text,inputs=[gr.Video(), ch],outputs="text")

demo.launch()

Results of the:

Video speech extraction and text conversion

2. Save the recognition results

4. Optimization and Upgrading

1. Optimization

The word error rate of the speech that this project can currently recognize is 0.083327. Some words with similar sounds cannot be modified in conjunction with the context, such as this sentence

"This South is an engineer"

Through contextual association here, the correct one should be:

"This man is an engineer"

There are not many such recognition errors, and there are still some sentence segments that are not segmented well. If you want to optimize, you can add LLM (large language model) to screen errors. This code connected to LLM is in the training and testing stages.

2.Upgrade

Projects can be upgraded:

  1. The current project only targets Chinese voice, and multi-language support will be added later.
  2. If the video does not have subtitles, you can add a subtitle generation module to the video.
  3. The video has subtitles. Read the subtitles of the video screen and use OCR recognition and speech recognition to verify each other.
  4. Added support for web version.
  5. The optional segment extracts and recognizes video speech.
  6. For videos of multi-person conversation scenes, voiceprint recognition can be added and then formatted for recognition.
  7. Export the generated text to word and type it.

3. Project source code

Source code: https://download.csdn.net/download/matt45m/88386353
Model:

Source code configuration:

conda create -n video_to_txt python=3.8
python -m pip install paddlepaddle-gpu==2.5.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
cd VideoToTxt
pip install -r requirements.txt
python video_txt.py

Then open it with a browser: http://127.0.0.1:7860/ and you can use it.

4.Remarks

If you are interested in this project or encounter any errors during the installation process, you can join my penguin group: 487350510, and we can discuss it together.

Guess you like

Origin blog.csdn.net/matt45m/article/details/133429008