[whisper] Call whisper in python to extract subtitles or translate subtitles into text

Recently I have been doing business related to video processing. There is a need to extract subtitles from the video. Our implementation process is divided into two steps: first separate the audio, and then use whisper for speech recognition or translation. This article will introduce in detail the basic use of whisper and the two ways to call whisper in python.

1. Introduction to whisper

whisper is an open source library for speech recognition that supports multiple languages, including Chinese. In this article, we will introduce how to install whisper and how to use it to recognize Chinese subtitles.

2. Install whisper

First, we need to install whisper. Depending on your operating system, you can install it by following these steps:

  • For Windows users, you can download the applicable Python version of the whisper installation package from whisper's GitHub page (https://github.com/qingzhao/whisper), and then run the installer.

  • For macOS users, you can use Homebrew (https://brew.sh/) to install it. Run the following command in the terminal: brew install [email protected] whisper.

  • For Linux users, you can install using a package manager such as apt or yum. For example, for Ubuntu users using Python 3.10, run the following command in the terminal: sudo apt install python3.10 whisper.

Of course, we also need to configure the environment. Here we can refer to this article . This article uses the console to translate subtitles, which is more suitable for non-developers.

3. Use Whisper to extract video subtitles and generate files

3.1 Install Whisper library

First, we need to install the Whisper library. It can be installed from the command line using:

pip install whisper

3.2 Import required libraries and modules

import whisper
import arrow
import time
from datetime import datetime, timedelta
import subprocess
import re
import datetime

Refer to the two methods of generating requirements.txt in python.
If the generation fails, refer here
to the requirements.txt information generated by the corresponding version.

arrow==1.3.0
asposestorage==1.0.2
numpy==1.25.0
openai_whisper==20230918

3.3 Extract subtitles and generate files

Below is a function that extracts subtitles from the target video and generates them to a specified file:

1. How to directly adjust the library in python

def extract_subtitles(video_file, output_file, actual_start_time=None):
    # 加载whisper模型
    model = whisper.load_model("medium")  # 根据需要选择合适的模型
    subtitles = []
    # 提取字幕
    result = model.transcribe(video_file)
    start_time = arrow.get(actual_start_time, 'HH:mm:ss.SSS') if actual_start_time is not None else arrow.get(0)

    for segment in result["segments"]:
        # 计算开始时间和结束时间
        start = format_time(start_time.shift(seconds=segment["start"]))
        end = format_time(start_time.shift(seconds=segment["end"]))
        # 构建字幕文本
        subtitle_text = f"【{
      
      start} -> {
      
      end}】: {
      
      segment['text']}"
        print(subtitle_text)
        subtitles.append(subtitle_text)
    # 将字幕文本写入到指定文件中
    with open(output_file, "w", encoding="utf-8") as f:
        for subtitle in subtitles:
            f.write(subtitle + "\n")

2. Call console commands in python

"""
从目标视频中提取字幕并生成到指定文件
参数:

video_file (str): 目标视频文件的路径
output_file (str): 输出文件的路径
actual_start_time (str): 音频的实际开始时间,格式为'时:分:秒.毫秒'(可选)
target_lang (str): 目标语言代码,例如'en'表示英语,'zh-CN'表示简体中文等(可选)
"""


def extract_subtitles_translate(video_file, output_file, actual_start_time=None, target_lang=None):
	# 指定whisper的路径
    whisper_path = r"D:\soft46\AncondaSelfInfo\envs\py39\Scripts\whisper"
    subtitles = []
    # 构建命令行参数
    command = [whisper_path, video_file, "--task", "translate", "--language", target_lang, "--model", "large"]

    if actual_start_time is not None:
        command.extend(["--start-time", actual_start_time])

    print(command)

    try:
        # 运行命令行命令并获取字节流输出
        output = subprocess.check_output(command)
        output = output.decode('utf-8')  # 解码为字符串
        subtitle_lines = output.split('\n')  # 按行分割字幕文本

        start_time = time_to_milliseconds(actual_start_time) if actual_start_time is not None else 0
        for line in subtitle_lines:
            line = line.strip()
            if line:  # 空行跳过
                # 解析每行字幕文本
                match = re.match(r'\[(\d{2}:\d{2}.\d{3})\s+-->\s+(\d{2}:\d{2}.\d{3})\]\s+(.+)', line)
                if match:
                	# 这是秒转时间
                    # start = seconds_to_time(start_time + time_to_seconds(match.group(1)))
                    # end = seconds_to_time(start_time + time_to_seconds(match.group(2)))
                    start = start_time + time_to_milliseconds(match.group(1))
                    end = start_time + time_to_milliseconds(match.group(2))
                    text = match.group(3)
                    # 构建字幕文本 自定义输出格式
                    subtitle_text = f"【{
      
      start} -> {
      
      end}】: {
      
      text}"
                    print(subtitle_text)
                    subtitles.append(subtitle_text)
        # 将字幕文本写入指定文件
        with open(output_file, "w", encoding="utf-8") as f:
            for subtitle in subtitles:
                f.write(subtitle + "\n")

    except subprocess.CalledProcessError as e:
        print(f"命令执行失败: {
      
      e}")

3.4 Auxiliary functions

In the above code, some auxiliary functions are also used to handle the conversion and formatting of time format:

def time_to_milliseconds(time_str):
    h, m, s = time_str.split(":")
    seconds = int(h) * 3600 + int(m) * 60 + float(s)
    return int(seconds * 1000)

def format_time(time):
    return time.format('HH:mm:ss.SSS')

def format_time_dot(time):
    return str(timedelta(seconds=time.total_seconds())).replace(".", ",")[:-3]
    
# 封装一个计算方法运行时间的函数
def time_it(func, *args, **kwargs):
    start_time = time.time()
    print("开始时间:", time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time)))

    result = func(*args, **kwargs)

    end_time = time.time()
    total_time = end_time - start_time

    minutes = total_time // 60
    seconds = total_time % 60

    print("结束时间:", time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time)))
    print("总共执行时间: {} 分 {} 秒".format(minutes, seconds))

    return result

3.5 Call function to extract subtitles

The following code can be used to call the above function and extract the subtitles to the specified output file:

if __name__ == '__main__':
    video_file = "C:/path/to/video.mp4"  # 替换为目标视频文件的路径
    output_file = "C:/path/to/output.txt"  # 替换为输出txt文件的路径
    actual_start_time = '00:00:00.000'  # 替换为实际的音频开始时间,格式为'时:分:秒.毫秒',如果未提供则默认为00:00:00.000时刻
	# 直接在main方法中调用
    # extract_subtitles(video_file, output_file, actual_start_time)
    time_it(extract_subtitles_translate, video_file, output_file, None, 'en')

Note replacing video_fileand output_filewith the actual video file path and output file path. The parameter can be replaced if there is an actual audio start time actual_start_time.

In the above code, we first import the whisper library and then define a recognize_chinese_subtitlefunction named which accepts an audio file path as input and uses the whisper client for speech recognition. The recognition results are saved in resultthe dictionary, where textthe field contains the recognized subtitle text.

In if __name__ == "__main__"the block, we call recognize_chinese_subtitlethe function, passing in an audio file path, and then print the recognized subtitles.

3.6 Model selection, refer to the following

_MODELS = {
    
    
    "tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
    "tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt",
    "base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt",
    "base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt",
    "small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt",
    "small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
    "medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt",
    "medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt",
    "large-v1": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt",
    "large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
    "large": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
}

# 1.
# tiny.en / tiny:
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03 / tiny.en.pt
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / 65147644
# a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9 / tiny.pt
# - 优点:模型体积较小,适合在资源受限的环境中使用,推理速度较快。
# - 缺点:由于模型较小,可能在处理复杂或长文本时表现不如其他大型模型。           -------------错误较多
#
# 2.
# base.en / base:
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / 25
# a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead / base.en.pt
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e / base.pt
# - 优点:具有更大的模型容量,可以处理更复杂的对话和文本任务。
# - 缺点:相对于较小的模型,推理速度可能会稍慢。
#
# 3.
# small.en / small:
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872 / small.en.pt
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / 9
# ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794 / small.pt
# - 优点:模型规模适中,具有一定的表现能力和推理速度。
# - 缺点:在处理更复杂的对话和文本任务时,可能不如较大的模型表现出色。
#
# 4.
# medium.en / medium:
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f / medium.en.pt
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / 345
# ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1 / medium.pt
# - 优点:更大的模型容量,可以处理更复杂的对话和文本任务。
# - 缺点:相对于较小的模型,推理速度可能会较慢。                          ---断句很长  【00:00:52.000 -> 00:01:03.000】: 嗯,有一个那个小箱子,能看到吗?上面有那个白色的泡沫,那个白色塑料纸一样盖着,把那个白色那个塑料纸拿起来,下面就是。
#
# 5.
# large - v1 / large - v2 / large:
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a / large - v1.pt
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / 81
# f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524 / large - v2.pt
# - 链接:https: // openaipublic.azureedge.net / main / whisper / models / 81
# f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524 / large - v2.pt
# - 优点:最大的模型容量,具有最强大的表现能力,可以处理复杂的对话和文本任务。
# - 缺点:相对于其他较小的模型,推理速度较慢,占用更多的内存。


# whisper C:/Users/Lenovo/Desktop/whisper/luyin.aac --language Chinese --task translate

4. Conclusion

Through the above steps, whisper has been successfully installed and the function of recognizing Chinese subtitles has been implemented. In actual applications, it may be necessary to make some adjustments to the code based on the actual situation, such as processing audio file paths, recognition results, etc. Hope this article helps you!

Guess you like

Origin blog.csdn.net/qq_48424581/article/details/134113540