OpenAI-whisper speech recognition model

1. Introduction to whisper

  • Whisper is a general speech recognition model. It is trained on a large dataset of different audios and is also a multi-task model that can perform multilingual speech recognition, speech translation and language recognition.

  • whisper has five model sizes, providing a balance of speed and accuracy, of which the English-only model offers four options. Below are the names of the available models, approximate memory requirements, and relative speeds.

insert image description here

  • github link: https://github.com/openai/whisper

2. Method

A Transformer sequence-to-sequence model is trained for various speech processing tasks, including multilingual speech recognition, speech translation, spoken language recognition, and voice activity detection. Together, these tasks are represented in the form of a sequence of symbols that need to be predicted by the decoder, enabling a single model to replace multiple stages in traditional speech processing pipelines. The multi-task training format uses a series of special symbols as task designators or classification targets.
insert image description here

3. Environment configuration

conda create -n whisper python=3.9
conda activate whisper
pip install -U openai-whisper
sudo apt update && sudo apt install ffmpeg
pip install setuptools-rust

4. python test script

  • Taking the lightweight tiny model as an example, the test script is as follows:
import whisper

model = whisper.load_model("tiny")
result = model.transcribe("sample_1.wav")
print(result["text"])

The test results are as follows:

insert image description here

  • If you want to test the large model, you need a graphics card with more than 16GB.

Note: The above test script does not support multi-gpu at the moment, because it is possible to load the encoder on one GPU and the decoder on another GPU.

If you want to pass the multi-gpu test, try the following methods:

  • First update the package so it has the latest commit
pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
  • Then, refer to the following script:
import whisper

model = whisper.load_model("large", device="cpu")

model.encoder.to("cuda:0")
model.decoder.to("cuda:1")

model.decoder.register_forward_pre_hook(lambda _, inputs: tuple([inputs[0].to("cuda:1"), inputs[1].to("cuda:1")] + list(inputs[2:])))
model.decoder.register_forward_hook(lambda _, inputs, outputs: outputs.to("cuda:0"))

model.transcribe("jfk.flac")

Multi-gpu script reference link: https://github.com/openai/whisper/discussions/360

  • Test the large model (video memory >=16GB), input audio, output text (Chinese Simplified), need to set initial_prompt, otherwise the output may be Chinese Traditional
import whisper
import os

model = whisper.load_model("large")
prompt='以下是普通话的句子'
result = model.transcribe(file_path, language='zh',verbose=True, initial_prompt=prompt)
print(result["text"])

Guess you like

Origin blog.csdn.net/wjinjie/article/details/130762112