Introduction to openai's whisper speech recognition

openAI released chatgpt, and the halo is the same for a while. But openAI does not only have this project, its other projects are also very worthy of our research and study.

Talk about this whisper project today
https://github.com/openai/whisper

ta is about speech recognition. It proposes a method for speech recognition through large-scale weak supervision. Weak supervision refers to methods that use incomplete or inaccurate labels or annotations to train models. This method can avoid the time-consuming and laborious manual labeling of data, and can also use more data to improve the performance of the model.

In this method, a large amount of unlabeled speech data and some labeled data are used to train a deep learning model. This model can automatically learn how to extract features from speech signals and convert them to text.

Let's take a look at the official description: (whether you can understand it or not, I don't understand it anyway)
insert image description here

The author uses Python version 3.9.9 and PyTorch version 1.10.1, but the code base should be compatible with Python 3.8-3.10 and the latest PyTorch version. (I tried 3.11 myself and it is not compatible, so I will use 3.9 honestly)

The usage is also very simple, it is the favorite of Tiaoku boys.

Step 1: Install the Python library

python3 -m pip install openai-whisper

The second step is to install FFmpeg

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

I personally recommend using Scoop to install FFmpeg on Windows, Chocolatey is too troublesome

The third step is to choose the model to use.

The official said that there are 5 models, 4 of which are English-only models, but the actual measurement of english-only can also support Chinese (only the base test can support Chinese, and the others have not been tested but should also be possible)
insert image description here

Although it supports Chinese, there are also some unsatisfactory places. The recognition error rate (WER (Word Error Rate)) of Chinese is not low, and it is probably in the middle of all supported languages.
insert image description here

The fourth step, specific use

There are several methods:
1. Command line mode

whisper audio.flac audio.mp3 audio.wav --model medium
  • For non-English languages, add the --language parameter, such as Japanese
whisper japanese.wav --language Japanese

There are quite a lot of supported languages

LANGUAGES = {
    
    
    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",
}

  • With the –task translate parameter, the voice content will be translated into English
whisper japanese.wav --language Japanese --task translate
  • There are other problems, you can use the help command
whisper --help

2. Python code mode

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

When loading the model for the first time, it will go online to pull the model (that is, the five models introduced above), and different models have different sizes. After the pull is complete, you don't need to connect to the Internet when you use it again.

tiny------base------small------medium---large, the model scale is small to large, and the accuracy rate is getting higher and higher, but the resources used are also getting bigger. Choose according to your own needs, generally small is good.

Above, the text ends.

Let me talk about the usage scenarios and slots I think

scenes to be used:

1. Extract the audio from the video and convert it into text for recording;
2. Extract the audio from the recorder and quickly view the content (sometimes the audio is too long, not as fast as text reading)
3. When making a video or audio by yourself, think Generate subtitles can also be used.

advantage:

Free of charge, available when the network is disconnected (when the environment is set up), safe and worry-free, no worries about leakage

Slots:

No real-time voice support, no support for speech synthesis.

I actually want to use local real-time voice-to-text conversion, and send it to ChatGPT after it is converted into text, and then ChatGPT returns the result and then synthesizes the voice to play. But ta is currently unable to do real-time and speech synthesis.

Guess you like

Origin blog.csdn.net/xkukeer/article/details/130227944