OpenAI bilingual document reference Speech to text Beta

Speech to text Beta

Learn how to turn audio into text
Learn how to turn audio into text

Introduction

The speech to text API provides two endpoints, transcriptionsand translations, based on our state-of-the-art open source large-v2 Whisper model . They can be used to:
Speech to text API provides two endpoints, transcriptionsand translations, based on our state-of-the-art open source large-v2 Whisper model . The open source large-v2 Whisper model. They can be used for:

  • Transcribe audio into whatever language the audio is in
    .
  • Translate and transcribe the audio into english.
    Translate and transcribe the audio into English.

File uploads are currently limited to 25 MB and the following input file types are supported : mp3, mp4, mpeg, mpga, m4a, , wavand webm.
mp3mp4mpegmpgam4awavwebm

Quickstart

Transcriptions

The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. We currently support multiple input and output file formats
. The output file format for the input. We currently support several input and output file formats.

Transcribe audio

python
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai
audio_file= open("/path/to/file/audio.mp3", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)

By default, the response type will be json with the raw text included. By
default, the response type will be json with the raw text included.

{ "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. … } { "text": "Imagine you've ever
had The craziest idea ever, and you wonder how it scales to something 100 times, 1000 times bigger. ...}

To set additional parameters in a request, you can add more lines with the relevant options. For example, if you want to set the output format as text, you would add --formthe following line:
More --formlines . For example, if you want to format the output as text, you can add the following line:

...
--form file=@openai.mp3 \
--form model=whisper-1 \
--form response_format=text

Translations

The translations API takes as input the audio file in any of the supported languages ​​and transcripts, if necessary, the audio into english. This differs from our /Transcriptions endpoint since the output is not in the original input language and is instead translated to english text The
.translation API takes as input an audio file in any supported language and transcribes the audio into English if necessary. This differs from our /Transcriptions endpoint in that the output is not in the original input language but translated into English text.

Translate audio

python
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai
audio_file= open("/path/to/file/german.mp3", "rb")
transcript = openai.Audio.translate("whisper-1", audio_file)

In this case, the inputted audio was german and the outputted text looks like
:

Hello, my name is Wolfgang and I come from Germany. Where are you heading today?
Hello, my name is Wolfgang and I come from Germany. where are you going today

We only support translation into english at this time.
We only support translation into English at this time.

Supported languages ​​Supported languages

We currently support the following languages ​​through both the transcriptionsand translationsendpoint:
We transcriptionscurrently translationssupport the following languages ​​through both the and endpoints:

Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh. Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian
, Catalan, Chinese, Croatian, Czech Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese , Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepalese, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh .

While the underlying model was trained on 98 languages, we only list the languages ​​that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model will return results for languages ​​not listed above but The quality will be low.
While the base model was trained on 98 languages, we only list languages ​​that exceed <50% word error rate (WER), an industry standard benchmark for speech-to-text model accuracy. The model will return results for languages ​​not listed above, but the quality will be low.

Longer inputs

By default, the Whisper API only supports files that are less than 25 MB. If you have an audio file that is longer than that, you will need to break it up into chunks of 25 MB's or less or used a compressed audio format. To To get the best performance, we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.
By default, the Whisper API only supports files smaller than 25 MB. If you have audio files longer than this, you'll need to split them into chunks of 25 MB or smaller or use a compressed audio format. For best performance, we recommend that you avoid interrupting audio in the middle of sentences, as this may cause some loss of context.

One way to handle this is to use the PyDub open source Python package to split the audio
:

from pydub import AudioSegment

song = AudioSegment.from_mp3("good_morning.mp3")

# PyDub handles time in milliseconds
ten_minutes = 10 * 60 * 1000

first_10_minutes = song[:ten_minutes]

first_10_minutes.export("good_morning_10.mp3", format="mp3")

OpenAI makes no guarantees about the usability or security of 3rd party software like PyDub.
OpenAI makes no guarantees about the usability or security of 3rd party software like PyDub.

Prompting

You can use a prompt to improve the quality of the transcripts generated by the Whisper API. The model will try to match the style of the prompt, so it will be more likely to use capitalization and punctuation if the prompt does too. However, the current prompting system is much more limited than our other language models and only provides limited control over the generated audio. Here are some examples of how prompting can help in different scenarios: You can use prompting
to improve the quality of transcripts generated by the Whisper API. The model will try to match the style of the prompt, so it's more likely to use capitalization and punctuation if the prompt does too. However, current hint systems are much more constrained than our other language models, and only provide limited control over the generated audio. Here are some examples of how tips can help in different situations:

  1. Prompts can be very helpful for correcting specific words or acronyms that the model often misrecognizes in the audio. For example, the following prompt improves the transcription of the words DALL·E and GPT-3, which were previously written as “GDP 3” and The "DALI".hint
    is very helpful in correcting specific words or acronyms that the model often misidentifies in the audio. For example, the following prompt improves the transliteration of the words DALL·E and GPT-3, which were previously written as "GDP 3" and "DALI".


    The transcript is about OpenAI which makes technology like DALL·E, GPT-3, and ChatGPT with the hope of one day building an AGI system that benefits all of humanity 3 and ChatGPT and other technologies, hoping to build an AGI system that will benefit all mankind one day

  2. To preserve the context of a file that was split into segments, you can prompt the model with the transcript of the preceding segment. This will make the transcript more accurate, as the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier.
    To preserve the context of a file that has been split into multiple fragments, you can use the transcript prompt model from the previous fragment. This will make the transcription more accurate, as the model will use relevant information from previous audio. The model will only consider the last 224 tokens of the hint and ignore anything before that.

  3. Sometimes the model might skip punctuation in the transcript. You can avoid this by using a simple prompt that includes punctuation:
    Sometimes the model might skip punctuation in the transcript. You can avoid this with a simple hint that includes punctuation:

    Hello, welcome to my lecture. Hello, everyone, welcome to my lecture.

  4. The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, you can use a prompt that contains them: The model may also leave out common filler words in the audio
    . If you want to keep filler words in your transcript, you can use the prompt to include them:


    Umm, let me think like, hmm... Okay, here's what I 'm, like, thinking."

  5. Some languages ​​can be written in different ways, such as simplified or traditional Chinese. The model might not always use the writing style that you want for your transcript by default. You can improve this by using a prompt in your preferred writing style
    . It can be written in different ways, such as Simplified Chinese or Traditional Chinese. The model may not always use the transcript writing style you want by default. You can improve on this by using cues from your preferred writing style.

Guess you like

Origin blog.csdn.net/pointdew/article/details/130072080
Recommended