Parameter description of the speech recognition model whisper

1. Introduction to whisper:

Whisper is a general speech recognition model. It is trained on a large dataset of various audios and is also a multi-task model that can perform multilingual speech recognition, speech translation and language recognition.

Two, whisper parameters

1、-h, --help

View whisper parameters

2、--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large} 

Select the model to use, default value: small

 3、--model_dir MODEL_DIR

The save path of the model file, default value: ~/.cache/whisper

4、--device DEVICE 

The device used by the PyTorch interface, default: CPU

5、--output_dir OUTPUT_DIR, -o OUTPUT_DIR

The directory where the output results are saved, default value: current directory

6、--output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}

Output file format, default value: all

7、--verbose VERBOSE

Whether to print progress and debug information, default value: true

8、--task {transcribe,translate}

transcript: speech to text

translate: voice to English

Default: transcribe

9、--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu, fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la, lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru, sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi, yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English, Estonian, Faroese, Finnish, Flemish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer ,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}

The language setting of the audio file, if it is set to none, language detection will be performed

Default: None

10、--temperature TEMPERATURE

temperature for sampling

Default: 0

11、--best_of BEST_OF 

Number of candidates when sampling at non-zero temperature

Default: 5

12. --beam_size BEAM_SIZE
 The number of beams in the beam search algorithm is only applicable when the temperature is 0

Default: 5
13, --patience PATIENCE

The option patience value is used in beam decoding, refer to https://arxiv.org/abs/2204.05424, by default (1.0) is equivalent to traditional beam search

Default: None
 14, --length_penalty LENGTH_PENALTY
 optional token length penalty coefficient (alpha) refer to https://arxiv.org/abs/1609.08144, use simple length normalization by default

Default: None
 15, --suppress_tokens SUPPRESS_TOKENS
Comma-separated list of token IDs to suppress during sampling; '-1' will suppress most special characters except common punctuation Default: -1
 16, -- initial_prompt INITIAL_PROMPT
 optional text as the prompt word for the first window

Default: None
17, --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT

If true, the output of the previous model is used as a hint for the next window, disabling may cause text inconsistencies between windows, but the model is less prone to failure loops

Default: true
18, --fp16 FP16

Whether to use FP16

Default value: true
19, --temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
When decoding fails and encounters any of the following threshold fallbacks, the temperature will increase

Default value: 0.2
20. --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
If the gzip compression ratio is greater than this value, the decoding is considered to be failed. The
default value: 2.4
21. --logprob_threshold LOGPROB_THRESHOLD
. If the average logarithmic probability is lower than this value, the decoding is considered to be failed.
Default value: -1.0
22 , --no_speech_threshold NO_SPEECH_THRESHOLD
If the probability of |nospeech| is higher than this value and the decoding fails because of `logprob_threshold`, the segment is considered to have no sound

Default: 0.6
23. --word_timestamps WORD_TIMESTAMPS
Extract word-level timestamps and refine results based on them (experimental) 

Default: False
24, --prepend_punctuations PREPEND_PUNCTUATIONS

Merge these punctuation marks with the next word if word_timestamps is set to true

Defaults:"'" ([{-
25, --append_punctuations APPEND_PUNCTUATIONS

If word_timestamps is set to true, merge these punctuation marks with the previous word

Default: "'..,,!!??:")]},
26, --highlight_words HIGHLIGHT_WORDS
underline each word spoken in srt and vtt (condition: --word_timestamps True)

Default: false
27, --max_line_width MAX_LINE_WIDTH

The maximum number of characters in a line before a line break (condition: --word_timestamps True)

Default: None
28, --max_line_count MAX_LINE_COUNT
Maximum number of lines in a segment (Condition: --word_timestamps True)

Default: none
29, --threads THREADS

Under the CPU interface, the number of threads used by torch, replacing MKL_NUM_THREADS/OMP_NUM_THREADS

Guess you like

Origin blog.csdn.net/duzm200542901104/article/details/131476514