1. Introduction to whisper:
Whisper is a general speech recognition model. It is trained on a large dataset of various audios and is also a multi-task model that can perform multilingual speech recognition, speech translation and language recognition.
Two, whisper parameters
1、-h, --help
View whisper parameters
2、--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}
Select the model to use, default value: small
3、--model_dir MODEL_DIR
The save path of the model file, default value: ~/.cache/whisper
4、--device DEVICE
The device used by the PyTorch interface, default: CPU
5、--output_dir OUTPUT_DIR, -o OUTPUT_DIR
The directory where the output results are saved, default value: current directory
6、--output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
Output file format, default value: all
7、--verbose VERBOSE
Whether to print progress and debug information, default value: true
8、--task {transcribe,translate}
transcript: speech to text
translate: voice to English
Default: transcribe
9、--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu, fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la, lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru, sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi, yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English, Estonian, Faroese, Finnish, Flemish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer ,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
The language setting of the audio file, if it is set to none, language detection will be performed
Default: None
10、--temperature TEMPERATURE
temperature for sampling
Default: 0
11、--best_of BEST_OF
Number of candidates when sampling at non-zero temperature
Default: 5
12. --beam_size BEAM_SIZE
The number of beams in the beam search algorithm is only applicable when the temperature is 0
Default: 5
13, --patience PATIENCE
The option patience value is used in beam decoding, refer to https://arxiv.org/abs/2204.05424, by default (1.0) is equivalent to traditional beam search
Default: None
14, --length_penalty LENGTH_PENALTY
optional token length penalty coefficient (alpha) refer to https://arxiv.org/abs/1609.08144, use simple length normalization by default
Default: None
15, --suppress_tokens SUPPRESS_TOKENS
Comma-separated list of token IDs to suppress during sampling; '-1' will suppress most special characters except common punctuation Default: -1
16, -- initial_prompt INITIAL_PROMPT
optional text as the prompt word for the first window
Default: None
17, --condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
If true, the output of the previous model is used as a hint for the next window, disabling may cause text inconsistencies between windows, but the model is less prone to failure loops
Default: true
18, --fp16 FP16
Whether to use FP16
Default value: true
19, --temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
When decoding fails and encounters any of the following threshold fallbacks, the temperature will increase
Default value: 0.2
20. --compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
If the gzip compression ratio is greater than this value, the decoding is considered to be failed. The
default value: 2.4
21. --logprob_threshold LOGPROB_THRESHOLD
. If the average logarithmic probability is lower than this value, the decoding is considered to be failed.
Default value: -1.0
22 , --no_speech_threshold NO_SPEECH_THRESHOLD
If the probability of |nospeech| is higher than this value and the decoding fails because of `logprob_threshold`, the segment is considered to have no sound
Default: 0.6
23. --word_timestamps WORD_TIMESTAMPS
Extract word-level timestamps and refine results based on them (experimental)
Default: False
24, --prepend_punctuations PREPEND_PUNCTUATIONS
Merge these punctuation marks with the next word if word_timestamps is set to true
Defaults:"'" ([{-
25, --append_punctuations APPEND_PUNCTUATIONS
If word_timestamps is set to true, merge these punctuation marks with the previous word
Default: "'..,,!!??:")]},
26, --highlight_words HIGHLIGHT_WORDS
underline each word spoken in srt and vtt (condition: --word_timestamps True)
Default: false
27, --max_line_width MAX_LINE_WIDTH
The maximum number of characters in a line before a line break (condition: --word_timestamps True)
Default: None
28, --max_line_count MAX_LINE_COUNT
Maximum number of lines in a segment (Condition: --word_timestamps True)
Default: none
29, --threads THREADS
Under the CPU interface, the number of threads used by torch, replacing MKL_NUM_THREADS/OMP_NUM_THREADS