Evolução de velocidade extrema, transcrição de velocidade leve, versão C++ inteligência artificial voz para texto em tempo real (legenda/reconhecimento de fala) Prática Whisper.cpp

O modelo Whisper de código aberto da OpenAI, a consciência da indústria, é o líder no campo de fala para texto de código aberto. Dependências, baixo uso de memória, etc., o mais importante é adicionar suporte ao Core ML , que é perfeitamente adaptado aos chips da série M da Apple.

Os operadores tensores do Whisper.cpp são fortemente otimizados para a CPU do chip Apple M, dependendo do tamanho da computação, usando Arm Neon SIMD instrisics ou as rotinas do framework CBLAS Accelerate, sendo esta última particularmente eficiente para tamanhos maiores, já que o framework Accelerate pode usar Um coprocessador AMX dedicado disponível nos chips da série M da Apple.

Implantar Whisper.cpp

Como sempre, execute o comando git para clonar o projeto Whisper.cpp:

git clone https://github.com/ggerganov/whisper.cpp.git
复制代码

Em seguida, entre no diretório do projeto:

cd whisper.cpp
复制代码

O modelo básico padrão do projeto não suporta chinês. Recomenda-se usar o modelo médio aqui e baixá-lo por meio de um script de shell:

bash ./models/download-ggml-model.sh medium
复制代码

Após a conclusão do download, o arquivo de modelo ggml-medium.bin será salvo no diretório models do projeto, com tamanho de 1,53GB:

whisper.cpp git:(master) cd models   
➜  models git:(master) ll  
total 3006000  
-rw-r--r--  1 liuyue  staff   3.2K  4 21 07:21 README.md  
-rw-r--r--  1 liuyue  staff   7.2K  4 21 07:21 convert-h5-to-ggml.py  
-rw-r--r--  1 liuyue  staff   9.2K  4 21 07:21 convert-pt-to-ggml.py  
-rw-r--r--  1 liuyue  staff    13K  4 21 07:21 convert-whisper-to-coreml.py  
drwxr-xr-x  4 liuyue  staff   128B  4 22 00:33 coreml-encoder-medium.mlpackage  
-rwxr-xr-x  1 liuyue  staff   2.1K  4 21 07:21 download-coreml-model.sh  
-rw-r--r--  1 liuyue  staff   1.3K  4 21 07:21 download-ggml-model.cmd  
-rwxr-xr-x  1 liuyue  staff   2.0K  4 21 07:21 download-ggml-model.sh  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-base.bin  
-rw-r--r--  1 liuyue  staff   573K  4 21 07:21 for-tests-ggml-base.en.bin  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-large.bin  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-medium.bin  
-rw-r--r--  1 liuyue  staff   573K  4 21 07:21 for-tests-ggml-medium.en.bin  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-small.bin  
-rw-r--r--  1 liuyue  staff   573K  4 21 07:21 for-tests-ggml-small.en.bin  
-rw-r--r--  1 liuyue  staff   562K  4 21 07:21 for-tests-ggml-tiny.bin  
-rw-r--r--  1 liuyue  staff   573K  4 21 07:21 for-tests-ggml-tiny.en.bin  
-rwxr-xr-x  1 liuyue  staff   1.4K  4 21 07:21 generate-coreml-interface.sh  
-rwxr-xr-x@ 1 liuyue  staff   769B  4 21 07:21 generate-coreml-model.sh  
-rw-r--r--  1 liuyue  staff   1.4G  3 22 16:04 ggml-medium.bin
复制代码

Após o download do modelo, compile o arquivo executável no diretório raiz:

make
复制代码

O programa retorna:

whisper.cpp git:(master) make  
I whisper.cpp build info:   
I UNAME_S:  Darwin  
I UNAME_P:  arm  
I UNAME_M:  arm64  
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE  
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread  
I LDFLAGS:   -framework Accelerate  
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)  
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)  
  
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread examples/bench/bench.cpp ggml.o whisper.o -o bench  -framework Accelerate
复制代码

Até agora, o Whisper.cpp está configurado.

pequeno teste

Agora vamos testar uma voz e ver o efeito:

./main -osrt -m ./models/ggml-medium.bin -f samples/jfk.wav
复制代码

O significado desta linha de comando é reconhecer o arquivo de voz samples/jfk.wav no projeto baixando o modelo ggml-medium.bin agora mesmo. Esta voz é o famoso discurso do presidente assassinado dos Estados Unidos Kennedy. O programa retorna:

➜  whisper.cpp git:(master) ./main -osrt -m ./models/ggml-medium.bin -f samples/jfk.wav  
whisper_init_from_file_no_state: loading model from './models/ggml-medium.bin'  
whisper_model_load: loading model  
whisper_model_load: n_vocab       = 51865  
whisper_model_load: n_audio_ctx   = 1500  
whisper_model_load: n_audio_state = 1024  
whisper_model_load: n_audio_head  = 16  
whisper_model_load: n_audio_layer = 24  
whisper_model_load: n_text_ctx    = 448  
whisper_model_load: n_text_state  = 1024  
whisper_model_load: n_text_head   = 16  
whisper_model_load: n_text_layer  = 24  
whisper_model_load: n_mels        = 80  
whisper_model_load: f16           = 1  
whisper_model_load: type          = 4  
whisper_model_load: mem required  = 1725.00 MB (+   43.00 MB per decoder)  
whisper_model_load: adding 1608 extra tokens  
whisper_model_load: model ctx     = 1462.35 MB  
whisper_model_load: model size    = 1462.12 MB  
whisper_init_state: kv self size  =   42.00 MB  
whisper_init_state: kv cross size =  140.62 MB  
  
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |   
  
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...  
  
  
[00:00:00.000 --> 00:00:11.000]   And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.  
  
output_srt: saving output to 'samples/jfk.wav.srt'
复制代码

Leva apenas 11 segundos e as legendas de voz serão gravadas no arquivo samples/jfk.wav.srt.

A precisão do inglês é de 100%.

Agora vamos mudar para voz chinesa, você pode gravar um pedaço de voz casualmente, deve-se notar que Whisper.cpp suporta apenas arquivos de voz em formato wav, aqui primeiro converta arquivos mp3 para wav através de ffmpeg:

ffmpeg -i ./test1.mp3 -ar 16000 -ac 1 -c:a pcm_s16le ./test1.wav
复制代码

O programa retorna:

ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers  
  built with Apple clang version 14.0.0 (clang-1400.0.29.202)  
  configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/5.1.2_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-neon  
  libavutil      57. 28.100 / 57. 28.100  
  libavcodec     59. 37.100 / 59. 37.100  
  libavformat    59. 27.100 / 59. 27.100  
  libavdevice    59.  7.100 / 59.  7.100  
  libavfilter     8. 44.100 /  8. 44.100  
  libswscale      6.  7.100 /  6.  7.100  
  libswresample   4.  7.100 /  4.  7.100  
  libpostproc    56.  6.100 / 56.  6.100  
[mp3 @ 0x130e05580] Estimating duration from bitrate, this may be inaccurate  
Input #0, mp3, from './test1.mp3':  
  Duration: 00:05:41.33, start: 0.000000, bitrate: 48 kb/s  
  Stream #0:0: Audio: mp3, 24000 Hz, mono, fltp, 48 kb/s  
Stream mapping:  
  Stream #0:0 -> #0:0 (mp3 (mp3float) -> pcm_s16le (native))  
Press [q] to stop, [?] for help  
Output #0, wav, to './test1.wav':  
  Metadata:  
    ISFT            : Lavf59.27.100  
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s  
    Metadata:  
      encoder         : Lavc59.37.100 pcm_s16le  
[mp3float @ 0x132004260] overread, skip -6 enddists: -4 -4ed=N/A      
    Last message repeated 1 times  
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1  
[mp3float @ 0x132004260] overread, skip -7 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1  
[mp3float @ 0x132004260] overread, skip -9 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -5 enddists: -1 -1  
    Last message repeated 1 times  
[mp3float @ 0x132004260] overread, skip -7 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -8 enddists: -5 -5  
[mp3float @ 0x132004260] overread, skip -5 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -6 enddists: -1 -1  
[mp3float @ 0x132004260] overread, skip -7 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -6 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -6 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -7 enddists: -6 -6  
[mp3float @ 0x132004260] overread, skip -9 enddists: -6 -6  
[mp3float @ 0x132004260] overread, skip -5 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -5 enddists: -2 -2  
[mp3float @ 0x132004260] overread, skip -5 enddists: -3 -3  
[mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1  
size=   10667kB time=00:05:41.32 bitrate= 256.0kbits/s speed=2.08e+03x      
video:0kB audio:10666kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000714%
复制代码

Aqui, um período de cinco minutos e quarenta e um segundos de fala é convertido em um arquivo wav.

Em seguida, execute o comando para iniciar a transcrição:

./main -osrt -m ./models/ggml-medium.bin -f samples/test1.wav -l zh
复制代码

Aqui você precisa adicionar o parâmetro -l para informar ao programa que é voz chinesa, e o programa retorna:

➜  whisper.cpp git:(master) ./main -osrt -m ./models/ggml-medium.bin -f samples/test1.wav -l zh  
whisper_init_from_file_no_state: loading model from './models/ggml-medium.bin'  
whisper_model_load: loading model  
whisper_model_load: n_vocab       = 51865  
whisper_model_load: n_audio_ctx   = 1500  
whisper_model_load: n_audio_state = 1024  
whisper_model_load: n_audio_head  = 16  
whisper_model_load: n_audio_layer = 24  
whisper_model_load: n_text_ctx    = 448  
whisper_model_load: n_text_state  = 1024  
whisper_model_load: n_text_head   = 16  
whisper_model_load: n_text_layer  = 24  
whisper_model_load: n_mels        = 80  
whisper_model_load: f16           = 1  
whisper_model_load: type          = 4  
whisper_model_load: mem required  = 1725.00 MB (+   43.00 MB per decoder)  
whisper_model_load: adding 1608 extra tokens  
whisper_model_load: model ctx     = 1462.35 MB  
whisper_model_load: model size    = 1462.12 MB  
whisper_init_state: kv self size  =   42.00 MB  
whisper_init_state: kv cross size =  140.62 MB  
  
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |   
  
main: processing 'samples/test1.wav' (5461248 samples, 341.3 sec), 4 threads, 1 processors, lang = zh, task = transcribe, timestamps = 1 ...  
  
  
[00:00:00.000 --> 00:00:03.340]  Hello 大家好,这里是刘越的技术博客。  
[00:00:03.340 --> 00:00:05.720]  最近的事情大家都晓得了,  
[00:00:05.720 --> 00:00:07.880]  某公司技术经理魅上欺下,  
[00:00:07.880 --> 00:00:10.380]  打工人应对进队,不易快灾,  
[00:00:10.380 --> 00:00:12.020]  不易壮灾,  
[00:00:12.020 --> 00:00:14.280]  所谓魅上者必欺下,  
[00:00:14.280 --> 00:00:16.020]  古人诚不我窃。  
[00:00:16.020 --> 00:00:17.360]  技术经理者,  
[00:00:17.360 --> 00:00:20.160]  公然在聊天群里大玩职场PUA,  
[00:00:20.160 --> 00:00:22.400]  气焰嚣张,有恃无恐,  
[00:00:22.400 --> 00:00:23.700]  最终引发众目,  
[00:00:23.700 --> 00:00:26.500]  嘿嘿,技术经理,团队领导,  
[00:00:26.500 --> 00:00:29.300]  原来团队领导这四个字是这么用的,  
[00:00:29.300 --> 00:00:31.540]  奴媚显达,构陷下属,  
[00:00:31.540 --> 00:00:32.780]  人文巨损,  
[00:00:32.780 --> 00:00:33.840]  逢迎上意,  
[00:00:33.840 --> 00:00:34.980]  傲然下欺,  
[00:00:34.980 --> 00:00:36.080]  装腔作势,  
[00:00:36.080 --> 00:00:37.180]  极尽投机,  
[00:00:37.180 --> 00:00:38.320]  负他人之负,  
[00:00:38.320 --> 00:00:39.620]  康他人之愷,  
[00:00:39.620 --> 00:00:42.180]  如此者,可谓团队领导也。  
[00:00:42.180 --> 00:00:43.980]  中国的所谓传统文化,  
[00:00:43.980 --> 00:00:45.320]  除了仁义理智性,  
[00:00:45.320 --> 00:00:46.620]  除了金石子极,  
[00:00:46.620 --> 00:00:47.820]  除了争争风骨,  
[00:00:47.820 --> 00:00:49.560]  其实还有很多别的东西,  
[00:00:49.560 --> 00:00:52.020]  被大家或有意或无意的忽视了,  
[00:00:52.020 --> 00:00:53.300]  比如功利实用,  
[00:00:53.300 --> 00:00:54.300]  屈颜附示,  
[00:00:54.300 --> 00:00:55.360]  以兼至善,  
[00:00:55.360 --> 00:01:01.000]  官本位和钱规则的传统,在某种程度上,传统文化这没硬币的另一面,  
[00:01:01.000 --> 00:01:03.900]  才是更需要我们去面对和正视的,  
[00:01:03.900 --> 00:01:07.140]  我以为,这在目前盛行实惠价值观的时候,  
[00:01:07.140 --> 00:01:08.940]  提一提还是必要的,  
[00:01:08.940 --> 00:01:10.240]  有的人说了,  
[00:01:10.240 --> 00:01:13.740]  在开发群里对领导,非常痛快,非常爽,  
[00:01:13.740 --> 00:01:17.180]  但是,然后呢,有用吗?  
[00:01:17.180 --> 00:01:19.260]  倒霉的还不是自己,  
[00:01:19.260 --> 00:01:22.520]  没错,这就是功利且实用的传统,  
[00:01:22.520 --> 00:01:28.780]  各种精神,思辨,反抗,愤怒,都抵不过三个字,有用吗?  
[00:01:28.780 --> 00:01:31.820]  事实上,但凡叫做某种精神的,  
[00:01:31.820 --> 00:01:33.320]  那就是哲学思辨,  
[00:01:33.320 --> 00:01:36.220]  就是一种相对无用的思辨和学术,  
[00:01:36.220 --> 00:01:39.180]  而中国职场有很强的实用传统,  
[00:01:39.180 --> 00:01:42.140]  但这不是学术思辨,也没有理论构架,  
[00:01:42.140 --> 00:01:44.380]  仅仅是一种短视的经验论,  
[00:01:44.380 --> 00:01:47.220]  所以,功利主义,是密尔,  
[00:01:47.220 --> 00:01:48.980]  编庆的伦理价值学说,  
[00:01:48.980 --> 00:01:52.700]  强调的是,追求幸福,如何获得最大效用,  
[00:01:52.700 --> 00:01:55.580]  实用主义,是西方的一个学术流派,  
[00:01:55.580 --> 00:01:58.260]  比如杜威,胡适,就是代表,  
[00:01:58.260 --> 00:02:01.180]  实用主义的另一个名字,叫人本主义,  
[00:02:01.180 --> 00:02:04.780]  意思是,以人作为经验和万物的尺度,  
[00:02:04.780 --> 00:02:06.080]  换句话说,  
[00:02:06.080 --> 00:02:09.420]  功利主义,反对的正是那种短视的功利,  
[00:02:09.420 --> 00:02:13.220]  实用主义,反对的也正是那种凡是看对自己,  
[00:02:13.220 --> 00:02:15.220]  是不是有利的局限判断,  
[00:02:15.220 --> 00:02:17.260]  而在中国职场功利,  
[00:02:17.260 --> 00:02:21.060]  实用的传统中,恰恰是不会有这些理论构架的,  
[00:02:21.060 --> 00:02:23.700]  并且,不仅没有理论构架,  
[00:02:23.700 --> 00:02:26.140]  还要对那些无用的,思辨的,  
[00:02:26.140 --> 00:02:29.980]  纯粹的精神,视如避喜,吃之以鼻,  
[00:02:29.980 --> 00:02:32.260]  没错,在技术团队里,  
[00:02:32.260 --> 00:02:35.260]  我们重视技术,重视实用的科学,  
[00:02:35.260 --> 00:02:38.900]  但是主流职场并不鼓励去搞那些看似无用的东西,  
[00:02:38.900 --> 00:02:41.380]  比如普通劳动者的合法权益,  
[00:02:41.380 --> 00:02:43.580]  张义谋的满江红,  
[00:02:43.580 --> 00:02:45.220]  大家想必也都看了的,  
[00:02:45.220 --> 00:02:46.820]  人们总觉得很奇怪,  
[00:02:46.820 --> 00:02:48.300]  为什么那么坏的人,  
[00:02:48.300 --> 00:02:50.020]  皇帝为啥不罢免他?  
[00:02:50.020 --> 00:02:53.140]  为什么小人能当权来构陷好人呢?  
[00:02:53.140 --> 00:02:55.980]  当我们了解了传统文化中的法家思想,  
[00:02:55.980 --> 00:02:57.300]  就了然了,  
[00:02:57.300 --> 00:02:59.260]  在法家的思想规则下,  
[00:02:59.260 --> 00:03:01.660]  小人得是,忠良备辱,  
[00:03:01.660 --> 00:03:03.140]  事事所必然,  
[00:03:03.140 --> 00:03:04.900]  因为他一开始的设定,  
[00:03:04.900 --> 00:03:07.540]  就使得劣币驱逐良币的游戏规则,  
[00:03:07.540 --> 00:03:09.940]  所以,在这种观念下,  
[00:03:09.940 --> 00:03:12.460]  古代常见的一种职场智慧就是,  
[00:03:12.460 --> 00:03:14.820]  自污名节,以求自保,  
[00:03:14.820 --> 00:03:16.420]  在这种环境下,  
[00:03:16.420 --> 00:03:17.780]  要想生存,  
[00:03:17.780 --> 00:03:19.260]  就只有一条出路,  
[00:03:19.260 --> 00:03:20.900]  那就是依附权力,  
[00:03:20.900 --> 00:03:23.700]  并且,谁能拥有更大的权力,  
[00:03:23.700 --> 00:03:25.700]  谁就能生存得更好,  
[00:03:25.700 --> 00:03:27.500]  如何依附权力呢?  
[00:03:27.500 --> 00:03:29.180]  那就是现在正在发生的,  
[00:03:29.180 --> 00:03:31.900]  肆无忌惮的大腕职场PUA,  
[00:03:31.900 --> 00:03:33.060]  除此之外,  
[00:03:33.060 --> 00:03:34.340]  这种权力关系,  
[00:03:34.340 --> 00:03:36.900]  在古代会渗透到方方面面,  
[00:03:36.900 --> 00:03:40.300]  因为权力系统是一个复杂而高效的运行机器,  
[00:03:40.300 --> 00:03:42.940]  CPU,内存,硬盘,  
[00:03:42.940 --> 00:03:44.900]  甚至一颗C面底螺丝钉,  
[00:03:44.900 --> 00:03:47.140]  都是权力机器上的一个环节,  
[00:03:47.140 --> 00:03:48.060]  于是,  
[00:03:48.060 --> 00:03:50.420]  官僚体系之外的一切职场人,  
[00:03:50.420 --> 00:03:52.340]  都会面临一个尴尬的处境,  
[00:03:52.340 --> 00:03:54.340]  一方面遭遇权力的打压,  
[00:03:54.340 --> 00:03:55.340]  另一方面,  
[00:03:55.340 --> 00:03:57.900]  也都会多少尝到权力的甜头,  
[00:03:57.900 --> 00:03:58.900]  于是乎,  
[00:03:58.900 --> 00:04:01.420]  权力的细胞渗透到角角落落,  
[00:04:01.420 --> 00:04:02.980]  即便没有组织权力,  
[00:04:02.980 --> 00:04:04.620]  也要追求文化权力,  
[00:04:04.620 --> 00:04:05.500]  父权,  
[00:04:05.500 --> 00:04:06.380]  夫权,  
[00:04:06.380 --> 00:04:07.460]  家长权力,  
[00:04:07.460 --> 00:04:08.580]  宗族权力,  
[00:04:08.580 --> 00:04:09.660]  老师权力,  
[00:04:09.660 --> 00:04:10.780]  公司权力,  
[00:04:10.780 --> 00:04:12.140]  团队领导权力,  
[00:04:12.140 --> 00:04:13.100]  点点滴滴,  
[00:04:13.100 --> 00:04:15.580]  滴滴点点,追逐权力,  
[00:04:15.580 --> 00:04:18.140]  几乎成为人们生活的全部意义,  
[00:04:18.140 --> 00:04:18.980]  故而,  
[00:04:18.980 --> 00:04:19.980]  服从权力,  
[00:04:19.980 --> 00:04:21.180]  服从上级,  
[00:04:21.180 --> 00:04:22.420]  不得罪同事,  
[00:04:22.420 --> 00:04:23.660]  不得罪朋友,  
[00:04:23.660 --> 00:04:25.060]  不得罪陌生人,  
[00:04:25.060 --> 00:04:26.100]  因为你不知道,  
[00:04:26.100 --> 00:04:28.260]  他们背后有什么的权力关系,  
[00:04:28.260 --> 00:04:30.940]  他们又会不会用这个权力来对付你,  
[00:04:30.940 --> 00:04:31.940]  没错,  
[00:04:31.940 --> 00:04:34.380]  当我们解构群里那位领导的行为时,  
[00:04:34.380 --> 00:04:36.220]  我们也在解构我们自己,  
[00:04:36.220 --> 00:04:37.420]  毫无疑问,  
[00:04:37.420 --> 00:04:39.380]  对于这位敢于发声的职场人,  
[00:04:39.380 --> 00:04:41.180]  深安职场底层逻辑的,  
[00:04:41.180 --> 00:04:43.220]  我们一定能猜到他的结局,  
[00:04:43.220 --> 00:04:44.700]  他的结局是注定的,  
[00:04:44.700 --> 00:04:46.220]  同时也是悲哀的,  
[00:04:46.220 --> 00:04:47.340]  问题是,  
[00:04:47.340 --> 00:04:48.540]  这样做,  
[00:04:48.540 --> 00:04:49.660]  值得吗?  
[00:04:49.660 --> 00:04:52.580]  香港著名导演王家卫拍过一部电影,  
[00:04:52.580 --> 00:04:54.420]  叫做东邪西毒,  
[00:04:54.420 --> 00:04:56.340]  电影中有这样一个情节,  
[00:04:56.340 --> 00:04:59.620]  有个女人的弟弟被太尉府的一群刀客杀了,  
[00:04:59.620 --> 00:05:00.860]  他想报仇,  
[00:05:00.860 --> 00:05:02.300]  可自己没有武功,  
[00:05:02.300 --> 00:05:04.060]  只能请刀客出手,  
[00:05:04.060 --> 00:05:05.540]  但家里穷没钱,  
[00:05:05.540 --> 00:05:08.540]  最有价值的资产是一篮子鸡蛋,  
[00:05:08.540 --> 00:05:09.260]  于是,  
[00:05:09.260 --> 00:05:10.900]  他提着那一篮子鸡蛋,  
[00:05:10.900 --> 00:05:13.420]  天天站在刀客剑客们经过的路口,  
[00:05:13.420 --> 00:05:14.700]  请求他们出手,  
[00:05:14.700 --> 00:05:16.220]  报仇就是鸡蛋,  
[00:05:16.220 --> 00:05:17.860]  没有人愿意为了鸡蛋,  
[00:05:17.860 --> 00:05:20.020]  去单挑太尉府的刀客,  
[00:05:20.020 --> 00:05:21.460]  除了洪七,  
[00:05:21.460 --> 00:05:24.260]  洪七独自力战太尉府那帮刀客,  
[00:05:24.260 --> 00:05:26.780]  所得的报仇是一个鸡蛋,  
[00:05:26.780 --> 00:05:29.020]  但是洪七付出的代价太大,  
[00:05:29.020 --> 00:05:30.060]  混战中,  
[00:05:30.060 --> 00:05:32.700]  洪七被对手砍断了一根手指,  
[00:05:32.700 --> 00:05:33.820]  为了一个鸡蛋,  
[00:05:33.820 --> 00:05:35.500]  而失去一只手指,  
[00:05:35.500 --> 00:05:36.740]  值得吗?  
[00:05:36.740 --> 00:05:37.860]  不值得,  
[00:05:37.860 --> 00:05:39.300]  但是我觉得痛快,  
[00:05:39.300 --> 00:05:40.540]  因為這才是我自己  
  
output_srt: saving output to 'samples/test1.wav.srt'  
  
whisper_print_timings:     load time =   978.82 ms  
whisper_print_timings:     fallbacks =   0 p /   0 h  
whisper_print_timings:      mel time =   438.81 ms  
whisper_print_timings:   sample time =   980.66 ms /  2343 runs (    0.42 ms per run)  
whisper_print_timings:   encode time = 31476.10 ms /    13 runs ( 2421.24 ms per run)  
whisper_print_timings:   decode time = 47833.70 ms /  2343 runs (   20.42 ms per run)  
whisper_print_timings:    total time = 81797.88 ms
复制代码

Um discurso de cinco minutos pode ser transcrito em pouco mais de um minuto, o que é uma pontuação perfeita para eficiência.

Claro, a precisão precisa ser melhorada. Para melhorar a precisão, você pode escolher um modelo grande, mas o tempo de transcrição aumentará de acordo.

Conversão de modelo de chip Apple M

Abençoados sejam os usuários baseados no sistema Mac da Apple, o Whisper.cpp pode executar inferência de codificador no Apple Neural Engine (ANE) por meio do Core ML, que pode ser mais de três vezes mais rápido do que usar apenas a CPU.

Primeiro instale as dependências de transformação:

pip install ane_transformers  
pip install openai-whisper  
pip install coremltools
复制代码

Em seguida, execute o script de conversão:

./models/generate-coreml-model.sh medium 
复制代码

O parâmetro aqui é o nome do modelo.

O programa retorna:

➜  models git:(master) python3 convert-whisper-to-coreml.py --model medium --encoder-only True   
scikit-learn version 1.2.0 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.  
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=1024, n_audio_head=16, n_audio_layer=24, n_vocab=51865, n_text_ctx=448, n_text_state=1024, n_text_head=16, n_text_layer=24)  
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!  
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"  
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').  
  scale = (n_state // self.n_head) ** -0.25  
Converting PyTorch Frontend ==> MIL Ops: 100%|▉| 1971/1972 [00:00<00:00, 3247.25  
Running MIL frontend_pytorch pipeline: 100%|█| 5/5 [00:00<00:00, 54.69 passes/s]  
Running MIL default pipeline: 100%|████████| 57/57 [00:09<00:00,  6.29 passes/s]  
Running MIL backend_mlprogram pipeline: 100%|█| 10/10 [00:00<00:00, 444.13 passe  
  
  
  
  
  
  
done converting
复制代码

Após a conversão, recompile:

make clean  
WHISPER_COREML=1 make -j
复制代码

Em seguida, use o modelo convertido para transcrever:

./main -m models/ggml-medium.bin -f samples/jfk.wav
复制代码

Nesse ponto, os usuários de Mac são imediatamente promovidos a cidadãos de primeira classe.

epílogo

O Whisper.cpp é a reprodução espiritual e o renascimento físico do Whisper. Ele herda perfeitamente todas as funções do Whisper. Além disso, melhora a velocidade e a eficiência da transcrição de voz e a portabilidade entre plataformas. Ele fez grandes progressos e abriu o código tecnologia

おすすめ

転載: juejin.im/post/7229126088228126776