.Net uses OpenAI open source speech recognition model Whisper

.Net uses OpenAI open source speech recognition model Whisper

foreword

On September 21, 2022, Open AI opened up the Whisper neural network, which claims that its English speech recognition ability has reached human level , and it also supports automatic speech recognition in 98 other languages. The Automatic Speech Recognition (ASR) model provided by the Whisper system is trained to run speech recognition and translation tasks. They can convert speech in various languages ​​into text, and can also translate these texts into English.

The core function of whisper, speech recognition, for most people, can help us organize meetings, lectures, and classroom recordings into transcripts more quickly; for film and television lovers, it can automatically generate subtitles for resources without subtitles, no need to worry about it Waiting hard for the subtitle resources of major subtitle groups; for foreign language learners, using whisper to translate your pronunciation practice recordings can be a good test of your oral pronunciation level. Of course, all major cloud platforms provide voice recognition services, but they are basically networked, and there are always hidden dangers to personal privacy and security. However, whisper is completely different. Whisper runs completely locally without networking, which fully guarantees personal privacy, and whisper recognize The accuracy rate is quite high.

Whisper is written in C++ , and sandrohanea encapsulates it in .Net .

This article aims to sort out my process of using the open source speech recognition model Whisper in the .net web project, for the convenience of reading next time. It is a great honor to help you~

.Net Web project version is: .Net 6.0

Install the Whisper.net package

First we install the Whisper.net package in the Core project. Search and install the [Whisper.net] package in the NuGet package manager, as shown in the following figure:

Note that what we are looking for is [Whisper.net], not [Whisper.net.Runtime], [WhisperNet], [Whisper.Runtime].

image-20230530162444326

Download model file

Go to Hugging Face to download Whisper's model files. There are 5 models in total: ggml-tiny.bin, ggml-base.bin, ggml-small.bin, ggml-medium.bin, and ggml-large.bin. The file sizes become larger in turn. The recognition rate also increases sequentially. In addition, [xxx.en.bin] is an English model, and [xxx.bin] supports various languages.

We can put the model file in the project, here I put it under the wwwroot of the Web project:

image-20230530165740596

Create a new Whisper helper class

WhisperHelper.cs

image-20230530170227200

using Whisper.net;
using System.IO;
using System.Collections.Generic;
using Market.Core.Enum;

namespace Market.Core.Util
{
    public class WhisperHelper
    {
        public static List<SegmentData> Segments { get; set; }
        public static WhisperProcessor Processor { get; set; }

        public WhisperHelper(ASRModelType modelType)
        {
            if(Segments == null || Processor == null)
            {
                Segments = new List<SegmentData>();

                var binName = "ggml-large.bin";
                switch (modelType)
                {
                    case ASRModelType.WhisperTiny:
                        binName = "ggml-tiny.bin";
                        break;
                    case ASRModelType.WhisperBase:
                        binName = "ggml-base.bin";
                        break;
                    case ASRModelType.WhisperSmall:
                        binName = "ggml-small.bin";
                        break;
                    case ASRModelType.WhisperMedium:
                        binName = "ggml-medium.bin";
                        break;
                    case ASRModelType.WhisperLarge:
                        binName = "ggml-large.bin";
                        break;
                    default:
                        break;
                }
                var modelFilePath = $"wwwroot/WhisperModel/{binName}";
                var factory = WhisperFactory.FromPath(modelFilePath);
                var builder = factory.CreateBuilder()
                                     .WithLanguage("zh") //中文
                                     .WithSegmentEventHandler(Segments.Add);
                var processor = builder.Build();
                Processor = processor;
            }
        }

        /// <summary>
        /// 完整的语音识别 单例实现
        /// </summary>
        /// <returns></returns>
        public string FullDetection(Stream speechStream)
        {
            Segments.Clear();
            var txtResult = string.Empty;

            //开始识别
            Processor.Process(speechStream);

            //识别结果处理
            foreach (var segment in Segments)
            {
                txtResult += segment.Text + "\n";
            }
            Segments.Clear();
            return txtResult;
        }
    }
}

ModelType.cs

Different models have different names, and an enumeration class is needed to distinguish them:

image-20230530170534542

using System.ComponentModel;

namespace Market.Core.Enum
{
    /// <summary>
    /// ASR模型类型
    /// </summary>
    [Description("ASR模型类型")]
    public enum ASRModelType
    {
        /// <summary>
        /// ASRT
        /// </summary>
        [Description("ASRT")]
        ASRT = 0,

        /// <summary>
        /// WhisperTiny
        /// </summary>
        [Description("WhisperTiny")]
        WhisperTiny = 100,

        /// <summary>
        /// WhisperBase
        /// </summary>
        [Description("WhisperBase")]
        WhisperBase = 110,

        /// <summary>
        /// WhisperSmall
        /// </summary>
        [Description("WhisperSmall")]
        WhisperSmall = 120,

        /// <summary>
        /// WhisperMedium
        /// </summary>
        [Description("WhisperMedium")]
        WhisperMedium = 130,

        /// <summary>
        /// WhisperLarge
        /// </summary>
        [Description("WhisperLarge")]
        WhisperLarge = 140,

        /// <summary>
        /// PaddleSpeech
        /// </summary>
        [Description("PaddleSpeech")]
        PaddleSpeech = 200,
    }
}

The backend accepts audio and recognizes

The backend interface accepts audio binary bytecode and uses Whisper helper classes for speech recognition.

image-20230530171221152

The key code is as follows:

public class ASRModel
{
        public string samples { get; set; }
}

/// <summary>
/// 语音识别
/// </summary>
[HttpPost]
[Route("/auth/speechRecogize")]
public async Task<IActionResult> SpeechRecogizeAsync([FromBody] ASRModel model)
{
    ResultDto result = new ResultDto();
    byte[] wavData = Convert.FromBase64String(model.samples);
    model.samples = null;   //内存回收
    // 使用Whisper模型进行语音识别
    var speechStream = new MemoryStream(wavData);
    var whisperManager = new WhisperHelper(model.ModelType);
    var textResult = whisperManager.FullDetection(speechStream);
    speechStream.Dispose();//内存回收
    speechStream = null;
    wavData = null; //内存回收
    result.Data = textResult;
    return Json(result.OK());
}

Upload audio on front-end page

The front-end mainly does an audio collection work, and then converts the audio file into a binary code and transmits it to the back-end Api interface

The front-end page is as follows:

image-20230530134802045

The page code is as follows:

@{
    Layout = null;
}
@using Karambolo.AspNetCore.Bundling.ViewHelpers
@addTagHelper *, Karambolo.AspNetCore.Bundling
@addTagHelper *, Microsoft.AspNetCore.Mvc.TagHelpers
<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8" />
    <title>语音录制</title>
    <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
    <environment names="Development">
        <link href="~/content/plugins/element-ui/index.css" rel="stylesheet" />
        <script src="~/content/plugins/jquery/jquery-3.4.1.min.js"></script>
        <script src="~/content/js/matomo.js"></script>
        <script src="~/content/js/slick.min.js"></script>
        <script src="~/content/js/masonry.js"></script>
        <script src="~/content/js/instafeed.min.js"></script>
        <script src="~/content/js/headroom.js"></script>
        <script src="~/content/js/readingTime.min.js"></script>
        <script src="~/content/js/script.js"></script>
        <script src="~/content/js/prism.js"></script>
        <script src="~/content/js/recorder-core.js"></script>
        <script src="~/content/js/wav.js"></script>
        <script src="~/content/js/waveview.js"></script>
        <script src="~/content/js/vue.js"></script>
        <script src="~/content/plugins/element-ui/index.js"></script>
        <script src="~/content/js/request.js"></script>
    </environment>
    <environment names="Stage,Production">
        @await Styles.RenderAsync("~/bundles/login.css")
        @await Scripts.RenderAsync("~/bundles/login.js")
    </environment>
    <style>
        html,
        body {
     
     
            margin: 0;
            height: 100%;
        }

        body {
     
     
            padding: 20px;
            box-sizing: border-box;
        }
        audio {
     
     
            display:block;
        }
        audio + audio {
     
     
            margin-top: 20px;
        }
        .el-textarea .el-textarea__inner {
     
     
            color: #000 !important;
            font-size: 18px;
            font-weight: 600;
        }
        #app {
     
     
            height: 100%;
        }
        .content {
     
     
            height: calc(100% - 130px);
            overflow: auto;
        }
        .content > div {
     
     
            margin: 10px 0 20px;
        }
        .press {
     
     
            height: 40px;
            line-height: 40px;
            border-radius: 5px;
            border: 1px solid #dcdfe6;
            cursor: pointer;
            width: 100%;
            text-align: center;
            background: #fff;
        }
    </style>
</head>

<body>
    <div id="app">
        <div style="display: flex; justify-content: space-between; align-items: center;">
            <center>{
   
   {isPC? '我是电脑版' : '我是手机版'}}</center>
            <center style="margin: 10px 0">
                <el-radio-group v-model="modelType">
                    <el-radio :label="0">ASRT</el-radio>
                    <el-radio :label="100">WhisperTiny</el-radio>
                    <el-radio :label="110">WhisperBase</el-radio>
                    <el-radio :label="120">WhisperSmall</el-radio>
                    <el-radio :label="130">WhisperMedium</el-radio>
                    <el-radio :label="140">WhisperLarge</el-radio>
                    <el-radio :label="200">PaddleSpeech</el-radio>
                </el-radio-group>
            </center>
            <el-button type="primary" size="small" onclick="window.location.href = '/'">返回</el-button>
        </div>
        <div class="content" id="wav_pannel">
            @*{
   
   {textarea}}*@
        </div>
        <div style="margin-top: 20px"></div>
        <center style="height: 40px;"><h4 id="msgbox" v-if="messageSatuts">{
   
   {message}}</h4></center>
        <button class="press" v-on:touchstart="start" v-on:touchend="end" v-if="!isPC">
            按住 说话
        </button>
        <button class="press" v-on:mousedown="start" v-on:mouseup="end" v-else>
            按住 说话
        </button>
    </div>
</body>

</html>
<script>
    var blob_wav_current;
    var rec;
    var recOpen = function (success) {
     
     
        rec = Recorder({
     
     
            type: "wav",
            sampleRate: 16000,
            bitRate: 16,
            onProcess: (buffers, powerLevel, bufferDuration, bufferSampleRate, newBufferIdx, asyncEnd) => {
     
     

            }
        });
        rec.open(() => {
     
     
            success && success();
        }, (msg, isUserNotAllow) => {
     
     
            app.textarea = (isUserNotAllow ? "UserNotAllow," : "") + "无法录音:" + msg;
        });
    };
    var app = new Vue({
     
     
        el: '#app',
        data: {
     
     
            textarea: '',
            message: '',
            messageSatuts: false,
            modelType: 0,
        },
        computed: {
     
     
            isPC() {
     
     
                var userAgentInfo = navigator.userAgent;
                var Agents = ["Android", "iPhone", "SymbianOS", "Windows Phone", "iPod", "iPad"];
                var flag = true;
                for (var i = 0; i < Agents.length; i++) {
     
     
                    if (userAgentInfo.indexOf(Agents[i]) > 0) {
     
     
                        flag = false;
                        break;
                    }
                }
                return flag;
            }
        },
        methods: {
     
     
            start() {
     
     
                app.message = "正在录音...";
                app.messageSatuts = true;
                recOpen(function() {
     
     
                    app.recStart();
                });
            },
            end() {
     
     
                if (rec) {
     
     
                    rec.stop(function (blob, duration) {
     
     
                        app.messageSatuts = false;
                        rec.close();
                        rec = null;
                        blob_wav_current = blob;
                        var audio = document.createElement("audio");
                        audio.controls = true;
                        var dom = document.getElementById("wav_pannel");
                        dom.appendChild(audio);
                        audio.src = (window.URL || webkitURL).createObjectURL(blob);
                        //audio.play();
                        app.messageSatuts = false;
                        app.upload();
                    }, function (msg) {
     
     
                        console.log("录音失败:" + msg);
                        rec.close();
                        rec = null;
                    });
                    app.message = "录音停止";
                }
            },
            upload() {
     
     
                app.message = "正在上传识别...";
                app.messageSatuts = true;
                var blob = blob_wav_current;
                var reader = new FileReader();
                reader.onloadend = function(){
     
     
                    var data = {
     
     
                        samples: (/.+;\s*base64\s*,\s*(.+)$/i.exec(reader.result) || [])[1],
                        sample_rate: 16000,
                        channels: 1,
                        byte_width: 2,
                        modelType: app.modelType
                    }
                    $.post('/auth/speechRecogize', data, function(res) {
     
     
                        if (res.data && res.data.statusCode == 200000) {
     
     
                            app.messageSatuts = false;
                            app.textarea = res.data.text == '' ? '暂未识别出来,请重新试试' : res.data.text;
                        } else {
     
     
                            app.textarea = "识别失败";
                        }
                        var dom = document.getElementById("wav_pannel");
                        var div = document.createElement("div");
                        div.innerHTML = app.textarea;
                        dom.appendChild(div);
                        $('#wav_pannel').animate({
     
      scrollTop: $('#wav_pannel')[0].scrollHeight - $('#wav_pannel')[0].offsetHeight });
                    })
                };
                reader.readAsDataURL(blob);
            },
            recStart() {
     
     
                rec.start();
            },
        }
    })
</script>

quote

whisper official website

Test the basic usage of the offline audio-to-text model Whisper.net

whisper.cpp's github

whisper.net's github

whisper model download

Guess you like

Origin blog.csdn.net/guigenyi/article/details/130955947