.Net uses OpenAI open source speech recognition model Whisper
foreword
On September 21, 2022, Open AI opened up the Whisper neural network, which claims that its English speech recognition ability has reached human level , and it also supports automatic speech recognition in 98 other languages. The Automatic Speech Recognition (ASR) model provided by the Whisper system is trained to run speech recognition and translation tasks. They can convert speech in various languages into text, and can also translate these texts into English.
The core function of whisper, speech recognition, for most people, can help us organize meetings, lectures, and classroom recordings into transcripts more quickly; for film and television lovers, it can automatically generate subtitles for resources without subtitles, no need to worry about it Waiting hard for the subtitle resources of major subtitle groups; for foreign language learners, using whisper to translate your pronunciation practice recordings can be a good test of your oral pronunciation level. Of course, all major cloud platforms provide voice recognition services, but they are basically networked, and there are always hidden dangers to personal privacy and security. However, whisper is completely different. Whisper runs completely locally without networking, which fully guarantees personal privacy, and whisper recognize The accuracy rate is quite high.
Whisper is written in C++ , and sandrohanea encapsulates it in .Net .
This article aims to sort out my process of using the open source speech recognition model Whisper in the .net web project, for the convenience of reading next time. It is a great honor to help you~
.Net Web project version is: .Net 6.0
Article directory
Install the Whisper.net package
First we install the Whisper.net package in the Core project. Search and install the [Whisper.net] package in the NuGet package manager, as shown in the following figure:
Note that what we are looking for is [Whisper.net], not [Whisper.net.Runtime], [WhisperNet], [Whisper.Runtime].
Download model file
Go to Hugging Face to download Whisper's model files. There are 5 models in total: ggml-tiny.bin, ggml-base.bin, ggml-small.bin, ggml-medium.bin, and ggml-large.bin. The file sizes become larger in turn. The recognition rate also increases sequentially. In addition, [xxx.en.bin] is an English model, and [xxx.bin] supports various languages.
We can put the model file in the project, here I put it under the wwwroot of the Web project:
Create a new Whisper helper class
WhisperHelper.cs
using Whisper.net;
using System.IO;
using System.Collections.Generic;
using Market.Core.Enum;
namespace Market.Core.Util
{
public class WhisperHelper
{
public static List<SegmentData> Segments { get; set; }
public static WhisperProcessor Processor { get; set; }
public WhisperHelper(ASRModelType modelType)
{
if(Segments == null || Processor == null)
{
Segments = new List<SegmentData>();
var binName = "ggml-large.bin";
switch (modelType)
{
case ASRModelType.WhisperTiny:
binName = "ggml-tiny.bin";
break;
case ASRModelType.WhisperBase:
binName = "ggml-base.bin";
break;
case ASRModelType.WhisperSmall:
binName = "ggml-small.bin";
break;
case ASRModelType.WhisperMedium:
binName = "ggml-medium.bin";
break;
case ASRModelType.WhisperLarge:
binName = "ggml-large.bin";
break;
default:
break;
}
var modelFilePath = $"wwwroot/WhisperModel/{binName}";
var factory = WhisperFactory.FromPath(modelFilePath);
var builder = factory.CreateBuilder()
.WithLanguage("zh") //中文
.WithSegmentEventHandler(Segments.Add);
var processor = builder.Build();
Processor = processor;
}
}
/// <summary>
/// 完整的语音识别 单例实现
/// </summary>
/// <returns></returns>
public string FullDetection(Stream speechStream)
{
Segments.Clear();
var txtResult = string.Empty;
//开始识别
Processor.Process(speechStream);
//识别结果处理
foreach (var segment in Segments)
{
txtResult += segment.Text + "\n";
}
Segments.Clear();
return txtResult;
}
}
}
ModelType.cs
Different models have different names, and an enumeration class is needed to distinguish them:
using System.ComponentModel;
namespace Market.Core.Enum
{
/// <summary>
/// ASR模型类型
/// </summary>
[Description("ASR模型类型")]
public enum ASRModelType
{
/// <summary>
/// ASRT
/// </summary>
[Description("ASRT")]
ASRT = 0,
/// <summary>
/// WhisperTiny
/// </summary>
[Description("WhisperTiny")]
WhisperTiny = 100,
/// <summary>
/// WhisperBase
/// </summary>
[Description("WhisperBase")]
WhisperBase = 110,
/// <summary>
/// WhisperSmall
/// </summary>
[Description("WhisperSmall")]
WhisperSmall = 120,
/// <summary>
/// WhisperMedium
/// </summary>
[Description("WhisperMedium")]
WhisperMedium = 130,
/// <summary>
/// WhisperLarge
/// </summary>
[Description("WhisperLarge")]
WhisperLarge = 140,
/// <summary>
/// PaddleSpeech
/// </summary>
[Description("PaddleSpeech")]
PaddleSpeech = 200,
}
}
The backend accepts audio and recognizes
The backend interface accepts audio binary bytecode and uses Whisper helper classes for speech recognition.
The key code is as follows:
public class ASRModel
{
public string samples { get; set; }
}
/// <summary>
/// 语音识别
/// </summary>
[HttpPost]
[Route("/auth/speechRecogize")]
public async Task<IActionResult> SpeechRecogizeAsync([FromBody] ASRModel model)
{
ResultDto result = new ResultDto();
byte[] wavData = Convert.FromBase64String(model.samples);
model.samples = null; //内存回收
// 使用Whisper模型进行语音识别
var speechStream = new MemoryStream(wavData);
var whisperManager = new WhisperHelper(model.ModelType);
var textResult = whisperManager.FullDetection(speechStream);
speechStream.Dispose();//内存回收
speechStream = null;
wavData = null; //内存回收
result.Data = textResult;
return Json(result.OK());
}
Upload audio on front-end page
The front-end mainly does an audio collection work, and then converts the audio file into a binary code and transmits it to the back-end Api interface
The front-end page is as follows:
The page code is as follows:
@{
Layout = null;
}
@using Karambolo.AspNetCore.Bundling.ViewHelpers
@addTagHelper *, Karambolo.AspNetCore.Bundling
@addTagHelper *, Microsoft.AspNetCore.Mvc.TagHelpers
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>语音录制</title>
<meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
<environment names="Development">
<link href="~/content/plugins/element-ui/index.css" rel="stylesheet" />
<script src="~/content/plugins/jquery/jquery-3.4.1.min.js"></script>
<script src="~/content/js/matomo.js"></script>
<script src="~/content/js/slick.min.js"></script>
<script src="~/content/js/masonry.js"></script>
<script src="~/content/js/instafeed.min.js"></script>
<script src="~/content/js/headroom.js"></script>
<script src="~/content/js/readingTime.min.js"></script>
<script src="~/content/js/script.js"></script>
<script src="~/content/js/prism.js"></script>
<script src="~/content/js/recorder-core.js"></script>
<script src="~/content/js/wav.js"></script>
<script src="~/content/js/waveview.js"></script>
<script src="~/content/js/vue.js"></script>
<script src="~/content/plugins/element-ui/index.js"></script>
<script src="~/content/js/request.js"></script>
</environment>
<environment names="Stage,Production">
@await Styles.RenderAsync("~/bundles/login.css")
@await Scripts.RenderAsync("~/bundles/login.js")
</environment>
<style>
html,
body {
margin: 0;
height: 100%;
}
body {
padding: 20px;
box-sizing: border-box;
}
audio {
display:block;
}
audio + audio {
margin-top: 20px;
}
.el-textarea .el-textarea__inner {
color: #000 !important;
font-size: 18px;
font-weight: 600;
}
#app {
height: 100%;
}
.content {
height: calc(100% - 130px);
overflow: auto;
}
.content > div {
margin: 10px 0 20px;
}
.press {
height: 40px;
line-height: 40px;
border-radius: 5px;
border: 1px solid #dcdfe6;
cursor: pointer;
width: 100%;
text-align: center;
background: #fff;
}
</style>
</head>
<body>
<div id="app">
<div style="display: flex; justify-content: space-between; align-items: center;">
<center>{
{isPC? '我是电脑版' : '我是手机版'}}</center>
<center style="margin: 10px 0">
<el-radio-group v-model="modelType">
<el-radio :label="0">ASRT</el-radio>
<el-radio :label="100">WhisperTiny</el-radio>
<el-radio :label="110">WhisperBase</el-radio>
<el-radio :label="120">WhisperSmall</el-radio>
<el-radio :label="130">WhisperMedium</el-radio>
<el-radio :label="140">WhisperLarge</el-radio>
<el-radio :label="200">PaddleSpeech</el-radio>
</el-radio-group>
</center>
<el-button type="primary" size="small" onclick="window.location.href = '/'">返回</el-button>
</div>
<div class="content" id="wav_pannel">
@*{
{textarea}}*@
</div>
<div style="margin-top: 20px"></div>
<center style="height: 40px;"><h4 id="msgbox" v-if="messageSatuts">{
{message}}</h4></center>
<button class="press" v-on:touchstart="start" v-on:touchend="end" v-if="!isPC">
按住 说话
</button>
<button class="press" v-on:mousedown="start" v-on:mouseup="end" v-else>
按住 说话
</button>
</div>
</body>
</html>
<script>
var blob_wav_current;
var rec;
var recOpen = function (success) {
rec = Recorder({
type: "wav",
sampleRate: 16000,
bitRate: 16,
onProcess: (buffers, powerLevel, bufferDuration, bufferSampleRate, newBufferIdx, asyncEnd) => {
}
});
rec.open(() => {
success && success();
}, (msg, isUserNotAllow) => {
app.textarea = (isUserNotAllow ? "UserNotAllow," : "") + "无法录音:" + msg;
});
};
var app = new Vue({
el: '#app',
data: {
textarea: '',
message: '',
messageSatuts: false,
modelType: 0,
},
computed: {
isPC() {
var userAgentInfo = navigator.userAgent;
var Agents = ["Android", "iPhone", "SymbianOS", "Windows Phone", "iPod", "iPad"];
var flag = true;
for (var i = 0; i < Agents.length; i++) {
if (userAgentInfo.indexOf(Agents[i]) > 0) {
flag = false;
break;
}
}
return flag;
}
},
methods: {
start() {
app.message = "正在录音...";
app.messageSatuts = true;
recOpen(function() {
app.recStart();
});
},
end() {
if (rec) {
rec.stop(function (blob, duration) {
app.messageSatuts = false;
rec.close();
rec = null;
blob_wav_current = blob;
var audio = document.createElement("audio");
audio.controls = true;
var dom = document.getElementById("wav_pannel");
dom.appendChild(audio);
audio.src = (window.URL || webkitURL).createObjectURL(blob);
//audio.play();
app.messageSatuts = false;
app.upload();
}, function (msg) {
console.log("录音失败:" + msg);
rec.close();
rec = null;
});
app.message = "录音停止";
}
},
upload() {
app.message = "正在上传识别...";
app.messageSatuts = true;
var blob = blob_wav_current;
var reader = new FileReader();
reader.onloadend = function(){
var data = {
samples: (/.+;\s*base64\s*,\s*(.+)$/i.exec(reader.result) || [])[1],
sample_rate: 16000,
channels: 1,
byte_width: 2,
modelType: app.modelType
}
$.post('/auth/speechRecogize', data, function(res) {
if (res.data && res.data.statusCode == 200000) {
app.messageSatuts = false;
app.textarea = res.data.text == '' ? '暂未识别出来,请重新试试' : res.data.text;
} else {
app.textarea = "识别失败";
}
var dom = document.getElementById("wav_pannel");
var div = document.createElement("div");
div.innerHTML = app.textarea;
dom.appendChild(div);
$('#wav_pannel').animate({
scrollTop: $('#wav_pannel')[0].scrollHeight - $('#wav_pannel')[0].offsetHeight });
})
};
reader.readAsDataURL(blob);
},
recStart() {
rec.start();
},
}
})
</script>
quote
Test the basic usage of the offline audio-to-text model Whisper.net