Speech Recognition with Azure AI Service

The author briefly introduced the text translation API in Azure Cognitive Services in the previous article " Azure AI Service Text Translation ", and machine translation can be easily performed through these simple REST API calls. Wouldn't it be great if you could simply integrate the speech-to-text function in the program! In this article, we will introduce how to use the Bing Speech API (Bing Speech API) to convert speech into text:

The following applications can be easily developed using the Bing Speech API:

You click the "Start Recording" button, then speak into the microphone, and you will be able to recognize and output what you say and output it as text. The above screenshot is a demo provided by Azure. In order to demonstrate the usage of the speech recognition API, let's write an ugly program that can output detailed information:

The program will recognize two pieces of audio data of our hardcode in different patterns, and then output the recognized result. The upper text box will output a large number of intermediate recognition results, while the lower text box will output the final recognition results.

Create an Azure service

To use Azure's translation service, you need to create a corresponding instance on Azure. For example, we need to create a "Bing Speech API" service instance first:

Note: For learning and practice, you can create a free Azure account and create the above instances of the free version. For details, please refer to the Azure official website.

Create a WPF program

The Bing Speech API service provides both a REST API and a client library. Because the services provided by the REST API have some limitations, we use the client library in the demo program. The client class library is divided into two versions: x86 and x64. The author refers to the x64 version Microsoft.ProjectOxford.SpeechRecognition-x64:

Therefore, the platform target of the project needs to be set to x64 as well.

It should be noted that the Cognitive Services APIs provided by Azure all require authentication information. The specific method is to authenticate the server-side authentication sent by the key of the service we created with the API. You can get the corresponding keys in the details interface of the created service instance, and we save them by defining constants in the program:

const string SUBSCRIPTIONKEY = "your bing speech API key";

Since the code of the demo is relatively long, in order to focus on introducing the content related to Azure AI, only the relevant code is posted in this article. The complete demo code is here .

Identify patterns

Speech recognition distinguishes different recognition modes to deal with different usage scenarios, such as conversation mode, dictation mode and interactive mode.

  • Conversation mode (conversation) In the conversation mode, the user participates in the dialogue between people.
  • Dictation mode (dictation) In dictation mode, the user speaks a long speech and waits for the result of speech recognition.
  • Interactive mode (interactive) In interactive mode, the consumer makes a short request and expects the application to perform a response action.

Unfortunately, in the client class library we use, the relevant mode types do not correspond to the three modes above. The class library provides an enumeration called SpeechRecognitionMode:

public enum SpeechRecognitionMode
{
    ShortPhrase = 0 ,
    LongDictation = 1
}

It defines two recognition modes , ShortPhrase and LongDictation . ShortPhrase mode supports up to 15 seconds of speech. The voice data is sent to the server in blocks, and the server will return the partial recognition results in time, so the client will receive multiple partial results and a final result containing multiple n-best options. LongDictation mode supports up to two minutes of speech. Voice data is sent to the server in chunks, and the client receives multiple partial results and multiple final results based on the pauses between sentences identified by the server.

We use them in the code to tell the speech recognition API what type of recognition to perform. For example, to recognize speech shorter than 15s, you can use the ShortPhrase mode to build an instance of the CreateDataClient type:

// Use the CreateDataClient method of the factory type to create an instance of the DataRecognitionClient type. 
this .dataClient = SpeechRecognitionServiceFactory.CreateDataClient(
    SpeechRecognitionMode.ShortPhrase ,             // 指定语音识别的模式。
    "en-US",          // 我们把语音中语言的类型 hardcode 为英语,因为我们的两个 demo 文件都是英语语音。
    SUBSCRIPTIONKEY); // Bing Speech API 服务实例的 key。

如果要识别长于 15s 的语音,就需要使用 SpeechRecognitionMode.LongDictation 模式。

分块传输音频

为了能得到近乎实时的识别效果,我们必须把音频数据以适当大小的块连续发送给服务端,下面代码中使用的块大小为 1024:

/// <summary>
/// 向服务端发送语音数据。
/// </summary>
/// <param name="wavFileName">wav 格式文件的名称。</param>
private void SendAudioHelper(string wavFileName)
{
    using (FileStream fileStream = new FileStream(wavFileName, FileMode.Open, FileAccess.Read))
    {
        // Note for wave files, we can just send data from the file right to the server.
        // In the case you are not an audio file in wave format, and instead you have just
        // raw data (for example audio coming over bluetooth), then before sending up any
        // audio data, you must first send up an SpeechAudioFormat descriptor to describe
        // the layout and format of your raw audio data via DataRecognitionClient's sendAudioFormat() method.
        int bytesRead = 0;
        // 创建大小为 1024 的 buffer。
        byte[] buffer = new byte[1024];

        try
        {
            do
            {
                // 把文件数据读取到 buffer 中。
                bytesRead = fileStream.Read(buffer, 0, buffer.Length);

                // 通过 DataRecognitionClient 类型的实例把语音数据发送到服务端。
                this.dataClient.SendAudio(buffer, bytesRead);
            }
            while (bytesRead > 0);
        }
        finally
        {
            // 告诉服务端语音数据已经传送完了。
            this.dataClient.EndAudio();
        }
    }
}

注意,在数据传送结束后需要通过 EndAudio() 方法显式的告诉服务端数据传送结束。

部分结果与最终结果

部分结果
把数据分块发送给语音识别服务端,我们就能得到近乎实时的识别效果。服务器端通过 OnPartialResponseReceived 事件不断把识别的结果发送到客户端。比如 demo 中演示的 ShortPhrase 模式实例,我们会得到下面的中间结果(在上面的输出框中):

--- Partial result received by OnPartialResponseReceivedHandler() ---
why

--- Partial result received by OnPartialResponseReceivedHandler() ---
what's

--- Partial result received by OnPartialResponseReceivedHandler() ---
what's the weather

--- Partial result received by OnPartialResponseReceivedHandler() ---
what's the weather like

在识别的过程中 OnPartialResponseReceived 事件被触发了 4 次,识别的结果也越来越完整。如果应用程序能够根据这些中间结果不断地向使用者做出反馈,则应用程序就具备了实时性。

最终结果
当使用者结束语音的输入后,demo 中就是调用了 EndAudio() 函数。语音识别服务在完成识别后会触发 OnResponseReceived 事件,我们通过下面的函数把结果输出到 UI 中:

/// <summary>
/// 把服务端返回的语音识别结果输出到 UI。
/// </summary>
/// <param name="e"><see cref="SpeechResponseEventArgs"/>该类型的实例包含语音识别的结果。</param>
private void WriteResponseResult(SpeechResponseEventArgs e)
{
    if (e.PhraseResponse.Results.Length == 0)
    {
        this.WriteLine("No phrase response is available.");
    }
    else
    {
        this.WriteLine("********* Final n-BEST Results *********");
        for (int i = 0; i < e.PhraseResponse.Results.Length; i++)
        {
            this.WriteLine(
                "[{0}] Confidence={1}, Text=\"{2}\"",
                i,
                e.PhraseResponse.Results[i].Confidence,
                e.PhraseResponse.Results[i].DisplayText);
        }
        this.WriteLine();
    }
}

数据的结果大体如下:

--- OnDataShortPhraseResponseReceivedHandler ---
********* Final n-BEST Results *********
[0] Confidence=High, Text="What's the weather like?"

上面是 ShortPhrase 模式的一个识别结果,它的特点是只有一个最终的返回结果,其中会包含多个识别结果并被称为 n-best。n-best 中的每个结果都包含 Confidence,DisplayText,InverseTextNormalizationResult,LexicalForm,MaskedInverseTextNormalizationResult 等属性,比如我们可以根据 Confidence 属性判断识别的结果是否可靠:

上图是实际的返回结果,因为太简单了,所以 n-best 列表中只有一条(Azure 上的语言材料,发音还是很标准的)。

对于 LongDictation 模式的识别,客户端事件 OnResponseReceived  会被触发多次,并返回分阶段的识别结果,结果中的内容和 ShortPhrase 模式类似。更详细的内容请大家直接看代码吧,很简单的。

支持语言

笔者图省事直接使用了 Azure 文档中提供的英语语音作为 demo 数据,其实 Bing Speech API 对中文支持还是比较全面的,现在支持的所有模式都支持中文。如果你还有其它需求,可以从这里查看详细的语言支持列表。

总结

笔者最早接触语音识别是在 2000 年左右,当时感觉太神奇了。只是识别的效果不太好,并且要求反复的读一个基准文档…
这么多年过去了,其实语言相关的技术发展并不算很快。 AI 的兴起让我们看到了一线希望,在介绍了 Azure AI 的语音识别服务后,让我们接着探索如何通过 AI 让程序理解文本的内容。

参考:
Bing Speech Recognition API in C# for .NET

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325124550&siteId=291194637