How to integrate AI speech recognition in Unity games?

Introduction

Speech recognition is a technology that converts speech to text, imagine how it would work in games? It is possible to issue commands to manipulate the control panel or game characters, talk directly to NPCs, improve interactivity, and more. This article will introduce how to use the Hugging Face Unity API to integrate SOTA speech recognition in Unity games.

You can try speech recognition for yourself by downloading a sample Unity game from the itch.io website.

prerequisites

Reading the text may require an understanding of some basic Unity concepts. In addition, you need to install the Hugging Face Unity API, you can click the previous blog post to read the API installation instructions.

step

1. Setting the scene

In this tutorial we will set up a very simple scene. Players can tap a button to start or stop recording speech, and the audio is recognized and converted to text.

First, we create a new Unity project, and then create a canvas (Canvas) containing three UI components:

  1. Start button : Press to start recording voice.

  2. Stop button : Press to stop recording voice.

  3. Text component (TextMeshPro) : The place where the speech recognition result text is displayed.

2. Create script

Create a script SpeechRecognitionTestcalled and attach it to an empty GameObject.

In the script, first define a reference to the UI component:

[SerializeField] private Button startButton;
[SerializeField] private Button stopButton;
[SerializeField] private TextMeshProUGUI text;

Assign the corresponding components in the inspector window.

Then, use Start()the method to set listeners for the start and stop buttons:

private void Start() {
    startButton.onClick.AddListener(StartRecording);
    stopButton.onClick.AddListener(StopRecording);
}

At this point, the code in the script should look like this:

using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {
    [SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private void Start() {
        startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
    }

    private void StartRecording() {

    }

    private void StopRecording() {

    }
}

3. Record microphone voice input

Now, let's record the microphone voice input and encode it to WAV format. Here you need to define member variables first:

private AudioClip clip;
private byte[] bytes;
private bool recording;

Then, StartRecording()in , use Microphone.Start()the method to realize the function of starting to record the voice:

private void StartRecording() {
    clip = Microphone.Start(null, false, 10, 44100);
    recording = true;
}

The above code realizes to record audio up to 10 seconds at 44100 Hz.

When the recording time reaches the maximum limit of 10 seconds, we want the recording behavior to stop automatically. To do this, you need to write the following in Update()the method :

private void Update() {
    if (recording && Microphone.GetPosition(null) >= clip.samples) {
        StopRecording();
    }
}

Next, StopRecording()in , capture the recording segment and encode it to WAV format:

private void StopRecording() {
    var position = Microphone.GetPosition(null);
    Microphone.End(null);
    var samples = new float[position * clip.channels];
    clip.GetData(samples, 0);
    bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
    recording = false;
}

Finally, we need to implement EncodeAsWAV()the method . Here we directly use the Hugging Face API, and only need to prepare the audio data:

private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {
    using (var memoryStream = new MemoryStream(44 + samples.Length * 2)) {
        using (var writer = new BinaryWriter(memoryStream)) {
            writer.Write("RIFF".ToCharArray());
            writer.Write(36 + samples.Length * 2);
            writer.Write("WAVE".ToCharArray());
            writer.Write("fmt ".ToCharArray());
            writer.Write(16);
            writer.Write((ushort)1);
            writer.Write((ushort)channels);
            writer.Write(frequency);
            writer.Write(frequency * channels * 2);
            writer.Write((ushort)(channels * 2));
            writer.Write((ushort)16);
            writer.Write("data".ToCharArray());
            writer.Write(samples.Length * 2);

            foreach (var sample in samples) {
                writer.Write((short)(sample * short.MaxValue));
            }
        }
        return memoryStream.ToArray();
    }
}

The full script looks like this:

using System.IO;
using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {
    [SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private AudioClip clip;
    private byte[] bytes;
    private bool recording;

    private void Start() {
        startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
    }

    private void Update() {
        if (recording && Microphone.GetPosition(null) >= clip.samples) {
            StopRecording();
        }
    }

    private void StartRecording() {
        clip = Microphone.Start(null, false, 10, 44100);
        recording = true;
    }

    private void StopRecording() {
        var position = Microphone.GetPosition(null);
        Microphone.End(null);
        var samples = new float[position * clip.channels];
        clip.GetData(samples, 0);
        bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
        recording = false;
    }

    private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {
        using (var memoryStream = new MemoryStream(44 + samples.Length * 2)) {
            using (var writer = new BinaryWriter(memoryStream)) {
                writer.Write("RIFF".ToCharArray());
                writer.Write(36 + samples.Length * 2);
                writer.Write("WAVE".ToCharArray());
                writer.Write("fmt ".ToCharArray());
                writer.Write(16);
                writer.Write((ushort)1);
                writer.Write((ushort)channels);
                writer.Write(frequency);
                writer.Write(frequency * channels * 2);
                writer.Write((ushort)(channels * 2));
                writer.Write((ushort)16);
                writer.Write("data".ToCharArray());
                writer.Write(samples.Length * 2);

                foreach (var sample in samples) {
                    writer.Write((short)(sample * short.MaxValue));
                }
            }
            return memoryStream.ToArray();
        }
    }
}

To test that the script code is working correctly, you can add the following code at the end of StopRecording()the method :

File.WriteAllBytes(Application.dataPath + "/test.wav", bytes);

Ok, now you click Startthe button , then speak into the microphone, then click Stopthe button, and your recorded audio will be saved as test.wava file in the Unity Assets folder in the project directory.

4. Voice recognition

Next, we'll implement speech recognition on the encoded audio using the Hugging Face Unity API. To do this, we create a SendRecording()method :

using HuggingFace.API;

private void SendRecording() {
    HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
        text.color = Color.white;
        text.text = response;
    }, error => {
        text.color = Color.red;
        text.text = error;
    });
}

This method implements sending encoded audio to the Speech Recognition API, and displays the response in white if successful, otherwise displays an error message in red.

Don't forget to call at the end of StopRecording()the method SendRecording():

private void StopRecording() {
    /* other code */
    SendRecording();
}

5. Finishing touches

Finally, to improve the user experience, here we use interactive buttons and status messages.

The start and stop buttons should only interact when appropriate, such as: ready to record, recording, stop recording.

When recording voice or waiting for the API to return the recognition result, we can set a simple response text to display the corresponding status information.

The full script looks like this:

using System.IO;
using HuggingFace.API;
using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {
    [SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private AudioClip clip;
    private byte[] bytes;
    private bool recording;

    private void Start() {
        startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
        stopButton.interactable = false;
    }

    private void Update() {
        if (recording && Microphone.GetPosition(null) >= clip.samples) {
            StopRecording();
        }
    }

    private void StartRecording() {
        text.color = Color.white;
        text.text = "Recording...";
        startButton.interactable = false;
        stopButton.interactable = true;
        clip = Microphone.Start(null, false, 10, 44100);
        recording = true;
    }

    private void StopRecording() {
        var position = Microphone.GetPosition(null);
        Microphone.End(null);
        var samples = new float[position * clip.channels];
        clip.GetData(samples, 0);
        bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
        recording = false;
        SendRecording();
    }

    private void SendRecording() {
        text.color = Color.yellow;
        text.text = "Sending...";
        stopButton.interactable = false;
        HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
            text.color = Color.white;
            text.text = response;
            startButton.interactable = true;
        }, error => {
            text.color = Color.red;
            text.text = error;
            startButton.interactable = true;
        });
    }

    private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {
        using (var memoryStream = new MemoryStream(44 + samples.Length * 2)) {
            using (var writer = new BinaryWriter(memoryStream)) {
                writer.Write("RIFF".ToCharArray());
                writer.Write(36 + samples.Length * 2);
                writer.Write("WAVE".ToCharArray());
                writer.Write("fmt ".ToCharArray());
                writer.Write(16);
                writer.Write((ushort)1);
                writer.Write((ushort)channels);
                writer.Write(frequency);
                writer.Write(frequency * channels * 2);
                writer.Write((ushort)(channels * 2));
                writer.Write((ushort)16);
                writer.Write("data".ToCharArray());
                writer.Write(samples.Length * 2);

                foreach (var sample in samples) {
                    writer.Write((short)(sample * short.MaxValue));
                }
            }
            return memoryStream.ToArray();
        }
    }
}

congratulate! Now you can integrate SOTA speech recognition in your Unity games!

If you have any questions, or want to get more involved with the Hugging Face for Games series, you can join the Hugging Face Discord channel!


Original English: https://hf.co/blog/unity-asr

Author: Dylan Ebert

Translated by: SuSung-boy

Proofreading/Typesetting: zhongdongy (阿东)

Guess you like

Origin blog.csdn.net/HuggingFace/article/details/131179897