Herramienta Unity Azure Microsoft SSML síntesis de voz TTS flujo para obtener datos de audio disposición simple

Tabla de contenido

1. Breve introducción

La clase de herramientas de Unity, algunos módulos que pueden usarse en el desarrollo de juegos organizados por mí , se pueden usar de forma independiente para facilitar el desarrollo de juegos.

Esta sección presenta los dos métodos de síntesis de voz usando Microsoft Azure . Aquí hay una breve explicación. Si tiene un método mejor, deje un mensaje para intercambiar.

El lenguaje de marcado de síntesis de voz (SSML) es un lenguaje de marcado basado en XML que se puede usar para ajustar las propiedades de salida de texto a voz, como el tono, la pronunciación, la velocidad, el volumen y más. Tiene mucho más control y flexibilidad que la entrada de texto sin formato.

Puede usar SSML para hacer lo siguiente:

Define la estructura del texto de entrada , que determina la estructura, el contenido y otras características de la salida de texto a voz. Por ejemplo, SSML se puede usar para definir párrafos, oraciones, pausas/pausas o silencios. El texto se puede envolver con marcadores de eventos, como marcadores o visemas, que la aplicación puede procesar más tarde.

Elige voz , idioma, nombre, estilo y rol. Se pueden usar varias voces en un solo documento SSML. Ajuste el acento, la velocidad del habla, el tono y el volumen. También puede usar SSML para insertar audio pregrabado, como efectos de sonido o notas musicales.

Controla la pronunciación del audio de salida . Por ejemplo, puede usar SSML con fonemas y diccionarios personalizados para mejorar la pronunciación. SSML también se puede utilizar para definir pronunciaciones específicas de palabras o expresiones matemáticas.
El siguiente es un subconjunto de la estructura básica y la sintaxis de un documento SSML:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="string">
    <mstts:backgroundaudio src="string" volume="string" fadein="string" fadeout="string"/>
    <voice name="string" effect="string">
        <audio src="string"></audio>
        <bookmark mark="string"/>
        <break strength="string" time="string" />
        <emphasis level="value"></emphasis>
        <lang xml:lang="string"></lang>
        <lexicon uri="string"/>
        <math xmlns="http://www.w3.org/1998/Math/MathML"></math>
        <mstts:audioduration value="string"/>
        <mstts:express-as style="string" styledegree="value" role="string"></mstts:express-as>
        <mstts:silence type="string" value="string"/>
        <mstts:viseme type="string"/>
        <p></p>
        <phoneme alphabet="string" ph="string"></phoneme>
        <prosody pitch="value" contour="value" range="value" rate="value" volume="value"></prosody>
        <s></s>
        <say-as interpret-as="string" format="string" detail="string"></say-as>
        <sub alias="string"></sub>
    </voice>
</speak>
SSML Speech and Sound
Speech and Sound for Speech Synthesis Markup Language (SSML) - Speech Services - Azure AI services | Microsoft Learn

Registro en el sitio web oficial:

Azure para estudiantes: créditos de cuenta gratuitos | Microsoft Azure

URL de la documentación técnica del sitio web oficial:

Documentación técnica | Microsoft Learn

TTS del sitio web oficial:

Inicio rápido de texto a voz - Servicios de voz - Azure Cognitive Services | Microsoft Learn

Sitio web oficial del paquete Azure Unity SDK:

Instale Speech SDK: Azure Cognitive Services | Microsoft Learn

Enlace específico del SDK:

https://aka.ms/csspeech/unitypackage

2. Principio de implementación

1. Solicite SPEECH_KEY y SPEECH_REGION correspondientes a la síntesis de voz en el sitio web oficial

2. Luego configure el idioma y la configuración de sonido requerida en consecuencia

3. Use SSML con transmisión para obtener los datos de audio, solo reprodúzcalos o guárdelos en la fuente de sonido, la muestra es la siguiente

public static async Task SynthesizeAudioAsync()
{
    var speechConfig = SpeechConfig.FromSubscription("YourSpeechKey", "YourSpeechRegion");
    using var speechSynthesizer = new SpeechSynthesizer(speechConfig, null);

    var ssml = File.ReadAllText("./ssml.xml");
    var result = await speechSynthesizer.SpeakSsmlAsync(ssml);

    using var stream = AudioDataStream.FromResult(result);
    await stream.SaveToWaveFileAsync("path/to/write/file.wav");
}

3. Pasos de implementación

Referencia básica de construcción del entorno: herramienta Unity Azure Microsoft Speech Synthesis método común y arreglo simple de transmisión de audio data_unity Speech síntesis

1. Implementación del script, monte el script correspondiente a la escena.

2. Ejecute la escena, use SSML para sintetizar TTS y reproduzca

4. Código clave

1, AzureTTSDataConSSMLHandler

using Microsoft.CognitiveServices.Speech;
using System;
using System.Threading;
using System.Threading.Tasks;
using System.Xml;
using UnityEngine;

/// <summary>
/// 使用 SSML 方式语音合成
/// </summary>
public class AzureTTSDataWithSSMLHandler
{
    /// <summary>
    /// Azure TTS 合成 必要数据
    /// </summary>
    private const string SPEECH_KEY = "YOUR_SPEECH_KEY";
    private const string SPEECH_REGION = "YOUR_SPEECH_REGION";
    private const string SPEECH_RECOGNITION_LANGUAGE = "zh-CN";
    private string SPEECH_VOICE_NAME = "zh-CN-XiaoxiaoNeural";

    /// <summary>
    /// 创建 TTS 中的参数
    /// </summary>
    private CancellationTokenSource m_CancellationTokenSource;
    private AudioDataStream m_AudioDataStream;
    private Connection m_Connection;
    private SpeechConfig m_Config;
    private SpeechSynthesizer m_Synthesizer;

    /// <summary>
    /// 音频获取事件
    /// </summary>
    private Action<AudioDataStream> m_AudioStream;

    /// <summary>
    /// 开始播放TTS事件
    /// </summary>
    private Action m_StartTTSPlayAction;
    /// <summary>
    /// 停止播放TTS事件
    /// </summary>
    private Action m_StartTTSStopAction;

    /// <summary>
    /// 初始化
    /// </summary>
    public void Initialized()
    {
        m_Config = SpeechConfig.FromSubscription(SPEECH_KEY, SPEECH_REGION);
        m_Synthesizer = new SpeechSynthesizer(m_Config, null);
        m_Connection = Connection.FromSpeechSynthesizer(m_Synthesizer);
        m_Connection.Open(true);

    }

    /// <summary>
    /// 开始进行语音合成
    /// </summary>
    /// <param name="msg">合成的内容</param>
    /// <param name="stream">获取到的音频流数据</param>
    /// <param name="style"></param>
    public async void Start(string msg, Action<AudioDataStream> stream, string style = "chat")
    {
        this.m_AudioStream = stream;
        await SynthesizeAudioAsync(CreateSSML(msg, SPEECH_RECOGNITION_LANGUAGE, SPEECH_VOICE_NAME, style));
    }

    /// <summary>
    /// 停止语音合成
    /// </summary>
    public void Stop()
    {
        m_StartTTSStopAction?.Invoke();

        if (m_AudioDataStream != null)
        {
            m_AudioDataStream.Dispose();
            m_AudioDataStream = null;
        }

        if (m_CancellationTokenSource != null)
        {
            m_CancellationTokenSource.Cancel();
        }

        if (m_Synthesizer != null)
        {
            m_Synthesizer.Dispose();
            m_Synthesizer = null;
        }

        if (m_Connection != null)
        {
            m_Connection.Dispose();
            m_Connection = null;
        }
    }

    /// <summary>
    /// 设置语音合成开始播放事件
    /// </summary>
    /// <param name="onStartAction"></param>
    public void SetStartTTSPlayAction(Action onStartAction)
    {
        if (onStartAction != null)
        {
            m_StartTTSPlayAction = onStartAction;
        }
    }

    /// <summary>
    /// 设置停止语音合成事件
    /// </summary>
    /// <param name="onAudioStopAction"></param>
    public void SetStartTTSStopAction(Action onAudioStopAction)
    {
        if (onAudioStopAction != null)
        {
            m_StartTTSStopAction = onAudioStopAction;
        }
    }


    /// <summary>
    /// 开始异步请求合成 TTS 数据
    /// </summary>
    /// <param name="speakMsg"></param>
    /// <returns></returns>
    private async Task SynthesizeAudioAsync(string speakMsg)
    {

        Cancel();

        m_CancellationTokenSource = new CancellationTokenSource();

        var result = m_Synthesizer.StartSpeakingSsmlAsync(speakMsg);
        await result;

        m_StartTTSPlayAction?.Invoke();

        m_AudioDataStream = AudioDataStream.FromResult(result.Result);
        m_AudioStream?.Invoke(m_AudioDataStream);
    }

    private void Cancel()
    {
        if (m_AudioDataStream != null)
        {
            m_AudioDataStream.Dispose();
            m_AudioDataStream = null;
        }

        if (m_CancellationTokenSource != null)
        {
            m_CancellationTokenSource.Cancel();
        }
    }

    /// <summary>
    /// 生成 需要的 SSML XML 数据
    /// （格式不唯一，可以根据需要自行在增加删减）
    /// </summary>
    /// <param name="msg">合成的音频内容</param>
    /// <param name="language">合成语音</param>
    /// <param name="voiceName">采用谁的声音合成音频</param>
    /// <param name="style">合成时的语气类型</param>
    /// <returns>ssml XML</returns>
    private string CreateSSML(string msg, string language, string voiceName, string style = "chat")
    {
        // XmlDocument
        XmlDocument xmlDoc = new XmlDocument();

        // 设置 speak 基础元素
        XmlElement speakElem = xmlDoc.CreateElement("speak");
        speakElem.SetAttribute("version", "1.0");
        speakElem.SetAttribute("xmlns", "http://www.w3.org/2001/10/synthesis");
        speakElem.SetAttribute("xmlns:mstts", "http://www.w3.org/2001/mstts");
        speakElem.SetAttribute("xml:lang", language);

        // 设置 voice 元素
        XmlElement voiceElem = xmlDoc.CreateElement("voice");
        voiceElem.SetAttribute("name", voiceName);

        // 设置 mstts:viseme 元素
        XmlElement visemeElem = xmlDoc.CreateElement("mstts", "viseme", "http://www.w3.org/2001/mstts");
        visemeElem.SetAttribute("type", "FacialExpression");

        // 设置 语气 元素
        XmlElement styleElem = xmlDoc.CreateElement("mstts", "express-as", "http://www.w3.org/2001/mstts");
        styleElem.SetAttribute("style", style.ToString().Replace("_", "-"));

        // 创建文本节点，包含文本信息
        XmlNode textNode = xmlDoc.CreateTextNode(msg);

        // 设置好的元素添加到 xml 中
        voiceElem.AppendChild(visemeElem);
        styleElem.AppendChild(textNode);
        voiceElem.AppendChild(styleElem);
        speakElem.AppendChild(voiceElem);
        xmlDoc.AppendChild(speakElem);
        
        Debug.Log("[SSML  XML] Result : " + xmlDoc.OuterXml);
        return xmlDoc.OuterXml;
    }

}

2, Azure TTS Mono

using Microsoft.CognitiveServices.Speech;
using System;
using System.Collections.Concurrent;
using System.IO;
using UnityEngine;

[RequireComponent(typeof(AudioSource))]
public class AzureTTSMono : MonoBehaviour
{
    private AzureTTSDataWithSSMLHandler m_AzureTTSDataWithSSMLHandler;

    /// <summary>
    /// 音源和音频参数
    /// </summary>
    private AudioSource m_AudioSource;
    private AudioClip m_AudioClip;

    /// <summary>
    /// 音频流数据
    /// </summary>
    private ConcurrentQueue<float[]> m_AudioDataQueue = new ConcurrentQueue<float[]>();
    private AudioDataStream m_AudioDataStream;

    /// <summary>
    /// 音频播放完的事件
    /// </summary>
    private Action m_AudioEndAction;

    /// <summary>
    /// 音频播放结束的布尔变量
    /// </summary>
    private bool m_NeedPlay = false;
    private bool m_StreamReadEnd = false;

    private const int m_SampleRate = 16000;
    //最大支持60s音频 
    private const int m_BufferSize = m_SampleRate * 60;
    //采样容量
    private const int m_UpdateSize = m_SampleRate;

    //audioclip 设置过的数据个数
    private int m_TotalCount = 0;
    private int m_DataIndex = 0;

    #region Lifecycle function

    private void Awake()
    {
        m_AudioSource = GetComponent<AudioSource>();

        m_AzureTTSDataWithSSMLHandler = new AzureTTSDataWithSSMLHandler();
        m_AzureTTSDataWithSSMLHandler.SetStartTTSPlayAction(() => { Debug.Log(" Play TTS "); });
        m_AzureTTSDataWithSSMLHandler.SetStartTTSStopAction(() => { Debug.Log(" Stop TTS "); AudioPlayEndEvent(); });
        m_AudioEndAction = () => { Debug.Log(" End TTS "); };
        m_AzureTTSDataWithSSMLHandler.Initialized();
    }

    // Start is called before the first frame update
    void Start()
    {
        m_AzureTTSDataWithSSMLHandler.Start("今朝有酒，今朝醉，人生几年百花春", OnGetAudioStream);
    }

    // Update is called once per frame
    private void Update()
    {
        UpdateAudio();
    }

    #endregion

    #region Audio handler

    /// <summary>
    /// 设置播放TTS的结束的结束事件
    /// </summary>
    /// <param name="act"></param>
    public void SetAudioEndAction(Action act)
    {
        this.m_AudioEndAction = act;
    }

    /// <summary>
    /// 处理获取到的TTS流式数据
    /// </summary>
    /// <param name="stream">流数据</param>
    public async void OnGetAudioStream(AudioDataStream stream)
    {
        m_StreamReadEnd = false;
        m_NeedPlay = true;
        m_AudioDataStream = stream;
        Debug.Log("[AzureTTSMono] OnGetAudioStream");

        MemoryStream memStream = new MemoryStream();
        byte[] buffer = new byte[m_UpdateSize * 2];
        uint bytesRead;
        m_DataIndex = 0;
        m_TotalCount = 0;
        m_AudioDataQueue.Clear();

        // 回到主线程进行数据处理
        Loom.QueueOnMainThread(() =>
        {
            m_AudioSource.Stop();
            m_AudioSource.clip = null;
            m_AudioClip = AudioClip.Create("SynthesizedAudio", m_BufferSize, 1, m_SampleRate, false);
            m_AudioSource.clip = m_AudioClip;
        });

        do
        {
            bytesRead = await System.Threading.Tasks.Task.Run(() => m_AudioDataStream.ReadData(buffer));
            if (bytesRead <= 0)
            {
                break;
            }

            // 读取写入数据
            memStream.Write(buffer, 0, (int)bytesRead);
            {
                var tempData = memStream.ToArray();
                var audioData = new float[memStream.Length / 2];

                for (int i = 0; i < audioData.Length; ++i)
                {
                    audioData[i] = (short)(tempData[i * 2 + 1] << 8 | tempData[i * 2]) / 32768.0F;
                }
                try
                {
                    m_TotalCount += audioData.Length;
                    // 把数据添加到队列中
                    m_AudioDataQueue.Enqueue(audioData);

                    // new 获取新的地址，为后面写入数据
                    memStream = new MemoryStream();
                }
                catch (Exception e)
                {
                    Debug.LogError(e.ToString());
                }
            }

        } while (bytesRead > 0);

        m_StreamReadEnd = true;
    }

    /// <summary>
    /// Update 播放音频
    /// </summary>
    private void UpdateAudio() {
        if (!m_NeedPlay) return;

        //数据操作
        if (m_AudioDataQueue.TryDequeue(out float[] audioData))
        {

            m_AudioClip.SetData(audioData, m_DataIndex);
            m_DataIndex = (m_DataIndex + audioData.Length) % m_BufferSize;
        }

        //检测是否停止
        if (m_StreamReadEnd && m_AudioSource.timeSamples >= m_TotalCount)
        {
            AudioPlayEndEvent();
        }

        if (!m_NeedPlay) return;

        //由于网络，可能额有些数据还没有过来，所以根据需要判断是否暂停播放
        if (m_AudioSource.timeSamples >= m_DataIndex && m_AudioSource.isPlaying)
        {
            m_AudioSource.timeSamples = m_DataIndex;

            //暂停
            Debug.Log("[AzureTTSMono] Pause");
            m_AudioSource.Pause();


        }

        //由于网络，可能有些数据过来比较晚，所以这里根据需要判断是否继续播放
        if (m_AudioSource.timeSamples < m_DataIndex && !m_AudioSource.isPlaying)
        {
            //播放
            Debug.Log("[AzureTTSMono] Play");
            m_AudioSource.Play();
        }
    }

    /// <summary>
    /// TTS 播放结束的事件
    /// </summary>
    private void AudioPlayEndEvent()
    {
        Debug.Log("[AzureTTSMono] End");
        m_NeedPlay = false;
        m_AudioSource.timeSamples = 0;
        m_AudioSource.Stop();
        m_AudioEndAction?.Invoke();
    }

    #endregion
}