Unity C#'s Azure Microsoft SSML speech synthesis TTS stream to obtain audio data and simple arrangement of expression and mouth animation

Unity C#'s Azure Microsoft SSML speech synthesis TTS stream to obtain audio data and simple arrangement of expression and mouth animation

Table of contents

Unity C#'s Azure Microsoft SSML speech synthesis TTS stream to obtain audio data and simple arrangement of expression and mouth animation

1. Brief introduction

2. Implementation principle

3. Matters needing attention

4. Implementation steps

5. Key code


1. Brief introduction

Unity tool class, some modules that may be used in game development organized by myself , can be used independently to facilitate game development.

This section introduces that Microsoft Azure uses SSML to perform SS speech synthesis audio, and obtains the expression mouth animation data, and saves it locally. In certain cases, it is used to read the audio and expression mouth animation data locally. , use it directly to avoid possible delays caused by network access. Here is a brief explanation. If you have a better method, please leave a message to communicate.

Speech Synthesis Markup Language (SSML) is an XML-based markup language that can be used to fine-tune text-to-speech output properties such as pitch, pronunciation, rate, volume, and more. You have far more control and flexibility than plain text input.

You can use SSML to do the following:

  •     Defines the input text structure, which determines the structure, content, and other characteristics of the text-to-speech output. For example, SSML can be used to define paragraphs, sentences, breaks/pauses, or silences. Text can be wrapped with event markers such as bookmarks or visemes, which can be processed later by the application.
  •     Choose voice, language, name, style and role. Multiple voices can be used in a single SSML document. Adjust accent, speech rate, pitch and volume. You can also use SSML to insert pre-recorded audio, such as sound effects or musical notes.
  •     Controls the pronunciation of the output audio. For example, you can use SSML with phonemes and custom dictionaries to improve pronunciation. SSML can also be used to define specific pronunciations of words or mathematical expressions.
     

The following is a subset of the basic structure and syntax of an SSML document:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="string">
    <mstts:backgroundaudio src="string" volume="string" fadein="string" fadeout="string"/>
    <voice name="string" effect="string">
        <audio src="string"></audio>
        <bookmark mark="string"/>
        <break strength="string" time="string" />
        <emphasis level="value"></emphasis>
        <lang xml:lang="string"></lang>
        <lexicon uri="string"/>
        <math xmlns="http://www.w3.org/1998/Math/MathML"></math>
        <mstts:audioduration value="string"/>
        <mstts:express-as style="string" styledegree="value" role="string"></mstts:express-as>
        <mstts:silence type="string" value="string"/>
        <mstts:viseme type="string"/>
        <p></p>
        <phoneme alphabet="string" ph="string"></phoneme>
        <prosody pitch="value" contour="value" range="value" rate="value" volume="value"></prosody>
        <s></s>
        <say-as interpret-as="string" format="string" detail="string"></say-as>
        <sub alias="string"></sub>
    </voice>
</speak>

 SSML Speech and Sound
Speech and Sound for Speech Synthesis Markup Language (SSML) - Speech Services - Azure AI services | Microsoft Learn

Official website registration:

Azure for Students - Free Account Credits | Microsoft Azure

Official website technical documentation URL:

Technical Documentation | Microsoft Learn

TTS of the official website:

Text-to-Speech Quickstart - Speech Services - Azure Cognitive Services | Microsoft Learn

Azure Unity SDK package official website:

Install the Speech SDK - Azure Cognitive Services | Microsoft Learn

SDK specific link:

https://aka.ms/csspeech/unitypackage

 

2. Implementation principle

1. Apply for SPEECH_KEY and SPEECH_REGION corresponding to speech synthesis on the official website

2. Then set the language and the required sound configuration accordingly

3. Use SSML with streaming to get the audio data, just play or save it in the sound source, the sample is as follows

public static async Task SynthesizeAudioAsync()
{
    var speechConfig = SpeechConfig.FromSubscription("YourSpeechKey", "YourSpeechRegion");
    using var speechSynthesizer = new SpeechSynthesizer(speechConfig, null);
 
    var ssml = File.ReadAllText("./ssml.xml");
    var result = await speechSynthesizer.SpeakSsmlAsync(ssml);
 
    using var stream = AudioDataStream.FromResult(result);
    await stream.SaveToWaveFileAsync("path/to/write/file.wav");
}

4. Locally save the audio and the animation data of the facial expressions and mouth shapes

    // 获取到视频的数据,保存为 .wav 
    using var stream = AudioDataStream.FromResult(speechSynthesisResult);
    await stream.SaveToWaveFileAsync($"./{fileName}.wav");



    /// <summary>
    /// 嘴型 animation 数据,本地保存为 json 数据
    /// </summary>
    /// <param name="fileName">保存文件名</param>
    /// <param name="content">保存内容</param>
    /// <returns></returns>
    static async Task CommitAsync(string fileName,string content)
    {
        var bits = Encoding.UTF8.GetBytes(content);
        using (var fs = new FileStream(
            path: @$"d:\temp\{fileName}.json",
            mode: FileMode.Create,
            access: FileAccess.Write,
            share: FileShare.None,
            bufferSize: 4096,
            useAsync: true))
        {
            await fs.WriteAsync(bits, 0, bits.Length);
        }
    }

3. Matters needing attention

1. Not all speechSynthesisVoiceName can generate the corresponding animation data of the expression and mouth shape

4. Implementation steps

Here is the code test directly using .Net VS

1. Install Microsoft's Speech package in NuGet

 2. Write code to realize SSML synthesized voice, and save the corresponding audio file and animation json data of emoticon mouth shape locally

3. Run the code. After running, the corresponding audio file and the animation json data of the mouth shape will be saved locally.

 

 4. View saved data locally

 

5. Key code

using Microsoft.CognitiveServices.Speech;
using System.Text;

class Program
{
    // This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    static string speechKey = "YOUR_SPEECH_KEY";
    static string speechRegion = "YOUR_SPEECH_REGION";
    static string speechSynthesisVoiceName = "zh-CN-XiaoxiaoNeural";
    static string fileName = "Test" + "Hello";
    static string InputAudioContent = "黄河之水天上来,奔流到海不复回";  // 生成的

    static int index = 0;   // 记录合成的表情口型动画的数据数组个数
    static string content="[";  // [ 是为了组成 json 数组

    async static Task Main(string[] args)
    {
        var speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion);

        // 根据需要可以使用更多 xml 配置,让合成的声音更加生动立体
        var ssml = @$"<speak version='1.0' xml:lang='zh-CN' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts'>
            <voice name='{speechSynthesisVoiceName}'>
                <mstts:viseme type='FacialExpression'/>
                <mstts:express-as style='friendly'>{InputAudioContent}</mstts:express-as>
            </voice>
        </speak>";

        // Required for sentence-level WordBoundary events
        speechConfig.SetProperty(PropertyId.SpeechServiceResponse_RequestSentenceBoundary, "true");

        using (var speechSynthesizer = new SpeechSynthesizer(speechConfig))
        {
            // Subscribe to events
            // 注册表情嘴型数据
            speechSynthesizer.VisemeReceived += async (s, e) =>
            {
                Console.WriteLine($"VisemeReceived event:" +
                    $"\r\n\tAudioOffset: {(e.AudioOffset + 5000) / 10000}ms" 
                   + $"\r\n\tVisemeId: {e.VisemeId}" 
                    // + $"\r\n\tAnimation: {e.Animation}"
                    );
                if (string.IsNullOrEmpty( e.Animation)==false)
                {
                    // \r\n, 是为了组合 json 格式
                    content += e.Animation + "\r\n,";
                    index++;
                }
                
            };
            
            // 注册合成完毕的事件
            speechSynthesizer.SynthesisCompleted += async (s, e) =>
            {
                Console.WriteLine($"SynthesisCompleted event:" +
                    $"\r\n\tAudioData: {e.Result.AudioData.Length} bytes" +
                    $"\r\n\tindex: {index} " +
                    $"\r\n\tAudioDuration: {e.Result.AudioDuration}");
                content = content.Substring(0, content.Length-1);
                content += "]";
                await CommitAsync(fileName, content);
            };

            // Synthesize the SSML
            Console.WriteLine($"SSML to synthesize: \r\n{ssml}");
            var speechSynthesisResult = await speechSynthesizer.SpeakSsmlAsync(ssml);

            // 获取到视频的数据,保存为 .wav 
            using var stream = AudioDataStream.FromResult(speechSynthesisResult);
            await stream.SaveToWaveFileAsync(@$"d:\temp\{fileName}.wav");

            // Output the results
            switch (speechSynthesisResult.Reason)
            {
                case ResultReason.SynthesizingAudioCompleted:
                    Console.WriteLine("SynthesizingAudioCompleted result");
                    break;
                case ResultReason.Canceled:
                    var cancellation = SpeechSynthesisCancellationDetails.FromResult(speechSynthesisResult);
                    Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");

                    if (cancellation.Reason == CancellationReason.Error)
                    {
                        Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");
                        Console.WriteLine($"CANCELED: ErrorDetails=[{cancellation.ErrorDetails}]");
                        Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?");
                    }
                    break;
                default:
                    break;
            }
        }

        Console.WriteLine("Press any key to exit...");
        Console.ReadKey();
    }


    /// <summary>
    /// 嘴型 animation 数据,本地保存为 json 数据
    /// </summary>
    /// <param name="fileName">保存文件名</param>
    /// <param name="content">保存内容</param>
    /// <returns></returns>
    static async Task CommitAsync(string fileName,string content)
    {
        var bits = Encoding.UTF8.GetBytes(content);
        using (var fs = new FileStream(
            path: @$"d:\temp\{fileName}.json",
            mode: FileMode.Create,
            access: FileAccess.Write,
            share: FileShare.None,
            bufferSize: 4096,
            useAsync: true))
        {
            await fs.WriteAsync(bits, 0, bits.Length);
        }
    }
}

Guess you like

Origin blog.csdn.net/u014361280/article/details/132313878