Unity+讯飞语音+讯飞星火+Motionverse打造智能数字人

不废话，先来效果视频

unity+讯飞语音识别+讯飞星火大模型+Motionver

然后说说原理

要实现这个，主要的实现三个模块的接入：

语音识别。作用是吧麦克风的语音转化为gpt大模型能够识别的文字。然后发给gpt。
GPT大模型。作用当然是把第一步中生成的问题文字转换为解答文本。
数字人驱动。作用使用文字内容，驱动数字人的动作，并生成语音进行播报。

下面说说具体的接入步骤以及遇到的坑，还有解决方案

一、接入讯飞语音

这个是三个模块中接入最顺利的了。
开通的方法很简单，去讯飞开放平台，注册一个账号，创建好应用，领取一下新人福利，就能获取50000多的服务量，这个对于开发测试而言绝对管够了。具体方法不涉及技术问题，这里不再赘述。
这里涉及两个问题：

如何自动录音。我的方法是，麦克风其实一直都在录音，但是我会监听麦克风一段时间内的平均音量，当达到阈值，就记录下这个时间点（为了录音完整，再减去一点点时间）作为开始记录点，然后当平均音量低于某个阈值，并且保持了一段时间的低音量，就记录这个时间点作为结束点，截取开始和结束点的录音，这样就由了语音了，使用有穷自动状态机来实现上面的逻辑，清晰明了。
算法代码如下：

public event OnAudioToTextConvertedHandler OnAudioToTextConverted;	// 语音转为文字成功事件
public event OnRecordStartedHandler OnRecordStarted;               	// 开始录制事件
public event OnRecordStopedHandler OnRecordStoped;					// 停止录制事件

private enum State	// 状态机三种状态
{
    
    
    Listening,  // 监听状态，没有录音
    Recording,  // 正在录音
    PreStop		// 准备停止录音
}

// 初始状态处于监听状态
private State state = State.Listening;

private void Update()
{
    
    
    volume = GetVolume();  // 获取平均音量
    switch (state)
    {
    
    
        case State.Listening: // 处于监听状态时，音量大于阈值，就开始录音
            if (volume > BeginRecordThreshold)
            {
    
    
                state = State.Recording;
                startPos = Microphone.GetPosition(null) - 2000;

                // OnStartRecord;
                OnRecordStarted?.Invoke();
            }
            break;
        
        case State.Recording:  // 处于录音状态时，音量小于阈值，就准备停止
            if (volume < StopRecordThreshold)
            {
    
    
                state = State.PreStop;
                waitTime = 0;
            }
            break;
        
        case State.PreStop:  // 处于准备停止状态时，超过时间，就真的停止
            if (volume < StopRecordThreshold)
            {
                if (waitTime > StopWaitTime)
                {
    
    
                    var end = Microphone.GetPosition(null);
                    XunFeiMSC.AudioToText( clip.ToBytes(startPos, end ));
                    state = State.Listening;
                    
                    // OnStopRecord;
                    OnRecordStoped?.Invoke();
                }
                else
                    waitTime += Time.deltaTime;
            }
            else
                state = State.Recording;
            break;
    }
}

private float GetVolume()
{
    
    
    if (Microphone.IsRecording(null))
    {
    
    
        var offset = Microphone.GetPosition(null) - ( TestDataLength + 1 );
        if (offset < 0)
            return 0;

        clip.GetData(datas, offset); 

        float av = datas.Sum() / datas.Length;
        return Mathf.Max(0, av);
    }

    return 0;
}

C# 如何接入讯飞语音SDK
这的确是个不太友好的SDK，因为他只支持C/C++，你要再unity中用，那家伙，得一顿研究：第一，你得研究你到底用到哪些API，然后查头文件，看他们的定义，将他们从DLL中引入。第二，从C/C++的DLL里引入并不是很轻松的事情，因为有很多参数，你得转换为C#的，这就涉及类型对应的问题，没办法，问GTP4、Google搜。。不过，我这里提供一下已经整理好的：

namespace HexuXunFeiMSC
{
    
    
    public static class XunFeiMSC
    {
    
    
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int MSPLogin(string usr, string pwd, string parameters);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int MSPLogout();
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern IntPtr MSPUploadData(string dataName, IntPtr data, uint dataLen, string _params,
            ref int errorCode);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int MSPAppendData(IntPtr data, uint dataLen, uint dataStatus);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern IntPtr MSPDownloadData(string _params, ref uint dataLen, ref int errorCode);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int MSPSetParam(string paramName, string paramValue);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int MSPGetParam(string paramName, ref byte[] paramValue, ref uint valueLen);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern IntPtr MSPGetVersion(string verName, ref int errorCode);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern IntPtr QISRSessionBegin(string grammarList, string _params, ref int errorCode);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int QISRAudioWrite(IntPtr sessionID, byte[] waveData, uint waveLen,
            AudioStatus audioStatus, ref EpStatus epStatus, ref RecogStatus recogStatus);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern IntPtr QISRGetResult(IntPtr sessionID, ref RecogStatus rsltStatus, int waitTime,
            ref int errorCode);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int QISRSessionEnd(IntPtr sessionID, string hints);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int QISRGetParam(string sessionID, string paramName, ref byte[] paramValue,
            ref uint valueLen);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern IntPtr QTTSSessionBegin(string _params, ref int errorCode);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int QTTSTextPut(IntPtr sessionID, string textString, uint textLen, string _params);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern IntPtr QTTSAudioGet(IntPtr sessionID, ref uint audioLen, ref SynthStatus synthStatus,
            ref int errorCode);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern IntPtr QTTSAudioInfo(IntPtr sessionID);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int QTTSSessionEnd(IntPtr sessionID, string hints);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int QTTSSetParam(IntPtr sessionID, string paramName, byte[] paramValue);
        [DllImport("msc_x64", CallingConvention = CallingConvention.StdCall)]
        public static extern int QTTSGetParam(IntPtr sessionID, string paramName, ref byte[] paramValue,
            ref uint valueLen);
    }
}

然后就是发送录音并获取转换好的文本了：

// 这里是多线程版本，这个函数将在非主线程运行
private static void RealQueryAudioToText(object obj)
{
    
    
    try
    {
    
    
        byte[] data = (byte[])obj;

		// 首先登录讯飞平台
        int res = MSPLogin(null, null, appId);
        if (res != 0)
            throw new Exception($"Can't Login {
      
      res}");

		// 获取会话ID
        IntPtr sessionID = QISRSessionBegin(null, sessionBeginParams, ref res);
        if (res != 0)
            throw new Exception($"SessionBegin Error {
      
      res}");
        
        EpStatus epStatus = EpStatus.MSP_EP_LOOKING_FOR_SPEECH;
        RecogStatus recognizeStatus = RecogStatus.MSP_REC_STATUS_SUCCESS;
		
		// 发送语音数据，并标明这是最后一段（表示这句话完整了，后续没有别的录音了）
        res = QISRAudioWrite(sessionID, data, (uint)data.Length, AudioStatus.MSP_AUDIO_SAMPLE_LAST, ref epStatus, ref recognizeStatus);
        if (res != 0)
            throw new Exception($"Write failed {
      
      res}");
        
        StringBuilder sb = new StringBuilder();

		// 获取结果，直到结束
        while (recognizeStatus != RecogStatus.MSP_REC_STATUS_COMPLETE)
        {
    
    
            IntPtr curtRslt = QISRGetResult(sessionID, ref recognizeStatus, 0, ref res);
            if (res != 0)
                throw new Exception($"get result failed. error code: {
      
      res}");

            sb.Append(Marshal.PtrToStringUTF8(curtRslt));
        }
        
        // 会话结束
        res = QISRSessionEnd(sessionID, "Finish");
        if (res != 0)
            throw new Exception($"end failed. error code: {
      
      res}");

		// 退出登录
        res = MSPLogout();
        if (res != 0)
            throw new Exception($"logout failed. error code {
      
      res}");
        
        // 结果回调
        OnAudioToText?.Invoke(sb.ToString());
    }
    catch (Exception e)
    {
    
    
        OnError?.Invoke(e.Message);
    }
}

二、接入讯飞星火大模型

这个需要在讯飞控制台提前申请，不过很容易就能通过，但是需要等待一天左右。通过后，他会给你50万的Token数，Token类似于“单词”，平均起来，一句话基本上会消耗几十到几百个Token，50W也够测的了。
这里遇到了很多坑。
本来想继续用它提供的SDK，但是一看又是只有C/C++的，鉴于前面语音识别时DLL引入到C#的痛苦，有点动摇，但它还提供Web访问，果断选择Web，没想到Web也遇到了一堆问题。
按照它给的Python案例，按照自己的理解，用C#重写了一下，信心满满的去测试，没想到无论怎么测，都失败。先是用UnityWebRequest去请求，失败，以为是UnityWebRequest不支持WSS协议，换成了HttpWebRequest，还是失败，最后用WebSocket重写了请求逻辑，仍旧失败。。我靠，只好静下心来看之前发请求之前的代码，最终发现在生成鉴权URL时，C#生成的url比它Python案例中生成的少了几个字节，然而并没有发现原因，很是莫名其妙，仔细对比了下，发现少的几个字节，用于都是AI==四个字符，而且是固定的，我去。直接写进去算求，没想到就解决了。至于UnityWebRequest、HttpWebRequest是否此原因导致的，已经不想再去重测了。下面给出代码：

// 生成鉴权URL
private static string BuildURL()
{
    
    
    Uri uri = new Uri(gptUrl);
    string host = uri.Host;
    string date = DateTime.UtcNow.ToString("R");
    var auth = $"host: {
      
      host}\ndate: {
      
      date}\nGET {
      
      uri.PathAndQuery} HTTP/1.1";
    var sha256 = HmacSha256(auth, apiSecret);
    var signature = Convert.ToBase64String(sha256);
    string authorizationOrigin =
        $"api_key=\"{
      
      apiKey}\", algorithm=\"hmac-sha256\", headers=\"host date request-line\", signature=\"{
      
      signature}\"";
    string authorization = Convert.ToBase64String(Encoding.UTF8.GetBytes(authorizationOrigin));
    return gptUrl + "?authorization=" + authorization + UnityWebRequest.EscapeURL("IA==") + "&date=" +
           UnityWebRequest.EscapeURL(date) + "&host=" + UnityWebRequest.EscapeURL(host);
}

// 发起问题请求，这也是多线程版本，此函数会在非主线程中调用
private static void RealRequestQuestion(object obj)
{
    
    
    try
    {
    
    
    	// 问新问题之前，把前面的问题也发过去，这样它才能联系上下文，它限制8Ktoken，所以只传前面20个问题
        if (contentTextList.Count > 20)
            contentTextList.RemoveAt(0);
        contentTextList.Add(new QueryText() {
    
     role = "user", content = (string)obj });

		// 生成鉴权URL，生成问题数据
        string url = BuildURL();
        string dataString = JsonConvert.SerializeObject(
            new
            {
    
    
                header = new {
    
     app_id = appID },
                parameter = new {
    
     chat = new {
    
     domain = "general" } },
                payload = new
                {
    
    
                    message = new
                    {
    
    
                        text = contentTextList.ToArray()
                    }
                }
            });

		// 发起请求，写入问题，获取结果
        using ClientWebSocket webSocket = new ClientWebSocket();
        webSocket.Options.Proxy = null;
        webSocket.ConnectAsync(new Uri(url), CancellationToken.None).Wait();
        ArraySegment<byte> buffer = new ArraySegment<byte>(Encoding.UTF8.GetBytes(dataString));
        webSocket.SendAsync(buffer, WebSocketMessageType.Binary, true, CancellationToken.None).Wait();

        StringBuilder sbResult = new StringBuilder();
        byte[] receiveBuffer = new byte [2048];
        while (true)
        {
    
    
            WebSocketReceiveResult result = webSocket.ReceiveAsync(receiveBuffer, CancellationToken.None)
                .GetAwaiter().GetResult();
            string text = Encoding.UTF8.GetString(receiveBuffer, 0, result.Count);
            
            if (string.IsNullOrEmpty(text))
            {
    
    
                break;
            }

            JObject res = JsonConvert.DeserializeObject<JObject>(text);
            if (res == null)
            {
    
    
                break;
            }
            
            int code = (int)res["header"]?["code"];
            if (code != 0)
            {
    
    
                break;
            }
            
            JObject choices = (JObject)res["payload"]?["choices"];
            if (choices == null)
                break;
            
            int status = (int)choices["status"];
            string content = (string)choices["text"]?[0]?["content"];
            if(!string.IsNullOrEmpty(content))
                sbResult.Append(content);

			// 如果status不是2，表示回答还没有完成。
            if (status == 2)
            {
    
    
                string answer = sbResult.ToString();
                contentTextList.Add(new QueryText()
                {
    
    
                    role = "assistant", content = answer
                });
                
                OnResponeQuestion?.Invoke(answer);
                break;
            }
        }
    }
    catch
    {
    
    
        // ignored
    }
}

三、接入Motionverse

这个可以说是整个里面最大的坑了。它提供的Unity插件，导入进去，在unity中运行没有问题，但是你想要打包，那就会报错。我的解决方法是：

// Packages\cn.deepscience.motionverse\Runtime\Interface\EngineInterface.cs
        private const string DllName = 
#if UNITY_EDITOR
  "libMotionEngine";
#elif UNITY_IOS || UNITY_WEBGL
      "__Internal";
#elif UNITY_ANDROID
      "libMotionEngine";
#endif

// 改为：
        private const string DllName = "libMotionEngine";

还有就是，即便是改了上面的代码，它也没法在2020以上版本的Unity中打包，只能用2020版本才可以打包成功。而且，这个Motionverse包引导导入，就会有一堆的警告，有的甚至是某些方法里面定义了从未使用的变量。我靠，这种低级问题也有。所以，可见Motionverse的代码质量并不高。
但是，它却是最易用的。只要绑定好了角色和骨骼，全程只需要调用一个函数：

TextDrive.GetDrive(text);

总结

总体上来说，这个项目还有很多需要完善的地方，比如：

讯飞语音可以支持实时转换，就是说并不一定非得一句话说完，而是在说的过程中，就一直转换，并动态的修正，这样说完也就转换完了，效率比较高，但是我没有去深入研究实时转换和动态修正，这需要更好的耐心。
Motionverse驱动效率比较低，每次都需要将文本发到后台，运算完成后，生成语音数据和动作数据，再发回来，延时比较大，动作驱动比较生硬，而且从idle动画转到说话时的动作，没有过度，很不友好。