Using Microsoft's text-to-speech service in uni-app

I am participating in the cross-end technology thematic essay call. For details, please check: juejin.cn/post/710123…

foreword

I have tried various TTS solutions, and after some experience, I found that Microsoft is the king in this field, and Azure文本转语音the voice effect converted from its services is the most natural, but Azure is a paid service, and it is too troublesome to pay for registration operations. But on its official website, it actually provides a 完全体demo function, which can fully experience the voices and speaking styles of all characters...

image.png

But it just can't be downloaded as a mp3file, so some friends have to transcribe the computer's voice to get the audio file, but this is too troublesome. In fact, all the resources that can be seen and heard on the webpage are the result of decryption. That is to say, as long as the sound is played from the web page, we must find a way to extract the audio file.

This article is to record the entire process of exploration and realization, please enjoy~

Most of the content of this article was written at the beginning of this year and has not been released yet. I know that once this method is made public, it may soon be blocked by Microsoft, or even directly cancel the entrance and related interfaces of the web experience.

Parse the demo function of the Azure official website

Use the Chrome browser to open the debug panel. When we click the 播放function on the Azure official website, we can monitor a wss://request from the network tab, which is a websocketrequest.

image.png

two parameters

In the request URL, we can see that there are two parameters AuthorizationandX-ConnectionId

image.png

Interestingly, the first parameter is in the source code of the web page, and you can directly extract it by requesting axiosthis Azure text-to-speech URL.get

image.png

const res = await axios.get("https://azure.microsoft.com/en-gb/services/cognitive-services/text-to-speech/");

const reg = /token: \"(.*?)\"/;

if(reg.test(res.data)){
    const token = RegExp.$1;
}
复制代码

By viewing the JS call stack that initiated the request, click play again after adding a breakpoint

image.png

image.png

It can be found that the second parameter X-ConnectionIdcomes from a createNoDashGuidfunction

this.privConnectionId = void 0 !== t ? t : s.createNoDashGuid(),
复制代码

这就是一个uuid v4格式的字符串,nodash就是没有-的意思。

三次发送

请求时URL里的两个参数已经搞定了,我们继续分析这个webscoket请求,从Message标签中可以看到

image.png

每次点击播放时,都向服务器上报了三次数据,明显可以看出来三次上报数据各自的作用

第一次的数据:SDK版本,系统信息,UserAgent

Path: speech.config
X-RequestId: 818A1E398D8D4303956D180A3761864B
X-Timestamp: 2022-05-27T16:45:02.799Z
Content-Type: application/json

{"context":{"system":{"name":"SpeechSDK","version":"1.19.0","build":"JavaScript","lang":"JavaScript"},"os":{"platform":"Browser/MacIntel","name":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36","version":"5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36"}}}
复制代码

第二次的数据:转语音输出配置,从outputFormat可以看出来,最终的音频格式为audio-24khz-160kbitrate-mono-mp3,这不就是我们想要的mp3文件吗?!

Path: synthesis.context
X-RequestId: 091963E8C7F342D0A8E79125EA6BB707
X-Timestamp: 2022-05-27T16:48:43.340Z
Content-Type: application/json

{"synthesis":{"audio":{"metadataOptions":{"bookmarkEnabled":false,"sentenceBoundaryEnabled":false,"visemeEnabled":false,"wordBoundaryEnabled":false},"outputFormat":"audio-24khz-160kbitrate-mono-mp3"},"language":{"autoDetection":false}}}
复制代码

第三次的数据:要转语音的文本信息和角色voice name,语速rate,语调pitch,情感等配置

Path: ssml
X-RequestId: 091963E8C7F342D0A8E79125EA6BB707
X-Timestamp: 2022-05-27T16:48:49.594Z
Content-Type: application/ssml+xml

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="zh-CN-XiaoxiaoNeural"><prosody rate="0%" pitch="0%">我叫大帅,一个热爱编程的老程序猿</prosody></voice></speak>
复制代码

接收的二进制消息

既然从前三次上报的信息已经看出来返回的格式就是mp3文件了,那么我们是不是把所有返回的二进制数据合并就可以拼接成完整的mp3文件了呢?答案是肯定的!

每次点击播放后接收的所有来自websocket的消息的最后一条,都有明确的结束标识符

image.png

image.png

turn.end代表转换结束!

用Node.js实现它

既然都解析出来了,剩下的就是在Node.js中重新实现这个过程。

两个参数

  1. Authorization,直接通过axios的get请求抓取网页内容后通过正则表达式提取
const res = await axios.get("https://azure.microsoft.com/en-gb/services/cognitive-services/text-to-speech/");

const reg = /token: \"(.*?)\"/;

if(reg.test(res.data)){
    const Authorization = RegExp.$1;
}
复制代码
  1. X-ConnectionId,直接使用uuid库即可
//npm install uuid
const { v4: uuidv4 } = require('uuid');

const XConnectionId = uuidv4().toUpperCase();
复制代码

创建WebSocket连接

//npm install nodejs-websocket
const ws = require("nodejs-websocket");

const url = `wss://eastus.tts.speech.microsoft.com/cognitiveservices/websocket/v1?Authorization=${Authorization}&X-ConnectionId=${XConnectionId}`;
const connect = ws.connect(url);
复制代码

三次发送

第一次发送

function getXTime(){
    return new Date().toISOString();
}

const message_1 = `Path: speech.config\r\nX-RequestId: ${XConnectionId}\r\nX-Timestamp: ${getXTime()}\r\nContent-Type: application/json\r\n\r\n{"context":{"system":{"name":"SpeechSDK","version":"1.19.0","build":"JavaScript","lang":"JavaScript","os":{"platform":"Browser/Linux x86_64","name":"Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0","version":"5.0 (X11)"}}}}`;

connect.send(message_1);
复制代码

第二次发送

const message_2 = `Path: synthesis.context\r\nX-RequestId: ${XConnectionId}\r\nX-Timestamp: ${getXTime()}\r\nContent-Type: application/json\r\n\r\n{"synthesis":{"audio":{"metadataOptions":{"sentenceBoundaryEnabled":false,"wordBoundaryEnabled":false},"outputFormat":"audio-16khz-32kbitrate-mono-mp3"}}}`;

connect.send(message_2);
复制代码

第三次发送

const SSML = `
    <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">
        <voice name="zh-CN-XiaoxiaoNeural">
            <mstts:express-as style="general">
                <prosody rate="0%" pitch="0%">
                我叫大帅,一个热爱编程的老程序猿
                </prosody>
            </mstts:express-as>
        </voice>
    </speak>
    `

const message_3 = `Path: ssml\r\nX-RequestId: ${XConnectionId}\r\nX-Timestamp: ${getXTime()}\r\nContent-Type: application/ssml+xml\r\n\r\n${SSML}`

connect.send(message_3);
复制代码

接收二进制消息拼接mp3

当三次发送结束后我们通过connect.on('binary')监听websocket接收的二进制消息。

Create an empty Buffer object final_data, and then splicing the binary content received each time into final_datait. Once the normal text message contains an Path:turn.endidentifier, it will be final_datawritten and created into a mp3file.

let final_data=Buffer.alloc(0);
connect.on("text", (data) => {
    if(data.indexOf("Path:turn.end")>=0){
        fs.writeFileSync("test.mp3",final_data);
        connect.close();
    }
})
connect.on("binary", function (response) {
    let data = Buffer.alloc(0);
    response.on("readable", function () {
        const newData = response.read()
        if (newData)data = Buffer.concat([data, newData], data.length+newData.length);
    })
    response.on("end", function () {
        const index = data.toString().indexOf("Path:audio")+12;
        final_data = Buffer.concat([final_data,data.slice(index)]);
    })
});
复制代码

In this way, we successfully saved the mp3audio file without even opening the Azure official website!

command line tool

I have packaged the whole code into a command line tool, it is very simple to use

npm install -g mstts-js
mstts -i 文本转语音 -o ./test.mp3
复制代码

All open source: github.com/ezshine/mst…

use in uni-app

Create a new cloud function

Create a new cloud function and name itmstts image.png

Since it mstss-jshas been encapsulated, it only needs to be in the cloud function and npm install mstts-jsthen requirethe code is as follows

'use strict';
const mstts = require('mstts-js')

exports.main = async (event, context) => {
    const res = await mstts.getTTSData('要转换的文本','CN-Yunxi');
   
    //res为buffer格式
});
复制代码

Download and play mp3 files

To play this mp3 format file in uniapp, there are two ways

Method 1. Upload to cloud storage first, and access through cloud storage address

exports.main = async (event, context) => {
    const res = await mstts.getTTSData('要转换的文本','CN-Yunxi');
   
    //res为buffer格式
    var uploadRes = await uniCloud.uploadFile({
        cloudPath: "xxxxx.mp3",
        fileContent: res
    })
    
    return uploadRes.fileID;
});
复制代码

Front-end usage:

uniCloud.callFunction({
    name:"mstts",
    success:(res)=>{
        const aud = uni.createInnerAudioContext();
        aud.autoplay = true;
        aud.src = res;
        aud.play();
    }
})
复制代码
  • Pros: Cloud Function Security
  • Disadvantage: Uploading files to cloud storage without cleaning mechanism will waste space

Method 2. Use the URLization + integrated response of cloud functions to access

This method is to directly turn the response body of the cloud function into an mp3 file, which audio.srccan be accessed directly through assignment.`

exports.main = async (event, context) => {
	const res = await mstts.getTTSData('要转换的文本','CN-Yunxi');
	
	return {
		mpserverlessComposedResponse: true,
		isBase64Encoded: true,
		statusCode: 200,
		headers: {
			'Content-Type': 'audio/mp3',
			'Content-Disposition':'attachment;filename=\"temp.mp3\"'
		},
		body: res.toString('base64')
	}
};
复制代码

Front-end usage:

const aud = uni.createInnerAudioContext();
aud.autoplay = true;
aud.src = 'https://ezshine-274162.service.tcloudbase.com/mstts';
aud.play();
复制代码
  • Pros: Simple to use, no need to save files to cloud storage
  • Disadvantage: If the URLized cloud function has no security mechanism, it can be used arbitrarily by others after being captured.

summary

Such a useful ttslibrary, if it helps you, don't forget to githubsupport starit.


I am 大帅, a program that loves programming . Personal WeChat:dashuailaoyuan

Guess you like

Origin juejin.im/post/7103720862221598757