I am participating in the cross-end technology thematic essay call. For details, please check: juejin.cn/post/710123…
foreword
I have tried various TTS solutions, and after some experience, I found that Microsoft is the king in this field, and Azure文本转语音
the voice effect converted from its services is the most natural, but Azure is a paid service, and it is too troublesome to pay for registration operations. But on its official website, it actually provides a 完全体
demo function, which can fully experience the voices and speaking styles of all characters...
But it just can't be downloaded as a mp3
file, so some friends have to transcribe the computer's voice to get the audio file, but this is too troublesome. In fact, all the resources that can be seen and heard on the webpage are the result of decryption. That is to say, as long as the sound is played from the web page, we must find a way to extract the audio file.
This article is to record the entire process of exploration and realization, please enjoy~
Most of the content of this article was written at the beginning of this year and has not been released yet. I know that once this method is made public, it may soon be blocked by Microsoft, or even directly cancel the entrance and related interfaces of the web experience.
Parse the demo function of the Azure official website
Use the Chrome browser to open the debug panel. When we click the 播放
function on the Azure official website, we can monitor a wss://
request from the network tab, which is a websocket
request.
two parameters
In the request URL
, we can see that there are two parameters Authorization
andX-ConnectionId
Interestingly, the first parameter is in the source code of the web page, and you can directly extract it by requesting axios
this Azure text-to-speech URL.get
const res = await axios.get("https://azure.microsoft.com/en-gb/services/cognitive-services/text-to-speech/");
const reg = /token: \"(.*?)\"/;
if(reg.test(res.data)){
const token = RegExp.$1;
}
复制代码
By viewing the JS call stack that initiated the request, click play again after adding a breakpoint
It can be found that the second parameter X-ConnectionId
comes from a createNoDashGuid
function
this.privConnectionId = void 0 !== t ? t : s.createNoDashGuid(),
复制代码
这就是一个uuid v4
格式的字符串,nodash
就是没有-
的意思。
三次发送
请求时URL里的两个参数已经搞定了,我们继续分析这个webscoket
请求,从Message标签中可以看到
每次点击播放时,都向服务器上报了三次数据,明显可以看出来三次上报数据各自的作用
第一次的数据:SDK版本,系统信息,UserAgent
Path: speech.config
X-RequestId: 818A1E398D8D4303956D180A3761864B
X-Timestamp: 2022-05-27T16:45:02.799Z
Content-Type: application/json
{"context":{"system":{"name":"SpeechSDK","version":"1.19.0","build":"JavaScript","lang":"JavaScript"},"os":{"platform":"Browser/MacIntel","name":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36","version":"5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36"}}}
复制代码
第二次的数据:转语音输出配置,从outputFormat
可以看出来,最终的音频格式为audio-24khz-160kbitrate-mono-mp3
,这不就是我们想要的mp3
文件吗?!
Path: synthesis.context
X-RequestId: 091963E8C7F342D0A8E79125EA6BB707
X-Timestamp: 2022-05-27T16:48:43.340Z
Content-Type: application/json
{"synthesis":{"audio":{"metadataOptions":{"bookmarkEnabled":false,"sentenceBoundaryEnabled":false,"visemeEnabled":false,"wordBoundaryEnabled":false},"outputFormat":"audio-24khz-160kbitrate-mono-mp3"},"language":{"autoDetection":false}}}
复制代码
第三次的数据:要转语音的文本信息和角色voice name
,语速rate
,语调pitch
,情感等配置
Path: ssml
X-RequestId: 091963E8C7F342D0A8E79125EA6BB707
X-Timestamp: 2022-05-27T16:48:49.594Z
Content-Type: application/ssml+xml
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="zh-CN-XiaoxiaoNeural"><prosody rate="0%" pitch="0%">我叫大帅,一个热爱编程的老程序猿</prosody></voice></speak>
复制代码
接收的二进制消息
既然从前三次上报的信息已经看出来返回的格式就是mp3
文件了,那么我们是不是把所有返回的二进制数据合并就可以拼接成完整的mp3
文件了呢?答案是肯定的!
每次点击播放后接收的所有来自websocket
的消息的最后一条,都有明确的结束标识符
turn.end
代表转换结束!
用Node.js实现它
既然都解析出来了,剩下的就是在Node.js
中重新实现这个过程。
两个参数
- Authorization,直接通过axios的get请求抓取网页内容后通过正则表达式提取
const res = await axios.get("https://azure.microsoft.com/en-gb/services/cognitive-services/text-to-speech/");
const reg = /token: \"(.*?)\"/;
if(reg.test(res.data)){
const Authorization = RegExp.$1;
}
复制代码
- X-ConnectionId,直接使用
uuid
库即可
//npm install uuid
const { v4: uuidv4 } = require('uuid');
const XConnectionId = uuidv4().toUpperCase();
复制代码
创建WebSocket连接
//npm install nodejs-websocket
const ws = require("nodejs-websocket");
const url = `wss://eastus.tts.speech.microsoft.com/cognitiveservices/websocket/v1?Authorization=${Authorization}&X-ConnectionId=${XConnectionId}`;
const connect = ws.connect(url);
复制代码
三次发送
第一次发送
function getXTime(){
return new Date().toISOString();
}
const message_1 = `Path: speech.config\r\nX-RequestId: ${XConnectionId}\r\nX-Timestamp: ${getXTime()}\r\nContent-Type: application/json\r\n\r\n{"context":{"system":{"name":"SpeechSDK","version":"1.19.0","build":"JavaScript","lang":"JavaScript","os":{"platform":"Browser/Linux x86_64","name":"Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0","version":"5.0 (X11)"}}}}`;
connect.send(message_1);
复制代码
第二次发送
const message_2 = `Path: synthesis.context\r\nX-RequestId: ${XConnectionId}\r\nX-Timestamp: ${getXTime()}\r\nContent-Type: application/json\r\n\r\n{"synthesis":{"audio":{"metadataOptions":{"sentenceBoundaryEnabled":false,"wordBoundaryEnabled":false},"outputFormat":"audio-16khz-32kbitrate-mono-mp3"}}}`;
connect.send(message_2);
复制代码
第三次发送
const SSML = `
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">
<voice name="zh-CN-XiaoxiaoNeural">
<mstts:express-as style="general">
<prosody rate="0%" pitch="0%">
我叫大帅,一个热爱编程的老程序猿
</prosody>
</mstts:express-as>
</voice>
</speak>
`
const message_3 = `Path: ssml\r\nX-RequestId: ${XConnectionId}\r\nX-Timestamp: ${getXTime()}\r\nContent-Type: application/ssml+xml\r\n\r\n${SSML}`
connect.send(message_3);
复制代码
接收二进制消息拼接mp3
当三次发送结束后我们通过connect.on('binary')
监听websocket
接收的二进制消息。
Create an empty Buffer object final_data
, and then splicing the binary content received each time into final_data
it. Once the normal text message contains an Path:turn.end
identifier, it will be final_data
written and created into a mp3
file.
let final_data=Buffer.alloc(0);
connect.on("text", (data) => {
if(data.indexOf("Path:turn.end")>=0){
fs.writeFileSync("test.mp3",final_data);
connect.close();
}
})
connect.on("binary", function (response) {
let data = Buffer.alloc(0);
response.on("readable", function () {
const newData = response.read()
if (newData)data = Buffer.concat([data, newData], data.length+newData.length);
})
response.on("end", function () {
const index = data.toString().indexOf("Path:audio")+12;
final_data = Buffer.concat([final_data,data.slice(index)]);
})
});
复制代码
In this way, we successfully saved the mp3
audio file without even opening the Azure official website!
command line tool
I have packaged the whole code into a command line tool, it is very simple to use
npm install -g mstts-js
mstts -i 文本转语音 -o ./test.mp3
复制代码
All open source: github.com/ezshine/mst…
use in uni-app
Create a new cloud function
Create a new cloud function and name itmstts
Since it mstss-js
has been encapsulated, it only needs to be in the cloud function and npm install mstts-js
then require
the code is as follows
'use strict';
const mstts = require('mstts-js')
exports.main = async (event, context) => {
const res = await mstts.getTTSData('要转换的文本','CN-Yunxi');
//res为buffer格式
});
复制代码
Download and play mp3 files
To play this mp3 format file in uniapp, there are two ways
Method 1. Upload to cloud storage first, and access through cloud storage address
exports.main = async (event, context) => {
const res = await mstts.getTTSData('要转换的文本','CN-Yunxi');
//res为buffer格式
var uploadRes = await uniCloud.uploadFile({
cloudPath: "xxxxx.mp3",
fileContent: res
})
return uploadRes.fileID;
});
复制代码
Front-end usage:
uniCloud.callFunction({
name:"mstts",
success:(res)=>{
const aud = uni.createInnerAudioContext();
aud.autoplay = true;
aud.src = res;
aud.play();
}
})
复制代码
- Pros: Cloud Function Security
- Disadvantage: Uploading files to cloud storage without cleaning mechanism will waste space
Method 2. Use the URLization + integrated response of cloud functions to access
This method is to directly turn the response body of the cloud function into an mp3 file, which audio.src
can be accessed directly through assignment.`
exports.main = async (event, context) => {
const res = await mstts.getTTSData('要转换的文本','CN-Yunxi');
return {
mpserverlessComposedResponse: true,
isBase64Encoded: true,
statusCode: 200,
headers: {
'Content-Type': 'audio/mp3',
'Content-Disposition':'attachment;filename=\"temp.mp3\"'
},
body: res.toString('base64')
}
};
复制代码
Front-end usage:
const aud = uni.createInnerAudioContext();
aud.autoplay = true;
aud.src = 'https://ezshine-274162.service.tcloudbase.com/mstts';
aud.play();
复制代码
- Pros: Simple to use, no need to save files to cloud storage
- Disadvantage: If the URLized cloud function has no security mechanism, it can be used arbitrarily by others after being captured.
summary
Such a useful tts
library, if it helps you, don't forget to github
support star
it.
I am 大帅
, a program that loves 老
programming 猿
. Personal WeChat:dashuailaoyuan