Best Practices | Use Tencent Cloud Intelligent Voice to Build Intelligent Conversation Robots

Driven by AI technology, intelligent conversational robots have gradually become important efficiency tools in our work and life, and even partners, especially bringing the most original and intuitive implementation of "cost reduction and efficiency improvement" to enterprises.

As a developer, have you ever thought about building an intelligent conversational robot based on voice technology?

This article will teach you the technical implementation details step by step.

First, let’s analyze what an intelligent conversation robot needs:

1. Voice input: If you want to have intelligent dialogue, you definitely need voice input and output.

2. Speech recognition: Recognize speech into text.

3. Intelligent question and answer service: Input the speech recognition results into the service and get the results.

4. Speech synthesis: generate audio from intelligent question and answer service answers

5. Voice broadcast: The questions answered by the intelligent question and answer service will be broadcast to you in the form of voice.

flow chart:

Voice collection:

1. Use the SDK provided by Tencent Cloud Speech Recognition (Android, IOS, WeChat applet)

2. You can use hardware recording equipment to collect audio by yourself

3. Set up the recording device on the terminal (IOS, Android, etc.) to collect audio

Technical process:

1. Collect audio first

2. Call Tencent Cloud Speech Recognition (ASR) with audio stream data

3. Call the speech recognition text data into the intelligent question and answer service

4. Use the answer from the intelligent question and answer service to call Tencent Cloud Speech Synthesis (TTS)

5. Finally, return the audio generated by speech synthesis to the end for playback

1. Preparation work

1.1 Activate speech recognition service

The author uses Tencent's speech recognition. First activate the service. Click here on the Tencent Cloud Speech Recognition Console and click Activate Now to activate the service.

You can click here to receive a newbie experience resource package: Speech Recognition_Real-time Speech Recognition_Recording File Recognition_Speech to Text Service-Tencent Cloud

1.2 Obtain the API key for calling the service

Accessing Tencent Cloud services requires a secret key. On the API key management page of Tencent Cloud Access Management, you can create a new secret key. This must be kept well and cannot be leaked, otherwise it will be stolen by others. We will use the secret key later.

1.3 Obtain speech recognition and speech synthesis SDK

Speech recognition SDK acquisition: Speech recognition real-time speech recognition (websocket)-API Documentation-Document Center-Tencent Cloud

Obtain the Speech Synthesis SDK: Speech Synthesis Basic Speech Synthesis-API Documentation-Document Center-Tencent Cloud

Get the client SDK:

1.IOS: Login-Tencent Cloud

2. Android: Login-Tencent Cloud

3. WeChat Mini Program: Tencent Cloud Intelligent Voice | Mini Program Plug-in | WeChat Public Platform

1.4. Access intelligent question and answer service

WeLM：- WeLM

You can also use other intelligent question and answer services here, such as ChatGPT

2. Code development

Logic includes:

1. Request ASR real-time identification

2. Request intelligent question and answer service

3. Request TTS speech synthesis and obtain audio

Code compilation:

1. Execute the command to generate go.mod environment go mod init demo

2.go build compile

3. Execute ./demo -e 16k_zh -f test audio address-format 1

Note: This code only includes the server part. You can connect to the SDK by yourself to stream the audio to the server for recognition.

package main

import (
	"encoding/base64"
	"flag"
	"fmt"
	ttsCommon "github.com/tencentcloud/tencentcloud-sdk-go/tencentcloud/common"
	"github.com/tencentcloud/tencentcloud-sdk-go/tencentcloud/common/errors"
	"github.com/tencentcloud/tencentcloud-sdk-go/tencentcloud/common/profile"
	tts "github.com/tencentcloud/tencentcloud-sdk-go/tencentcloud/tts/v20190823"
	"github.com/tencentcloud/tencentcloud-speech-sdk-go/asr"
	"github.com/tencentcloud/tencentcloud-speech-sdk-go/common"
	"os"
	"sync"
	"time"
)

var (
	AppID           = "输入appid"
	SecretID        = "输入密钥ID"
	SecretKey       = "输入密钥key"
	EngineModelType = "16k_zh"
	SliceSize       = 16000
)

// MySpeechRecognitionListener implementation of SpeechRecognitionListener
type MySpeechRecognitionListener struct {
	ID int
}

// OnRecognitionStart implementation of SpeechRecognitionListener
func (listener *MySpeechRecognitionListener) OnRecognitionStart(response *asr.SpeechRecognitionResponse) {
}

// OnSentenceBegin implementation of SpeechRecognitionListener
func (listener *MySpeechRecognitionListener) OnSentenceBegin(response *asr.SpeechRecognitionResponse) {
}

// OnRecognitionResultChange implementation of SpeechRecognitionListener
func (listener *MySpeechRecognitionListener) OnRecognitionResultChange(response *asr.SpeechRecognitionResponse) {
}

// OnSentenceEnd implementation of SpeechRecognitionListener
func (listener *MySpeechRecognitionListener) OnSentenceEnd(response *asr.SpeechRecognitionResponse) {
	fmt.Printf("语音识别结果: %s \n", response.Result.VoiceTextStr)
	ConversationalRobot(response.Result.VoiceTextStr)
}

// OnRecognitionComplete implementation of SpeechRecognitionListener
func (listener *MySpeechRecognitionListener) OnRecognitionComplete(response *asr.SpeechRecognitionResponse) {
}

// OnFail implementation of SpeechRecognitionListener
func (listener *MySpeechRecognitionListener) OnFail(response *asr.SpeechRecognitionResponse, err error) {
	fmt.Printf("%s|%s|OnFail: %v\n", time.Now().Format("2006-01-02 15:04:05"), response.VoiceID, err)
}

var proxyURL string
var VoiceFormat *int
var e *string

func main() {
	var f = flag.String("f", "test.pcm", "audio file")
	var p = flag.String("p", "", "proxy url")
	VoiceFormat = flag.Int("format", 0, "voice format")
	e = flag.String("e", "", "engine_type")
	fmt.Println("input-", *e, "-input")
	flag.Parse()
	if *e == "" {
		panic("please input engine_type")
	}
	if *VoiceFormat == 0 {
		panic("please input voice format")
	}

	proxyURL = *p
	var wg sync.WaitGroup
	wg.Add(1)
	go processOnce(1, &wg, *f)
	fmt.Println("Main: Waiting for workers to finish")
	wg.Wait()
	fmt.Println("Main: Completed")

}

func processOnce(id int, wg *sync.WaitGroup, file string) {
	defer wg.Done()
	process(id, file)
}

func process(id int, file string) {
	audio, err := os.Open(file)
	defer audio.Close()
	if err != nil {
		fmt.Printf("open file error: %v\n", err)
		return
	}

	listener := &MySpeechRecognitionListener{
		ID: id,
	}
	credential := common.NewCredential(SecretID, SecretKey)
	EngineModelType = *e
	fmt.Println("engine_type:", EngineModelType)
	recognizer := asr.NewSpeechRecognizer(AppID, credential, EngineModelType, listener)
	recognizer.ProxyURL = proxyURL
	recognizer.VoiceFormat = *VoiceFormat
	err = recognizer.Start()
	if err != nil {
		fmt.Printf("%s|recognizer start failed, error: %v\n", time.Now().Format("2006-01-02 15:04:05"), err)
		return
	}

	data := make([]byte, SliceSize)
	//这里的data可以换成实时端上传输过来的音频流
	for n, err := audio.Read(data); n > 0; n, err = audio.Read(data) {
		if err != nil {
			if err.Error() == "EOF" {
				break
			}
			fmt.Printf("read file error: %v\n", err)
			break
		}
		//一句话识别结束会回调上面OnSentenceEnd方法
		err = recognizer.Write(data[0:n])
		if err != nil {
			break
		}
		time.Sleep(20 * time.Millisecond)
	}
	recognizer.Stop()
}

func ConversationalRobot(text string) {
	//调用智能问答服务，获取回答
	Result := SendToGPTService(text)
	//把智能问答服务的文案转成音频文件
	audioData := TextToVoice(Result)
	//将音频文件返回给端上播放
	ResponseAudioData(audioData)
}

func ResponseAudioData(audioData []byte) {
	//把音频数据audioData推到端上播放
}

func SendToGPTService(text string) string {
	// API 调用智能问答服务
	// 获取智能问答服务返回结果
	result := "智能问答服务返回结果"
	fmt.Println("智能问答服务 API调用")
	return result
}

func TextToVoice(text string) []byte {
	fmt.Println("语音合成调用")
	// 实例化一个认证对象，入参需要传入腾讯云账户 SecretId 和 SecretKey，此处还需注意密钥对的保密
	// 代码泄露可能会导致 SecretId 和 SecretKey 泄露，并威胁账号下所有资源的安全性。以下代码示例仅供参考，建议采用更安全的方式来使用密钥，请参见：https://cloud.tencent.com/document/product/1278/85305
	// 密钥可前往官网控制台 https://console.cloud.tencent.com/cam/capi 进行获取
	credential := ttsCommon.NewCredential(
		SecretID,
		SecretKey,
	)
	// 实例化一个client选项，可选的，没有特殊需求可以跳过
	cpf := profile.NewClientProfile()
	cpf.HttpProfile.Endpoint = "tts.tencentcloudapi.com"
	// 实例化要请求产品的client对象,clientProfile是可选的
	client, _ := tts.NewClient(credential, "ap-beijing", cpf)

	// 实例化一个请求对象,每个接口都会对应一个request对象
	request := tts.NewTextToVoiceRequest()

	request.Text = ttsCommon.StringPtr(text)
	request.SessionId = ttsCommon.StringPtr("f435g34d23a24y546g")

	// 返回的resp是一个TextToVoiceResponse的实例，与请求对象对应
	response, err := client.TextToVoice(request)
	if _, ok := err.(*errors.TencentCloudSDKError); ok {
		fmt.Printf("An API error has returned: %s", err)
		return nil
	}
	if err != nil {
		panic(err)
	}
	// 输出json格式的字符串回包
	audioData, _ := base64.StdEncoding.DecodeString(*response.Response.Audio)
	fmt.Println("语音合成调用结束")
	return audioData
}

The above are the technical details of the implementation of intelligent voice dialogue robots. Interested students can also practice or expand development.

At present, intelligent conversational robots have entered the stage of large-scale implementation in economic production activities such as customer contact, marketing operations, window services, and human-computer dialogue interaction. With the continuous innovation of AI technology, intelligent conversational robots will also derive higher-level products. , smarter mode.

Tencent Cloud Intelligence also provides one-stop voice technology services for enterprise customers and developers. For more product information, you can also go to Tencent Cloud’s official website.

Tencent Cloud Intelligent Speech Recognition: Speech Recognition_Real-time Speech Recognition_Recording File Recognition_Speech to Text Service-Tencent Cloud

Tencent Cloud Intelligent Speech Synthesis: Speech Synthesis_Voice Customization_Text-to-Speech Service-Tencent Cloud