Voice interaction evaluation indicators that AI product managers need to know

This article specifically introduces the following five major industry practical evaluation indicators:

一、语音识别		
二、自然语言处理	
三、语音合成	
四、对话系统	
五、整体用户数据指标

1. Speech recognition ASR

Automatic Speech Recognition, generally referred to as ASR, is the process of converting sounds into text, which is equivalent to the human ear.

1. Recognition rate

Look at the recognition rate of the pure engine, as well as the recognition rate under different signal-to-noise ratios (the signal-to-noise ratio simulates different vehicle speeds, windows, air conditioning states, etc.), and the difference between online/offline recognition.

In actual work, the direct indicator of general recognition rate is "WER (Word Error Rate)"

Definition: In order to make the recognized word sequence consistent with the standard word sequence, certain words need to be replaced, deleted or inserted. The total number of these inserted, replaced or deleted words is divided by the standard word sequence. The percentage of the total number of words is WER.

The formula is:

Substitution——替换


Deletion——删除


Insertion——插入


N——单词数目

3 points

  1. WER can be divided into male and female, speed, accent, numbers/English/Chinese, etc., and can be viewed separately.

  2. Because there are inserted words, the WER may be greater than 100% in theory, but in practice, especially when the sample size is large, it is impossible, otherwise it will be too bad and cannot be commercially used.

  3. From the perspective of pure product experience, many people think that the recognition rate should be equal to "the number of sentences recognized correctly / the total number of sentences", that is, "the recognition (correct) rate is equal to 96%". In actual work, this should be Points to "SER (Sentence Error Rate
    )", which is "the number of sentence recognition errors/the total number of sentences". However, it is said that in actual work, the sentence error rate is generally 2 to 3 times the word error rate, so you may not read it very much.

2. Indicators related to voice wake-up

First, we need to introduce the relevant information about Voice Trigger (VT).

A. Requirement background for voice wake-up: During near-field recognition, such as when using the voice input method, the user can press and hold the Siri voice button on the mobile phone and speak directly (release it after the end); the signal-to-noise ratio in near-field situations (Signal to Noise Ratio (SNR) is relatively high, the signal is clear, and a simple algorithm can be effective and reliable.

However, in far-field recognition, such as in smart speaker scenarios, users cannot touch the device with their hands and need to wake it up by voice, which is equivalent to calling the AI ​​(robot) by its name to attract its attention, such as Apple's "Hey Siri", Google "OK Google", Amazon Echo's "Alexa", etc.

B. The meaning of voice wake-up: Simply put, it means "calling the name to attract the attention of the listener (AI)". If the speech wake-up judgment result is the correct wake-up (activation) word, then the subsequent speech should be recognized; otherwise, no recognition will be performed.

C. Related indicators of voice wake-up

  1. Arousal rate. When calling AI, the rate at which it is successfully awakened.

  2. False wake-up rate. The rate at which the AI ​​jumps out and speaks on its own when the AI ​​is not called. If there are many false awakenings, especially in the middle of the night, and the smart speaker suddenly starts singing or telling stories, it will be particularly scary...

  3. Syllable length of the wake word. Generally speaking, the technical requirements are at least 3 syllables. For example, "OK Google" and "Alexa" have four syllables, and "Hey Siri" has three syllables. For domestic smart speakers, such as Xiaoya, the wake-up word is "Xiaoya Xiaoya" , instead of "Xiaoya" - if the syllable is too short, the false awakening rate will generally be higher.

  4. Wake response time. I have read Fu Sheng’s article before, saying that all the speakers in the world, except for Echo and their Xiaoya smart speakers, can reach 1.5 seconds, and the others are all above 3 seconds.

  5. Power consumption (should be low). I have read reports that Siri appeared on the iPhone 4s, but it was not until the iPhone 6s that it was allowed to directly shout "Hey Siri" for voice wake-up without plugging in the power supply; this is because the 6s has a low-power chip dedicated to voice activation. , of course the algorithm and hardware must be coordinated, and the algorithm must also be optimized.

The above 1, 2, and 3 are relatively more important.

D. Others

When it comes to AEC (Automatic Echo Cancellation), the relative improvement of WER must also be examined.

2. Natural language processing NLP

Natural Language Processing, generally referred to as NLP, is commonly understood as "enabling computers to understand and generate human language."

1. Precision rate and recall rate

Attached is an explanation shared in the previous article " Introduction to Data Annotation Work that AI Product Managers Need to Know ":

Accuracy: number of samples identified correctly/number of samples identified

Recall rate: number of samples identified as correct/number of correct samples among all samples

For example: there are 30 boys and 20 girls in the class. A machine is needed to identify the number of boys. This time the machine identified a total of 20 target subjects, 18 of whom were male and 2 were female. but

精确率=18/(18+2)=0.9


召回率=18/30=0.6

Add another picture to explain:

2. F1 value (harmonic mean of precision and recall)

After model optimization, the pursuit of improving the F1 value, the accuracy and recall rate alone dropped within a small range, and the increase in the overall F1 value was also seen between partitions (the F1 value within 60% is definitely different from that above 60%, 90 % or above may only pursue a 1% improvement).

P是精准率,R是召回率,Fa是在F1基础上做了赋权处理:Fa=(a^2+1)PR/(a^2P+R)

3. Speech synthesis TTS

Text-To-Speech (Text-To-Speech), generally referred to as TTS, converts text into sounds (read aloud), which is analogous to the human mouth. The voices you hear in various voice assistants such as Siri are all generated by TTS and are not real people speaking.

Subjective test (naturalness), mainly MOS:

MOS(Mean Opinion Scores),专家级评测(主观);1-5分,5分最好。


ABX,普通用户评测(主观)。让用户来视听两个TTS系统,进行对比,看哪个好。

Objective test :

对声学参数进行评估,一般是计算欧式距离等(RMSE,LSD)。


对工程上的测试:实时率(合成耗时/语音时长),流式分首包、尾包,非流式不考察首包;首包响应时间(用户发出请求到用户感知到的第一包到达时间)、内存占用、CPU占用、3*24小时crash率等。

4. Dialogue system

Dialogue System can be simply understood as the chat dialogue experience supported by Siri or various Chatbots.

1. User task completion rate (indicating whether product functions are useful and functional coverage)

比如智能客服,如果这个Session最终是以接入人工为结束的,那基本就说明机器的回答有问题。或者重复提供给用户相同答案等等。


分专项或分意图的统计就更多了,不展开了。

2. Dialogue interaction efficiency , such as the time it takes for users to complete a task, the efficiency of reply words in information transmission and action guidance, the efficiency of user voice input, etc. (may be related to functions such as interruption and one-shot); specific definition, It’s up to you to decide for each product.

3. There are some differences according to the type of dialogue system .

1. Chat type :

CPS(Conversations Per Session,平均单次对话轮数)。这算是微软小冰最早期提出的指标,并且是小冰内部的(唯一)最重要指标;


相关性和新颖性。与原话题要有一定的相关性,但又不能是非常相似的话;


话题终结者。如果机器说过这句话之后,通常用户都不会继续接了,那这句话就会给个负分。

2. Task type :

留存率。虽然是传统的指标,但是能够发现用户有没有形成这样的使用习惯;留存的计算甚至可以精确到每个功能,然后进一步根据功能区做归类,看看用户对哪类任务的接受程度较高,还可以从用户的问句之中分析发出指令的习惯去针对性的优化解析和对话过程;到后面积累的特征多了,评价机制建立起来了,就可以上强化学习;比如:之前百度高考,教考生填报志愿,就是这么弄的;


完成度(即,前文提过的“用户任务达成率”)。由于任务型最后总要去调一个接口或者触发什么东西来完成任务,所以可以计算多少人进入了这个对话单元,其中有多少人最后调了接口;


相关的,还有(每个任务)平均slot填入轮数或填充完整度。即,完成一个任务,平均需要多少轮,平均填写了百分之多少的槽位slot。对于槽位的基础知识介绍,可详见《填槽与多轮对话 | AI产品经理需要了解的AI技术概念》。

3. Question and answer type :

最终求助人工的比例(即,前文提过的“用户任务达成率”相关);


重复问同样问题的比例;


“没答案”之类的比例。

Generally speaking, the industry generally raises CPS more when promoting PR. Other indicators may seem relatively trivial or not high-level enough. However, in actual work, CPS may be more oriented to chat-type dialogue systems, while other scenarios may be more based on "effects". For example, if a child cries, the robot can "cry to comfort". There is no need for so many rounds of dialogue, but the fewer the better.

4. The degree of naturalness and humanization of the corpus

Currently, manual evaluation is generally used for this type of problem. The corpus here is usually not a single sentence, but divided into a single round of question and answer pairs or a multi-round session. Generally speaking, the scoring range is 1 to 5 points:

1分或2分:完全答非所问,以及含有不友好内容或不适合语音播报的特殊内容;


3分:基本可用,问答逻辑正确;


4分:能解决用户问题且足够精炼;


5分:在4分基础上,能让人感受到情感及人设。

In addition, in order to eliminate subjective bias, it is currently a common practice to use multiple people to label and remove extreme values.

5. Overall user data indicators

Conventional Internet products will have overall user indicators; AI products will generally be considered from this perspective.

1. DAU (Daily Active User, number of daily active users, referred to as "daily active")

There will be changes in special scenarios. For example, in vehicle scenarios, "DAU proportion (proportion of vehicle DAU)" will be counted.

2. Richness of used intentions (number of intentions with usage rate >X%).

3. You can try to evaluate satisfaction through emotional information and semantic emotional classification of the user's voice.

Especially for angry emotion detection, these conversation samples can be selected and analyzed. For example, some companies count how many curse words are used in speech to roughly understand user emotions. For example, in the Tonghuashun mobile client, scroll to the bottom and there is a one-stop question and answer function. When users say "Why can't I log in?" and "Why can't I always log in?", the results returned are different. ——The latter, if the system detects negative emotions, it will prompt manual transfer.

Conclusion

This article introduces the common evaluation indicators of voice interaction systems in the industry. On the one hand, it provides the most down-to-earth relevant information to AI product managers; on the other hand, it also hopes that everyone can create more based on these indicators. Good product experience.

Guess you like

Origin blog.csdn.net/weixin_43153548/article/details/82899530