AI popular science article | Is speech recognition accurate? ——ASR effect evaluation principle and practice

In daily work and life, speech recognition technology, as a basic service, is increasingly appearing around us, such as smart speakers, meeting records, subtitle generation, etc.

As a very mature AI technology, many manufacturers on the market provide speech recognition services, and the recognition accuracy they claim is also very high.

For us on the business side, we are actually more concerned about how it performs in our specific business scenarios.

This article will take you through all aspects of speech recognition effect evaluation from principle to practice.

Speech recognition, also known as speech transcription to text, is a technology that recognizes speech into text. The English name is Automatic Speech Recognition, usually abbreviated as ASR (hereinafter referred to as ASR).

Obviously, the quality of an ASR service can be measured by the accuracy of the text recognized by speech.

The industry usually uses an indicator to quantify whether this is accurate: Word Correct (W.Corr), also known as recognition accuracy.

To understand the word accuracy rate, we must first understand another indicator, WER.

1. Principle of indicators

1.1 WER Official

WER (Word Error Rate) is an important indicator for evaluating the effect of ASR. It is used to measure the error rate between predicted text and annotated text.

Because the smallest unit in an English sentence is a word (Word), and the smallest unit in Chinese is a Chinese character (Character), in the Chinese speech recognition task, the Character Error Rate (CER) is used to measure the ASR recognition effect.

The calculation methods of the two are the same. We usually use WER to represent this indicator in the Chinese field.

The calculation formula for WER is as follows

#Deletions: Delete the wrong number of characters

#Insertions: Inserting the wrong number of characters

#Substitutions: Substitute the wrong number of characters

#ReferenceWords: Total characters

 1.2 Type III Error

Overall, the denominator of the formula is the total number of characters, and the numerator is the sum of the number of characters for the three types of errors. Let’s take a look at the meaning of these three types of errors.

For the convenience of description, the convention is as follows

REF: The correct text content corresponding to the speech, also known as annotated text, that is, Reference

HYP: Text recognized by voice through ASR service, that is, Hypothesis

delete error

During the speech transcription process, ASR did not recognize the text originally included in the original text. example:

The voice "Have you eaten?" is recognized as "Have you eaten", but the word "has" is not recognized.

Insertion error

During the speech transcription process, text not included in the original text, such as noise, was mistakenly recognized as text by ASR. For example:

The voice "Have you eaten?" is recognized as "Have you eaten?", in which the word "Yah" is mistakenly recognized.

Substitution error

During the process of speech-transcribing text, the text contained in the original text was mistakenly recognized by ASR as other text. For example:

 

The voice "Did you eat?" was recognized as "Did you eat?", in which the word "?" was misrecognized and turned into the word "灞".

in conclusion

Deletion errors: There is less recognition, and the original words in the speech are missed.

Insertion error: too much recognition, words that are not in the speech are recognized.

Replacement error: The recognition is wrong, and the words in the speech are recognized as other words.

After understanding these three types of errors, it will be easy to understand if we look back at the fields above.

 

In summary, it can be seen

WER refers to the proportion of the number of characters containing various errors (deletions, insertions, substitutions) in the result text identified by ASR compared to the total number of original texts.

Now that we understand the WER indicator, let's look at how to calculate it to get these values.

1.3 Edit distance

In the case where the recognition result text and the annotation text are given, the total number of #ReferenceWords characters is easy to obtain, and the number of type three errors needs to be calculated by the introduction of "edit distance".

The numerator part of the WER formula, which is

That is the edit distance from the recognition result text to the annotation text.

That is to say, we only require the edit distance from the recognition result text to the annotated text, divided by the number of annotated text characters, to get the WER indicator.

Let's take a closer look at what edit distance is and how it is calculated.

Edit Distance was proposed by Russian scientist Vladimir Levenshtein in 1965, also known as Levenshtein distance.

Edit distance is used to measure the similarity between two strings and is widely used in DNA sequence comparison, spelling detection, error rate calculation and other fields.

It is measured by looking at the minimum number of processes required to transform one string into another. Each processing is called an editing operation, which includes three types:

  • Delete, delete a character
  • Insert, insert a character
  • Replace, replace a character

As you can see, the editing operations here correspond to the three types of errors discussed above.

The shorter the edit distance, the more similar the two texts are; the longer the edit distance, the more different the two texts are.

Edit distance can be calculated by the following formula:

Through the above formula, the recognition result text is calculated and converted to the minimum number of editing operations to the annotated text, and its editing distance can be obtained.

Students who are familiar with algorithms should know that calculating the minimum value of the total number of operations by adjusting the sequence and number of different operations is a typical dynamic programming (DP) problem.

However, this is beyond the scope of the topic of this article. Students who are interested in the DP algorithm can refer to the following information to learn more:

1.4 WER calculation

To summarize, to calculate WER, you can calculate the edit distance from the recognition result to the annotated text, and then enter the following formula to get

 

The parameters are as follows

 

1.5 word accuracy

Okay, now let's go back to the word correct rate (Word Correct) mentioned at the beginning. What does this indicator refer to, and what is its relationship with WER?

Compared with WER, the word accuracy rate ignores the number of inserted incorrect characters in the calculation, that is, the inserted errors are not included in the error statistics.

In an actual system, the recognition results of the upstream ASR will be further processed by the downstream task analysis module, and incorrectly inserted text will be processed. Therefore, we only need to examine the proportion of texts contained in the speech that are correctly recognized, that is, Word accuracy.

Therefore, industry manufacturers usually provide the word accuracy rate together with WER to measure the ASR recognition effect.

1.6 Open source tools

So far, we have understood the WER index, word accuracy index, and the principles and algorithms behind them.

In the industry, in order to avoid inconsistencies in indicator data caused by different implementations and to allow various manufacturers to easily compare their own data, open source tools are usually used for calculations.

Here, we use the Sclite open source from the National Institute of Technology NIST as a computing tool.

By inputting the recognition result text and annotation text, the tool can calculate the corresponding WER, the number of three types of errors and the corresponding details.

Tool usage

By providing recognition result files and annotation text files that meet a specific format (trn), sclite can calculate and generate detailed evaluation reports (dtl) including WER, word accuracy, and three types of error information.

a. Example of calling command

# 命令格式 sclite -r reffile [ fmt ] -h hypfile [ fmt [ title ] ] OPTIONS
./bin/sclite -r /corpus/audio_file/16k_60s_all_100.trn trn -h /data/output/16k_zh-PY-16k_60s_all_100.trn trn -i spu_id -o dtlb

Annotation file:/corpus/audio_file/16k_60s_all_100.trn

Recognition result:/data/output/16k_zh-PY-16k_60s_all_100.trn

b. Evaluation report example (dtl)

 

 

Attachment: Interested students can obtain NIST Tools through the following official website link

 

2. Evaluation Practice

There is a popular saying on the Internet: I know a lot of truths, but I still can’t live a good life. Easier said than done.

Similarly, although we understand ASR performance indicators, principles and open source tools, we may still feel that we have nowhere to start.

In order to lower the test threshold and facilitate customers to easily and quickly evaluate the recognition effect of their business scenarios on Tencent Cloud ASR service , Tencent Cloud AI application team created the AI ​​Studio one-click evaluation tool, allowing users to complete the evaluation with zero foundation.

Now in closed beta, let’s see how to use it.

2.1 Interface preview

AI Studio official website link: AI Studio - Developer Tool Platform

Open the official website and see the following page.

Click [Login] in the upper right corner, which will jump to the login page of Tencent Cloud official website. Log in using the official cloud account.

The first column is the evaluation service option. Here we select [Speech Recognition]. The drop-down box on the far right contains two speech recognition interfaces: recording file recognition and real-time speech recognition;

Since the algorithm model has been specifically optimized for these two business scenarios, you only need to choose the interface you use.

The second column explains how to create a test set and what to pay attention to when labeling files.

The third column is the field that needs to be selected when submitting a test task. Just keep it consistent with the test audio meta information.

2.2 Operation Guide

Below we use an example to show how to conduct an evaluation process.

a. Prepare evaluation corpus

Click the page template link to view a sample format of the test set:

The test corpus contains two parts:

  • Audio files: audio data collected in business scenarios, with a sampling rate of 8k or 16k
  • Annotation file: Manually record the human speech contained in the audio into a text file

Among them, the numbers in the annotation file need to be marked in Chinese capital form. For example, the text "Xiao Ming scored 98 points in the exam" needs to be marked as "Xiao Ming scored 98 points in the exam"

For other notes, please refer to the page:

 

b. Submit evaluation task

Create a new evaluation task below

Step 1: Select the corresponding parameters

According to the audio information, select the corresponding recognition language and audio sampling rate

Different engine types have been optimized for specific scenarios and have better recognition effects in matching scenarios. Just select the most suitable engine type here, as follows

 

 

Step 2: Upload the annotated test set

Compress and package the prepared test set and upload it through the page

 

Step 3: Check the content of the annotated test set

Here the system will parse the uploaded test set, match the audio with the annotated text, and display it on the page for users to check and confirm (since the test results are directly related to the accuracy of the annotated text, it is necessary to ensure that the annotated file is correct).

Click Confirm to submit to complete the creation of the evaluation task.

c. Get evaluation results

During task execution, you can check the task status through the evaluation task management list at the bottom of the evaluation page.

After the task status displays [Success], click [View Results] on the right to view the evaluation results:

You can see the evaluation performance indicators: word accuracy rate (i.e., the above-mentioned word accuracy rate), WER, and insertion/deletion/replacement error rate.

At the same time, you can also click on the download address below to obtain the evaluation report and identification result file for further analysis.

appendix

Guess you like

Origin blog.csdn.net/tencentAI/article/details/128547180