[blog] Speech Recognition Is Not Solved 语音识别领域尚待解决的子问题

链接：https://awni.github.io/speech-recognition/

Ever since Deep Learning hit the scene in speech recognition, word error rates have fallen dramatically. But despite articles you may have read, we still don’t have human-level speech recognition. Speech recognizers have many failure modes. Acknowledging these and taking steps towards solving them is critical to progress. It’s the only way to go from ASR which works for some people, most of the time to ASR which works for all people, all of the time.

Improvements in word error rate over time on the Switchboard conversational speech recognition benchmark. The test set was collected in 2000. It consists of 40 phone conversations between two random native English speakers.

Saying we’ve achieved human-level in conversational speech recognition based just on Switchboard results is like saying an autonomous car drives as well as a human after testing it in one town on a sunny day without traffic. The recent improvements on conversational speech are astounding. But, the claims about human-level performance are too broad. Below are a few of the areas that still need improvement.

Accents and Noise

One of the most visible deficiencies in speech recognition is dealing with accents1 and background noise. The straightforward reason is that most of the training data consists of American accented English with high signal-to-noise ratios. For example, the Switchboard conversational training and test sets only have native English speakers (mostly American) with little background noise.

语音识别中最明显的缺陷之一是在有口音和背景噪音的情况下表现不佳。原因是大多数训练数据由具有高信噪比(signal/noise,比例越高说明声音中有用信号越多)的美式口音英语组成。例如，Switchboard会话训练和测试集只有母语为英语的人（大多数是美国人），几乎没有背景噪音。

But, more training data likely won’t solve this problem on its own. There are a lot of languages many of which have a lot of dialects and accents. It’s not feasible to collect enough annotated data for all cases. Building a high quality speech recognizer just for American accented English needs upwards of 5 thousand hours of transcribed audio.

仅仅使用更多的训练数据可能无法单独解决这个问题。有很多语言都是有很多方言和口音的，为所有口音都收集足够的注释数据是不可行的。例如，仅仅为美国口音英语构建一个高质量的语音识别器就需要超过5000小时的转录音频。

Comparison of human transcribers to Baidu’s Deep Speech 2 model on various types of speech.2 Notice the humans are worse at transcribing the non-American accents. This is probably due to an American bias in the transcriber pool. I would expect transcribers native to a given region to have much lower error rates for that region’s accents.

人类抄写员与百度Deep Speech 2模型在不同类型语音上的比较.2请注意，人类在抄写非美国口音方面更糟糕。这可能是由于American bias in the transcriber pool。一个朴素的想法是：某个地区的抄写员对该地区的口音的错误率肯定要低得多。

With background noise, it’s not uncommon for the SNR in a moving car to be as low as -5dB. People don’t have much trouble understanding one another in these environments. Speech recognizers, on the other hand, degrade more rapidly with noise. In the figure above we see the gap between the human and the model error rates increase dramatically from the low SNR to the high SNR audio.

因为背景噪声的存在，驾驶环境下的SNR低至-5dB并不罕见，但是人们在这些环境中能听懂对方却不是一件难事。另一方面，语音识别的WER随着噪声的加入而更快地升高。在上图中，我们看到人类和模型之间载WER上的差距从低SNR到高SNR音频急剧增加。

Semantic Errors

Often the word error rate is not the actual objective in a speech recognition system. What we care about is the semantic error rate. That’s the fraction of utterances in which we misinterpret the meaning.

通常，单词错误率不是语音识别系统中的实际评判标准。我们关心的是语义错误率。也就是识别出的语句中导致我们误解整体语句含义的那小部分utterance。

An example of a semantic error is if someone said “let’s meet up Tuesday” but the speech recognizer predicted “let’s meet up today”. We can also have word errors without semantic errors. If the speech recognizer dropped the “up” and predicted “let’s meet Tuesday” the semantics of the utterance are unchanged.

语义错误的一个例子是，如果有人说“let’s meet up Tuesday”，但语音识别器预测“let’s meet up today”。我们也可以在没有语义错误的情况下出现单词错误。如果模型没能识别出“up”并预测“let’s meet Tuesday”，那么整句话的语义不变但是WER却是存在的。

We have to be careful when using the word error rate as a proxy. Let me give a worst-case example to show why. A WER of 5% roughly corresponds to 1 missed word for every 20. If each sentence has 20 words (about average for English), the sentence error rate could be as high as 100%. Hopefully the mistaken words don’t change the semantic meaning of the sentences. Otherwise the recognizer could misinterpret every sentence even with a 5% WER.

使用单词错误率作为评价指标时，我们必须要小心。让我举一个最坏的例子来说明原因。 5％的WER大致对应于每20个单词中有一个遗漏的单词。如果每个句子有20个单词（大约是英语的平均值），则句子错误率可能高达100％。希望错误的词语不会改变句子的语义。否则识别器可能会误解每个句子，即使是5％的WER。

When comparing models to humans, it’s important to check the nature of the mistakes and not just look at the WER as a conclusive number. In my own experience, human transcribers tend to make fewer and less drastic semantic errors than speech recognizers.

在将模型与人类进行比较时，重要的是要检查错误的本质原因，而不仅仅是将WER看作结论性数字。根据我自己的经验，人类抄写员倾向于比语音识别器产生更少的语义错误。

Researchers at Microsoft recently compared mistakes made by humans and their human-level speech recognizer.3 One discrepancy they found was that the model confuses “uh” with “uh huh” much more frequently than humans. The two terms have very different semantics: “uh” is just filler whereas “uh huh” is a backchannel acknowledgement. The model and humans also made a lot of the same types of mistakes.

微软的研究人员最近比较了人类和他们的语音识别器所犯的错误。他们发现的一个差异是：模型比人类更频繁地混淆“uh”和“uh huh”。这两个术语具有非常不同的语义：“uh”只是填充物，而“uh huh”是backchannel 确认(acknowledgement)。此外，model和人类也犯了很多相同类型的错误。

Single-channel, Multi-speaker

The Switchboard conversational task is also easier because each speaker is recorded with a separate microphone. There’s no overlap of multiple speakers in the same audio stream. Humans on the other hand can understand multiple speakers sometimes talking at the same time.

另一方面，Switchboard conversational任务是一种容易的任务，因为每个说话人都用一个单独的麦克风来记录自己的声音。同一音频流中没有来自多个说话人的重叠音频。但是，人类却可以理解多个发言者同时说话时每个人具体在说什么。

A good conversational speech recognizer must be able to segment the audio based on who is speaking (diarisation). It should also be able to make sense of audio with overlapping speakers (source separation). This should be doable without needing a microphone close to the mouth of each speaker, so that conversational speech can work well in arbitrary locations.

一个好的会话语音识别器必须能够根据讲话的人（diarisation）来分割音频。它还应该能够理解有说话人重叠的音频（源分离）。这应该是可行的，并且不需要麦克风靠近每个说话人的嘴，这样麦克风可以在任意位置很好地工作。

Domain Variation

Accents and background noise are just two factors a speech recognizer needs to be robust to. Here are a few more:

Reverberation from varying the acoustic environment.
Artefacts from the hardware.
The codec used for the audio and compression artefacts.
The sample rate.
The age of the speaker.

口音和背景噪声只是语音识别器需要更具鲁棒性的两个因素。这里还有一些：

不同声学环境下的去混响

硬件的更新设计

用于音频和压缩工件的编解码器

采样率

说话人的年龄

Most people wouldn’t even notice the difference between an mp3 and a plain wav file. Before we claim human-level performance, speech recognizers need to be robust to these sources of variability as well.

大多数人甚至都不会注意到mp3和普通wav文件之间的区别。在我们声称人类级别的表现之前，语音识别器也需要对这些可变性来源具有鲁棒性。

Context

You’ll notice the human-level error rate on benchmarks like Switchboard is actually quite high. If you were conversing with a friend and they misinterpreted 1 of every 20 words, you’d have a tough time communicating.

您会注意到Switchboard等基准测试中人类的错误率实际上也非常高。如果你和朋友交谈并且他们误解了每20个单词中的一个，你就很难沟通。

One reason for this is that the evaluation is done context-free. In real life we use many other cues to help us understand what someone is saying. Some examples of context that people use but speech recognizers don’t include:

The history of the conversation and the topic being discussed.
Visual cues of the person speaking including facial expressions and lip movement.
Prior knowledge about the person we are speaking with.

其中一个原因是评估是在无上下文的情况下完成的。在现实生活中，我们使用许多其他线索来帮助我们理解某人所说的话。人们使用的但语音识别器不使用的一些语境包括：

过去有过的一些谈话和正在讨论的主题

说话的人的视觉线索包括面部表情和嘴唇运动

关于我们与之交谈的人的先验知识

Currently, Android’s speech recognizer has knowledge of your contact list so it can recognize your friends’ names.4 The voice search in maps products uses geolocation to narrow down the possible points-of-interest you might be asking to navigate to.5

目前，Android的语音识别器具有您的联系人列表的知识，因此它可以识别您朋友的名字.4地图产品中的语音搜索使用地理位置信息来找到您可能要求导航到的兴趣点。

The accuracy of ASR systems definitely improves when incorporating this type of signal. But, we’ve just begun to scratch the surface on the type of context we can include and how it’s used.

当结合这种类型的信号时，ASR系统的准确性肯定会提高。但是，我们刚刚开始研究我们可以包含的上下文类型以及它是如何使用的。

Deployment

The recent improvements in conversational speech are not deployable. When thinking about what makes a new speech algorithm deployable, it’s helpful to think in terms of latency and compute. The two are related, as algorithms which increase compute tend to increase latency. But for simplicity I’ll discuss each separately.

最近对话语音的改进是不可部署的。在考虑什么使新的语音算法可部署时，考虑延迟和计算是有帮助的。这两者是相关的，因为增加计算量的算法往往会增加延迟。但为了简单起见，我将分别讨论每个问题。

Latency: With latency, I mean the time from when the user is done speaking to when the transcription is complete. Low latency is a common product constraint in ASR. It can significantly impact the user experience. Latency requirements in the tens of milliseconds aren’t uncommon for ASR systems. While this may sound extreme, remember that producing the transcript is usually the first step in a series of expensive computations. For example in voice search the actual web-scale search has to be done after the speech recognition.

延迟：延迟时间是指用户完成说话到转录完成时的时间。低延迟是ASR中常见的产品约束。它可以显着影响用户体验。对于ASR系统，数十毫秒的延迟要求并不少见。虽然这可能听起来很极端，但请记住，产生识别文本通常是一系列昂贵计算的第一步。例如，在语音搜索中，必须在语音识别之后进行实际的网络规模搜索。

Bidirectional recurrent layers are a good example of a latency killing improvement. All the recent state-of-the-art results in conversational speech use them. The problem is we can’t compute anything after the first bidirectional layer until the user is done speaking. So the latency scales with the length of the utterance.

双向RNN是牺牲延迟提升识别性能的一个很好的例子。所有最先进的语音model都使用它们。问题是我们无法在第一层layer后计算任何东西，直到用户完成说话。所以延迟随着话语的长度而增加。

Left: With a forward only recurrence we can start computing the transcription immediately. Right: With a bidirectional recurrence we have to wait until all the speech arrives before beginning to compute the transcription.

左：只有前向recurrence ，我们可以立即开始计算识别。右：由于双向recurrence ，我们必须等到所有语音到达才开始计算。

A good way to efficiently incorporate future information in speech recognition is still an open problem.

将未来信息有效地融入语音识别的好方法仍然是一个悬而未决的问题。

Compute: The amount of computational power needed to transcribe an utterance is an economic constraint. We have to consider the bang-for-buck of every accuracy improvement to a speech recognizer. If an improvement doesn’t meet an economical threshold, then it can’t be deployed.

计算：识别话语所需的计算能力是一种经济约束。我们必须考虑对语音识别器的每次精度提升的trade-off。如果改进不符合经济阈值，则无法部署。

A classic example of a consistent improvement that never gets deployed is an ensemble. The 1% or 2% error reduction is rarely worth the 2-8x increase in compute. Modern RNN language models are also usually in this category since they are very expensive to use in a beam search; though I expect this will change in the future.

一个永远不会被部署的持续改进的典型例子是整体。 1％或2％的误差减少很少值得计算增加2-8倍。现代RNN语言模型通常也属于这一类，因为它们在波束搜索中使用起来非常昂贵。虽然我预计这将在未来发生变化。

As a caveat, I’m not suggesting research which improves accuracy at great computational cost isn’t useful. We’ve seen the pattern of “first slow but accurate, then fast” work well before. The point is just that until an improvement is sufficiently fast, it’s not usable.

作为一个警告，我并不是说以高计算量为成本来提高准确性的研究是没有用的。我们之前已经看到了“先缓慢但准确，然后快速”的模式。重点是，除非改进足够快，否则它就无法使用。

The Next Five Years

There are still many open and challenging problems in speech recognition. These include:

Broadening the capabilities to new domains, accents and far-field, low SNR speech.
Incorporating more context into the recognition process.
Diarisation and source-separation.
Semantic error rates and innovative methods for evaluating recognizers.
Super low-latency and efficient inference.

语音识别中仍存在许多开放且具有挑战性的问题。这些包括：

口音和远场，低SNR语音的识别。

将更多上下文信息纳入识别过程。

Diarisation和源分离。

语义错误率和评估识别器的创新方法。

超低延迟和高效推理。

I look forward to the next five years of progress on these and other fronts.

Acknowledgements

Thanks to @mrhannun for useful feedback and edits.

Edit

Hacker News discussion.

Footnotes

Just ask anyone with a Scottish accent. ↩
These results are from Amodei et al, 2016. The accented speech comes from VoxForge. The noise-free and noisy speech comes from the third CHiME challenge. ↩
Stolcke and Droppo, 2017 ↩
See Aleksic et al., 2015 for an example of how to improve contact name recognition. ↩
See Chelba et al., 2015 for an example of how to incorporate speaker location. ↩