Image captioning评价方法之BLEU (bilingual evaluation understudy)

文章地址：BLEU: a Method for Automatic Evaluation of Machine Translation

代码地址(非官方)：https://github.com/tylin/coco-caption

该评价方法是IBM发表于ACL2002上。从文章命名可以看出，文章提出的是一种双语评价替补，"双语评价(bilingual evaluation)"说明文章初衷提出该评价指标是用于机器翻译好坏的评价指标，"替补(understudy)"说明文章想提出一种有效的评价指标进而帮助人类来快速评价翻译结果的好坏。

翻译结果的好坏，指的是翻译的结果与人类翻译的结果是否尽可能的相似。

下面来看看该评价指标是如何来做到的。在说明该评价计算方法之前先说明两个概念，一个是生成的翻译内容下文称为candidate translation，一个是人工标注或者说是人工翻译内容称为reference translation。

为了更好的说明，先定义一些字符，对于一条输入i，candidate translation为 $c_i$ , reference translation有多个为一个集合 $R_i=\{r_{i1},r_{i2},...,r_{ij},...r_{im}\}$ , 一个n元组n-gram为由n个有顺序的单词组成的序列， $count_k$ 为计算k元组（k-gram）在candidate translation/ reference translation中的出现次数。

一、 Modified n-gram precision

1.1 modified unigram precision的引入

本文提出的评价方法的基石是precision measure。

准确率的计算如下：

定义任何一个reference translation（reference translation可能存在多个，即有多个人工翻译）包含candidate translation中的某个字，那么这个candidate translation中的字为预测正确的字。
计算出所有预测正确的字的个数作为分子，所有candidate translation中的字的个数作为分母，这样得到的值为candidate translation的准确率。

在准确率基础上，文章改进了一版，称为modified unigram precision。

modified unigram precision的计算如下（写法与原文有点不太一样，但是更好理解）:

计算candidate translation中每个字相同的个数，记为$ Count_i$，i表示candidate translation中不同字的id
分别计算所有reference translation中每一条翻译，每个字相同的个数，记为 $Ref\_Count_{ij}$ (对于id为i这个字来说，j表示第j条reference translation)，然后对每个字的个数在reference translation中取最大值，即 $Ref\_Count_{i}=max(Ref\_Count_{ij})$
利用2中计算出来的 $Ref\_Count_{i}$ 去对 $Count_i$ 进行裁剪，即 $Count_{clip_i}=min(Count_i, Ref\_Count_{i})$
$modified\ unigram\ precision=\frac{\sum_{i\in C}Count_{clip_i}}{len(C)}$ ，式中C表示candidate translation，i表示candidate translation中不同的字，len表示求总的字的个数

上述解释看起来比较绕，举个例子来说明（下面例子统计单词数时都是忽略大小写的）：

Example 1:

Candidate: the the the the the the the.
Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.

如果这种情况用precision来计算，candidate translation只有一种字就是the，而不管是reference1还是reference2只要一个中包含the，那么the就是预测正确的字，预测正确的字的个数为7，candidate translation总的字数为7，所有precision就为7/7。

如果采用modified unigram precision来计算

$Count_i=7$ ，且因为这个例子特殊，candidate translation中只有一种字，所有id只有一类。
在Reference 1中，the出现两次，那么 $Ref\_Count_{i1}=2$ ，在Reference 2中，the出现一次，那么 $Ref\_Count_{i2}=1$ ，最后 $Ref\_Count_{i}=max(Ref\_Count_{i1}, Ref\_Count_{i2})=2$ .
$Count_{clip_i}=min(Count_i, Ref\_Count_{i})=min(7, 2)=2$
$modified\ unigram\ precision=\frac{\sum_{i\in C}Count_{clip_i}}{len(C)}=2/7$

同理，再举一个例子：

Example 2:

Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party

下面只计算modified unigram precision：

对于Candidate 1来说

$Count_i$ 即各个预测字的个数，用字典表示每个单词的情况{it: 1, is: 1, a: 1, guide: 1, to: 1, action: 1, which: 1, ensures: 1, that: 1, the: 3, military: 1, always: 1, obeys: 1, commands: 1, of : 1, party: 1}，则 $Count_i=18$
计算 $Ref\_Count_{i}$
1. 在Reference 1中， $Ref\_Count_{i1}$ 用字典表示每个单词的情况{it: 1, is:1, a: 1, guide: 1, to: 1, action:1, which: 0, ensures: 1, that: 2, the: 1, military: 1, always: 0, obeys: 0, commands: 1, of: 0, party: 1 }。
2. 在Reference 2中， $Ref\_Count_{i2}$ 用字典表示每个单词的情况{it: 1, is: 1, a: 0, guide: 0, to: 0, action: 0, which: 1, ensures: 0, that: 0, the: 4, military: 1, always: 1, obeys: 0, commands: 0, of : 1, party: 1}。
3. 在Reference 3中， $Ref\_Count_{i3}$ 用字典表示每个单词的情况{it: 1, is: 1, a: 0, guide: 1, to: 1, action: 0, which: 0, ensures: 0, that: 0, the: 4, military: 0, always: 1, obeys: 0, commands: 0, of : 1, party: 1}。
4. 最后 $Ref\_Count_{i}$ 用字典表示每个单词的情况为{it: 1, is: 1, a: 1, guide: 1, to: 1, action: 1, which: 1, ensures: 1, that: 2, the: 4, military: 1, always: 1, obeys: 0, commands: 1, of : 1, party: 1}
计算 $Count_{clip_i}=max(Ref\_Count_{i1}, Ref\_Count_{i2})$ ，用字典表示每个单词的情况{it: 1, is: 1, a: 1, guide: 1, to: 1, action: 1, which: 1, ensures: 1, that: 1, the: 3, military: 1, always: 1, obeys: 0, commands: 1, of : 1, party: 1}， $Count_{clip_i}=17$
$modified\ unigram\ precision=\frac{\sum_{i\in C}Count_{clip_i}}{len(C)}=17/18$

同理对于Candidate2来说也可以计算出 $modified\ unigram\ precision=\frac{\sum_{i\in C}Count_{clip_i}}{len(C)}=8/14$

1.2 modified n-gram precision的引入

上述是 unigram的情况，更为通用的是采用n-gram的情况来计算，用公式表示为

$p_n=\frac{\sum_{C\in\{Candidates\}}\sum_{n-gram\in C}Count_{clip}(n-gram)}{\sum_{C'\in\{Candidates\}}\sum_{n-gram'\in C'}Count(n-gram')}$

这个式子只是上面unigram的拓展，因为对于一次翻译系统的测试来说，输入可能是多个待翻译的句子，输出也是多个，所以评价这个测试系统是从整个数据集来的，所以存在 $\sum_{C\in\{Candidates\}}$ 。而后面的 $\sum_{n-gram\in C}$ 也是上面 $\sum_{i\in C}$ 的扩展，因为n-gram表示所有元组为n的个数。上述是分子的构成，同理分母也是 $l e n (C)$ 的扩展，相对于分子要求每个元组的 $Count_{clip_i}=min(Count_i, Ref\_Count_{i})$ （这里i表示一个元组），分母只需要求所有元组的总数即可。

1.3 modified n-gram precision的合并方式

文章采用五个翻译系统-两个人三个翻译机器，翻译的结果采用n-gram来计算，如下图所示

在这里插入图片描述

上图展示了modified n-gram precision随着n的增加，呈指数衰减，为了方面不同n取值时的n元组（n取1，2，3，4时）评测结果能够进行合并，得到平均的结果，在合并前将各个评测结果采用log来计算。

二、翻译输出的长度对于评测指标的影响

生成的翻译既不能太长也不能太短，实际上modified n-gram precisions已经考虑了这种情况，例如n-gram precision会惩罚生成虚假词（reference translations没有出现的词），并且也会惩罚一个词出现的次数比reference translations中最大的出现次数还多的情况。但是还没有对很好的考虑合适的翻译长度，如下面例子：

Example 3:

Candidate: of the
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.

因为candidate translation很短，对于modifified unigram precision=2/2，对于modifified bigram precision（2-gram）=1/1。从上面结果来看，只考虑准确率，那么这个翻译结果在评测指标上可以取得很好的结果。所以还需要考虑recall的情况。

2.1 recall的计算困难

在例子3中，可以看出，考虑召回很重要。但是不同于其他任务，翻译任务的reference translation有多个，recall不太好计算。举下面例子来说明。

Example 4:

Candidate 1: I always invariably perpetually do.
Candidate 2: I always do.
Reference 1: I always do.
Reference 2: I invariably do.
Reference 3: I perpetually do.

可以看出从翻译好坏来讲，candidate 1明显不如candidate 2，但是由于存在多个reference translation的情况，candidate1召回单词数比candidate2要多。所以recall在这种情况下不是很好计算。

2.2 Sentence brevity penalty

例子4的情况，在算modified n-gram precisions时，由于句子太长，会受到相应的惩罚，所以这里不需要去考虑长句子的评价问题。所以文章引入一个名叫brevity penalty的乘法因子，用于惩罚例子3这种过短句子的情况。

当candidate长度与任何一个reference长度一样时则不对precision进行惩罚，即brevity penalty为1.0。如candidate的长度为12，reference长度有12，15，17，那么brevity penalty为1.0。这里我们定义与candidate长度最接近的reference长度称为**“best match length”**。

考虑到reference有多个的情况，brevity penalty计算如下，计算测试集所有有效reference translation长度的和记为r（有效长度指的是为各个candidate对应的"best match length"）。brevity penalty的指数衰减参数为r/c，c为整个数据集所有candidate translation长度的和。

三、 BLEU

BLEU的计算公式如下：

$BLEU=BP\cdot exp(\sum^N_{n=1}w_nlogp_n)$

$\left \{ \begin{aligned} 1, \quad if \quad c>r \\ e^{(1-r/c)}, \quad if \quad c \leq r \end{aligned} \right.$

式中， $w_n$ 为 $1 / N$