BLEU：一种自动化评价机器翻译性能的方法

Modified unigram precision

判断一个翻译得好不好，是看翻译的话里面和reference中的句子一样的单词有多少一致，一致得越多，说明翻译得越准。也就是看准确率（即找出的单词总数中找对的单词所占的比例）。但这样会存在这样一种问题。

Candidate: the the the the the the the

Reference 1: The cat is on the mat.

Reference 2: There is a cat on the mat.

上面翻译的话，一看就是鸟话，但是每个单词都在reference中出现过，所以准确率是7/7=100%，但这明显不合理，因此推出modified unigram precision。

其思想是：

Reference中的the最多出现了2次，因此，即使candidate中全是the，但是只能算前两个配对了，后面的the就认为不算。因此modified unigram precision=2/7

测试实例1：

Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2:It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3:It is the practical guide for the army always to heed the directions of the party.

Candidate 1中：the出现最多的reference是reference 2中的4次，所以candidate 1中的the算次数最多不能超过4次，而candidate 1中的the出现了3次，所以the的总数算3次。进行累加得到：It(1)+is(1)+a(1)+guide(1)+to(1)+action(1)+which(1)+ensures(1)+that(1)+the(1)+military(1)+always(1)+the(1)+commands(1)+of(1)+the(1)+party(1)=17

然后modified unigram precision=17/candidate1中单词总数18=17/18

而Candidate 2中：

It(1)+is(1)+to(1)+the(1)+forever(1)+the(1)+that(1)+party(1)=8

然后modified unigram precision=8/candidate2中单词总数=8/14

而bi-gram等情况是这样匹配

W1w2w3:w1w2算一个，w2w3算一个

上面是针对单个句子，如果想针对篇章等句子的组合计算方法为：

以此时实例1为例：

Modified unigram precision=17+8/（18+14）

使用modified precison有个最大的缺陷在于，如果candidate的句子长度很短，即使是一句鸟话，得到的modified precision依旧很高。如下例所示：

Candidate: of the

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2:It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3:It is the practical guide for the army always to heed the directions of the party.

其中，modified unigram precision=2/2,而modified bi-gram precison=1/1

为了解决candidate的句子很短造成的这种问题，可以考虑同时引入recall来进行折中。

但是recall对于下面的例子又显得不恰当。

Candidate 1: I always invariably perpetually do.

Candidate 2: I always do.

Reference 1: I always do.

Reference 2: I invariably do.

Reference 3: I perpetually do.

Candidate 1因为单词包含得比candidate2多，所以recall较大，但显然candidate 1不如candidate 2.

BLEU的详细思想：

针对某一句candidate，有很多个reference，选取其中长度最接近的reference，在语料中的这样的长度的求和为r,而candidate的长度的求和为c

BLEU只考虑了precision的情况，为了解决candidate句子短造成的问题，所以引入了惩罚措施，即BLEU.具体推导过程如下：

引入对数的原因在于：使得数据之间不会因为稀疏性造成的差别很大的情况，且单调性不会发生变化。

引入几何平均的意义在于：可以体现出不同性质的参数的折中综合性能，在这里就是每句话的翻译集成在一起时整个篇章的翻译好坏。

BLEU：一种自动化评价机器翻译性能的方法

猜你喜欢