bleu全称为Bilingual Evaluation Understudy(双语评估替换),是2002年提出的用于评估机器翻译效果的一种方法,这种方法简单朴素、短平快、易于理解。因为其效果还算说得过去,因此被广泛迁移到自然语言处理的各种评估任务中。这种方法可以说是:山上无老虎,猴子称大王。时无英雄遂使竖子成名。蜀中无大将,廖化做先锋。
问题描述
首先,对bleu算法建立一个直观的印象。
有两类问题:
1、给定一个句子和一个候选句子集,求bleu值,此问题称为sentence_bleu
2、给定一堆句子和一堆候选句子集,求bleu值,此问题称为corpus_bleu
机器翻译得到的句子称为candidate,候选句子集称为references。
计算方式就是计算candidate和references的公共部分。公共部分越多,说明翻译结果越好。
给定一个句子和一个候选句子集计算bleu值
bleu考虑1,2,3,4共4个n-gram,可以给每个n-gram指定权重。
对于n-gram:
- 对candidate和references分别分词(n-gram分词)
- 统计candidate和references中每个word的出现频次
- 对于candidate中的每个word,它的出现频次不能大于references中最大出现频次
- candidate中每个word的出现频次之和除以总的word数,即为得分score
- score乘以句子长度惩罚因子即为最终的bleu分数
from collections import Counter
import numpy as np
from nltk.translate import bleu_score
def bp(references, candidate):
# brevity penality,句子长度惩罚因子
ind = np.argmin([abs(len(i) - len(candidate)) for i in references])
if len(references[ind]) < len(candidate):
return 1
scale = 1 - (len(candidate) / len(references[ind]))
return np.e ** scale
def parse_ngram(sentence, gram):
# 把一个句子分成n-gram
return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)] # 此处一定要注意+1,否则会少一个gram
def sentence_bleu(references, candidate, weight):
bp_value = bp(references, candidate)
s = 1
for gram, wei in enumerate(weight):
gram = gram + 1
# 拆分n-gram
ref = [parse_ngram(i, gram) for i in references]
can = parse_ngram(candidate, gram)
# 统计n-gram出现次数
ref_counter = [Counter(i) for i in ref]
can_counter = Counter(can)
# 统计每个词在references中的出现次数
appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
# +1是为了平衡,类似朴素贝叶斯拉帕拉斯校正
score = appear / len(can)
# 每个score的权值不一样
s *= score ** wei
s *= bp_value # 最后的分数需要乘以惩罚因子
return s
references = [
"the dog jumps high",
"the cat runs fast",
"dog and cats are good friends"
]
candidate = "the d o g jump s hig"
weights = [0.25, 0.25, 0.25, 0.25]
print(sentence_bleu(references, candidate, weights))
print(bleu_score.sentence_bleu(references, candidate, weights))
一个corpus是由多个sentence组成的,计算corpus_bleu并非求sentence_bleu的均值,而是一种略微复杂的计算方式,可以说是没什么道理的狂想曲。
corpus_bleu
一个文档包含3个句子,句子的分值分别为a1/b1,a2/b2,a3/b3。
那么全部句子的分值为:(a1+a2+a3)/(b1+b2+b3)
惩罚因子也是一样:三个句子的长度分别为l1,l2,l3,对应的最接近的reference分别为k1,k2,k3。那么相当于bp(l1+l2+l3,k1+k2+k3)。
也就是说:对于corpus_bleu不是单纯地对sentence_bleu求均值,而是基于更统一的一种方法。
from collections import Counter
import numpy as np
from nltk.translate import bleu_score
def bp(references_len, candidate_len):
if references_len < candidate_len:
return 1
scale = 1 - (candidate_len / references_len)
return np.e ** scale
def parse_ngram(sentence, gram):
return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]
def corpus_bleu(references_list, candidate_list, weights):
candidate_len = sum(len(i) for i in candidate_list)
reference_len = 0
for candidate, references in zip(candidate_list, references_list):
ind = np.argmin([abs(len(i) - len(candidate)) for i in references])
reference_len += len(references[ind])
s = 1
for index, wei in enumerate(weights):
up = 0 # 分子
down = 0 # 分母
gram = index + 1
for candidate, references in zip(candidate_list, references_list):
# 拆分n-gram
ref = [parse_ngram(i, gram) for i in references]
can = parse_ngram(candidate, gram)
# 统计n-gram出现次数
ref_counter = [Counter(i) for i in ref]
can_counter = Counter(can)
# 统计每个词在references中的出现次数
appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())
# nltk的实现这里化简了分子和分母,我非常不认同,我认为不应该化简,此处极有可能是NLTK的一处bug
# gcd = np.gcd(appear, len(can))
gcd = 1
if appear != 0:
up += appear // gcd
down += len(can) // gcd
else:
down += 1
s *= (up / down) ** wei
return bp(reference_len, candidate_len) * s
references = [
[
"the dog jumps high",
"the cat runs fast",
"dog and cats are good friends"],
[
"ba ga ya",
"lu ha a df",
]
]
candidate = ["the d o g jump s hig", 'it is too bad']
weights = [0.25, 0.25, 0.25, 0.25]
print(corpus_bleu(references, candidate, weights))
print(bleu_score.corpus_bleu(references, candidate, weights))
其中,对于句子分值的求和,NLTK代码中是使用Fraction,Fraction会自动对分子和分母进行化简,我认为不应该化简,如果化简了的话,不同句子所起的作用就不一样了。