Rouge | Automatic summarization and machine translation evaluation index


tag: evaluation index, summary, nlp

Rouge (Recall-Oriented Understudy for Gisting Evaluation) is a set of indicators for evaluating automatic summarization and machine translation. It is calculated by comparing an automatically generated abstract or translation with a set of reference abstracts (usually human-generated), resulting in a corresponding score to measure the "similarity" between the automatically generated abstract or translation and the reference abstract .

red-N

Refers to the number of N-grams

When calculating in Chinese, you need to separate each word with a space (just like the space between each word in an English sentence)

for example

from rouge import Rouge  
rouge = Rouge()  
title = '今天是星期四'  
pred_title = '今天是周四'  
pred_title2 = '周四是今天'  
print(rouge.get_scores(' '.join(list(pred_title)), ' '.join(list(title))))  
  
# [{'rouge-1': {'r': 0.6666666666666666, 'p': 0.8, 'f': 0.7272727223140496}, 'rouge-2': {'r': 0.4, 'p': 0.5, 'f': 0.4444444395061729}, 'rouge-l': {'r': 0.6666666666666666, 'p': 0.8, 'f': 0.7272727223140496}}]

print(rouge.get_scores(' '.join(list(pred_title2)), ' '.join(list(title))))
# [{'rouge-1': {'r': 0.6666666666666666, 'p': 0.8, 'f': 0.7272727223140496}, 'rouge-2': {'r': 0.2, 'p': 0.25, 'f': 0.22222221728395072}, 'rouge-l': {'r': 0.3333333333333333, 'p': 0.4, 'f': 0.36363635867768596}}]

rouge-1

Refers to the matching degree of a single word

  • Recall r=number of coincidences of single words/len(title), that is, how many words are found, r=4 (coincidence words are: today is four)/6=0.67
  • Accurate p=number of coincidences of single words/len(pred_title), that is, how many of the found words are correct, p=4/5=0.8
  • f is the calculation of r and p, f=(2PR)/(P+R)=0.737
  • Both examples are exactly the same, don't care about the order

rouge-2

Refers to the matching degree of every two words

  • Recall r=number of coincidences of every two words/len(title), that is, how many words are found, r=2(today, day is)/5(today, day is, day, week, Thursday)=0.4 ;In title2, r=1(today)/5(today, day is, is day, week, Thursday)=0.2
  • Accurate p=number of overlaps of every two words/len(pred_title), that is, how many of the found words are correct, p=2/4=0.5; in title2, p=1 (today)/4 (week Four, four is, is today, today) = 0.25
  • f is the calculation of r and p, f=(2PR)/(P+R)

red-L

The first letter of LCS (longest common subsequence, the longest common subsequence), because Rouge-L uses the longest common subsequence. But it should be noted that this is the longest common subsequence considering the order (because of this order problem, many bloggers have not said that they have been confused for a long time).

  • recall r=3(today is four)/len(title)=4/6; at title2, r=2(today)/6=1/3
  • Precise p=3(today is four)/len(pred_title)=4/5; at title2, r=2(today)/5=0.4
  • f is the calculation of r and p

Advantages : It does not require continuous matching of words, only the order of appearance of words is required, and it can reflect the sentence-level word order like n-gram. Automatically match the longest common subsequence without predefining the length of n-grams.
Disadvantage : Only one longest subsequence is calculated, and the final value ignores the influence of other alternative longest subsequences and shorter subsequences.

After reading this example, you can clearly understand that if the predicted order is the same, such as the title example, then rouge1=rouge-L; but when it comes to title2, due to the consideration of the order, the longest common subsequence is not Again "today is four" but "today", in the two examples rouge-1 has not changed, but both rouge2 and rouge-L have changed.

Guess you like

Origin blog.csdn.net/qq_23590921/article/details/129181387