Machine translation and automatic summarization evaluation indicators BLEU and ROUGE

Machine translation and automatic summarization evaluation indicators BLEU and ROUGE

In machine translation tasks, BLEU and ROUGE are two commonly used evaluation indicators. BLEU measures the quality of translation based on Precision, while ROUGE measures the quality of translation based on Recall.

1. Machine translation evaluation index

After using the machine learning method to generate the translation of the text, it is necessary to evaluate the performance of the model translation, which requires the use of some machine translation evaluation indicators, among which the more common evaluation indicators are BLEU and ROUGE. These two indicators have a relatively old history, BLEU was proposed in 2002, and ROUGE was proposed in 2003. Although there are some problems with these two indicators, they are still relatively mainstream machine translation evaluation indicators.

Generally,  C is used  to represent the translation of machine translation, and m reference translations  S1S2 , ...,  Sm also need to be provided . The evaluation index can measure   the degree of matching between the machine translation C  and the reference translations  S1S2 , ...,  Sm .

2.BLEU

The full name of BLEU is Bilingual evaluation understudy ( bilingual evaluation substitute ) . BLEU scores range from 0 to 1. The closer the score is to 1, the higher the quality of the translation. BLEU is mainly based on precision , the following is the overall formula of BLEU.

BLEU

  • BLEU needs to calculate the accuracy rate of 1-gram, 2-gram, ..., N-gram of the translation. Generally, N is set to 4, and Pn in the formula refers to the accuracy rate of n-gram.
  • Wn refers to the weight of n-gram, which is generally set as uniform weight, that is, Wn = 1/N for any n.
  • BP is a penalty factor, BP is less than 1 if the length of the translation is less than the shortest reference translation.
  • The 1-gram accuracy of BLEU indicates how faithful the translation is to the original text, while the other n-grams indicate the fluency of the translation.

2.1 n-gram accuracy calculation

Assume that the machine-translated translation  C  and a reference translation  S1  are as follows:

C: a cat is on the table
S1: there is a cat on the table 

Then you can calculate the accuracy of 1-gram, 2-gram, ...

n-gram precision

There are some problems in calculating Precision directly in this way, for example:

C: there there there there there
S1: there is a cat on the table 

At this time, the result of machine translation is obviously incorrect, but its 1-gram Precision is 1, so BLEU generally uses a correction method. Given the reference translations  S1S2 , ...,  Sm ,  the Precision of n-tuples in C can be calculated  , and the calculation formula is as follows:

n-gram precision

2.2 Penalty factor

The method of BLEU to calculate n-gram accuracy is introduced above, but there are still some problems. When the length of machine translation is relatively short, the BLEU score will be relatively high, but this translation will lose a lot of information, for example:

C: a cat
S1: there is a cat on the table 

Therefore, the BLEU score needs to be multiplied by the penalty factor

BLEU score penalty factor

3.ROUGE

The full name of the ROUGE indicator is (Recall-Oriented Understudy for Gisting Evaluation), which is mainly based on the recall rate (recall). ROUGE is a commonly used evaluation index for machine translation and article abstracts, proposed by Chin-Yew Lin, who proposed four ROUGE methods in the paper:

  • ROUGE-N: Calculate the recall rate on N-gram
  • ROUGE-L: considers the longest common subsequence between the machine translation and the reference translation
  • ROUGE-W: Improved ROUGE-L, using a weighted method to calculate the longest common subsequence

3.1 RED-N

ROUGE-N mainly counts the recall rate on N-gram. For N-gram, the ROUGE-N score can be calculated. The calculation formula is as follows:

RED-N

The denominator of the formula is to count the number of N-grams in the reference translation, and the numerator is to count the number of N-grams shared by the reference translation and the machine translation.

C: a cat is on the table
S1: there is a cat on the table 

The ROUGE-1 and ROUGE-2 scores for the above example are as follows:

RED-1 RED-2

If multiple reference translations Si are given  , Chin-Yew Lin also gave a calculation method, assuming there are M translations  S1 , ...,  SM . ROUGE-N will calculate the ROUGE-N scores of the machine translation and these reference translations separately, and take the maximum value, the formula is as follows. This method can also be used for ROUGE-L, ROUGE-W and ROUGE-S.

ROUGE-N Multi

3.2 RED-L

L in ROUGE-L refers to the longest common subsequence (LCS). ROUGE-L uses the longest   common subsequence of the machine translation C  and the reference translation  S. The calculation formula is as follows:

RED-L

R_LCS in the formula represents the recall rate, while P_LCS represents the precision rate, and F_LCS is ROUGE-L. Generally, beta will be set to a large number, so F_LCS almost only considers R_LCS (ie recall rate). Note that if beta is large here, F will pay more attention to R instead of P. You can see the formula below. If beta is large, the term P_LCS can be ignored.

3.3 ROUGE-W

ROUGE-W is an improved version of ROUGE-L, consider the following example, X  represents the reference translation, and  Y1 , Y2  represent two machine translations.

ROUGE-W

In this example, it is obvious that  Y1  's translations are of higher quality because  Y1  has more consecutive matching translations. But the scores calculated by ROUGE-L are indeed the same, that is, ROUGE-L( XY1 )=ROUGE-L( XY2 ).

Therefore, the author proposes a weighted longest common subsequence method (WLCS), which gives higher scores for continuous translation correctness. For details, please read the original paper "ROUGE: A Package for Automatic Evaluation of Summaries".

3.4 ROUGE-S

ROUGE-S also counts N-grams, but the N-grams it uses allow "Skip", that is, words do not need to appear consecutively. For example, the Skip 2-gram of the sentence "I have a cat" includes (I, have), (I, a), (I, cat), (have, a), (have, cat), (a, cat).

4. References

Bleu: a method for automatic evaluation of machine translation
ROUGE: A Package for Automatic Evaluation of Summaries

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131324986