LLM - BLEU, a large model evaluation index

Table of contents

I. Introduction

2. Introduction to BLEU

1.Single BLUE

2.Modified BLEU

3.Modified n-gram precision

4.Sentence brevity penalty

3. BLEU calculation

1. Calculate sentences and single references

2. Calculate sentences and multiple references

4. Summary


I. Introduction

Human evaluation of machine translation is extensive and expensive, can take months to complete and involves manual labor that cannot be reused. BLEU - bilingual evaluation understudy literally means 'bilingual evaluation stand-in'. It was born to provide a quality evaluation aid for bilingual translation. The method is fast, cheap, language-independent, and highly correlated with human evaluation. This evaluation method was first applied to the rapid effect evaluation of machine translation and human translation. LLM is currently very popular, but many generated answers require manual evaluation of the effect, resulting in no quantitative indicators to measure its quality. Here is a brief introduction to the calculation and use of BLEU. .

2. Introduction to BLEU

The main programming task implemented by Bleu is to compare the n-gram of the candidate with the n-gram of the reference translation and count the number of matches. These matches are position independent. The more matches, the better the candidate translation. For simplicity, we first focus on computing unigram matches.

1.Single BLUE

Take a tuple, a single word, as an example:

Reference 1 and 2 are reference output, and MT output is machine translation output.

MT output outputs three words, all of which are the, and the denominator is 7

the appears in the reference translation, so the numerator is 7

At this time, P = 7 / 7 = 1, which obviously does not match our intuitive feeling.

2.Modified BLEU

The previous simple contains method of judging whether a candidate translation word appears in a reference sentence is obviously not ideal, so Modified BLEU is proposed. Modified mainly modifies the unreasonable calculation method of the above molecule. Here is the following definition:

Count_{w_{i,j}}^{clip} = min(Count_{w_i}, Ref_j-Count_{w_i})

The Count of the pruned or modified word Wi in Reference J is calculated as the above formula, where:

 Count_{w_i} 

◆  Represents the number of words wi, in the above wi = the, its number in MT output is 7

 Ref_j-Count_{w_i}

Represents the number of times wi appears in reference j, the appears 2 times in R1 and 1 time in R2

 Count_{w_{i,j}}^{clip}

◆  This indicator defines the truncated count of wi for the jth reference, which is min(7, 2) = 2 for R1 and 1 for R2

 Count^{clip}=max(Count^{clip}_{w_{i,j}}),j=1,2,3...

◆  The comprehensive truncation count of wi in all reference translations, taking the as an example, where max(1, 2) = 2

The denominator of the modified method remains unchanged at 7, and the numerator is 2, so BELU Score = 2/7, and the modified score is relatively reasonable.

3.Modified n-gram precision

The above only considers the situation of a single word. Although we have adjusted the calculation method of the numerator, although the score is relatively reasonable, the actual perception is still very poor because the translation is a repeated single word. Based on this situation, the algorithm considers introducing N-Gram to evaluate BLEU scores for different phrases. Generally, N = 4. Take the following statement as an example:

We calculate 1-gram to 4-gram based on MT output respectively:

 For longer paragraphs, we can understand it as larger sentences:

Here a word-weighted average of sentence-level modified accuracy is used instead of a sentence-weighted average.

4.Sentence brevity penalty

When the translation sentence is short, the calculated BLEU score will have a certain distortion. For this reason, the Sentence brevity penalty is introduced to punish short sentences. When the translation sentence is shorter than the reference translation, BP is introduced to penalize short sentences:

BLEU calculation formula before modification:

Here, a weighted summation method is used to calculate the probabilities of different n-grams. The modified BELU formula is:

c represents candidate translation, r represents reference reference translation, for the case of c ≤ r, some penalties will be applied to the score, where the calculation of BP is based on r, c and exp exponential functions. In the paper, the n-gram of the baseline is selected as N=4, and Wn is selected as 1/N.

3. BLEU calculation

Python can calculate the BLEU score between output and reference through the nltk library, and multiple references can be passed through the reference list. The BLEU score usually ranges from 0-1, where 1 represents a perfect match and a higher score indicates a better match.

1. Calculate sentences and single references

from nltk.translate.bleu_score import sentence_bleu

# 参考句子列表
reference = [['The', 'cat', 'is', 'on', 'the' ,'mat']]
# 候选句子
candidate = ['the', 'the', 'the', 'the', 'the', 'the', 'the']

# 计算BLEU分数
bleu_score = sentence_bleu(reference, candidate)

print("BLEU分数:", bleu_score)

Taking the above 7 'the' as an example, the BLEU score is: 1.1200407237786664e-231.

2. Calculate sentences and multiple references

from nltk.translate.bleu_score import sentence_bleu

# 参考句子列表
reference = [['The', 'cat', 'is', 'on', 'the' ,'mat'],['There', 'is', 'a', 'cat', 'on', 'the', 'mat']]
# 候选句子
candidate = ['The', 'cat', 'the', 'cat', 'on', 'the', 'mat']

# 计算BLEU分数
bleu_score = sentence_bleu(reference, candidate)

print("BLEU分数:", bleu_score)

Take 'the cat the cat on the mat' as an example. BLEU score: 0.4671379777282001.

4. Summary

BLEU was first used to evaluate machine translation results. It mainly considers the matching degree of n-gram phrases and introduces the BP penalty coefficient. The advantages of BLEU are fast calculation, simple definition, and the results have certain reference value; the disadvantage is that it only considers simple combinations of words and does not consider more complex grammar or approximate expressions.

When talking about n-gram, we have to think of word2vec, the originator of embedding. In essence, BLEU is actually calculating the frequency of co-occurrence and performing certain weighted optimizations for long and short sentences. Therefore, in the LLM field, on the one hand, we can quickly calculate the hard indicators of the generation effect based on the NLTK API, and on the other hand, we can also modify the indicators based on n-gram to adapt to our own business characteristics.

Guess you like

Origin blog.csdn.net/BIT_666/article/details/132343033