Evaluation indicators (Metric) (4)

BLUE

BLEU (BiLingual Evaluation Understudy) was first used in machine translation tasks to evaluate the rationality of machine-translated statements. Specifically, BLEU is calculated by measuring the overlap between the generated sequence and the reference sequence. Below we will use machine translation as an example to discuss this indicator.

Suppose there is currently a source text s ​ ss ​​​​​​, and the corresponding translation reference sequencer 1 , r 2 , . . . , rn r_1,r_2,...,r_nr1,r2,...,rn​​​​​​. The machine translation model is based on the source text s​​​​​​ss ​​​​​​generated a generated sequencexxx ​​​​​​​​​​,AndW ​​​​​​​ WW ​​​​​​is based on the candidate sequencexxThe N-ary word combinations generated by x , the accuracy of these N-ary combinations is: PN ( x ) = ∑ w ∈ W min ( cw ( x ) , maxk = 1 ncw ( rk ) ) ∑ w ∈ W cw ( x ) P_N(x)=\displaystyle \frac {\sum_{w \in W} min(c_w(x),max^n_{k=1}c_w(r_k)) } {\sum_{w \in W}c_w(x)}PN(x)=wWcw(x)wWmin(cw(x),maxk=1ncw(rk))Among them, cw ( x ) c_w(x)cw( x ) ​is an N-ary compound wordwww​ is generating the sequencexxThe number of occurrences in x , cw (rk) c_w(r_k)cw(rk) ​is an N-ary compound wordwww​ in the reference sequencerk​ r_krkThe number of times it appears in Accuracy of N-element combinationPN (x) P_N(x)PN( x ) is the proportion of N-gram combinations in the generated sequence that appear in the reference sequence.

As can be seen from the above formula, PN ( x ) P_N(x)PNThe core idea of ​​( x ) is to measure the generated sequencex ​ xWhether the N-gram compound words in x appear in the reference sequence, the calculation result prefers the short generated sequence, that is, the generated sequencexxThe shorter x ​, the accuracy PN ( x ) P_N(x)PN( x ) ​​​​will be higher. In this case, a length penalty factor can be introduced. If the sequencexxx​ is compared to the reference sequencerk r_krkshort, the sequence xx will be generatedx 进行惩罚。 b ( x ) = { 1 if  l x > l r exp ⁡ ( 1 − l s / l r ) if  l s ≤ l r b(x)=\begin{cases} 1 & \text{if } l_x \gt l_r \\ \exp(1-l_s/l_r) &\text{if } l_s \le l_r \end{cases} b(x)={ 1exp(1ls/lr)if lx>lrif lslrAmong them, lx ​ l_xlx​Represents the generated sequencex ​ xThe length of x , lr l_rlr​Represents the reference sequence lr l_rlrThe minimum length of .

A concept has been mentioned repeatedly before - N-ary compound words, we can generate the sequence xx based onx ​​Construct N-ary compound words of different lengths, so that the accuracy of compound words of different lengths can be obtained, such asP 1 ( x ) ​, P 2 ( x ) ​, P 3 ( x ) P_1(x)​, P_2 (x)​,P_3(x)P1(x)P2(x)P3( x ) ​And so on. The BLEU algorithm calculates the accuracy PN (x) P_N(x)of N-element combinations of different lengthsPN(x) N = 1 , 2 , 3... N=1,2,3... N=1,2,3... , and obtained by geometrically weighted average, as shown below. BLUE-N ⁡ ( x ) = b ( x ) × exp ⁡ ( ∑ N = 1 N ′ α N log ⁡ PN ) \operatorname {BLUE-N}(x)=b(x) \times \exp(\displaystyle \sum^{N'}_{N=1} \alpha_N \log P_N)BLUE-N(x)=b(x)×exp(N=1NaNlogPN) where,N ′​ N′N ​ isα N ​ α_NaN​is the weight of different N-gram combinations, generally set to1 N ′ ​ \frac {1} {N′}N1​, The value range of the BLEU algorithm is [0,1]​. The larger the value, the better the quality of the generation.

The BLEU algorithm can better calculate whether the words in the generated sequence x have appeared in the reference sequence, but it does not pay attention to whether the words in the reference sequence have appeared in the generated sequence. That is, BLEU only cares about the accuracy of the generated sequence, not its recall.

ROUGE

Since the BLEU algorithm only cares about whether the words in the generated sequence appear in the reference sequence, and does not care about whether the words in the reference sequence appear in the generated sequence, this may have some impact in the actual index evaluation process, so it cannot be compared. Good for assessing the quality of generated sequences.

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) algorithm is a solution. It can measure whether the words in the reference sequence have appeared in the generated sequence, that is, it can measure the recall rate of the generated sequence. Let's take machine translation as an example to discuss the calculation of ROUGE.

Assume that there is currently a source text sss , and the corresponding translation reference sequencer 1 , r 2 , . . . , rn r_1,r_2,...,r_nr1,r2,...,rn. The machine translation model is based on the source text sss generates a generated sequencexxx , andWWW is based on the candidate sequencexxFor the N-word combination generated by ∑ w ∈ W cw ( rk ) \operatorname {ROUGE-N}(x)=\displaystyle \frac {\sum_{k=1}^n \sum_{w \in W} min(c_w(x),c_w( r_k))} {\sum^n_{k=1}\sum_{w \in W}c_w(r_k)}ROUGE-N(x)=k=1nwWcw(rk)k=1nwWmin(cw(x),cw(rk))Among them, cw ( x ) c_w(x)cw( x ) is the N-ary compound wordwww is generating sequencexxThe number of occurrences in x , cw (rk) c_w(r_k)cw(rk) is the N-ary compound wordwww is in the reference sequencerk r_krkthe number of times it appears.

As can be seen from the formula, the ROUGE algorithm can better calculate whether the words in the reference sequence have appeared in the generated sequence, but it does not pay attention to whether the words in the generated sequence have appeared in the reference sequence. That is, the ROUGE algorithm only cares about the generated sequence. Recall, not precision.

Guess you like

Origin blog.csdn.net/weixin_49346755/article/details/127344594