[Recommendation] The evaluation index nDCG of the ranking model

introduce

  nDCG(Normalized Discounted Cumulative Gain)Normalized Discounted Cumulative Gain is a metric used to evaluate the performance of ranking models, which considers two aspects: the correctness of the ranking and the degree of correlation.

  Learn to learn nDCGin this order of Gain, CG, DCG, iDCG.nDCG

  Suppose now you have a sequence of labels:

A B C D
scoring dataset 3 2 1 0
Click on Dataset 1 0 1 0

  If it is a scoring data set, then A has the highest score (3) and D is the lowest (0), so the real order is ABCD;
  if it is a click data set, then AC has a score (1), BD has no (0), so the label is AC , the order does not matter;

Gain

  The gain Gainrepresents the score of the i-th label position. where rel( i ) rel(i)re l ( i ) represents a fraction, what exactly is this fraction? What the score is depends on what is stored in the corresponding location of the data set.

G a i n = r e l ( i ) Gain= rel(i) Gain=rel(i)

  If the recommended display feedback is used, that is, the scoring data set (1-5 points), then the 1-5 scoring is the score to be used for calculation. If implicit feedback is used, that is, the user clicks on the data set, then the score is 0-1. 1 means the user has clicked, and 0 means no click.

  Then in the above example: In the scoring data set, it can be said that the gain of label A is 3, and B is 2...

cumulative gain CG

  Cumulative gain CGrepresents the accumulated benefit of the first k positions. CGIt is necessary to specify k in topk to be able to calculate. Otherwise, in different situations, user A has 100 tags, and user B has only 10 tags, so it is meaningless to count CG.

C G @ k = ∑ i = 1 k r e l ( i ) CG@k = \sum_{i=1}^{k} rel(i) CG@k=i=1krel(i)

  Then in the above example: in the scoring data set, if the label is [A,B,C,D] or [B,A,C,D], CG@2 = 5. So, the order does not affect CGthe score. If we want to evaluate the impact of different orders, we need to use another metric DCGto evaluate.

Discount Cumulative Gain DCG

  CGIt just simply accumulates the correlation without considering the location information. Considering the factors of sorting order, the higher-ranked items gain more, and the lower-ranked items are discounted. CGThe order is independent, but DCGthe effect of the order is evaluated. DCGThe idea is: the order of the items in the list is very important, and the contribution of different positions is different. Generally speaking, the items in the front have a greater impact, and the items in the back have less impact.

D C G @ k = ∑ i = 1 k r e l ( i ) log ⁡ 2 ( i + 1 ) DCG@k = \sum_{i=1}^{k} \frac{rel(i)}{\log_2(i+1)} D CG @ k=i=1klog2(i+1)rel(i)

  Then in the above example: In the scoring data set, if the label is [A,B,C,D]: ; DCG@2 = 3/log(2) + 2/log(3) = 6.149If the label is [B,A,C,D]: DCG@2 = 2/log(2) + 3/log(3) = 5.616; We found that the closer to the original label order, the fold The greater the cumulative loss, the greater the gain.

Ideal Discount Cumulative Gain IDCG

  IDCGIt refers to the ideal situation DCG, that is, DCGthe situation where the maximum value is obtained, which is the order in the data set. The formula is:

I D C G @ k = ∑ i = 1 ∣ R E L ∣ r e l ( i ) log ⁡ 2 ( i + 1 ) IDCG@k = \sum_{i=1}^{|REL|} \frac{rel(i)}{\log_2(i+1)} IDCG@k=i=1RELlog2(i+1)rel(i)

  It is DCGno different from the other, except DCGthat the score is calculated based on the label order predicted by the model, IDCGand the actual label order in the data set is used to calculate the score, so DCGit is impossible to compare IDCG.

Normalized loss cumulative gain nDCG

  Divide by IDCG, constraining the score to [0,1].

n D C G @ k = D C G @ k I D C G @ k nDCG@k = \frac{DCG@k}{IDCG@k} nDCG@k=IDCG@kD CG @ k

Calculation example

A-B-C-DIf the score of the model for label prediction is   now[0.111, 0.222, 0.001, 0.10]

  Then to sort the scores, that is[0.222, 0.111, 0.10, 0.001]

  Then the label order predicted by the model isB-A-D-C

  The score of the label order predicted by the model B-A-D-Cin the data set is[2, 3, 0, 1]

  nDGC@3 = [2/log(2) + 3/log(3) + 0/log(4)] / [3/log(2) + 2/log(3) + 1/log(4)] = 0.8174935137996165

Guess you like

Origin blog.csdn.net/qq_43592352/article/details/131936768