[Recommendation system] common evaluation indicators NDCG, HR, Recall, MRR analysis

[Recommendation system] common evaluation indicators NDCG, HR, Recall, MRR analysis

1. Preparations

This article will discuss indicators such as Accuracy, Recall, Precision, HR, F1 score, MAP, MRR, and NDCG.

  • A confusion matrix is ​​required.
    insert image description here

    • Simple memory:
      • Positive means the prediction is true, and Negative means the prediction is false;
      • True indicates that the actual value matches the predicted value, and False indicates that the actual value does not match the predicted value.
  • Here we take @5 as an example to calculate these indicators.

  • Given the predicted and actual items that the user will interact with, the calculation is performed below.

    • True: [A, B, C, D, E]
    • Predicted: [A, C, B, E, F]

2. Calculate these metrics (@5)

—————————————————————————————————————
Here assumes that the total sample is 6 (A, B, C, D, E, F), then

  • TP = 4(ABCE)
  • TN = 0
  • FP = 1(F)
  • FN = 1(D)

2.1 Accuracy (accuracy rate)

  • meaning
    • The proportion of samples that are correctly predicted among all samples.
    • In the case of unbalanced samples, it is not a good indicator to measure the results.
  • official
    insert image description here
  • calculate:
    • ACC = (4 + 0) / 6 = 0.67

2.2 Recall (recall rate, recall rate)

  • meaning:
    • The proportion of correctly predicted positive samples in all positive samples indicates how many proportions of user-item interaction records are included in the final prediction list.
    • Focus on the items that the user is interested in (TP + FN is actually the item that the user is interested in)
  • official
    insert image description here
  • calculate
    • Recall = 4 / (4 + 1) = 0.8

2.3 Precision (precision rate, precision rate)

  • meaning
    • Among all the predicted positive results, the proportion of positive samples that are predicted to be correct.
    • Focus on the item to recommend (TP + FP is actually the item to recommend)
  • official
    insert image description here
  • calculate
    • Precision = 4 / (4 + 1) = 0.8

2.4 F1 score (harmonic mean of precision and recall)

Recall rate and precision rate are a pair of contradictory indicators. When the recall rate is high, the precision rate is generally low; when the precision rate is high, the recall rate is generally low.

  • So there is: the harmonic mean F1 score of precision and recall.

  • official
    insert image description here

  • calculate

    • F1 = 2 (0.8 * 0.8) / (0.8 + 0.8) = 0.8

2.5 HR (Hits Ratio)

  • meaning
    • The ratio of correctly predicted samples to all samples in the predicted result list, that is, whether the item that the user wants has been recommended, emphasizing the "accuracy" of the prediction.
    • In fact, it is aimed at users who only interact with one item in the next stage .
  • official
    insert image description here
    • N, represents the total number of users.
    • hits(i), indicates whether the item to be accessed by the i-th user is in the recommendation list, and it is 1 if it is, otherwise it is 0.
  • calculate
    • HR = 4 / 5 = 0.8 (5 were visited, 4 were successfully predicted, so 0.8)
    • For A, HR = 1
    • For D, HR = 0

2.6 MRR (Mean Reciprocal Rank)

  • meaning
    • The reciprocal sorting in the average result indicates whether the item to be recommended is placed in a more prominent position for the user, emphasizing "sequence".
    • It is also aimed at the user interacting with only one item at the next moment. (here to)
  • official
    insert image description here
    • N, represents the total number of users.
    • pi, indicating the position of the i-th user's real access value in the recommendation list, if the value does not exist in the recommendation list, then pi->∞.
  • calculate
    • For A, MRR = 1 / 1 = 1.0
    • For B, MRR = 1 / 3 = 0.33

2.7 NDCG (Normalized Discounted Cumulative Gain)

The recommendation system usually returns an item list for a user, assuming the list length is K, then NDCG@K can be used to evaluate the gap between the sorted list and the user's real interaction list.

2.7.1 CG (Cumulative Gain)

Consider a list of length K, reli rel_ireliRepresents the relevance of the item at position i. (In the recommendation system, it is 0, 1)
insert image description here

  • There is a problem with this evaluation metric, the recommended items are clustered at the end of the list with the same score as the head, which is not appropriate.

2.7.2 DCG (Discounted cumulative gain)

DCG proposes: If the valid results are ranked lower in the list, the scoring of the list should be punished, and the penalty is related to the ranking of valid results. So the attenuation factor is added:
insert image description here
insert image description here

  • The latter formula is widely used in industry. When the score is 0 / 1, ie reli ∈ { 0 , 1 } re l_{i} \in\{0,1\}reli{ 0,1 } , the two are equivalent.

2.7.3 NDCG (Normalized Discounted Cumulative Gain)

DCG does not take into account the recommendation list and the number of real valid results (test items list) in each retrieval, so NDCG is finally introduced, which is the standardized DCG.
insert image description here

  • Among them, IDCG refers to ideal DCG, that is, the DCG under the perfect result.

insert image description here

2.7.4 When targeting a single item that the user will access next, NDCG is defined as follows

  • meaning
    • Indicates the accumulative benefit of the first p positions calculated by normalizing and adding the position information measure.
  • official
    insert image description here
    • N, represents the total number of users.
    • pi, indicating the position of the i-th user's real access value in the recommendation list, if the value does not exist in the recommendation list, then pi->∞.

2.8 MAP (Mean Average Precision)

  • Average Precision (AP): The average precision rate, while the recall rate gradually increases from 0 to 1, it is also necessary to ensure that the accuracy rate is relatively high, and the AP value is as large as possible.
  • meaning
    • Use multiple to measure performance, the average of multiple categories of AP.
  • official
    insert image description here

reference

【1】https://blog.csdn.net/qq_51392112/article/details/129169738
【2】https://blog.csdn.net/qq_41750911/article/details/124082415

Guess you like

Origin blog.csdn.net/qq_51392112/article/details/130579953