Spark - AUC, Accuracy, Precision, Recall, F1-Score theory and practice

I. Introduction

In the recommended scenario, the above-mentioned indicators need to be used to evaluate the effect of the offline and online models. The following is a brief description of each indicator and all can be done through the spark program.

2. Index meaning

1.TP、TN、FP、FN

The most common scenario in the search, promotion and promotion scenario is the Ctr 2 classification scenario. There are two possibilities for the real value real and the predicted value pre respectively, 0 and 1, so that the final 2x2 produces 4 possibilities:

- The prediction pair of TP true rate pair , that is, 1 prediction is 1, which is reflected in the figure as both observation and prediction are Spring

- FP False positive rate wrong prediction pair , that is, 0 is predicted as 1, which is reflected in the figure as NoSpring and predicted as Spring

- The prediction of the FN false negative rate is wrong , that is, 1 is predicted to be 0, which is reflected in the figure as Spring and predicted as NoSpring

- The prediction of TN true negative rate is wrong , that is, 0 is predicted as 0, which is reflected in the figure as NoSpring and predicted as NoSpring

After sorting, it is shown in the figure below:

2.Accuracy、Precision、Recall、F1-Score

According to the TP, TN, FP, and FN mentioned above, there are definitions of the above indicators:

- Accuracy

That is, whether it is 0 or 1, the prediction is correct

Accuracy=\frac{TP + TN }{TP + FP + FN + TN}

- Precision Precision

That is, the proportion of samples that are predicted to be 1 in the sample that is predicted to be 1, this indicator only cares about the sample that is predicted to be 1

Precsion=\frac{TP}{TP + FP}

- Recall recall rate

That is, how many positive samples in the sample are predicted to be positive, and this indicator only cares about samples that are actually 1

Recall=\frac{TP}{TP+FN}

- F1-Score

Balance score, used to define the harmonic mean of Precision and Recall

F_1Score=2\cdot \frac{Precision\cdot Recall}{Precision+Recall}

- Fβ-Score

Balanced score, a more flexible harmonic mean than F1

F_{\beta}Score=(1+\beta^2)\cdot \frac{Precision\cdot Recall}{(\beta^2\cdot Precision)+Recall}

By adjusting the β parameter, the indicator can be more inclined to different metrics. When β>1, such as F2-Score, the weight of Recall is higher than that of Precision. On the contrary, if β < 1, such as F0.5-Score, the weight of Precision is higher than that of Recall. The conclusions can be easily drawn by taking the molecules apart.

3.AUC

AUC (Area Under Curve) is defined as the area under the ROC curve. For a given batch of positive and negative samples, a classifier is used to predict a pair of positive and negative samples respectively. The probability that the positive sample prediction probability is greater than the negative sample prediction probability is the corresponding area size.

3.1 Routine calculations

Assuming that there are M positive samples and N negative samples in this batch of samples, and the classifier is Tk, the AUC calculation formula is as follows:

AUC=\frac{\sum I(P_{positive},P_{negative})}{M\cdot N}

where P is calculated according to the learner Tk:

P = T_k(x)

I is an indicative function:

I(P_{positive}, P_{negative})=\left\{\begin{matrix} 1, P_{positive} > P_{negative} \\ 0.5, P_{positive}=P_{negative}\\ 0, P_{positive}<P_{negative} \end{matrix}\right.

3.2 Fast calculation

The above is the basic AUC calculation formula. Since it is necessary to compare the predicted probabilities of each pair of positive and negative samples, and in actual scenarios, M and N are both very large, which causes the problem of slow running speed. The optimized formula is as follows:

AUC=\frac{\sum_{i\in Positive} rank_i - \frac{M(M+1)}{2}}{M \cdot N}

Simple approximation:

AUC=\frac{\sum_{i\in Positive} rank_i}{M \cdot N}-\frac{(M+1)}{2N}

This formula is also easier to understand. Since AUC does not pay attention to the score but the order of positive and negative samples, we sort all the predicted samples. For each positive sample, its order represents the predicted value of how many samples it exceeds , for each positive sample, the total order can be obtained as:

\sum _{_i\in Positive} rank_i

However, since there are positive samples in addition to negative samples in each sequence arrangement, there are M-1 positive samples under the positive sample with the highest rank, and there are M-2 positive samples under it with the second highest rank. , and so on, there are:

0+1+...+(M-1) = \frac{(M-1)\cdot M}{2}

So you need to subtract (M-1)·M/2 on the basis of ∑ rank. As for why some formulas are M·(M+1) and others are (M-1)·M, this is actually the same as the sequence you calculated. Whether the rank starts from 0 or 1, if it starts from 1, the result of the arithmetic series summation formula is M·(M+1), otherwise it is (M-1)·M.

3.3 Push calculation

According to M-1 rank calculation starting from 0:

AUC = \frac{5 + 4 + 2}{3 \cdot 3} - \frac{2}{2 \cdot 3}=\frac{11}{9}-\frac{1}{3}=\frac{8}{9}\approx 0.8889

According to M+1, that is, the rank calculation starting from 1:

AUC = \frac{6+5+3}{3\cdot3} - \frac{4}{2 \cdot 3} = \frac{14}{9}-\frac{2}{3}=\frac{8}{9}\approx 0.8889

Finally, use the most primitive method to calculate, that is, to traverse all combinations:

AUC = \frac{I_{0,9} \cdot 3 + I_{0.7}\cdot3+I_{0.55}\cdot2}{3 \cdot 3} = \frac{8}{9} \approx 0.8889

Here I is the indicative function mentioned above.

Tips:

There is also a special case here, that is, there are multiple positive and negative samples with the same score. At this time, the rank value of all the same scores can be taken as the average to represent its rank after sorting. However, under the condition of large data and high CTR accuracy, this The influence of rank can be approximately ignored.

3. Spark implementation

1. Data preprocessing

    val rankResult = inputRdd.map { case (realLabel, preLabel, preScore) => {
      // val preLabel = if (preScore > 0.5) "1" else "-1"

      (realLabel, preLabel, preScore)
    }
    }.filter(_._3 >= 0)

    rankResult.persist(StorageLevel.MEMORY_AND_DISK_SER_2)

The original data format is realLabel and preScore. Here, preLabel can be deduced through preScore + threshold, and the RDD of three-column tuples can be obtained. In the actual calculation process, your RDD ancestor only needs to have:

- Real Label Real Label

- Predict Score model prediction score

Just two elements.

2. Calculate TP, FP, FN, TN

    /*
      计算相关数值
        TP: 真正率 对的预测对
        FP: 假正率 错的预测对
        FN: 假负率 对的预测错
        TN: 真阴率 错的预测错
     */
    val threshold = 0.5
    val dataTP = rankResult.filter(x => x._1 == 1 && x._3 >= threshold).cache()
    val dataFP = rankResult.filter(x => x._1 != 1 && x._3 >= threshold).cache()
    val dataFN = rankResult.filter(x => x._1 == 1 && x._3 < threshold).cache()
    val dataTN = rankResult.filter(x => x._1 != 1 && x._3 < threshold).cache()

    val TP = dataTP.count()
    val FP = dataFP.count()
    val FN = dataFN.count()
    val TN = dataTN.count()
    val total = TP + FN + FP + TN

Calculate the indicator according to the definition:

- Predicted pairs of TP true rate pairs

- FP False Positive Rate Wrong Prediction Pairs

- FN False Negative Rate Right Prediction Wrong

- TN True Negative Rate Wrong Prediction Wrong

Tips:

The threshold value here is usually 0.5, and you can also adjust the threshold value according to the needs of your own scenarios. For AUC, its calculation only pays attention to the sequence and not to the score, while the indicators such as TP and TN are related to the threshold score. Secondly, persist must be added here, otherwise the performance will be much slower.

3. Calculate Precision, Recall, Accuracy, F1-Score

Calculate Precision, Recall, and Accuracy based on TP, FP, FN, and TN, and finally calculate F1-Score. You can also customize β to realize F-β parameters.

    val Precision = if ((TP + FP) > 0) (TP * 1.0) / (TP + FP) else 0.0
    val Recall = if ((TP + FN) > 0) (TP * 1.0) / (TP + FN) else 0.0
    val Accuracy = if (total > 0) (TP + TN) * 1.0 / total else 0.0
    val F1Score = if ((Precision + Recall) > 0.0) (2 * Precision * Recall) / (Precision + Recall) else 0.0

4. Calculate AUC

    // sort by predict
    val sorted = rankResult.sortBy(x => x._3)
    val numTotal = sorted.count() // M + N
    val numPositive = rankResult.filter(x => x._1 == 1).count // M
    val numNegative = numTotal - numPositive // N
    val sumRanks = sorted.zipWithIndex().filter(x => x._1._1 == 1).map(x => x._2 + 1).reduce(_ + _)
    val AUC = if (numNegative > 0 & numPositive > 0) {
      sumRanks * 1.0 / numPositive / numNegative - (numPositive + 1.0) / 2.0 / numNegative
    } else 0.0

It can be sorted directly according to the predict score, which once again echoes the previous Tips mentioned, AUC only pays attention to the order of positive and negative samples, and finally apply the formula after the score:

AUC=\frac{\sum_{i\in Positive} rank_i}{M \cdot N}-\frac{(M+1)}{2N}

where M = numPositive, N = numNegative.

Four. Summary

The advantages of AUC assessment are:

- It has good adaptability to data imbalance and is not affected by the ratio of positive and negative samples

- Evaluation results are simple and easy to understand

- is not affected by the classifier's threshold choice

Disadvantages of AUC assessment are:

- Does not directly give the classification threshold of the classifier

- not suitable for multi-classification problems

For different problems, in addition to AUC, there are many indicators for reference. You need to choose the optimal evaluation indicator according to your own scenario needs, or expand based on existing indicators.

Guess you like

Origin blog.csdn.net/BIT_666/article/details/129965706