Classification Algorithm Performance Metrics

1 Introduction

  In order to understand the generalization ability of the model, we need to measure it with a certain indicator, which is the meaning of performance measurement. Commonly used evaluation indicators are: confusion matrix (Confuse Matrix), accuracy (Accuracy), precision (Precision) and recall (Recall), F1-Score , ROC curve (Receiver Operating Characteristic Curve), AUC (Area Under the Curve ), PR curve and so on.

2. Confusion Matrix

For a binary classification problem, instances are divided into two types, positive and negative.

  Mix all the results of the predicted situation with the actual situation, and the following 4 situations will appear as a result, forming a confusion matrix. In fact, it can also be multi-classified. Here is an example of binary classification.
insert image description here

  • True Positive (TP): The prediction is a positive sample, and the real one is also a positive sample
  • True Negative (TN): The prediction is a negative sample, and the real one is also a negative sample
  • False positive (False Positive, FP): predicted as a positive sample, the real negative sample
  • False Negative (FN): The prediction is a negative sample, and the real one is a positive sample

3. Accuracy

Accuracy is the proportion of correctly predicted samples in all samples , and its formula is as follows:
Accuracy = TP + TNTP + FP + TN + FN Accuracy=\frac{TP+TN}{TP+FP+TN+FN}Accuracy=TP+FP+TN+FNTP+TN
insert image description here

  The accuracy rate has an obvious disadvantage, that is, when the data types are unbalanced, especially in the case of extremely biased data, the accuracy rate, an evaluation index, cannot objectively evaluate the pros and cons of the algorithm.

  For example: In the test set, there are 1000 test samples, 999 negative examples, and only 1 positive example. If the model predicts a negative example for any test sample, then the accuracy rate of the model is 99%, which is very good from a numerical point of view, but in fact, such an algorithm has no predictive ability.

4. Precision and recall

4.1 Accuracy

  Precision, also known as precision , is for the prediction results , and its meaning is the probability of the actual positive samples among all the samples that are predicted to be positive , that is, in the results of the predicted positive samples , how much certainty can be predicted correctly, the formula is as follows:
P precision = TPTP + FP Precision=\frac{TP}{TP+FP}Precision=TP+FPTP
  Precision and accuracy are two completely different concepts. The precision rate represents the prediction accuracy of the positive sample results, and the accuracy rate represents the overall prediction accuracy, including both positive samples and negative samples.
insert image description here

4.2 Recall rate

  The recall rate (Recall), also known as the recall rate , is for the original sample , and its meaning is the probability of being predicted as a positive sample in the actual positive sample . The formula is as follows:
R ecall = TPTP + FN Recall =\frac{TP}{TP+FN}Recall=TP+FNTP
insert image description here

Refer to the diagram in the Wiki to help illustrate the relationship between the two.

Precision And Recall

  In different application scenarios, the focus is different. For example, when predicting stocks, you are more concerned about Precision, that is, how much the stocks that are predicted to rise have really risen, because those stocks that are predicted to rise are all invested. In the scenario of predicting patients, more attention should be paid to Recall, that is, among those who are really sick, there should be as few wrong predictions as possible.

  Precision and Recall are trade-offs. For example, in a recommendation system, if you want to push the content that is of interest to all users as much as possible, you can only push content with high confidence, which will miss some content that users are interested in, and the Recall will be low; if you want to make users interested If all the content is pushed, the only way to push all the content is to kill one thousand by mistake than to let one go, so that the Precision will be very low.

4.3 PR curve

  According to the prediction result of the learner (generally a real value or probability), the test samples are sorted, and the samples that are most likely to be "positive examples" are arranged in front, and the samples that are least likely to be "positive examples" are arranged in the back. Press here Sequentially predict the samples as "positive examples" one by one, and calculate the current P value and R value each time.
insert image description here

PR curve evaluation:

  If the PR curve of a learner A is completely covered by the PR curve of another learner B, it is said that the performance of B is better than that of A. If the curves of A and B intersect, whoever has the larger area under the curve will have better performance. But generally speaking, the area under the curve is difficult to estimate, so the "balance point" (Break-Event Point, referred to as BEP) is derived, that is, the value when P=R, the higher the value of the balance point High, better performance.

5.F1-Score

  The two indicators of Precision and Recall are usually used to evaluate the effect of the binary classification model. However, the Precision and Recall indicators sometimes ebb and flow, that is, the higher the precision rate, the lower the recall rate. In order to take into account both Precision and Recall, the most common method is F-Measure, also known as F-Score. F-Measure is the weighted harmonic mean of P and R , and its calculation formula is as follows:
1 F β = 1 1 + β 2 ⋅ ( 1 P + β 2 R ) 1 F β = ( 1 + β 2 ) × P × R ( β 2 × P ) + R \begin{align} \frac{1}{F_\beta}=\frac{1}{1+\beta^2}·(\frac{1}{P}+\frac{\ beta^2}{R})\\[2ex] \frac{1}{F_\beta}=\frac{(1+\beta^2)×P×R}{(\beta^2×P)+ R}\end{align}Fb1=1+b21(P1+Rb2)Fb1=( b2×P)+R(1+b2)×P×R
This β = 1 \beta=1b=1 , which is often referred to as F1-score, is the harmonic mean of P and R. The value of F1-Score ranges from 0 to 1. 1 represents the best output of the model, and 0 represents the worst output of the model. When F1 is higher, the performance of the model is better.
1 F 1 = 1 2 ⋅ ( 1 P + 1 R ) F 1 = 2 × P × RP + R \begin{align} \frac{1}{F_1}=\frac{1}{2}·(\frac {1}{P}+\frac{1}{R})\\[2ex] F1=\frac{2×P×R}{P+R} \end{align}F11=21(P1+R1)Q1 _=P+R2×P×R
Among them, P stands for Precision and R stands for Recall.

6. ROC curve

6.1 Introduction to ROC

ROC and AUC to be mentioned later are very commonly used evaluation indicators in classification tasks, which will be elaborated in this article. Some people may have doubts, since there are so many evaluation criteria, why use ROC and AUC?

  Because the ROC curve has a very good characteristic: when the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged. Class Imbalance often occurs in actual data sets, that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may also change over time, ROC and AUC It can well eliminate the impact of sample category imbalance on the indicator results.

  Another reason is that ROC, like the PR curve mentioned above, is an evaluation indicator that does not depend on the threshold (Threshold). In a classification model whose output is a probability distribution, if only accuracy, precision, and recall are used When ratio is used as an evaluation index for model comparison, it must be based on a given threshold. For different thresholds, the Metrics results of each model will be different, so it is difficult to obtain a very confident result.

Before officially introducing ROC, we will introduce two more indicators. The choice of these two indicators allows ROC to ignore the imbalance of the sample. These two indicators are: sensitivity (sensitivity) and specificity (specificity) , also known as true rate (TPR) and false positive rate (FPR) , the specific formula is as follows.

  • True Positive Rate (True Positive Rate, TPR ), also known as Sensitivity:
    TPR = The number of positive samples predicted correctly The total number of positive samples = TPTP + FN TPR=\frac{The number of positive samples predicted correctly}{Total number of positive samples}=\frac{TP} {TP+FN}TPR=Total number of positive samplesPositive samples predict correct number=TP+FNTP
    It can be found that the sensitivity and recall are exactly the same

  • False Negative Rate ( FNR ):

FNR = Number of Positive Sample Prediction Errors Total Number of Positive Samples = FNTP + FN FNR=\frac{Number of Positive Sample Prediction Errors}{Total Number of Positive Samples}=\frac{FN}{TP+FN}FNR=Total number of positive samplesNumber of positive sample prediction errors=TP+FNFN

  • False Positive Rate (False Positive Rate, FPR ):

FPR = Number of Negative Prediction Errors Total Negative Samples = FPTN + FP FPR=\frac{Negative Prediction Errors}{Total Negative Samples}=\frac{FP}{TN+FP}FPR=Total number of negative samplesNumber of Negative Sample Prediction Errors=TN+FPFP

  • True Negative Rate ( TNR ), also known as specificity:

TNR = Negative Samples Predicted Correct Number Total Negative Samples = TNTN + FP TNR=\frac{Negative Samples Predicted Correct Number}{Total Negative Samples}=\frac{TN}{TN+FP}TNR=Total number of negative samplesNegative samples predict correct number=TN+FPTN

Analyzing the above formula in detail, we can see that the sensitivity (true rate) TPR TPRTPR is the recall rate of positive samples, specificity (true negative rate) TNR is the recall rate of negative samples, and false negative rateFNR = 1 − TPR FNR=1−TPRFNR=1TPR , false positive rateFPR = 1 − TNR FPR=1−TNRFPR=1TNR , the above four quantities are all for the prediction results of a single category, so it is not sensitive to whether the overall sample is balanced. For example: Assume that 90% of the total samples are positive samples and 10% are negative samples. In this case, it is unscientific for us to use the accuracy rate for evaluation, but it is possible to use TPR and TNR, because TPR only pays attention to how many of the 90% positive samples are predicted correctly, and how many of the 10% negative samples are predicted correctly? The sample has nothing to do with it. Similarly, FPR only pays attention to how many of the 10% negative samples are predicted wrong, and has nothing to do with the 90% positive samples. This avoids the problem of sample imbalance.

The ROC curve is called the receiver operating characteristic curve (Receiver Operating Characteristic Curve). ROC is a line on a graph (as shown in the figure below), the closer to the ROC curve in the upper left corner, the higher the accuracy of the model and the more ideal the model;

insert image description here

In the ROC curve, the horizontal axis is the false positive rate (False positive rate, referred to as FPR), which is defined as the proportion of all real negative samples that are incorrectly judged as positive by the model. The calculation formula is as follows FPR = FPFP + TN FPR
= \frac{FP}{FP+TN}FPR=FP+TNFP
The vertical axis is the True Positive Rate (TPR for short), which is defined as the proportion of all real positive samples that are correctly judged as positive by the model, which is actually the recall rate. The calculation formula is as follows: TPR = TPTP + FN
TPR =\frac{TP}{TP+FN}TPR=TP+FNTP
In the ROC curve, the point in the lower left corner corresponds to the case where all samples are judged as negative examples, and the point in the upper right corner corresponds to the case where all samples are judged as positive examples;

6.2 Advantages of ROC Curve

threshold problem

The ROC curve also draws the entire curve by traversing all thresholds . If we continue to traverse all the thresholds, the predicted positive and negative samples are constantly changing, and correspondingly slide along the curve in the ROC curve.

GIF cover

Ignoring the sample imbalance problem

When the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged.

GIF

Q: How to judge the ROC curve of a model is good?

FPR indicates the degree of misjudgment of the model for negative samples, and TPR indicates the degree of recall of the model for positive samples. What we want is of course: the less negative samples are misjudged, the better, and the more positive samples are recalled, the better. So to sum up, the higher the TPR and the lower the FPR (that is, the steeper the ROC curve), the better the performance of the model.

ROC Curve Change

When comparing the performance of the models, similar to the PR curve, if the ROC curve of a model A is completely covered by the ROC curve of another model B, then the performance of B is said to be better than that of A. If the curves of A and B intersect, whoever has the larger area under the curve will have better performance.

6.3 How to draw the ROC curve

Assume that the probability of a series of samples being classified as positive has been obtained, and then sorted by size. The following figure is an example. There are 20 test samples in the figure. The "Class" column indicates the true label of each test sample (p indicates Positive samples, n represents negative samples), "Score" represents the probability that each test sample belongs to a positive sample.
insert image description here

From high to low, the "Score" value is used as the threshold threshold in turn. When the probability of a test sample belonging to a positive sample is greater than or equal to this threshold, it is considered a positive sample, otherwise it is a negative sample. For example, for the fourth sample in the figure, its "Score" value is 0.6, then samples 1, 2, 3, and 4 are all considered positive samples, because their "Score" values ​​are all greater than or equal to 0.6, while other All samples are considered as negative samples. ** Each time a different threshold is selected, a set of FPR and TPR can be obtained, which is a point on the ROC curve. **In this way, we have obtained a total of 20 sets of FPR and TPR values, and the results of drawing them on the ROC curve are as follows:

insert image description here

7.AUC

AUC (Area Under Curve) is defined as the area under the ROC curve, obviously the value of this area will not be greater than 1. And because the ROC curve is generally above the straight line y=x, the value range of AUC is generally between 0.5 and 1.

AUC is an aggregate measure of the effect of all possible classification thresholds. First of all, the AUC value is a probability value, which can be understood as randomly selecting a positive sample and a negative sample. The probability that the classifier determines that the positive sample score is higher than the negative sample score is the AUC value . In short, the larger the AUC value, the more likely the current classification algorithm will score positive samples higher than negative samples, that is, it can classify better.

7.1 Calculation of AUC

7.1.1 Calculation method 1

Calculate the area under the ROC curve, which is the value of AUC.

  Since our test sample is limited. The AUC curve we get must be a ladder. Therefore, the calculated AUC is simply the sum of the areas under these steps. First sort the scores (assuming that the larger the score, the greater the probability that the sample belongs to the positive class), and then scan to get the AUC we want. However, this has a disadvantage, that is, when the scores of multiple test samples are equal, we adjust the threshold, and what we get is not an extension of the curve upwards or to the right, but a trapezoidal upwards. At this point, we need to calculate the area of ​​the trapezoid. It can be found that using this method to calculate AUC is actually more troublesome.

7.1.2 Calculation method 2

  Explained from a Mann-Witney U statistical perspective:

AUC represents the probability that a pair of positive samples and negative samples are randomly selected, and the current classification algorithm ranks the positive samples ahead of the negative samples according to the calculated score; it reflects the sorting ability of the classification algorithm.

Due to the finiteness of the sample, we cannot get this probability value, but we can approximate it. The easiest way is to use the frequency to estimate, that is, select all positive and negative sample pairs, and see how many sample pairs are in the middle. The score of the positive sample is greater than the score of the negative sample. If it is true, record one. The score of the positive sample is equal to the negative sample. The score, record 0.5; record positive samples as x + x^+x+ , the negative sample isx − x^-x , there are N + N^+positive samples in all samplesN+ , negative samples haveN − N^-N , the set of positive and negative samples isD + , D − D^+, D^-D+D,计算公式如下:
A U C = 1 N + × N − ∑ x + ∈ D + ∑ x − ∈ D − ( ( s c o r e ( x + ) > s c o r e ( x − ) ) + 1 2 ( ( s c o r e ( x + ) = s c o r e ( x − ) ) ) AUC=\frac{1}{N^+×N^-}\sum\limits_{x^+\in D^+}\sum\limits_{x^-\in D^-}\left((score(x^+)\gt score(x^-))+\frac{1}{2}((score(x^+)= score(x^-))\right) AUC=N+×N1x+D+xD((score(x+)>score(x))+21((score(x+)=score(x )))
Take a chestnut:

inst# class score
6 p 0.54
7 n 0.53
8 n 0.52
9 p 0.51

The "Class" column indicates the true label of each test sample (p indicates a positive sample, n indicates a negative sample), and "Score" indicates the probability that each test sample belongs to a positive sample.

Positive and negative sample pairs: (6,7), (6,8), (9,7), (9,8)

In (6,7), the probability that the positive sample is predicted to be positive is 0.54, which is greater than the probability that the negative sample is predicted to be positive, which is 0.53, which is recorded as 1. Similarly, (6,8) is recorded as 1;

In (9,7), the probability of positive samples being predicted as positive samples is 0.51, which is greater than the probability of negative samples being predicted as positive samples of 0.53, which is recorded as 0. Similarly, (9,8) is recorded as 0; AUC = 1 2 × 2
( 1 + 1 + 0 + 0 ) = 0.5 AUC=\frac{1}{2×2}(1+1+0+0)=0.5AUC=2×21(1+1+0+0)=0.5
The time complexity of the above method to calculate AUC isO ( N 2 ) O(N^2)O ( N2 )because each sample requires a judgment.

7.1.3 Calculation method 3

The calculation method is the same as the calculation method 2, but the complexity is reduced. Improved Method:

① Sort the score values ​​​​of all samples from large to small

② Assign a Rank value to each sample, and the Rank value of the sample with the largest score is N

③ For samples with the same score value, get the average value of their score values ​​(for example, A and B's score=0.7, Rank values ​​are 2 and 3 respectively, and the average value is 2.5)

The calculation formula is as follows:
AUC = 1 N + × N − ( ∑ x + ∈ D + R ank ( x + ) − 1 2 N + ( N + + 1 ) ) AUC=\frac{1}{N^+×N ^-}\left(\sum\limits_{x^+\in D^+}Rank(x^+)-\frac{1}{2}N^+(N^++1)\right)AUC=N+×N1(x+D+Rank(x+)21N+(N++1 ) )
Understanding: After ranking the samples:

  For the first positive sample x 1 + x^+_1x1+Combined with all subsequent samples into Rank ( x 1 + ) − 1 Rank(x^+_1)-1Rank(x1+)1 pair, where the pairing of positive samples does not need to be calculated, so subtractN + − 1 N^+-1N+1 (minus 1: you don’t need to pair with yourself), so the contribution of this sample to AUC is Rank( x 1 + ) − N + Rank(x^+_1)-N^+Rank(x1+)N+

  For the second positive sample x 2 + x^+_2x2+Combined with all subsequent samples into Rank ( x 1 + ) − 1 Rank(x^+_1)-1Rank(x1+)1 pair, where the pairing of positive samples does not need to be calculated, so subtractN + − 2 N^+-2N+2 (minus 2: I don’t need to be paired with myself, and I didn’t pair with the previous positive sample), so the contribution of this sample to AUC is Rank( x 1 + ) − ( N + − 1 ) Rank(x^+_1) -(N^+-1)Rank(x1+)(N+1)

And so on:

  For the Nth (last) positive sample x N + + x^+_{N^+}xN++Combined with all subsequent samples into Rank ( x 1 + ) − 1 Rank(x^+_1)-1Rank(x1+)1 pair, where the pairing of positive samples does not need to be calculated, so 0 is subtracted (minus 0: there is no positive sample after the last sample), so the contribution of this sample to AUC is Rank ( x 1 + ) − 1Rank (x^+_1)-1Rank(x1+)1
A U C = 1 N + × N − ( ∑ i = 1 N R a n k ( x i + ) − ( N + − i + 1 ) ) = 1 N + × N − ( ∑ x + ∈ D + R a n k ( x + ) − 1 2 N + ( N + + 1 ) ) \begin{align} AUC&=\frac{1}{N^+×N^-}\left(\sum\limits^N\limits_{i=1}Rank(x^+_i)-(N^+-i+1)\right)\\[2ex] &=\frac{1}{N^+×N^-}\left(\sum\limits_{x^+\in D^+}Rank(x^+)-\frac{1}{2}N^+(N^++1)\right) \end{align} AUC=N+×N1(i=1NRank(xi+)(N+i+1))=N+×N1(x+D+Rank(x+)21N+(N++1))
Take a chestnut:

inst# class score Rank inst# class score Rank
1 p 0.9 20 11 p 0.4 10
2 p 0.8 19 12 n 0.39 9
3 n 0.7 18 13 p 0.38 8
4 p 0.6 17 14 n 0.37 7
5 p 0.55 16 15 n 0.36 6
6 p 0.54 15 16 n 0.35 5
7 n 0.53 14 17 p 0.34 4
8 n 0.52 13 18 n 0.33 3
9 p 0.51 12 19 p 0.3 2
10 n 0.5 11 20 n 0.1 1

The "Class" column indicates the true label of each test sample (p indicates a positive sample, n indicates a negative sample), and "Score" indicates the probability that each test sample belongs to a positive sample.

AUC is calculated as follows:
AUC = 1 10 × 10 ( 20 + 19 + 17 + 16 + 15 + 12 + 10 + 8 + 4 + 2 − 1 2 ( 10 × ( 10 + 1 ) ) ) = 0.68 AUC=\frac {1}{10×10}(20+19+17+16+15+12+10+8+4+2-\frac{1}{2}(10×(10+1)))=0.68AUC=10×101(20+19+17+16+15+12+10+8+4+221(10×(10+1)))=0.68

7.2 Code implementation

import numpy as np

label_all = np.random.randint(0,2,[10,1]).tolist()
pred_all = np.random.random((10,1)).tolist()

print(label_all)
print(pred_all)

# 正样本的数量
posNum = len(list(filter(lambda s: s[0] == 1, label_all)))

if (posNum > 0):
    negNum = len(label_all) - posNum
    sortedq = sorted(enumerate(pred_all), key=lambda x: x[1])

    posRankSum = 0
    for j in range(len(pred_all)):
        if (label_all[j][0] == 1):
            posRankSum += list(map(lambda x: x[0], sortedq)).index(j) + 1
    auc = (posRankSum - posNum * (posNum + 1) / 2) / (posNum * negNum)
    print("auc:", auc)

8. Summary

  • A confusion matrix is ​​a specific matrix used to visualize the performance of an algorithm.
  • The accuracy rate is the percentage of the predicted correct results in the total samples . When the samples are unbalanced, the accuracy rate will be invalid.
  • Precision is the probability of a sample that is actually positive among all samples that are predicted to be positive .
  • The precision rate represents the prediction accuracy of the positive sample results, and the accuracy rate represents the overall prediction accuracy, including both positive samples and negative samples.
  • Recall is the probability of being predicted as a positive sample among the samples that are actually positive .
  • Precision and recall are trade-offs.
  • When the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged; at the same time, the ROC curve is an evaluation indicator that does not depend on the threshold (Threshold).
  • The physical meaning of AUC: Randomly select a positive sample and a negative sample, and the classifier determines the probability that the positive sample score is higher than the negative sample score , which reflects the order relationship of the model to the sample.

This article is only used as a personal learning record, not for commercial purposes, thank you for your understanding.

Reference: https://zhuanlan.zhihu.com/p/37522326

Guess you like

Origin blog.csdn.net/weixin_44852067/article/details/130174772