Evaluate the classification model—Confusion Matrix and evaluation indicators

For a designed classification model, a large number of data sets are needed to evaluate its performance, so it is very important to understand the evaluation metrics.

The specific process for evaluating classification models:
Please add image description

One and two classification confusion matrix Confusion Matrix

Strictly speaking, for binary classification problems, there are no labels, only positive examples and counterexamples. The confusion matrix for the two-classification problem is as follows:
Please add image description

Evaluation index calculation formula:

  • A c c u r a c y = T P + T N T P + T N + F P + F N Accuracy=\frac{TP+TN}{TP+TN+FP+FN} Accuracy=TP+TN+FP+FNTP+TN
  • P r e c i s i o n = T P T P + F P Precision=\frac{TP}{TP+FP} Precision=TP+FPTP
  • R e c a l l = T P T P + F N Recall=\frac{TP}{TP+FN} Recall=TP+FNTP
  • F 1 − S c o r e = 2 1 P r e c i s i o n + 1 R e c a l l = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l F1-Score=\frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}=\frac{2 \times Precision \times Recall}{Precision + Recall} Q1 _Score=Precision1+Recall12=Precision+Recall2×Precision×Recall
  • s p e c i f i c i t y = T N F P + T N specificity=\frac{TN}{FP+TN} specificity=FP+TNTN

Let's take the two-category problem of cats and dogs as an example to discuss the confusion matrix and its evaluation indicators of the two-category category:
![Please add a picture description](https://img-blog.csdnimg.cn/8be2f9fec8d243e58689e969bbb75876.png#pic_center =500)

As shown in the figure, in the classification of cats and dogs, Dog is used as a positive example, and dog (cat) is not used as a counterexample. The upper side is the predicted value, and the left side is the true value. The main diagonal line (red) is the predicted correct value, and the sub-diagonal line (green) is the predicted incorrect value.

Assume that cats and dogs are classified as follows:
Please add image description

in:

  • T P + F P TP+FP TP+FP is the number of dogs in the data set
  • F P + T N FP+TN FP+TN is the number of cats in the data set
  • T P + T N TP+TN TP+TN is the number of correct classifications of the model

1. Accuracy rate

A accuracy = number of correct classifications all data = TP + TNTP + TN + F + FN Accuracy=\frac{number of correct classifications}{all data}=\frac{TP+TN}{TP+TN+F{+FN }}Accuracy=All dataNumber of correct classifications=TP+TN+F+FNTP+TN

即,
A c c u r a c y = 45 + 35 45 + 35 + 5 + 35 = 0.8 Accuracy=\frac{45 + 35}{45 + 35 + 5 + 35} = 0.8 Accuracy=45+35+5+3545+35=0.8

2. Precision rate

How many of the data predicted as dogs are real dogs
P recision = TP Number of predicted dogs = TPTP + FP Precision=\frac{TP}{Number of predicted dogs}=\frac{TP}{TP+FP}Precision=Predicted number of dogsTP=TP+FPTP
即,
p r e c i s i o n = 45 45 + 15 = 0.75 precision=\frac{45}{45 + 15} = 0.75 precision=45+1545=0.75

3. Recall rate, recall rate, sensitivity

What is the true number of dogs detected in the data
? Recall = TP The number of real dogs = TPTP + FN Recall=\frac{TP}{The number of real dogs}=\frac{TP}{TP+FN }Recall=The actual number of dogsTP=TP+FNTP
即,
R e c a l l = 45 45 + 5 = 0.9 Recall=\frac{45}{45 + 5} = 0.9 Recall=45+545=0.9

4、F1 Score

F1 Score is the harmonic mean of Precision and Recall, which comprehensively reflects the Precision and Recall of the classifier. That is, Precisino Precisino alonePrecisino R e c a l l Recall Rec all is high, but F1-Score is not high. (It can be compared to two resistors in parallel, one with high resistance and one with low resistance, but the result is still low)
F 1 − S core = 2 1 P recision + 1 R ecall = 2 × P recision × R ecall P recision + R ecall F1- Score=\frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}=\frac{2 \times Precision \times Recall}{Precision + Recall}Q1 _Score=Precision1+Recall12=Precision+Recall2×Precision×Recall
即,
F 1 − S c o r e = 2 × 0.75 × 0.9 0.75 + 0.9 = 0.82 F1-Score=\frac{2 \times 0.75 \times 0.9}{0.75 + 0.9}=0.82 Q1 _Score=0.75+0.92×0.75×0.9=0.82

5. Specificity

How many of the true cats (negative examples) are selected
specificity = TN The number of true cats = TNFP + TN specificity=\frac{TN}{The number of true cats}=\frac{TN}{FP+TN}specificity=true number of catsTN=FP+TNTN
即,
s p e c i f i c i t y = 35 15 + 35 = 0.7 specificity=\frac{35}{15 + 35} = 0.7 specificity=15+3535=0.7

2. Multiclass confusion matrix Multiclass Classifiers

The multi-class confusion matrix is ​​very similar to the two-class classification, except that when calculating precision, recall, etc., it needs to be calculated separately for each class.

For example:

Please add image description

  • A c c u r a c y = 15 + 12 + 22 15 + 2 + 3 + 6 + 12 + 4 + 22 = 0.7656 Accuracy=\frac{15+12+22}{15+2+3+6+12+4+22}=0.7656 Accuracy=15+2+3+6+12+4+2215+12+22=0.7656

  • Bicycle: P precision = 15 15 + 6 = 0.71 Precision=\frac{15}{15 + 6}=0.71Precision=15+615=0.71 R e a c a l l = 15 15 + 2 + 3 = 0.75 Reacall=\frac{15}{15 + 2 + 3}=0.75 Reacall=15+2+315=0.75

  • Motorcycle: P precision = 12 2 + 12 + 4 = 0.66 Precision=\frac{12}{2 + 12 + 4}=0.66Precision=2+12+412=0.66 R e a c a l l = 12 12 + 6 = 0.66 Reacall=\frac{12}{12 + 6}=0.66 Reacall=12+612=0.66

  • Car: P precision = 22 22 + 3 = 0.88 Precision=\frac{22}{22 + 3}=0.88Precision=22+322=0.88 R e a c a l l = 22 22 + 4 = 0.85 Reacall=\frac{22}{22 + 4}=0.85 Reacall=22+422=0.85

  • 平均: P r e c i s i o n = 0.71 + 0.66 + 0.88 3 = 0.75 Precision=\frac{0.71+0.66+0.88}{3}=0.75 Precision=30.71+0.66+0.88=0.75 R e c a l l = 0.75 + 0.66 + 0.85 3 = 0.75 Recall=\frac{0.75+0.66+0.85}{3}=0.75 Recall=30.75+0.66+0.85=0.75

  • F1 Score: F 1 − S c o r e = 2 × P r e c i s o n × R e c a l l P r e c i s i o n × R e c a l l = 2 × 0.75 × 0.75 0.75 + 0.75 = 0.75 F1-Score=\frac{2 \times Precison \times Recall}{Precision \times Recall} = \frac{2 \times 0.75 \times 0.75}{0.75 + 0.75}=0.75 Q1 _Score=Precision×Recall2×Precison×Recall=0.75+0.752×0.75×0.75=0.75

    Multi-class F1 Score is the average value of F1 Score for each category.

In multi-class confusion matrices, the form of heat map is more common, as shown in the figure:

Please add image description

3. ROC curve (Receiver Operating Characteristic Curve) Receiver Operating Characteristic Curve

FPR (pseudo-positive rate): FPR = FPFP + TN FPR=\frac{FP}{FP+TN}FPR=FP+TNFP, that is, the proportion of negative class data that is divided into positive classes

TPR (true class rate): TPR = TPTP + FN TPR=\frac{TP}{TP+FN}TPR=TP+FNTP, that is, the proportion of positive class data that is classified into positive classes

1. Intuitive understanding of ROC curve

The ROC curve originated from the judgment of radar signals by radar soldiers during World War II. The mission of the radar soldier is to analyze the radar signal, but the radar signal contains noise (such as a big bird), so whenever a signal appears on the radar screen, the radar soldier needs to judge it. Some radar soldiers are more cautious (low threshold) and judge all signals as enemy aircraft; some soldiers are more optimistic (high threshold) and judge all signals as big birds. The following is the judgment result of a radar soldier in one day:

Please add image description

at this time:

  • T P R = T P T P + F N = 1 TPR=\frac{TP}{TP + FN} = 1 TPR=TP+FNTP=1
  • F P R = F P F P + T N = 0.5 FPR=\frac{FP}{FP+TN}=0.5 FPR=FP+TNFP=0.5

For the system, we hope that the TPR is as high as possible, because in this way all enemy aircraft can be detected. At the same time, we hope that the FPR is as low as possible, because this can reduce misjudgments, that is, ideally, TPR = 1 TPR=1TPR=1 F P R = 0 FPR=0 FPR=0 . However, for a general system, you cannot have both: if you lower the threshold of soldiers, then ideally all enemy aircraft will be judged, but some birds will inevitably be judged as enemy aircraft, which will lead toTPR TPRHigh TPR and FPR FPRThe FPR is also high; correspondingly, if the threshold of soldiers is increased, then ideally all flying birds will not be judged as enemy aircraft, but some enemy planes will inevitably be judged as flying birds (this will cause huge damage to our own soldiers) ), which results inFPR FPRTPR TPR when FPR is lowTPR is also low. Therefore, in general, the ROC curve is a proportional increasing function, and wheny = xy=xy=above the x- curve.

2. ROC curve drawing principle


This picture is drawn in http://www.navan.name/roc/ and can be dynamically interacted with in real time. Readers can change the ROC curve settings while watching to deepen their understanding.

In the figure above, the blue curve represents negative examples, red represents positive examples, and the thick black vertical line represents the threshold.

The upper left and upper right are from the perspective of the soldier (threshold). At this time, the performance of the radar (classifier) ​​is determined. That is, the ROC curve is certain, and changing the threshold only changes the red coordinate point on the ROC curve.

As shown in the picture above left: If the threshold is selected too low, all positive examples will be judged as positive examples ( TPR = 1 TPR=1TPR=1 ), but most of the negative examples are also judged as positive examples (FPR FPRFPR is close to 1). At this time, the coordinate point in the ROC curve is in the upper right corner.

As shown in the picture on the right: If the threshold is selected too high, all negative examples will be judged as negative examples ( FPR = 0 FPR=0FPR=0 ), but most of the positive examples are judged as negative examples (TPR TPRTPR is close to 0), and the coordinate point in the ROC curve is at the lower left corner.

If the threshold is chosen between the positive and negative examples, then TPR TPRThe TPR value is relatively high,FPR FPRThe FPR value is relatively low, which is a relatively ideal state.

The lower left and lower right are from the perspective of the radar (classifier).

As shown in the picture below on the left: If the classifier performance is insufficient, the positive and negative examples will include each other, and the ROC curve will approach y = xy=xy=x function (that is, how much FPR increases and how much TPR decreases).

As shown in the picture on the right below, if the classifier performs very well, the positive and negative examples will be "separated" widely, and the ROC curve will be closer to a right angle. Ideally, positive examples and negative examples are completely separated. If the threshold is selected appropriately, TPR = 1 TPR=1 will be achieved.TPR=1 F P R = 0 FPR=0 FPR=0 is the upper left corner of the ROC curve rectangle.

3. AUC curve

AUC, (Area Under Curve), is the area under the ROC curve. Obviously this area is less than 1, and because the ROC curve is generally above the y=x line, the AUC is generally between 0.5 and 1. The AUC value can better quantify the performance of the classifier than the ROC curve.

The meaning of AUC is, when a positive sample and a negative sample are randomly selected, the probability that the positive sample will be ranked in front of the negative sample based on the score calculated by the current classifier.

Criteria for judging the quality of a classifier (prediction model) from AUC:

  • AUC = 1 is a perfect classifier. When using this prediction model, there is at least one threshold that can produce perfect predictions. In most prediction situations, there is no perfect classifier.
  • 0.5 < AUC < 1, better than random guessing. This classifier (model) can have predictive value if the threshold is properly set .
  • AUC = 0.5, which is the same as random guessing (for example: losing copper coins), and the model has no predictive value.
  • AUC < 0.5, which is worse than random guessing; but as long as it always goes against the prediction , it is better than random guessing.

4. Advantages of ROC curve

The ROC curve can well cope with the imbalance of positive and negative samples.

The ROC curve has a very good characteristic: when the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged. In actual data sets, class imbalance often occurs, that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may also change over time.

This is because, in the calculation formula of the ROC curve, TPR TPRTPR is only calculated for positive examples,FPR FPRFPR is calculated only for negative examples. Therefore, even if the ratio of positive and negative samples is imbalanced or the ratio of positive and negative samples changes over time, the ROC curve will not change significantly.

A c c u r a c y Accuracy Accuracy R e c a l l Recall Recall P r e c i s i o n Precision The P rec i s i o n calculation formula needs to consider both positive and negative examples. When the proportion of positive and negative examples changes, its value will be greatly affected.

4. Classification indicators when positive and negative samples are imbalanced

1. Positive and negative sample balanced data set

S.NO. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real label 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
Forecast—Model 1 0.1 0.1 0.1 0.1 0.1 0.1 0.6 0.6 0.5 0.5 0.9 0.9 0.9 0.9 0.9
Forecast—Model 2 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8
F1 threshold=0.5 F1 best case scenario ROC-AUC LogLoss
Model 1 0.88 0.88 0.94 0.28
Model 2 0.67 1 1 0.6

In terms of cross-entropy loss, M1 is much better than M2. Although M2 can classify data very well, the gap between 0.6 and 0 is still a bit big. This is why softmax is commonly used in classification problems instead of regression.

2. There are far more negative samples than positive samples

S.NO. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real label 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
Forecast—Model 1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9
Forecast—Model 2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.9 0.9
F1 threshold=0.5 ROC-AUC LogLoss
Model 1 0.88 0.83 0.24
Model 2 0.96 0.96 0.24

In this data set, model 1 classifies sample 14 as a negative FN, and model 2 classifies sample 13 as a positive FP. For situations where the number of positive samples is small, we hope to detect all positive samples (Model 2) instead of "following the crowd" (Model 1). Therefore, Model 2 is better than Model 1. This is in F1- Both Score and ROC-AUC can be reflected.

3. The number of positive samples is much greater than the number of negative samples.

S.NO. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real label 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
Forecast—Model 1 0.1 0.1 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
Forecast—Model 2 0.1 0.1 0.1 0.1 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
F1 threshold=0.5 ROC-AUC LogLoss
Model 1 0.963 0.83 0.24
Model 2 0.96 0.96 0.24

When the number of positive samples is much greater than the number of negative samples, we hope to detect the negative samples as much as possible. At this time, ROC-AUC is more suitable.

4. Summary

  • Logarithmic loss is not suitable for classification evaluation indicators when samples are unbalanced
  • ROC-AUC can be used as a classification evaluation index when the positive and negative samples are unbalanced.
  • If you want a small number of cases to be predicted correctly, you can choose ROC-AUC as the evaluation index.

Guess you like

Origin blog.csdn.net/qq_44733706/article/details/130619062