Machine Learning Notes (6) Classification Evaluation Index, Confusion Matrix, Precision Rate, Recall Rate, PR Curve, ROC Curve

1. The trap of accuracy

Accuracy: The proportion of all predictions that are correct.
Correct rate = correct number divided by total number multiplied by 100%.
Accuracy is the most basic and simplest evaluation index in classification algorithms. Suppose there is an algorithm whose accuracy rate of predicting a certain cancer is 99.9%. Is this algorithm good?
The accuracy rate of 99.9% seems high, but if the incidence rate of the cancer itself is only 0.1%, even if the model is not trained to directly predict that everyone is healthy, the accuracy rate of such a prediction can reach 99.9%. In a more extreme situation, if the incidence rate of the cancer itself is only 0.01%, the accuracy of the algorithm's prediction is not as good as directly predicting that everyone is healthy. For extremely skewed data (the difference between the number of cancer patients and the number of healthy people is particularly large) (skewed data), using the accuracy rate to evaluate the quality of the classification algorithm has limitations. Solution: Confusion Matrix

2. Confusion Matrix

Taking the binary classification as an example, 0 means Negative, and 1 means Positive. The general positive example is the part we focus on. The confusion matrix is ​​shown in the figure below, where:
TN: True Negative data sample label is 0 and the number of samples whose model prediction value is also 0, that is, the real case, indicating that the prediction is true, and the actual is also true FP: False Positive data
sample The number of samples whose label is 0 but the model prediction value is 1, that is, the false positive example, indicating that the prediction is true, but the actual value is false FN: False Negative The number of samples whose data sample label is 1 but the
model prediction value is 0, that is False negative cases, indicating that the prediction is false, but actually true
TP: True Positive The number of samples whose data sample label is 1 and the model prediction value is also 1, that is, positive and negative examples, indicating that the label is negative, but the actual value is negative. Take a chestnut

Please add a picture description
: In the confusion matrix of 10,000 cancer patient data,
Please add a picture description
position 00 represents 9978 people who do not have cancer; the model predicts that 9978 people do not have cancer; position
01 represents 12 people who do not suffer from cancer; the model predicts that 12 people have cancer and
position 11 represents 2 A person has cancer; the model predicts that 2 people do not have cancer and
12 positions represent 8 people have cancer; the model predicts that 8 people have cancer.
Benefits of using confusion matrix
insert image description here

If the incidence of a certain cancer is 0.1%, then the model that predicts that everyone is healthy will be 99.9% accurate. But the accuracy rate is 0 / ( 0 + 0 ) 0/(0+0)0/(0+0 ) has no meaning, and the recall rate is 0, which shows that this model is an invalid model.

3. Classification evaluation index

  • Accuracy

All predictions are correct proportion
accuracy = TP + FNTP + FP + TN + FN accuracy=\frac{TP+FN}{TP+FP+TN+FN}accuracy=TP+FP+TN+FNTP+FN

  • Precision

Indicates the proportion of the actual positive examples in the examples that are classified as positive examples, 1 is the part we are concerned about
precision = TPTP + FP precision=\frac{TP}{TP+FP}precision=TP+FPTP
insert image description here

  • Recall rate (recall)

Recall is a measure of coverage where multiple positive examples are classified as positive

r e c a l l = T P T P + F N recall=\frac{TP}{TP+FN} recall=TP+FNTP
Please add a picture description

  • F1-score

F1 Score is the harmonic mean of precision and recall
F 1 = 2 ⋅ precision ⋅ recallprecision + recall F1=\frac{2\cdot precision\cdot recall}{precision+recall}Q1 _=precision+recall2precisionrecall

4. PR curve

  • Precision and recall balance

For the decision boundary of logistic regression θ T ⋅ xb = 0 \theta^T\cdot x_b=0iTxb=0 , at> 0 > 0>One side of 0 is classified as 1, in< 0 < 0<The other side of 0 is classified as 0. We can set the decision boundary to be an arbitrary constant, the decision boundary is as follows:θ T ⋅ xb = threshold \theta^T\cdot x_b=thresholdiTxb=threshold,此时 > t h r e s h o l d >threshold >One side of t h r e s h o l d is classified as 1, at<threshold <threshold<The other side of th r e s hold is classified as. We can specify the threshold, which can translate the straight line of our decision boundary, thus affecting our classification results. The figure below shows the impact of adjusting different threshold values ​​on classification results, Precision, and Recall

insert image description here
It can be seen from the figure above that the precision rate and the recall rate are mutually restrained and contradictory. When the precision rate increases, the recall rate decreases. When the relative recall rate increases, the precision rate decreases.
insert image description here

The PR curve is the horizontal and vertical axis represents the precision rate, and the vertical axis represents the recall rate. In the figure below, we can see that as the precision rate gradually increases, the recall rate gradually decreases, and the relative recall rate gradually increases, while the precision rate gradually decreases. On this curve, we usually see a very steep point of sudden drop ( red point ), the recall rate drops slowly to the left of the red point, and drops sharply to the right. This shape tells us where the sharp drop begins It may be the best balance between precision and recall.
insert image description here
We can see that the overall PR curve is the curve shown in the figure below. The overall trend of the recall rate is that as the precision rate gradually increases, it gradually decreases.
Please add a picture description
Suppose we have two algorithms, or we use two sets of different hyperparameters for the same algorithm training, each training a model can correspond to a PR curve

insert image description here
We can see that the curve obtained by the second set of hyperparameters is generally outside the curve obtained by the first set of hyperparameters. In this way, we can conclude that the model obtained by the second set of hyperparameters is better than the model obtained by the first set of hyperparameters. The reason is very simple. The precision and recall values ​​​​of each point on the curve obtained by the second set of hyperparameters Both are greater than the precision and recall values ​​on the curve obtained from the first set of hyperparameters. If we have a model whose PR curve is closer to the outside, the model will be better, so the PR curve can also be used as an indicator for selecting models and selecting hyperparameters. For this indicator, we say that the inside and outside may be relatively abstract. More often, we will use the area enclosed by the PR curve and the X-axis and Y-axis to evaluate the quality of the model. The larger the corresponding area, the corresponding model the better .

5. ROC curve

POC curve: It is called Receiver Operatioan Characteristic Curve, which describes the relationship between TPR and FPR.

  • TPR

Among all samples that are actually positive, the ratio that is correctly judged as positive.

T P R = T P T P + F N TPR=\frac{TP}{TP+FN} TPR=TP+FNTP
insert image description here

  • FPR

Among all samples that are actually negatives, the ratio of falsely judged as negatives.
FPR = FPFP + TN FPR=\frac{FP}{FP+TN}FPR=FP+TNFP
insert image description here

  • The relationship between TPR and FPR

The figure below shows that TPR and FPR are consistent. The larger the TPR, the larger the FPR, and the smaller the TPR, the smaller the FPR.
insert image description here

  • ROC curve

We take FPR as the horizontal axis and TPR as the vertical axis to obtain the following ROC space:
insert image description here
For the ROC curve, the area under the curve is usually concerned with the size of the area. The larger the area under the curve (AUC), the better the classification effect of the model we trained. Therefore, the area under the ROC curve can be used as a classification index, called AUC (Area Under Curve). For this area it has a maximum value of 1 and a minimum value of 0. Because for this curve, its domain and range are in [ 0 , 1 ] [0, 1][0,1 ] , so the value range of AUC is also[0, 1] [0, 1][0,1 ] between.
insert image description here

Guess you like

Origin blog.csdn.net/qq_45723275/article/details/123841537