Evaluation indicators of machine learning model pros and cons: confusion matrix, PR curve and average accuracy (with code implementation)

Text reference: Mean Average Precision (mAP) Explained | Paperspace Blog

Table of contents

1. Confusion Metrics confusion matrix

2. Precision-Recall Curve, Average precision PR curve, average precision

3. Example and code implementation

(1) From Prediction Score to Class Label

(2) Precision-Recall Curve (Precision-Recall Curve)

(3) Average Precision AP (Average Precision)


Consider first the simplest binary classification problem:

1. Confusion Metrics confusion matrix

(See watermark for image source, predicted classification of Predicted Class, actual classification of Actual Class)

In fact, the Confusion Metrics ontology is just a 2 x 2 table. What is more important here is the understanding of Type I Error (the first type of error) and Type II Error (the second type of error). The corresponding ones are Accuracy (precision), Precision (precision) and Recall/Sensitivity (recall rate).

(1) Precision : Among all Positive predictions (that is, the prediction is 1), the correct proportion of predictions.

Precision=\frac{TP}{TP+FP}

(2) Recall/Sensitivity : In the case of 1 in reality, what is the proportion predicted by Positive (that is, predicted as 1 and correctly predicted). That is to say, it is 1 in the middle now, but the model predicts it as 0. At this time, the Type II Error in the above picture occurs.

*It is worth mentioning that when using the model to detect diseases, Recall/Sensitivity is an important reference index, because the result we hope to get is that we would rather predict wrongly than let go of patients with the possibility of disease.

Recall=\frac{TP}{TP+FN}

(3) Accuracy : Different from the previous two calculation methods, either the ratio in the column summation or the ratio in the row summation, this indicator can be regarded as a ratio overlooking the whole world, the prediction is 1 and the reality is 1 and the prediction is 1 0 and the actual ratio is 0, the denominator is the number of all overall cases.

Accuracy=\frac{TP+TN}{TP+TN+FP+FN}

(4) F1-score : An indicator that considers both (1) Precision and (2) Recall.

F1score=\frac{2*Precision*Recall}{Precision+Recall}

What the doctor said in the picture below can be understood as the positive/negative prediction of the model.

2. Precision-Recall Curve, Average precision PR curve, average precision

From the perspective of prediction results, Precision describes how many positive examples predicted by the binary classifier are real positive examples, that is, how many positive examples predicted by the binary classifier are accurate; Recall describes how many positive examples predicted by the binary classifier are accurate from the perspective of real results. How many real positive examples in the test set are selected by the binary classifier, that is, how many real positive examples are recalled by the binary classifier.

Precision and Recall are usually a pair of contradictory performance metrics. Generally speaking, the higher the Precision, the lower the Recall. The reason is that if we want to increase the Precision, that is, the positive examples predicted by the two classifiers are as real as possible, then we need to increase the threshold for the two classifiers to predict positive examples. We mark the sample as a positive example, so now we have to increase the probability to be greater than or equal to 0.7 before we mark it as a positive example, so as to ensure that the positive examples selected by the binary classifier are more likely to be real positive examples; and this goal is precisely related to improving Recall On the contrary, if we want to improve Recall, that is, the two classifiers can pick out the real positive examples as much as possible, then it is necessary to lower the threshold for the two classifiers to predict positive examples. Just mark it as a real positive example, so now we have to reduce it to greater than or equal to 0.3 and we will mark it as a positive example, so as to ensure that the binary classifier selects as many real positive examples as possible.

So is there an indicator that characterizes the comprehensive performance of the two classifiers in terms of Precision and Recall? The answer is yes. As mentioned above, according to the different thresholds of the two classifiers to predict positive examples, we can get multiple sets of Precision and Recall. In many cases, we can sort the prediction results of the binary classifier. The top one is the sample that the binary classifier thinks is most likely to be a positive example, and the last one is the sample that the binary classifier thinks is the least likely to be a positive example. sample. In this order, the threshold for predicting positive examples of the binary classifier is gradually lowered one by one, and the current Precision and Recall can be calculated each time. Taking Recall as the horizontal axis and Precision as the vertical axis, you can get the Precision-Recall curve diagram, referred to as the PR diagram:

The PR graph can visually show the Precision and Recall of the binary classifier. When comparing, if the PR curve of a binary classifier is completely enclosed by the PR curve of another binary classifier, it can be asserted that the performance of the latter is better than The former, for example, the performance of the binary classifier A in the above figure is better than that of C; if the PR graphs of the two binary classifiers cross, it is difficult to say which one is better, such as the two classifiers A and B in the above figure .

However, in many cases, people still want to compare the performance of the two classifiers A and B. At this time, a more reasonable indicator is the size of the area under the PR curve, which to a certain extent characterizes the performance of the two classifiers in Precision The comprehensive performance of these two aspects. This is the average precision of AP (Average Precision), which simply means averaging the Precision values ​​on the PR curve. For the PR curve, we use the integral to calculate:

AP=\int_{0}^{1}p(r)dr

In practice, there are many classification problems, not just limited to two classification problems. Generally speaking, AP is under a single category, and mAP (mean Average Precision) is the average value of AP values ​​​​under all categories.

Moreover, in the actual operation process, we do not directly calculate the PR curve, but smooth the PR curve. That is, for each point on the PR curve, the value of Precision takes the value of the largest Precision on the right side of the point:

Describe it with a formula P_{smooth}(r)=max_{​{r}'>r}P({r}'). Use this formula to smooth and then calculate the value of AP. For example, Interploated AP (AP calculation method of Pascal Voc 2008). On the smoothed PR curve, take the Precision value of the 10 equal points (11 points including the breakpoint) on the horizontal axis 0-1, and calculate the average value as the final AP value.

AP=\frac{1}{11}\sum_{i=1}^{11}P_{smooth}(i)

Of course, it can also be directly integrated for processing.

3. Example and code implementation

Considering a binary classification problem, the whole process has the following steps:

(1) From Prediction Score to Class Label

Assuming there are two categories, Positive and Negative, here are the actual labels of 10 samples, denoted as y_true:

y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive"]

When these samples are fed into the model, it returns the following prediction scores pred_scores:

pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3]

Based on these scores, we classify the samples (i.e. assign each sample a class label). First, artificially set a threshold of prediction score, when the score is equal to or higher than the threshold, the sample is classified into one class (usually positive class, 1). Otherwise, it is classified as something else (usually negative, 0). Here we set the threshold to 0.5 to get the label y_pred predicted by the model:

import numpy

pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3]
y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive"]

threshold = 0.5
y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores]
print(y_pred)

output:

['positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative']

Both the true and predicted labels are now   available in the y_true  and  y_pred variables . Based on these labels, the confusion matrix, precision and recall can be calculated from the previous definitions .

r = numpy.flip(sklearn.metrics.confusion_matrix(y_true, y_pred))
print(r)

precision = sklearn.metrics.precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
print(precision)

recall = sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
print(recall)

result:

# Confusion Matrix (From Left to Right & Top to Bottom: True Positive, False Negative, False Positive, True Negative)
[[4 2]
 [1 3]]

# Precision = 4/(4+1)
0.8

# Recall = 4/(4+2)
0.6666666666666666

(2) Precision-Recall Curve (Precision-Recall Curve)

Because of the importance of precision and recall, a precision-recall curve can show the trade-off between precision and recall values ​​for different thresholds . This curve helps to choose the optimal threshold to maximize both metrics.

Creating a precision-recall curve requires a few inputs:

1. Authentic labels. 2. The predicted score for the sample. 3. Some threshold to convert prediction scores to class labels.

This snippet creates  the y_true  list to hold the true labels , the pred_scores  list for the predicted scores , and finally the thresholds list for different  thresholds  (here from 0.2 to 0.7 in steps of 0.05).

import numpy

y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive", "positive", "positive", "positive", "negative", "negative", "negative"]

pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3, 0.7, 0.5, 0.8, 0.2, 0.3, 0.35]

thresholds = numpy.arange(start=0.2, stop=0.7, step=0.05)

Since thresholds has 10 thresholds, 10 precision and recall values ​​will be created. The next function called precision_recall_curve() receives the true label, prediction score and threshold . It returns two equal-length lists representing precision and recall values:

import sklearn.metrics

def precision_recall_curve(y_true, pred_scores, thresholds):
    precisions = []
    recalls = []
    
    for threshold in thresholds:
        y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores]

        precision = sklearn.metrics.precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
        recall = sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
        
        precisions.append(precision)
        recalls.append(recall)

    return precisions, recalls

Call the previous data to get:

precisions, recalls = precision_recall_curve(y_true=y_true, pred_scores=pred_scores,thresholds=thresholds)

output:

# Precision
[0.5625,
 0.5714285714285714,
 0.5714285714285714,
 0.6363636363636364,
 0.7,
 0.875,
 0.875,
 1.0,
 1.0,
 1.0]
# Recall
[1.0,
 0.8888888888888888,
 0.8888888888888888,
 0.7777777777777778,
 0.7777777777777778,
 0.7777777777777778,
 0.7777777777777778,
 0.6666666666666666,
 0.5555555555555556,
 0.4444444444444444]

Given two lists of equal length, their values ​​can be plotted in a two-dimensional plot as follows:

import matplotlib.pyplot
matplotlib.pyplot.plot(recalls, precisions, linewidth=4, color="red")
matplotlib.pyplot.xlabel("Recall", fontsize=12, fontweight='bold')
matplotlib.pyplot.ylabel("Precision", fontsize=12, fontweight='bold')
matplotlib.pyplot.title("Precision-Recall Curve", fontsize=15, fontweight="bold")
matplotlib.pyplot.show()

It can be seen that as the recall increases, the precision decreases. The reason is that when the number of positive samples increases (high recall), the accuracy of correctly classifying each sample decreases (low precision). This is expected since the model is more likely to fail when there are many samples.

A precision-recall curve makes it easy to identify points where both precision and recall are high. According to the figure above, the best point is (recall, precision)=(0.778, 0.875). A better way is to use the F1-score indicator (see the formula above):

f1 = 2 * ((numpy.array(precisions) * numpy.array(recalls)) / (numpy.array(precisions) + numpy.array(recalls)))

Based on the values ​​in the F1 list, the highest score is 0.82352941. It is the 6th element (ie index 5) in the list. The 6th element in the recall and precision lists are 0.778 and 0.875, respectively. The corresponding threshold is 0.45:

# F1-score
[0.72, 
 0.69565217, 
 0.69565217, 
 0.7,
 0.73684211,
 0.82352941, 
 0.82352941, 
 0.8, 
 0.71428571, 
 0.61538462]

The image below shows in blue the location of the point corresponding to the best balance between recall and precision. In summary, the optimal threshold for balancing precision and recall is 0.45, where precision is 0.875 and recall is 0.778.

matplotlib.pyplot.plot(recalls, precisions, linewidth=4, color="red", zorder=0)
matplotlib.pyplot.scatter(recalls[5], precisions[5], zorder=1, linewidth=6)

matplotlib.pyplot.xlabel("Recall", fontsize=12, fontweight='bold')
matplotlib.pyplot.ylabel("Precision", fontsize=12, fontweight='bold')
matplotlib.pyplot.title("Precision-Recall Curve", fontsize=15, fontweight="bold")
matplotlib.pyplot.show()

(3) Average Precision AP (Average Precision)

AP is calculated according to the following formula: use a loop to traverse all precision precision/recall recalls, calculate the difference between the current recall and the next recall, and then multiply by the current precision. In other words, Average-Precision is a weighted sum of precision for each threshold, where the weight is the difference in recall. This is actually the process of "dividing and summing" in calculus:

AP=\sum_{k=0}^{n-1}[Recalls(k)-Recalls(k+1)]*Precisions(k)

Among them: Recalls (n) = 0, Precisions (n) = 1, n is the number of thresholds selected. That is to say, the recall list recalls and the precise list precisions are appended with 0 and 1 respectively.

AP = numpy.sum((recalls[:-1] - recalls[1:]) * precisions[:-1])

Here is the complete code for calculating AP:

import numpy
import sklearn.metrics

def precision_recall_curve(y_true, pred_scores, thresholds):
    precisions = []
    recalls = []
    
    for threshold in thresholds:
        y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores]

        precision = sklearn.metrics.precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
        recall = sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
        
        precisions.append(precision)
        recalls.append(recall**

    return precisions, recalls

y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive", "positive", "positive", "positive", "negative", "negative", "negative"]
pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3, 0.7, 0.5, 0.8, 0.2, 0.3, 0.35]
thresholds=numpy.arange(start=0.2, stop=0.7, step=0.05)

precisions, recalls = precision_recall_curve(y_true=y_true, 
                                             pred_scores=pred_scores, 
                                             thresholds=thresholds)

precisions.append(1)
recalls.append(0)

precisions = numpy.array(precisions)
recalls = numpy.array(recalls)

AP = numpy.sum((recalls[:-1] - recalls[1:]) * precisions[:-1])
print(AP)

Guess you like

Origin blog.csdn.net/qq_54708219/article/details/129444818