Common Evaluation Indicators for Computer Vision Models

Table of contents

1 Overview

2. Commonly used evaluation indicators

2.1 Classification task

2.2 Detection task

2.3 Segmentation tasks


1 Overview

        The basic tasks of computer vision can be divided into four categories: classification, localization, detection, and segmentation. The classification task is to judge the category of the object in the image. Generally speaking, an image contains only one type of object, and the feature description of the image is the main research content of the image classification task; the positioning task is to determine the specific position of the object in the image , usually Represented in the form of a bounding box (bounding box); the detection task is usually to output the bounding box and label of each target in the image. Classification and positioning are usually single-target, while target detection is multi-target. The specific difference is shown in the figure below.

The segmentation task refers to the process of dividing an image into several semantic regions. It can be subdivided into three research directions: semantic segmentation, entity segmentation, and panoramic segmentation. The difference between the three research directions of image segmentation is shown in the figure below.

Semantic segmentation , which is commonly understood as image segmentation, is a pixel-by-pixel image classification problem. Each pixel predicts a unique category, and both countable and uncountable objects must be classified; instance segmentation (Instance Segmentation  ) , It is not only necessary to predict the semantic labels of countable objects, but also to distinguish the IDs of individuals. Semantic labels refer to the categories of objects, while instance IDs correspond to different numbers of similar objects. Note that uncountable objects do not need to be predicted, and instance segmentation is equivalent to the target Detection + semantic segmentation; panoptic segmentation (Panoptic Segmentation), which requires each pixel in the image must be assigned a semantic label and an instance ID, panoptic segmentation is equivalent to semantic segmentation + instance segmentation.

2. Commonly used evaluation indicators

2.1 Classification task

        Classification tasks often use indicators such as accuracy rate, precision rate, recall rate, F1_scores, and ROC curve to evaluate the pros and cons of the model. Of course, these basic indicators can also be used to evaluate the segmentation model or detection model, and they are basically universal. The confusion matrix is ​​a summary of the prediction results of the classification problem, and it is also the most basic, most intuitive, and easiest method to measure the accuracy of the classification model. The confusion matrix contains four basic indicators for classification problems, as shown in the table below.

confusion matrix actual value
True False
Predictive value Positive TP FP
Negative FN TN

TP: The positive sample predicted by the model as a positive class, that is, the sample whose real label is 1 and the prediction is also 1.

TN: The negative sample predicted by the model as a negative class, that is, the sample whose real label is 0 and the prediction is also 0.

FP: The negative sample predicted by the model as a positive class, that is, the real label is 0, but the sample is predicted to be 1.

FN: The positive sample predicted by the model as a negative class, that is, the real label is 1, but the sample is predicted to be 0.

        The predictive classification model must be as accurate as possible, that is, in the confusion matrix, the more the number of TP and TN, the better, and the less the number of FP and FN, the better. However, it is not enough for a model to use only a few simple basic indicators in the confusion matrix, so the following indicators are needed to further evaluate the quality of the model.

①Accuracy

Accuracy=\frac{TP+TN}{TP+TN+FP+FN}

②Precision rate or precision rate (Precision): refers to the actual probability of 1 in the sample predicted by the model as 1 .

P=\frac{TP}{TP+FP}

③Recall rate or recall rate (Recall): refers to the probability of predicting 1 in a sample that is actually 1 . Also called true positive rate, sensitivity, referred to as TPR.

R=\frac{TP}{TP+FN}

Note: precision rate and recall rate are two concepts that are easy to confuse. Many people don't know what scenario to use precision rate and when to use recall rate. For example (positive samples are generally more concerned about sample performance, such as earthquakes, tumors, spam, etc.): In scenarios such as tumor judgment and earthquake prediction, the model is required to have a higher recall rate, as long as it is a tumor or an earthquake. Let it go; in scenarios such as spam judgment, the model is required to have a higher accuracy rate, and it must be ensured that all spam emails are put into the recycling bin, and there must be no normal emails.

④F1_score: Taking into account the precision and recall of the classification model, it can be regarded as the harmonic mean of the precision and recall of the model. The maximum value is 1, and the minimum value is 0. 1 represents the best effect of the model, and 0 represents the best effect of the model. bad.

F1=\frac{2}{\frac{1}{P}+\frac{1}{R}}=\frac{2PR}{P+R}=\frac{2TP}{2TP+FN+FP}

⑤P-R curve

        The PR curve takes the recall rate Recall as the abscissa, and the precision rate Precision as the ordinate. The more convex the curve is to the right, the better the effect of the model. As shown in the figure below, there are three black, orange, and blue curves in the figure, which represent the PR curves of the three models, and the black and orange curves are always above the blue curve, which means that the black and orange curves correspond to the PR curves of the model. The effect is better than that of the model corresponding to the blue curve; while the black and orange curves have intersection points, it is impossible to judge whether the two models are good or bad, it depends on the situation. Another thing to note is that the two indicators of the PR curve only focus on positive samples.

        Seeing this, you may have doubts. A PR curve represents a model, and to draw a PR curve, you must first have multiple sets of (R, P) points. So isn’t the recall and precision of a model unique? First of all, I can tell you that for a certain model, its recall rate and precision rate must be unique. The reason why the PR curve has multiple sets of (P, R) points is because the classifier has a probability output. 0.5 is usually used as a threshold, which is greater than 0.5 belongs to one category, and less than 0.5 belongs to another category, but based on different scenarios, the predicted label can be changed by controlling the threshold of the probability output, so that different groups of (P, R) points can be obtained by choosing different thresholds. Thus, the PR curve as shown in the figure above can be drawn.

⑥ ROC curve

        The ROC curve takes FPR as the abscissa and TPR as the ordinate. Among them, TPR is called true positive rate or sensitivity, which is the recall rate, which refers to the probability of predicting 1 in a sample that is actually 1; FPR is called false positive rate or specificity, which refers to the probability of predicting 1 in a sample that is actually 0.

TPR=\frac{TP}{TP+FN}

FPR=\frac{FP}{FP+TN}

The trend of the ROC curve is shown in the figure below. The more convex the curve is to the left, the better the effect of the model. Both the ROC curve and the PR curve draw curves by selecting different thresholds to obtain different points.

Comparison of PR curve and ROC curve: When the number of positive and negative samples is close to 1:1, both the PR curve and the ROC curve can be used to evaluate the pros and cons of the model. The more convex the PR curve is to the right, the better the model is, and the ROC curve is better. The more convex the curve is to the left, the better the model. The advantage of the ROC curve is that when the number of negative samples is large, as shown in the figure below, the ROC curve can still maintain the same trend, while the PR curve will change greatly, basically losing the ability to evaluate the model.

        The reason why the PR curve changes so much is that when the number of negative samples far exceeds that of positive samples, the recall rate R in the PR curve, that is, the true positive rate TPR in the ROC curve, basically does not change, but the large amount of FP in the PR curve The increase will lead to a substantial decrease in the accuracy rate P. The reason why the ROC curve can still keep the trend basically unchanged is because the unbalanced sample size will not lead to large changes in the two indicators of TPR and FPR in the ROC curve, but this is not only the advantage of the AUC curve but also the disadvantage of the AUC curve ,Why do you say this way? This is because too many negative samples will lead to a large increase in the number of FP, and the large increase in FP can only be exchanged for a small change in FPR. As a result, although a large number of negative samples are misjudged as positive samples, they cannot be intuitively detected on the ROC curve. figure it out. Therefore, in order to make up for this disadvantage of the ROC curve, the ROC curve + AUC indicator is often used to evaluate the model. In practical applications, positive samples are generally more concerned, such as earthquakes, tumors, etc., such samples are often difficult to collect, the number is small, and the categories in the data set are unbalanced, that is, the ratio of positive and negative samples is large. It happens quite often.

⑦AUC indicator

        The AUC indicator is defined as the area under the ROC curve, AUC=0.5which corresponds to the worst case of the classifier. At this time, the ROC curve becomes a y=xstraight line, and the TPR is always equal to the FPR, which means that no matter the true category of the sample is 1 or 0 , the probability of the classifier model predicting 1 is equal, which is like flipping a coin without any classification ability; AUC=1corresponding to the best case of the classifier, which means that the model can always predict the right.

        The advantage of AUC is that the calculation method of AUC takes into account the classification ability of classifiers for positive and negative samples under different thresholds at the same time. In the case of unbalanced samples, it can still make a reasonable evaluation of the classifier, which makes up for the shortcomings of the ROC curve. . In short, the AUC indicator is one of the better evaluation indicators for classification models.

2.2 Detection task

        In addition to some evaluation indicators of the classification task, the target detection task also has several special evaluation indicators of the target detection model, such as: mAP, FPS, etc.

 ①mAP (class average precision rate)

        Each picture in the object detection problem may contain some different classes of objects, and even if your object detector detects a certain class of objects in the picture, it is useless if you cannot localize it, so you need to evaluate the model at the same time Object classification and positioning performance, while the standard index Precision for image classification problems cannot be directly applied to this, so the mAP evaluation index came out. mAP is the most commonly used evaluation index in target detection problems. Before explaining mAP, you must first understand the meaning of AP. The full English name of AP is Average Precision, which is defined as the area under the PR curve of a certain category, that is to say, a category An AP value can be calculated; the English full name of mAP is mean Average Precision, which is defined as the average value of all categories of AP values. In the actual calculation process, AP is defined as the average value of Precision under 10 Recall values ​​([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]), which can characterize the area under the entire PR curve , the following is an example to illustrate how to calculate mAP.

       In the scenario of calculating the mAP evaluation index, first define several definitions:

1) Intersection and Union ratio - IoU

        Intersection-over-union ratio (IOU) is a measure of the degree of overlap between two detection frames (for target detection). The formula is as follows:

IoU=\frac{area(B_{pre}\cap B_{gt})}{area(B_{pre}\cup B_{gt})}

B_{gt}It represents the actual border of the target (Ground Truth, GT) and B_{for} represents the predicted border. By calculating the IOU of the two, it can be judged whether the predicted detection frame meets the conditions. The IOU is shown in pictures as follows:

 2)TP、FP、FN、TN

IoU> thresholdTP: The number of prediction frames that meet (threshold threshold is taken according to the actual situation) (the same Ground Truth is only calculated once, that is to say, if there are multiple satisfying conditions, only the one with the largest IoU is taken) .

FP: IoU\leq thresholdThe number of predicted boxes that are satisfied, or the number of redundant detection boxes that detect the same GT.

FN: Number of GTs not detected.

TN: This is almost impossible to get in the target detection problem, so it is not used in the mAP calculation.

3) Precision rate P and recall rate R

        The precision rate P refers to the probability that the sample predicted to be 1 is actually 1, and the recall rate refers to the probability that the sample predicted to be 1 is actually predicted to be 1. So their formulas can be defined as follows, where all ground truth represents the number of all GTs, and this value is fixed when the data set is given.

P=\frac{TP}{TP+FP}

R=\frac{TP}{TP+FN}=\frac{TP}{all ground truth}

        Let's start with this example, assuming we have 7 pictures (Images1-Image7), these pictures have 15 targets (green boxes, the number of GT, mentioned above) and 24 predicted bounding boxes (red boxes,  all ground truthsAY number and has a confidence value).

        According to the above figure and description, we can list the following table, where Images represents the number of the picture, Detections represents the number of the predicted frame, Confidences represents the confidence of the predicted frame, TP or FP represents whether the predicted frame is marked as TP or FP (think If the IOU value between the prediction frame and GT is greater than or equal to 0.3, it is marked as TP; if a GT has multiple prediction frames, the prediction frame with the largest IOU and greater than or equal to 0.3 is considered to be marked as TP, and the others are marked as FP, that is, a GT can only There is a prediction frame marked as TP), where 0.3 is a randomly selected value .

        Through the above table, we can draw the PR curve (because AP is the area under the PR curve), but before that we need to calculate the coordinates of each point on the PR curve, and sort all the prediction boxes according to the confidence from large to small, Then you can calculate the values ​​of Precision and Recall, see the table below. (You need to remember a concept called accumulation, which is ACC TP and ACC FP in the figure below )

        With these 24 pairs of PR values, the PR curve can be drawn.

        The AP (area under the PR curve) can be calculated after the PR curve is obtained. To calculate the area under the PR curve, for the convenience of calculation, we generally perform a smooth sawtooth operation on the curve first. This operation is simply to select n on the Recall axis A point, see whose Precision is the largest on the right side, and then use this Precision value in this interval. For example, assuming that 10 points [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] are selected on the Recall axis, the Precision value corresponding to each Recall point is taken from the point on the right side of the PR curve. The maximum Precision value on the side to get a new curve, as shown in the red line in the figure below, and then calculate the area under the red line is the AP value of this category.

 

  AP=0.1\times 1+0.1\times 0.6666+0.3\times 0.4285+0.5\times 0=\frac{1}{10}\times (1+0.6666+0.4285+0.4285+0.4285+0+0+0+0+0)=0.295

        As shown in the calculation process of the above formula AP, AP can be regarded as the area under the PR curve (that is, the area under the red line), or as all the selected 10 recall points [0,0.1,0.2,0.3,0.4 ,0.5,0.6,0.7,0.8,0.9] corresponds to the average precision rate Precision, which may also be the real origin of the name AP (Average Precision), that is, the average precision rate. To calculate mAP, you have to calculate the APs of all categories, and then calculate the average. 

②FPS

        FPS is another important evaluation index of the target detection model. It is mainly used to evaluate the detection speed of the model, indicating the number of pictures that can be processed per second. The more pictures processed per second, the better the model.

2.3 Segmentation tasks

 ①MIoU (average intersection-over-union ratio)

        Average Intersection-Union Ratio (MIoU) is the most commonly used evaluation index in segmentation models. The meaning of the intersection ratio IoU is the ratio of the intersection and union of the model's prediction results for a certain category and the real value, but for target detection it is the intersection ratio between the detection frame and the real frame, and for image segmentation In terms of calculating the intersection ratio between the predicted mask and the real mask. After calculating the IoU of all categories, the MIoU can be obtained by averaging.

②MPA (category average pixel accuracy)

        The category average pixel accuracy rate is another evaluation index of the segmentation model. The meaning of the pixel accuracy rate PA is the proportion of the number of pixels predicted correctly in a certain category to the total number of pixels. Calculate the PA of all categories, and then accumulate and average to get Category Average Pixel Accuracy MPA.

Guess you like

Origin blog.csdn.net/Mike_honor/article/details/126123689