Python deep learning target detection evaluation indicators: mAP, Precision, Recall, AP, IOU, etc.

Target detection evaluation index:

Accuracy, Confusion Matrix, Precision, Recall, AP, Mean Average Precision (mAP), IoU, ROC + AUC, Non-maximum suppression (NMS).

Suppose there are two categories in the original sample, among them: 
  1: There are a total of P samples with category 1, assuming category 1 is a positive example. 
  2: There are a total of N samples with category 0, assuming category 0 is a negative example. 
After classification: 
  3: TP samples with category 1 were correctly judged as category 1 by the system, and FN samples with category 1 were incorrectly judged as category 0 by the system, obviously P=TP+FN; 
  4: there are FP samples The sample with category 0 was misjudged by the system as category 1, and TN samples with category 0 were correctly judged as category 0 by the system, obviously N=FP+TN; 

                                                                  GroundTruth prediction results

TP (True Positives): True positive samples = [Positive samples are correctly classified as positive samples]

TN (True Negatives): true negative samples = [negative samples are correctly classified as negative samples]

FP (False Positives): False positive samples = [negative samples are wrongly classified as positive samples]

FN (False Negatives): False Negatives = [Positive samples are incorrectly classified as negative samples]

image

 

1. Accuracy

  A = (TP + TN)/(P+N) = (TP + TN)/(TP + FN + FP + TN); it
  reflects the classifier's ability to judge the entire sample-it can judge positive as positive , The negative judgment is negative.

2. Precision

        P = TP/(TP+FP);

       The number of pairs of samples divided by the number of all samples, namely: accuracy (classification) rate = number of positive and negative cases correctly predicted / total number.

  Accuracy is generally used to evaluate the global accuracy of a model, and it cannot contain too much information to fully evaluate the performance of a model.

       Reflects the proportion of real positive samples among the positive examples judged by the classifier.

3. Recall rate (recall)

        R = TP/(TP+FN) = 1-FN/T;
  reflects the proportion of positive cases that are correctly determined to the total positive cases.

4. F1 value 

       F1 = 2 * recall rate * accuracy rate / (recall rate + accuracy rate);

       This is the traditionally called F1 measure.

5. Missing Alarm (Missing Alarm)

       MA = FN/(TP + FN) = 1–TP/T = 1-R; it
       reflects how many positive cases were missed.

6. False Alarm

       FA = FP/(TP + FP) = 1–P; it
       reflects how many negative cases were misjudged.

7. Confusion Matrix

  The horizontal axis in the confusion matrix is ​​the number of categories predicted by the model, and the vertical axis is the number of true labels in the data.

  The diagonal line represents the number of consistent model predictions and data labels, so the sum of diagonal lines divided by the total number of test sets is the accuracy rate. The larger the number on the diagonal, the better, and the darker the color in the visualization result indicates that the model has a higher prediction accuracy in this category. If you look at it in rows, each row that is not on the diagonal is the mispredicted category. In general, we hope that the higher the diagonal, the better, and the lower the off-diagonal, the better.

8. Precision and Recall

  

image

  Some related definitions. Suppose there is such a test set. The pictures in the test set consist of only two pictures of wild geese and airplanes. Suppose the ultimate goal of your classification system is to take out pictures of all airplanes in the test set, not pictures of wild geese.

  • True positives: A positive sample is correctly identified as a positive sample, and a picture of an airplane is correctly identified as an airplane. 

  • True negatives: Negative samples are correctly identified as negative samples, pictures of wild geese are not recognized, and the system correctly considers them to be wild geese. 

  • False positives: False positive samples, that is, negative samples are incorrectly identified as positive samples, and the pictures of wild geese are incorrectly identified as airplanes. 

  • False negatives: False negative samples, that is, positive samples are mistakenly identified as negative samples, the pictures of airplanes are not recognized, and the system mistakenly thinks they are wild geese.

  Precision is actually the ratio of True positives in the identified pictures. That is, in this hypothesis, the proportion of real aircraft among all recognized aircraft. Reflects the proportion of real positive samples among the positive examples judged by the classifier.

  Recall is the proportion of all positive samples in the test set that are correctly identified as positive samples. That is, in this hypothesis, the ratio of the number of correctly identified aircraft to the number of all real aircraft in the test set.

image

  Precision-recall curve: Change the recognition threshold so that the system can recognize the first K pictures in turn. The change of the threshold will also cause the Precision and Recall values ​​to change, thereby obtaining the curve.

  If the performance of a classifier is relatively good, then it should have the following performance: While the Recall value increases, the Precision value remains at a high level. A classifier with poor performance may lose a lot of Precision value in exchange for an increase in Recall value. Under normal circumstances, the Precision-recall curve is used in the article to show the trade-off between the precision and the recall of the classifier.

9. Average Precision (Average-Precision, AP) and mean Average Precision (mAP)

  AP is the area under the Precision-recall curve. Generally speaking, the better the classifier, the higher the AP value.

  mAP is the average value of APs in multiple categories. The meaning of this mean is to average the AP of each class, and the value of mAP is obtained. The size of mAP must be in the range of [0,1], the larger the better. This indicator is the most important one in the target detection algorithm.

  In the case of very few positive samples, the effect of PR performance will be better.

  image

10 、 IoU

  The value of IoU can be understood as the degree of overlap between the frame predicted by the system and the frame marked in the original picture. The calculation method is to compare the intersection of Detection Result and Ground Truth with their union, which is the accuracy of detection.

       It is worth noting that when the IoU value exceeds 0.5, the subjective effect is already quite good.

  IOU is an indicator that expresses the difference between this bounding box and groundtruth:

image

11. ROC (Receiver Operating Characteristic) curve and AUC (Area Under Curve)

     

image

  ROC curve:

  • The abscissa: False positive rate (FPR), FPR = FP / [FP + TN], represents the probability that all negative samples are incorrectly predicted as positive samples, and the false alarm rate;

  • The ordinate: True positive rate (TPR), TPR = TP / [TP + FN], represents the probability of correct prediction in all positive samples, and the hit rate.

  The diagonal corresponds to the random guessing model, and (0,1) corresponds to all the ideal models sorted before all counterexamples. The closer the curve is to the upper left corner, the better the performance of the classifier.

  The ROC curve has a good characteristic: when the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged. In actual data sets, class imbalance often occurs, that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may also change over time.

  ROC curve drawing:

  (1) According to the probability value of each test sample belonging to the positive sample, sort from large to small;

  (2) From high to low, the "Score" value is used as the threshold threshold. When the probability of the test sample being a positive sample is greater than or equal to this threshold, we consider it as a positive sample, otherwise it is a negative sample;

  (3) Each time we select a different threshold, we can get a set of FPR and TPR, that is, a point on the ROC curve. 

   When we set the threshold to 1 and 0, we can get two points (0,0) and (1,1) on the ROC curve respectively. Connect these (FPR, TPR) pairs to get the ROC curve. When the threshold is larger, the ROC curve is smoother.

   AUC (Area Under Curve) is the area under the ROC curve. The closer the AUC is to 1, the better the classifier performance.

   Physical meaning: First, the AUC value is a probability value. When you randomly select a positive sample and a negative sample, the current classification algorithm ranks the positive sample before the negative sample according to the calculated Score value. The probability is the AUC value. Of course, the larger the AUC value, the more likely the current classification algorithm will rank the positive samples in front of the negative samples, that is, better classification.

  Calculation formula: Find the area of ​​the rectangle under the curve.

 12. Comparison of PR curve and ROC curve

  ROC curve characteristics:

  (1) Advantages: When the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged. Because TPR focuses on positive cases, FPR focuses on negative cases, making it a more balanced evaluation method.

      In actual data sets, class imbalance often occurs, that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may also change over time.

  (2) Disadvantages: The advantage of ROC curve mentioned above is that it will not change with the change of category distribution, but this is also its disadvantage to some extent. Because the negative case N has increased a lot, but the curve has not changed, this is equivalent to a large amount of FP. For example, if the main concern in information retrieval is the accuracy of the prediction of positive examples, this is unacceptable. In the context of imbalanced categories, the large number of negative cases makes the growth of FPR insignificant, resulting in an overly optimistic estimate of the effect of the ROC curve. The horizontal axis of the ROC curve uses FPR. According to FPR, when the number of negative cases N far exceeds the number of positive cases P, a large increase in FP can only be exchanged for a small change in FPR. The result is that although a large number of negative cases are wrongly judged as positive cases, they cannot be seen intuitively on the ROC curve. (Of course, it is also possible to analyze only a small section on the left side of the ROC curve)

  PR curve:

  (1) The PR curve uses Precision, so both indicators of the PR curve focus on positive examples. In the category imbalance problem, the PR curve is widely considered to be better than the ROC curve in this case because it is mainly concerned with positive examples.

  scenes to be used:

  1. The ROC curve takes into account both positive and negative examples, so it is suitable for evaluating the overall performance of the classifier. In contrast, the PR curve completely focuses on positive examples.

  2. If there are multiple pieces of data and there are different category distributions, for example, the proportion of positive and negative cases in the credit card fraud problem may be different every month. At this time, if you only want to compare the performance of the classifiers and eliminate the changes in the category distribution The ROC curve is more suitable because the change in the category distribution may make the PR curve change from good to bad, which is difficult to compare models; on the contrary, if you want to test the impact of different category distributions on the performance of the classifier, then PR The curve is more suitable.

  3. If you want to evaluate the prediction of positive cases under the same category distribution, you should choose the PR curve.

  4. In the problem of category imbalance, the ROC curve usually gives an optimistic estimate of the effect, so most of the time, the PR curve is better.

  5. Finally, you can find the optimal point on the curve according to the specific application, get the corresponding precision, recall, f1 score and other indicators, adjust the threshold of the model, and get a model that meets the specific application.

 13. Non-maximum suppression (NMS)

   Non-Maximum Suppression is to find a bounding box with higher confidence based on the coordinate information of the score matrix and the region. For the overlapping prediction boxes, only the one with the highest score is retained.

  (1) NMS calculates the area of ​​each bounding box, and then sorts it according to the score, and takes the bounding box with the largest score as the first object to be compared in the queue;

  (2) Calculate the IoU of the remaining bounding boxes and the current maximum score and box, remove the bounding boxes whose IoU is greater than the set threshold, and keep the prediction box with a small IoU;

  (3) Then repeat the above process until the candidate bounding box is empty.

  Finally, there are two thresholds in the process of detecting the bounding box, one is IoU, and the other is after the process, the bounding box with a score less than the threshold is removed from the candidate bounding box. It should be noted that: Non-Maximum Suppression processes one category at a time. If there are N categories, Non-Maximum Suppression needs to be executed N times.

Guess you like

Origin blog.csdn.net/qingfengxd1/article/details/111530789