Selection and introduction of evaluation indicators for deep learning object detection model testing

One of the prerequisites for autonomous driving is to ensure the safety of people, so the detection of people is necessary. Considering the scene requirements of autonomous driving, various types of vehicles, traffic lights, traffic signs, and other objects that appear frequently on the road and affect decision-making should be identified, such as motorcycles, bicycles, etc. After the data set and detection category are determined, the test index is of vital significance to evaluate the performance of the model. There have been a large number of related researches. This paper summarizes the indicators selected in the project, and introduces various evaluation indicators that are currently mainstream.

Selection of test indicators

For target detection problems, the commonly used evaluation indicators are:

  • Accuracy evaluation indicators: map (average accuracy, accuracy evaluation) , accuracy rate (Accuracy), confusion matrix (Confusion Matrix), precision rate (Precision), recall rate (Recall), average correct rate (AP), mean Average Precision (mAP), Intersection Union (IoU), ROC + AUC, Non-Maximum Suppression (NMS).
  • Speed ​​evaluation index: FPS (that is, the number of pictures processed per second or the time required to process each picture, of course must be compared under the same hardware conditions)

Select mAP, confusion matrix, PR curve, fppi and F1-Score as accuracy evaluation indicators, among which mAP and F1-Score are quantitative evaluation indicators, fppi can be quantified, and log-average miss rate is used as its quantitative evaluation indicators. The confusion matrix and PR curve reflect the quality of the model from different angles.

FLOPs is selected as the speed evaluation index, which represents the amount of calculation required to process a frame of image, which is more general than FPS. At the same time, considering that the evaluation of the model is carried out on the same host, FPS will also be used as a reference standard.

The following is a detailed introduction to the commonly used evaluation indicators in the field of target detection.

1. Accuracy evaluation index

1. MAP (Mean Mean Accuracy)

1.1 Definition of mAP and related concepts

  • mAP: mean Average Precision, which is the average value of each category AP
  • AP: The area under the PR curve, which will be explained in detail later
  • PR curve: Precision-Recall curve
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • TP (True Positive, true positive): The detector gives a positive sample, which is actually a positive sample, that is, the target is correctly detected
  • TN (True Negative, true negative): The detector gives a negative sample, which is actually a negative sample, that is, it correctly detects the non-target
  • FP (False Positive, False Positive): The detector gives a positive sample, but in fact it is a negative sample, that is, a false detection
  • FN (False Negative, False Negative): The detector gives a negative sample, but in fact it is a positive sample, that is, a missed detection
  • TP: The number of detection frames with IoU>0.5 (the same Ground Truth is only calculated once)
  • FP: The detection frame with IoU<=0.5, or the number of redundant detection frames that detect the same GT
  • FN: Number of GTs not detected

Notice:

(1) Generally speaking, mAP is for the entire data set;

(2) AP is for a certain category in the data set;

(3) The precision and recall are for a certain category of a single picture.

1.2 Specific calculation of mAP

  • Map calculation methods for different datasets

Since map is the average AP value of all categories in the data set, to calculate map, we must first know how to find the AP value of a certain category. The AP calculation methods of a certain category of different data sets are similar, mainly divided into three types:

(1) Before VOC2010, you only need to select the maximum value of Precision when Recall >= 0, 0.1, 0.2, …, 1, a total of 11 points, and then AP is the average value of these 11 Precisions, and map is the AP value of all categories Average.

(2) In VOC2010 and later, for each different Recall value (including 0 and 1), it is necessary to select the maximum value of Precision when it is greater than or equal to these Recall values, and then calculate the area under the PR curve as the AP value. The map is all categories Average AP value.

(3) COCO data set, set multiple IOU thresholds (0.5-0.95, 0.05 is the step size), each IOU threshold has a certain type of AP value, and then calculate the AP average under different IOU thresholds, that is The final AP value of a category to be sought.

  • Calculate the AP of a class

From the above concept, we know that we need to draw the PR curve of this category to calculate the AP of a certain category, so we need to calculate the precision and recall of this category in each picture in the data set.

By the formula:

  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)

You only need to count the number of TP, FP, and FN.

  • How to judge TP, FP, FN (important)

Let's take a single picture,

  • First traverse the ground truth object in the picture,
  • Then extract the gt objects of a certain category we want to calculate,
  • Then read the detection frame of this category that we detected through the detector (ignoring other categories) ,
  • Then filter out the boxes whose confidence scores are lower than the confidence threshold (some have not set the confidence threshold),
  • Sort the remaining detection frames according to the confidence score from high to low, first determine whether the iou of the detection frame with the highest confidence score and the gt bbox is greater than the iou threshold,
  • If the iou is greater than the set iou threshold, it is judged as TP, and this gt_bbox is marked as detected (subsequent redundant detection boxes of the same GT are all regarded as FP, which is why it is first sorted according to the confidence score from high to low, The detection frame with the highest confidence score is first compared with the iou threshold, if it is greater than the iou threshold, it is regarded as TP, and the subsequent detection frames of the same gt object are regarded as FP) ,
  • If the iou is less than the threshold, it is directly planned into the FP.

Papers with different confidence scores here may have different definitions, generally referring to the majority of classification confidence, that is, the probability that the object in the prediction box belongs to a certain category.

  • Calculate mAP after NMS

This point must be clear. **mAP value calculation is carried out after NMS. **mAP is the final evaluation index of our detection model. After all operations are completed, the final detection result is used as the standard to calculate mAP. It is worth mentioning that NMS is generally only used during testing, and NMS operations are not performed during training, because a large number of positive and negative samples are required to learn during training.

2. Accuracy

Divide the number of samples in pairs by the number of all samples, that is:

Accurate (classification) rate = number of positive and negative cases correctly predicted / total number.

The accuracy rate is generally used to evaluate the global accuracy of the model, and cannot contain too much information to fully evaluate the performance of a model.

3. Confusion Matrix

The horizontal axis in the confusion matrix is ​​the number of categories predicted by the model, and the vertical axis is the number of real labels of the data.

The diagonal line represents the number of consistent model predictions and data labels, so the sum of the diagonal lines divided by the total number of test sets is the accuracy rate . The larger the number on the diagonal, the better, and the darker the color in the visualization results, the higher the prediction accuracy of the model in this category. If you look at it by row, each row that is not on the diagonal is the wrongly predicted category. In general, we want the diagonal to be as high as possible, and the off-diagonal to be as low as possible.
insert image description here

4. Precision and Recall

img

Some related definitions. Assuming that there is such a test set now, the pictures in the test set are only composed of pictures of wild geese and airplanes. Suppose the ultimate goal of your classification system is: to be able to take out pictures of all planes in the test set, not pictures of wild geese.

  • True positives : The positive sample is correctly identified as a positive sample , and the picture of the airplane is correctly identified as an airplane.

  • True negatives: Negative samples are correctly identified as negative samples , pictures of wild geese are not recognized, and the system correctly believes that they are wild geese.

  • False positives: False positive samples, that is, negative samples are mistakenly identified as positive samples , and pictures of wild geese are mistakenly identified as airplanes.

  • False negatives: False negative samples, that is, positive samples are misidentified as negative samples , pictures of airplanes are not recognized, and the system mistakenly thinks they are wild geese.

  • **Precision is actually the ratio of True positives in the identified pictures. **That is, in this hypothesis, the proportion of real aircraft among all identified aircraft.

img

  • Recall is the proportion of all positive samples in the test set that are correctly identified as positive samples . That is, in this hypothesis, the ratio of the number of correctly identified aircraft to the number of all real aircraft in the test set.

img

  • **Precision-recall curve: **Change the recognition threshold so that the system can recognize the first K pictures in turn. The change of the threshold will also cause the Precision and Recall values ​​to change, thus obtaining the curve.

If a classifier performs well, it should behave as follows:

While the Recall value increases, the Precision value remains at a very high level.

A classifier with poor performance may lose a lot of Precision values ​​in exchange for an increase in Recall values.

Usually, the Precision-recall curve is used in the article to show the trade-off between Precision and Recall of the classifier.

5. Average Precision (Average-Precision, AP) and mean Average Precision (mAP)

AP is the area under the Precision-recall curve . Generally speaking, the better the classifier, the higher the AP value.

mAP is the average of multiple class APs. The meaning of this mean is to average the AP of each class to obtain the value of mAP. The size of mAP must be in the [0,1] interval, and the larger the better. This indicator is the most important one in the target detection algorithm.

In the case of very few positive samples, the effect of PR performance will be better.

img

6. IoU

The value of IoU can be understood as the degree of overlap between the frame predicted by the system and the frame marked in the original picture.

insert image description here

The calculation method is that the intersection of the detection result Detection Result and Ground Truth is compared with their union, which is the detection accuracy.

IOU is the indicator that expresses the difference between this bounding box and groundtruth:

img

7. ROC (Receiver Operating Characteristic) curve and AUC (Area Under Curve)

Receiver operating characteristic curve (receiver operating characteristic curve, referred to as ROC curve), also known as sensitivity curve (sensitivity curve).

The ROC space defines the false positive rate (FPR) as the X axis and the true positive rate (TPR) as the Y axis.
insert image description here

ROC curve:

  • Abscissa: False positive rate (False positive rate, FPR), FPR = FP / [ FP + TN], representing the probability of wrongly predicting positive samples in all negative samples, false alarm rate;
  • Vertical axis: True positive rate (TPR), TPR = TP / [TP + FN], representing the probability of correct prediction in all positive samples, the hit rate.

PD=(number of true target)/(number of actural target)

FA=number of false detection/number of tested frames

Draw the ROC curve with FA as the horizontal axis and PD as the vertical axis

The diagonal corresponds to the random guess model , while (0,1) corresponds to the ideal model where all collations rank before all negative examples.

The closer the curve is to the upper left corner, the better the performance of the classifier.

The ROC curve has a good property: when the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged .

Class imbalance often occurs in real data sets , that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may also change over time.

ROC curve drawing:

(1) According to the probability value of each test sample belonging to a positive sample, it is sorted from large to small;

(2) From high to low, the "Score" value is used as the threshold threshold in turn. When the probability of a test sample belonging to a positive sample is greater than or equal to this threshold, we consider it a positive sample, otherwise it is a negative sample;

(3) Each time we select a different threshold, we can get a set of FPR and TPR, that is, a point on the ROC curve.

When we set the threshold to 1 and 0, two points (0,0) and (1,1) on the ROC curve can be obtained respectively. By concatenating these (FPR,TPR) pairs, the ROC curve is obtained. When the threshold value is larger, the ROC curve is smoother.

  • AUC(Area Under Curve)

That is the area under the ROC curve. The closer the AUC is to 1, the better the performance of the classifier.

**Physical meaning:** First, the AUC value is a probability value. When you randomly select a positive sample and a negative sample, the probability that the current classification algorithm ranks the positive sample ahead of the negative sample according to the calculated Score value is AUC value. Of course, the larger the AUC value, the more likely the current classification algorithm will rank positive samples ahead of negative samples, that is, better classification.

Calculation formula: Find the area of ​​the rectangle under the curve.

img

  • Comparison of PR curve and ROC curve

——ROC curve characteristics:

(1) Advantages: When the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged . Because TPR focuses on positive examples, FPR focuses on negative examples, making it a more balanced evaluation method.

Class imbalance often occurs in real data sets , that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may also change over time.

(2) Disadvantages: The advantage of the ROC curve mentioned above is that it will not change with the change of the category distribution, but this is also its disadvantage to some extent . Because the negative example N has increased a lot, but the curve has not changed, this is equivalent to generating a large number of FP. In information retrieval, if the main concern is the prediction accuracy of positive examples, this is not acceptable . In the context of category imbalance, the large number of negative examples results in an insignificant increase in FPR, resulting in an overly optimistic effect estimate in the ROC curve. The horizontal axis of the ROC curve uses FPR. According to FPR, when the number of negative examples N far exceeds that of positive examples P, a large increase in FP can only be exchanged for a small change in FPR. The result is that although a large number of negative cases are misjudged as positive cases, they cannot be seen intuitively on the ROC curve. (Of course, it is also possible to analyze only a small section to the left of the ROC curve)

——PR curve:

(1) The PR curve uses Precision, so both indicators of the PR curve focus on positive examples . In the category imbalance problem, since the main concern is positive examples , the PR curve is widely considered to be better than the ROC curve in this case.

scenes to be used:

  1. The ROC curve is suitable for evaluating the overall performance of the classifier because it takes into account both positive and negative examples. In contrast, the PR curve focuses entirely on positive examples .
  2. If there are multiple pieces of data and there are different category distributions , such as the proportion of positive and negative examples in each month of the credit card fraud problem may be different, at this time, if you only want to simply compare the performance of the classifier and eliminate the change in the category distribution Influence , the ROC curve is more suitable , because the change of category distribution may make the PR curve change from time to time, it is difficult to compare models at this time; conversely, if you want to test the impact of different category distributions on the performance of the classifier , then PR The curve is more suitable .
  3. If you want to evaluate the prediction of positive examples under the same category distribution , you should choose PR curve .
  4. In the category imbalance problem , the ROC curve usually gives an optimistic effect estimate , so most of the time the PR curve is better .
  5. Finally, according to the specific application, you can find the optimal point on the curve, get the corresponding precision, recall, f1 score and other indicators, and adjust the threshold of the model, so as to obtain a model that meets the specific application.

8. fppi/fppw

  • fppi = false positive per image

OK, here you finally draw the PR curve, calculate the mAP, and then happily talk about it at the project progress meeting. After you finished speaking, the product girl on the opposite side glanced at you with the eyes of a mentally retarded person, and said unhurriedly:
I don't care about mAP, let alone your curve, I just want to know your algorithm average There are a few mistakes in the picture .

fppi curve
insert image description here

The vertical axis of the fppi curve is FN (Miss rate), and the horizontal axis is false positive per image.

Obviously, fppi is closer to practical application than PR curve.

As for the drawing method, similar to the PR curve, it is obtained by adjusting thresh_conf to calculate relevant indicators and draw points, and then connect them.
Corresponding to fppw (fppw = false positive per window).

Before introducing miss rate versus false positives per-image (hereinafter referred to as FPPI), we have to talk about another indicator called
miss rate versus false positives per window (hereinafter referred to as FPPW).

At first, everyone used FPPW as an evaluation indicator for pedestrian detection. This indicator first appeared in the article Histograms of Oriented Gradients for Human Detection. The INRIA pedestrian dataset was published in this article, and this FPPW is used when evaluating performance (it is worth mentioning that the classic pedestrian detection method of HOG+SVM is also proposed in this article)

The following briefly introduces the detection principle of FPPW:

The vertical axis of FPPW is miss rate, and the horizontal axis is false positives per window. Both axes are represented by logarithmic axes:

miss rate = false negative / positive, that is, 1-recall, which means that among all the existing pedestrians (positive), how many false positives per
window = false positive / the number of window
are used The number of window, because this is related to the principle of HOG+SVM, his detection process is probably like this:

Input a picture to be detected
First, use the sliding window method to select a certain area on the picture (hereinafter referred to as the window)
to extract the HOG features of this area
. Input the HOG features into the SVM, use the SVM to classify, and judge whether it is Pedestrians
can be seen through the above process, because SVM is only used as a classifier, so if you want to detect pedestrians of different sizes, you need to use many windows of different sizes to slide in the sliding window method. Corresponds to a window, so many windows were born, and each window corresponds to a prediction result of SVM.

For a picture, we are concerned about whether the SVM can accurately judge these windows, so with false positive / the number of window, we can evaluate the detection performance of the SVM on this picture.

So how to get multiple miss rate and fppw values?

This is similar to the routine of the ROC curve, that is, by adjusting the detection threshold, a series of miss rate and fppw are obtained.

For example, the higher the threshold value, it means that only the detection frame with higher confidence can be considered as the output of the detector, so the fewer actual detection frames output, the more accurate the detection frame, the lower the possibility of detecting air, which will also As a result, the probability of missed detection is greater (true positive with low confidence becomes false negative), so the miss rate increases and fppw decreases at this time. And vice versa.

The above is the calculation method when there is only one picture. For multiple pictures, it is actually similar. First put the results of all the pictures together, sort them according to the confidence level from high to low, and then adjust the detection threshold according to the level of confidence level, thus obtaining a series of miss rate and fppw, and then divide by the number of window (the number of window at this time is the number of windows on each picture * the number of pictures).

(Examples of adjusting detection thresholds are below)

How does FPPW quantify the comparison?

Because there is no way to quantify the comparison between curves, the author uses the miss rate at 10^{−4} when FPPW=1 0 − 4 and 10 −4 at 10 −4 as the reference point for comparison of results (similar to the ROC
curve
in AUC value).

The above is the general principle of FPPW.

In the original article, the author said that the FPPW indicator is very sensitive to changes in miss rate, that is, if the miss rate changes a little, the fppw on the horizontal axis will change greatly. For example, every 1% reduction in miss rate is equivalent to reducing the original fppw by 1.57 times.

miss rate versus false positives per image (FPPI)

FPPW was introduced earlier, but FPPW has the following problems:

It cannot reflect the performance of false positive in different size and position spaces, that is, it is impossible to know how the classifier detects the performance near the target or how the classifier performs in a background similar to the target.
Because we cannot know where the window is in the image from per window, nor can we know the size of the window, the amount of useful information we can get about the window with per window is not large, so per window does not There is no special advantage
. The FPPW indicator is difficult to understand, because the concept of per window is too close to the underlying detection principle. According to normal thinking, we are actually more curious about "what is the false detection rate for each picture", we I will think about a more macroscopic and practical application scenario, but I will not care about the detection situation of each window. Therefore,
in the article Pedestrian detection: A benchmark , the author proposes FPPI as a more suitable measurement indicator for pedestrian detection.

The main benefits of FPPI are as follows:

The concept of per image is closer to real life and better understood.
The following is a brief introduction to the detection principle of FPPI:

The vertical axis of FPPI is miss rate, the horizontal axis is false positives per image, and both axes are represented by logarithmic axes:

miss rate = false negative / positive, that is, 1-recall
false positives per image = false positive / the number of image
We can find that, in fact, only the horizontal axis has changed, but the vertical axis is the same.

Similarly, we can also adjust the threshold to get a series of miss rate and fppi.

How does FPPI quantify and compare?

Similarly, there is no way to quantify the comparison between curves, so at the beginning, use the miss rate when FPPI=1 as the reference point for result comparison.

But in the follow-up paper Pedestrian Detection: An Evaluation of the State of the Art (the authors of both papers are the great god Piotr Dollar), the author changed to use log-average miss rate as the reference point for comparison of results, and the calculation method is :

In the logarithmic coordinate system, from 1 0 − 2 10^{-2}102 to1 0 0 10^010Take 9 FPPI values ​​evenly between 0 , these 9 FPPI values ​​will correspond to 9 miss rate values, and calculate the average of these 9 miss rate values ​​to get the log-average miss rate.

(For some curves that end early before reaching a specific FPPI value, the miss rate value takes the minimum value that the curve can reach)

What is a curve that ends early?

We can look at the picture below

insert image description here

The penultimate purple HogLbp has not yet reached 1 0 0 10^0100 has already ended early, and you will also find that different curves are also of different lengths, why is this so? In fact, this is related to the output of different detectors.

Because the curve in the FPPI diagram is essentially composed of a group of [fppi,mr] points, these points are connected into a curve, and the curve ends early, indicating that the maximum fppi value of these points cannot reach 1 0 0 10 ^0100 . And a group of [fppi, mr] points is obtained by adjusting the threshold of the detector. The threshold selection method of the detector is determined according to the number and confidence of the detector output detection frames.

For example, detector A detects 3 pictures, and it outputs a total of 10 detection boxes on these 3 pictures. Each detection box has its corresponding confidence level. We sort these detection boxes from high to low according to the confidence level. , for example: 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.5, 0.45.

We first choose 0.9 as the threshold of the detector. If the detection frame is greater than or equal to 0.9, we think there are pedestrians, and if the detection frame is lower than 0.9, we think there are no pedestrians, so we get a [fppi, mr] point.

Next, we choose 0.85 as the threshold of the detector. If the detection frame is greater than or equal to 0.85, we think there are pedestrians, and if the detection frame is lower than 0.85, we think there are no pedestrians, so we get a [fppi,mr] point.

By analogy, until 0.45, we can get a total of 10 [fppi, mr] points, that is, how many detection frames are output by the detector, how many [fppi, mr] points we can get. When the threshold is 0.45, assuming that the [fppi,mr] point value corresponding to detector A is [0.8, 0.25], then at this time, the curve of detector A can only draw fppi=0.8 at most, so it cannot reach 1 0 0 10^0100 . But if you change a detector, maybe they output the detection frame result. For example, detector B detects the same 3 pictures, assuming that it also outputs a total of 10 detection frames on these 3 pictures, according to the above steps, detector B will output 10 [fppi,mr] points. When the threshold is 0.45, assuming that the value of the [fppi,mr] point corresponding to detector B is [1.5, 0.25], then at this time, the curve of detector B will exceed 1 0 0 10^0 whendrawing100

(The number of detection frames output by different detectors may be different, which is also an influencing factor. In order to simplify the expression, it is not considered in the above example)

In fact, to put it bluntly, the performance of different detectors is different. When the threshold of the detector selects the lowest confidence level, essentially all the detection frames are considered to be pedestrians. At this time, if some detectors have many false detection frames, then the fppi it can achieve will be Relatively high; if some detectors falsely detect fewer frames, then the maximum fppi that it can achieve will be relatively low, so the upper bounds of fppi of different detectors are different.

The above is the general principle of FPPI.

Obtaining FPPI curve from ROC curve
In the actual code for drawing FPPI curve, the author uses compRoc, plotRoc and other words with ROC words to write, so let’s see how the author obtains FPPI through ROC in theory

insert image description here

The y-axis of the ROC curve is TPR (True positive rate), and the x-axis is FPR (False positive rate):

TPR = TP / ( TP + FN )
FPR = FP / ( FP + TN )
The calculation formula of TPR and recall is the same, so we can think that TPR=recall

The y-axis of the FPPI curve is miss rate, and the x-axis is fppi (false positives per image):

miss rate = FN / ( TP + FN )
fppi = FP / the number of image
conversion on the y-axis

miss rate = ( TP + FN - TP) / ( TP + FN ) = 1 - recall = 1 - TPR

So just subtract the y value of the ROC curve from 1 to get the y value of the FPPI curve

Transformation about the x-axis

In the compRoc function, for the ROC curve, the calculated y-axis is fppi! This is actually a problem of noun understanding. . . .

Usually, when we hear the ROC curve, we think of a curve with TPR on the y-axis and FPR on the x-axis.

But in the author's code, the ROC he refers to refers to the curve whose y-axis is TPR and the x-axis is fppi. So there is no conversion between FPR and fppi like the y-axis, because the author directly calculates fppi

We can also understand why he named it this way from another angle. In fact, the essence of the FPR in the ROC curve and the fppi in the FPPI curve are very similar. The molecules are both FP, and these two indicators focus on false detection.

The difference between the two is only that the denominator is different. The denominator of FPR is "all negative examples", and the denominator of fppi is "all pictures".

This is also well understood, because in the pedestrian detection task, the number of "all negative examples" is too much! There are only a few pedestrians in a picture (that is, "positive examples"), and places without pedestrians can be considered "negative examples". We cannot determine this number.

Therefore, in the pedestrian detection task, FPR cannot be calculated, so fppi is used to evaluate the false detection situation.

9. Non-Maximum Suppression (NMS)

Non-Maximum Suppression is to find a bounding box with a relatively high confidence based on the score matrix and the coordinate information of the region. For overlapping predicted boxes, only the one with the highest score is kept.

(1) NMS calculates the area of ​​each bounding box, and then sorts according to the score, and takes the bounding box with the largest score as the first object to be compared in the queue;

(2) Calculate the IoU of the remaining bounding boxes and the current maximum score and box, remove the bounding box whose IoU is greater than the set threshold, and keep the small IoU to get the prediction box;

(3) Then repeat the above process until the candidate bounding box is empty.

Finally, there are two thresholds in the process of detecting the bounding box, one is IoU, and the other is to remove the bounding box with a score smaller than the threshold from the candidate bounding boxes after the process. It should be noted that: Non-Maximum Suppression processes one category at a time . If there are N categories, Non-Maximum Suppression needs to be executed N times.

10. F1-Score

F1Score ( F1-score) is a measure for classification problems. F1The score considers that recall and precision are equally important, and some machine learning competitions with multi-classification problems are often used F1-scoreas the final evaluation method. It is the harmonic mean of precision and recall, with a maximum of 1 and a minimum of 0. Calculated as follows:

F1= 2TP/(2TP+FP+FN)

Plus there are F2fractions and F0.5fractions. F2The score considers recall to be twice as important as precision, and F0.5the score considers recall to be half as important as precision. The calculation formula is:

More generally, we can define ( precisionand recallweight adjustable F1 score):

Fβ = ((1+β*β)*precision*recall) / (β*β*precision + recall)

Commonly used such as F2and F0.5.

2. Speed ​​evaluation index

1 Overview

Many practical applications of target detection technology have high requirements on accuracy and speed. If the speed performance index is not considered, only the breakthrough in accuracy performance is paid attention to, but the cost is higher computational complexity and more memory requirements. Scalability remains an open question for full industry deployment. In general, the speed evaluation indicators in target detection are:

(1) FPS, the number of pictures that the detector can process per second

(2) The time required for the detector to process each picture

However, the speed evaluation index must be carried out on the same hardware. The same hardware, its maximum FLOPS ( the number of floating-point operations per second represents the hardware performance, and FLOPs are distinguished here ) is the same. Different networks, the FLOPs required to process each picture ( Floating-point operands ) are different, so the smaller the FLOPs required for the same hardware to process the same picture, the more pictures can be processed in the same time, and the faster the speed, the FLOPs required to process each picture is the same as many It is related to factors, such as the number of your network layers, the amount of parameters, the activation function you choose, etc. Here we only talk about the influence of the parameter amount of the network on it. Generally speaking, the lower the parameter amount of the network, the smaller the FLOPs. Save the model The required memory is small, and the requirements for hardware memory are relatively low, so it is more friendly to the embedded side.

  • Generally speaking, ResNeXt+Faster RCNN can achieve 1 second/picture on NVIDIA GPU
  • And MobileNet+SSD can achieve 300 ms/map on the ARM chip

Everyone knows the computing power gap between ARM and GPU.

2. FLOPs calculation

  • FLOPs and FLOPS distinction

First explain FLOPs: floating point operations refers to the number of floating point operations, which is understood as the amount of calculation and can be used to measure the complexity of the algorithm/model.

Let’s distinguish FLOPS (all uppercase) here. FLOPS refers to the floating-point number of calculations per second, which is understood as calculation speed and a standard for measuring a hardware. What we want is an indicator to measure the complexity of the model, so we choose FLOPs.

  • FLOPs calculation (the following calculation of FLOPs does not consider the operation of the activation function)

(1) Convolution layer

FLOPs=(2* Ci*k*K-1)*H*W*Co (regardless of bias)

FLOPs=(2* Ci*k*K)*H*W*Co (consider bias)

Ci is the number of input feature map channels, K is the filter size, H, W, Co are the height, width and number of channels of the output feature map.

img

The final Co output feature maps are obtained, each feature map has H W pixels, and the value of each pixel is obtained by convolving the filter with the input feature map, and the Ci k in the filter *K points, each point must be multiplied with the corresponding point of the input feature map (the floating point operand is ci × k × k c_i \times k\times kci×k×k ), and then add the numbers obtained by multiplying these filters and the corresponding points of the input feature map (the floating point operand isci × k × k c_i \times k\times kci×k×k , adding n numbers, the required floating-point operand is n-1), to get a value, corresponding to a pixel in an output feature map, the output feature map has Co sheets, so there are Co filters Participate in the convolution operation, sothe FLOP s of the convolutional layer = ( 2 × C i × k × k − 1 ) × H × W × C o FLOPs=(2\times C_i \times k \times k -1)\times H \times W \times C_oFLOPs=(2×Ci×k×k1)×H×W×Co

(Ci is the number of channels of the input feature map, K is the size of the filter, H, W, Co are the height, width and number of channels of the output feature map)

(2) Pooling layer

Pooling is divided into maximum pooling and average pooling. According to other people’s blogs, there are generally fewer pooling layers in the network, and the pooling operation occupies very few FLOPs, which has little impact on speed performance. I'm thinking,

Although the maximum pooling has no parameters, there are calculations, similar to Dropout, etc.

Mean pooling requires the average value to be added first and then divided by the total (to output a pixel on the feature map, the number of floating point operations required is: k × k − 1 + 1 k \times k -1 +1k×k1+1 . Find the average, firstk × kk \times kk×Add k numbers, the operand is,k × k − 1 k\times k-1k×k1 , then divide byk × kk \times kk×k , the floating point operand is 1), and the number of channels of the output feature map is Co, so the floating point operand here should bek ∗ k ∗ H ∗ W ∗ C ok*k*H*W*CokkHWC o (I don't know if there is any problem, if anyone knows, please let me know)

(3) Fully connected layer

First explain the fully connected layer

Fully connected layer of convolutional neural network
In the CNN structure, after multiple convolutional layers and pooling layers, one or more fully connected layers are connected. Similar to MLP, each neuron in the fully connected layer is fully connected to all neurons in the previous layer. Fully connected layers can integrate class-discriminative local information in convolutional or pooling layers. In order to improve the performance of the CNN network, the activation function of each neuron in the fully connected layer generally adopts the ReLU function. The output value of the last fully connected layer is passed to an output that can be classified using softmax logistic regression (softmax regression), which can also be called a softmax layer. For a specific classification task, it is very important to choose an appropriate loss function. CNN has several commonly used loss functions, each with different characteristics. Usually, the fully connected layer of CNN is the same as the MLP structure, and the training algorithm of CNN also mostly uses the BP algorithm.
Each node of the fully connected layer is connected to all the nodes of the previous layer, which is used to integrate the features extracted earlier. . Due to its fully connected characteristics, the parameters of the general fully connected layer are also the most. For example, in VGG16, the first fully connected layer FC1 has 4096 nodes, and the upper layer POOL2 has 7 7 512 = 25088 nodes, then the transmission requires 4096*25088 weights, which consumes a lot of memory

img

Among them, x1, x2, and x3 are the inputs of the fully connected layer, a1, a2, and a3 are the outputs,

img

In fact, the fully connected layer is basically not used now. CNN is basically represented by FCN, and the convolutional layer can be used to realize the function of the fully connected layer. Let I be the number of input neurons, and O be the number of output neurons. Each output neuron is multiplied by each input neuron with a weight (the floating-point operand is I), and then the resulting product is The sum is added (the floating-point operand is I-1), and a bias (the floating-point operand is 1) is obtained , so FLOPs are:

FLOPs=(I+I-1) * O = (2I-1) * O(not thinking bias)

FLOPs=((I+I-1+1)* O = (2I) * O(consideration bias)

  • FLOPs and parameter amount calculation widget

Recently, I found OpCouter, a small tool for calculating FLOPs and parameter quantities in the Pytorch framework, which is open sourced by others on github. It is very easy to use, and the installation of this tool is also very convenient. The author's open source link is released below: THOP: PyTorch-OpCounter

reference link

Target Detection Evaluation Index

Machine Learning Algorithm Evaluation Index - 2D Object Detection

[Pedestrian detection] miss rate versus false positives per image (FPPI) past and present (theory)

Guess you like

Origin blog.csdn.net/qq_37214693/article/details/126725029