Deep Learning (Object Detection)---Evaluation Method for Multi-label Image Classification Task-mAP

The multi-label image classification (Multi-label Image Classification) task has more than one label, so the evaluation cannot use the standard of ordinary single-label image classification, that is, mean accuracy. This task uses a similar method to information retrieval - mAP ( mean Average Precision). Although mAP literally looks similar to mean accuracy, the calculation method is much more complicated. The following is the calculation method of mAP:

First, use the trained model to get the confidence score of all test samples, and save the confidence score of each class (such as car) to a file (such as comp1_cls_test_car.txt). Assuming a total of 20 test samples, each id, confidence score and ground truth label are as follows:​​

 

Next, sort the confidence score to get:

This table is very important, the following precision and recall are calculated according to this table

Then calculate precision and recall, these two criteria are defined as follows:

The above picture is more intuitive, the circle (true positives + false positives) is the element we selected, which corresponds to the result we took out in the classification task, such as classifying the test sample on the trained car model, we want to get top-5 result, namely:

In this example, true positives refer to the 4th and 2nd images, and false positives refer to the 13th, 19th, and 6th images. The elements outside the circle inside the box (false negatives and true negatives) are relative to the elements inside the box, in this case, the elements whose confidence score ranks outside the top-5, namely:

 

Among them, false negatives refer to the 9th, 16th, 7th, and 20th pictures, and true negatives refer to the 1st, 18, 5, 15, 10, 17, 12, 14, 8, 11, and 3 pictures.

那么,这个例子中Precision=2/5=40%,意思是对于car这一类别,我们选定了5个样本,其中正确的有2个,即准确率为40%;Recall=2/6=30%,意思是在所有测试样本中,共有6个car,但是因为我们只召回了2个,所以召回率为30%。

实际多类别分类任务中,我们通常不满足只通过top-5来衡量一个模型的好坏,而是需要知道从top-1到top-N(N是所有测试样本个数,本文中为20)对应的precision和recall。显然随着我们选定的样本越来也多,recall一定会越来越高,而precision整体上会呈下降趋势。把recall当成横坐标,precision当成纵坐标,即可得到常用的precision-recall曲线。这个例子的precision-recall曲线如下:

接下来说说AP的计算,此处参考的是PASCAL  VOC  CHALLENGE的计算方法。首先设定一组阈值,[0, 0.1, 0.2, …, 1]。然后对于recall大于每一个阈值(比如recall>0.3),我们都会得到一个对应的最大precision。这样,我们就计算出了11个precision。AP即为这11个precision的平均值。这种方法英文叫做11-point interpolated average precision。​

当然PASCAL VOC CHALLENGE自2010年后就换了另一种计算方法。新的计算方法假设这N个样本中有M个正例,那么我们会得到M个recall值(1/M, 2/M, ..., M/M),对于每个recall值r,我们可以计算出对应(r' > r)的最大precision,然后对这M个precision值取平均即得到最后的AP值。计算方法如下:​

 

 

相应的Precision-Recall曲线(这条曲线是单调递减的)如下:​

 

AP衡量的是学出来的模型在每个类别上的好坏,mAP衡量的是学出的模型在所有类别上的好坏,得到AP后mAP的计算就变得很简单了,就是取所有AP的平均值。


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325579961&siteId=291194637