Microsoft COCO: Common Objects in Context - 目标检测评估 (Detection Evaluation)

http://cocodataset.org/#home
Home -> Evaluate -> Detection

1. Detection Evaluation
检测评估
This page describes the detection evaluation metrics used by COCO. The evaluation code provided here can be used to obtain results on the publicly available COCO validation set. It computes multiple metrics described below. To obtain results on the COCO test set, for which ground-truth annotations are hidden, generated results must be uploaded to the evaluation server. The exact same evaluation code, described below, is used to evaluate results on the test set.
http://cocodataset.org/#upload
本页介绍了 COCO 使用的检测评估指标。此处提供的评估代码可用于在公开可用的 COCO 验证集 (validation set) 上获得结果。它计算下面描述的多个指标。为了在 COCO 测试集 (test set) 上获得结果，其中隐藏了实际真值注释，必须将生成的结果上载到评估服务器。下面描述的完全相同的评估代码用于评估测试集的结果。

2. Metrics
指标
The following 12 metrics are used for characterizing the performance of an object detector on COCO:
以下 12 个指标用于表征 COCO 上物体检测器的性能：

Average Precision (AP):
AP
% AP at IoU=.50:.05:.95 (primary challenge metric)
APIoU=.50
% AP at IoU=.50 (PASCAL VOC metric)
APIoU=.75
% AP at IoU=.75 (strict metric)

AP Across Scales:
APsmall
% AP for small objects: area < 32^2
APmedium
% AP for medium objects: 32^2 < area < 96^2
APlarge
% AP for large objects: area > 96^2

Average Recall (AR):
ARmax=1
% AR given 1 detection per image
ARmax=10
% AR given 10 detections per image
ARmax=100
% AR given 100 detections per image

AR Across Scales:
ARsmall
% AR for small objects: area < 32^2
ARmedium
% AR for medium objects: 32^2 < area < 96^2
ARlarge
% AR for large objects: area > 96^2

1. Unless otherwise specified, AP and AR are averaged over multiple Intersection over Union (IoU) values. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric AP^IoU=.50). Averaging over IoUs rewards detectors with better localization.
2. AP is averaged over all categories. Traditionally, this is called "mean average precision" (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.
3. AP (averaged across all 10 IoU thresholds and all 80 categories) will determine the challenge winner. This should be considered the single most important metric when considering performance on COCO.
4. In COCO, there are more small objects than large objects. Specifically: approximately 41% of objects are small (area < 32^2), 34% are medium (32^2 < area < 96^2), and 24% are large (area > 96^2). Area is measured as the number of pixels in the segmentation mask.
5. AR is the maximum recall given a fixed number of detections per image, averaged over categories and IoUs. AR is related to the metric of the same name used in proposal evaluation but is computed on a per-category basis.
6. All metrics are computed allowing for at most 100 top-scoring detections per image (across all categories).
7. The evaluation metrics for detection with bounding boxes and segmentation masks are identical in all respects except for the IoU computation (which is performed over boxes or masks, respectively).
https://arxiv.org/abs/1502.05082
What makes for effective detection proposals?

1. 除非另有说明，否则 AP 和 AR 在多个交并比 (IoU) 值上取平均值。具体来说，我们使用 10 个 IoU 阈值 0.50 : 0.05 : 0.95。这是对传统的一个突破，传统的 AP 是在一个单一的 0.50 的 IoU 上计算的 (这对应于我们的度量AP^IoU=.50)。超过均值的 IoUs 奖励更好定位的探测器 (Averaging over IoUs rewards detectors with better localization.)。
2. AP是所有类别的平均值。传统上，这被称为“平均准确度” (mAP，mean average precision)。我们没有区分AP 和 mAP (同样是 AR 和 mAR)，并假定从上下文中可以清楚地看出差异。
3. AP (所有 10 个 IoU 阈值和所有 80 个类别的平均值) 将决定赢家。在考虑 COCO 性能时，这应该被认为是最重要的一个指标。
4. 在 COCO 中，与大物体相比有更多的小物体。具体地说，大约 41％的物体很小 (面积 < 32^2)，34％是中等 (32^2 < area < 96^2)，24％大 (area > 96^2)。测量的面积是分割掩码 (segmentation mask) 中的像素数量。
5. AR 是在每个图像中检测到固定数量的最大召回 (recall)，在类别和 IoU 上平均。AR 与提案评估 (proposal evaluation) 中使用的同名度量相关，但是按类别计算。
6. 所有度量标准允许每个图像 (在所有类别中) 最多 100 个最高得分检测进行计算。
7. 除了 IoU 计算 (分别在框 (box) 或掩码 (mask) 上执行) 之外，用边界框和分割掩码检测的评估度量在所有方面是相同的。

3. Evaluation Code
Evaluation code is available on the COCO github. Specifically, see either CocoEval.m or cocoeval.py in the Matlab or Python code, respectively. Also see evalDemo in either the Matlab or Python code (demo). Before running the evaluation code, please prepare your results in the format described on the results format page.
https://github.com/cocodataset/cocoapi
https://github.com/cocodataset/cocoapi/blob/master/MatlabAPI/CocoEval.m
https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py
https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb
http://cocodataset.org/#format-results
在运行评估代码之前，请按 results format 页面上描述的格式准备结果。

The evaluation parameters are as follows (defaults in brackets, in general no need to change):
评估参数如下 (括号中的默认值，一般不需要改变)：

params{
"imgIds"
: [all] N img ids to use for evaluation
"catIds"
: [all] K cat ids to use for evaluation
"iouThrs"
: [.5:.05:.95] T=10 IoU thresholds for evaluation
"recThrs"
: [0:.01:1] R=101 recall thresholds for evaluation
"areaRng"
: [all,small,medium,large] A=4 area ranges for evaluation
"maxDets"
: [1 10 100] M=3 thresholds on max detections per image
"useSegm"
: [1] if true evaluate against ground-truth segments
"useCats"
: [1] if true use category labels for evaluation
}

Running the evaluation code via calls to evaluate() and accumulate() produces two data structures that measure detection quality. The two structs are evalImgs and eval, which measure quality per-image and aggregated across the entire dataset, respectively. The evalImgs struct has KxA entries, one per evaluation setting, while the eval struct combines this information into precision and recall arrays. Details for the two structs are below (see also CocoEval.m or cocoeval.py):
运行评估代码通过调用 evaluate() 和 accumulate() 产生两个数据结构来衡量检测质量。这两个结构分别是 evalImgs 和 eval，它们分别衡量每个图像的质量并聚合到整个数据集中。evalImgs 结构体具有 K x A 条目，每个评估设置一个，而 eval 结构体将这些信息组合成 precision 和 recall 数组。这两个结构的细节如下 (另请参阅 CocoEval.m 或cocoeval.py)：

evalImgs[{
"dtIds"
: [1xD] id for each of the D detections (dt)
"gtIds"
: [1xG] id for each of the G ground truths (gt)
"dtImgIds"
: [1xD] image id for each dt
"gtImgIds"
: [1xG] image id for each gt
"dtMatches"
: [TxD] matching gt id at each IoU or 0
"gtMatches"
: [TxG] matching dt id at each IoU or 0
"dtScores"
: [1xD] confidence of each dt
"dtIgnore"
: [TxD] ignore flag for each dt at each IoU
"gtIgnore"
: [1xG] ignore flag for each gt
}]

eval{
"params"
: parameters used for evaluation
"date"
: date evaluation was performed
"counts"
: [T,R,K,A,M] parameter dimensions (see above)
"precision"
: [TxRxKxAxM] precision for every evaluation setting
"recall"
: [TxKxAxM] max recall for every evaluation setting
}

Finally summarize() computes the 12 detection metrics defined earlier based on the eval struct.
最后，summary() 根据 eval 结构计算前面定义的 12 个检测指标。

4. Analysis Code
In addition to the evaluation code, we also provide a function analyze() for performing a detailed breakdown of false positives. This was inspired by Diagnosing Error in Object Detectors by Derek Hoiem et al., but is quite different in implementation and details. The code generates plots like this:
除了评估代码外，我们还提供一个函数 analyze() 来执行误报的详细分类。这受到了 Derek Hoiem 等人在诊断物体检测器中的错误 (Diagnosing Error in Object Detectors) 的启发，但在实现和细节方面却有很大不同。代码生成这样的图像：

Both plots show analysis of the ResNet (bbox) detector from Kaiming He et al., winner of the 2015 Detection Challenge. The first plot shows a breakdown of errors of ResNet for the person class; the second plot is an overall analysis of ResNet averaged over all categories.
这两幅图显示了来自 2015 年检测挑战赛获胜者 Kaiming He 等人的 ResNet (bbox) 检测器的分析结果。左图显示了 ResNet 的人员类别错误；右图是 ResNet 对所有类别平均值的整体分析。
Each plot is a series of precision recall curves where each PR curve is guaranteed to be strictly higher than the previous as the evaluation setting becomes more permissive. The curves are as follows:
每个绘图是一系列准确率召回率 (precision recall) 曲线，其中每个 PR 曲线被保证严格地高于之前的评估设置变得更宽容。曲线如下：

C75: PR at IoU=.75 (AP at strict IoU), area under curve corresponds to APIoU=.75 metric.
C50: PR at IoU=.50 (AP at PASCAL IoU), area under curve corresponds to APIoU=.50 metric.
Loc: PR at IoU=.10 (localization errors ignored, but not duplicate detections). All remaining settings use IoU=.1.
Sim: PR after supercategory false positives (fps) are removed. Specifically, any matches to objects with a different class label but that belong to the same supercategory don't count as either a fp (or tp). Sim is computed by setting all objects in the same supercategory to have the same class label as the class in question and setting their ignore flag to 1. Note that person is a singleton supercategory so its Sim result is identical to Loc.
Oth: PR after all class confusions are removed. Similar to Sim, except now if a detection matches any other object it is no longer a fp (or tp). Oth is computed by setting all other objects to have the same class label as the class in question and setting their ignore flag to 1.
BG: PR after all background (and class confusion) fps are removed. For a single category, BG is a step function that is 1 until max recall is reached then drops to 0 (the curve is smoother after averaging across categories).
FN: PR after all remaining errors are removed (trivially AP=1).

1. C75：在 IoU = 0.75 (严格的 IoU 的 AP) 的 PR (precision recall)，对应于 APIoU=.75 度量曲线下的面积 (area under curve)。
2. C50：在 IoU = 0.50 (PASCAL IoU 的 AP) 的 PR (precision recall)，对应于 APIoU=.50 度量曲线下的面积。
3. Loc：在 IoU = 0 .10 (定位误差 (localization errors ignored) 忽略，但不重复检测) 的 PR (precision recall)。所有其余的设置使用 IoU = 0.1。
4. Sim：超类别误报 (fps，false positives) 删除后的 PR。具体而言，与具有不同类标签但属于同一个超类别的对象的任何匹配都不会被视为 fp (或tp)。通过设置同一超类别中的所有对象与所讨论的类具有相同的类标签并将它们的忽略标志设置为1来计算 Sim。注意，人是单例超类别，因此其 Sim 结果与 Loc 完全相同。
5. Oth：所有类型混乱被删除后的PR。与 Sim 类似，除了现在如果检测与任何其他对象匹配，则不再是 fp (或tp)。Oth 通过如下方式进行计算，设置所有其他对象具有与所讨论的类相同的类标签，并将它们的忽略标志设置为 1。
6. BG：所有的背景(和类混乱 (class confusion)) fps 被删除后的 PR。对于单个类别，BG 是一个阶跃函数，直到达到最大召回后才降为0 (跨类别平均后曲线更平滑)。
7. FN：在所有剩余错误被删除后 (AP = 1) 的 PR。

The area under each curve is shown in brackets in the legend. In the case of the ResNet detector, overall AP at IoU=.75 is .399 and perfect localization would increase AP to .682. Interesting, removing all class confusions (both within supercategory and across supercategories) would only raise AP slightly to .713. Removing background fp would bump performance to .870 AP and the rest of the errors are missing detections (although presumably if more detections were added this would also add lots of fps). In summary, ResNet's errors are dominated by imperfect localization and background confusions.
每条曲线下面的区域显示在图例的括号中。在 ResNet 检测器的情况下，IoU = 0.75的整体 AP 为0.399，完美定位将使 AP 增加到 0.682。有趣的是，消除所有类别混乱 (超范畴内和超范畴内) 只会将 AP 略微提升至 0.713。除去背景 fp 会将性能提高到 0.870 AP，而其余的错误则缺少检测(尽管假设更多的检测被添加，这也会增加大量的fps)。总之，ResNet 的错误来自不完美的定位和背景混淆。

For a given detector, the code generates a total of 372 plots! There are 80 categories, 12 supercategories, and 1 overall result, for a total of 93 different settings, and the analysis is performed at 4 scales (all, small, medium, large, so 93 * 4 = 372 plots). The file naming is [supercategory]-[category]-[size].pdf for the 80 * 4 per-category results, overall-[supercategory]-[size].pdf for the 12 * 4 per supercategory results, and overall-all-[size].pdf for the 1 * 4 overall results. Of all the plots, typically the overall and supercategory results are of the most interest.
对于一个给定的检测器 (detector)，该代码总共产生了 372 个图 (plots)！共有 80 个类别 (category)，12 个超类别 (supercategory)，1 个总体结果，总共 93 个不同的设置，分析是在 4 个尺度 (scales) (全部，小，中，大，所以 93 * 4 = 372 个块) 进行。文件命名为 [supercategory] - [category] - [size].pdf (对于 80 * 4 每个分类结果)，overall-[supercategory]-[size].pdf (对于 12 * 4 每个超类别结果)，overall-all-[size].pdf 为 1 * 4 的整体结果。在所有plots 中，通常总体和超类别的结果是最感兴趣的。

Note: analyze() can take significant time to run, please be patient. As such, we typically do not run this code on the evaluation server; you must run the code locally using the validation set. Finally, currently analyze() is only part of the Matlab API; Python code coming soon.
注意：analyze() 可能需要很长时间才能运行，请耐心等待。因此，我们通常不会在评估服务器上运行此代码，您必须使用验证集在本地运行代码。最后，目前 analyze() 只是 Matlab API 的一部分，Python 代码即将推出。

wordbook
true positive，TP：真正例
true negative，TN：真反例
false positive，FP：假正例
false negative，FN：假反例
receiver operating characteristic curve，ROC curve：受试者工作特征曲线，ROC曲线，感受性曲线
Intersection over Union，IoU：交并比
recall：召回率，查全率
precision：准确率，查准率
average precision，AP
average recall，AR
mean Average Precision，mAP：平均精度均值
precision recall，PR

References
http://cocodataset.org/#home
http://cocodataset.org/#detection-eval

Microsoft COCO: Common Objects in Context - 目标检测评估 (Detection Evaluation)

Microsoft COCO: Common Objects in Context - 目标检测评估 (Detection Evaluation)

猜你喜欢