1. Detection Evaluation

This page describes the detection evaluation metrics used by COCO. The evaluation code provided here can be used to obtain results on the publicly available COCO validation set. It computes multiple metrics described below. To obtain results on the COCO test set, for which ground-truth annotations are hidden, generated results must be uploaded to the evaluation server. The exact same evaluation code, described below, is used to evaluate results on the test set.

2. Metrics

The following 12 metrics are used for characterizing the performance of an object detector on COCO:

Average Precision (AP):

AP % AP at IoU=.50:.05:.95 (primary challenge metric)

APIoU=.50 % AP at IoU=.50 (PASCAL VOC metric)

APIoU=.75 % AP at IoU=.75 (strict metric)

AP Across Scales:

APsmall % AP for small objects: area < 322

APmedium % AP for medium objects: 322 < area < 962

APlarge % AP for large objects: area > 962

Average Recall (AR):

ARmax=1 % AR given 1 detection per image

ARmax=10 % AR given 10 detections per image

ARmax=100 % AR given 100 detections per image

AR Across Scales:

ARsmall % AR for small objects: area < 322

ARmedium % AR for medium objects: 322 < area < 962

ARlarge % AR for large objects: area > 962

Unless otherwise specified, AP and AR are averaged over multiple Intersection over Union (IoU) values. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric APIoU=.50). Averaging over IoUs rewards detectors with better localization.
AP is averaged over all categories. Traditionally, this is called "mean average precision" (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.
AP (averaged across all 10 IoU thresholds and all 80 categories) will determine the challenge winner. This should be considered the single most important metric when considering performance on COCO.
In COCO, there are more small objects than large objects. Specifically: approximately 41% of objects are small (area < 322), 34% are medium (322 < area < 962), and 24% are large (area > 962). Area is measured as the number of pixels in the segmentation mask.
AR is the maximum recall given a fixed number of detections per image, averaged over categories and IoUs. AR is related to the metric of the same name used in proposal evaluation but is computed on a per-category basis.
All metrics are computed allowing for at most 100 top-scoring detections per image (across all categories).
The evaluation metrics for detection with bounding boxes and segmentation masks are identical in all respects except for the IoU computation (which is performed over boxes or masks, respectively).

3. Evaluation Code

Evaluation code is available on the COCO github. Specifically, see either CocoEval.m or cocoeval.py in the Matlab or Python code, respectively. Also see evalDemo in either the Matlab or Python code (demo). Before running the evaluation code, please prepare your results in the format described on the results format page.

COCO detection evaluation metric

1. Detection Evaluation

2. Metrics

3. Evaluation Code

猜你喜欢