COCO detection evaluation metric

小编 2026-06-09 阅读:1783 评论:0
http://cocodataset.org/#detection-eval 1. Detection Evaluation This page describes the detection eval...

http://cocodataset.org/#detection-eval

1. Detection Evaluation

This page describes the detection evaluation metrics used by COCO. The evaluation code provided here can be used to obtain results on the publicly available COCO validation set. It computes multiple metrics described below. To obtain results on the COCO test set, for which ground-truth annotations are hidden, generated results must be uploaded to the evaluation server. The exact same evaluation code, described below, is used to evaluate results on the test set.

2. Metrics

The following 12 metrics are used for characterizing the performance of an object detector on COCO:

Average Precision (AP):

AP

% AP at IoU=.50:.05:.95 (primary challenge metric)

APIoU=.50

% AP at IoU=.50 (PASCAL VOC metric)

APIoU=.75

% AP at IoU=.75 (strict metric)

AP Across Scales:

APsmall

% AP for small objects: area < 322

APmedium

% AP for medium objects: 322 < area < 962

APlarge

% AP for large objects: area > 962

Average Recall (AR):

ARmax=1

% AR given 1 detection per image

ARmax=10

% AR given 10 detections per image

ARmax=100

% AR given 100 detections per image

AR Across Scales:

ARsmall

% AR for small objects: area < 322

ARmedium

% AR for medium objects: 322 < area < 962

ARlarge

% AR for large objects: area > 962

 

  1. Unless otherwise specified, AP and AR are averaged over multiple Intersection over Union (IoU) values. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric APIoU=.50). Averaging over IoUs rewards detectors with better localization.
  2. AP is averaged over all categories. Traditionally, this is called \"mean average precision\" (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.
  3. AP (averaged across all 10 IoU thresholds and all 80 categories) will determine the challenge winner. This should be considered the single most important metric when considering performance on COCO.
  4. In COCO, there are more small objects than large objects. Specifically: approximately 41% of objects are small (area < 322), 34% are medium (322 < area < 962), and 24% are large (area > 962). Area is measured as the number of pixels in the segmentation mask.
  5. AR is the maximum recall given a fixed number of detections per image, averaged over categories and IoUs. AR is related to the metric of the same name used in proposal evaluation but is computed on a per-category basis.
  6. All metrics are computed allowing for at most 100 top-scoring detections per image (across all categories).
  7. The evaluation metrics for detection with bounding boxes and segmentation masks are identical in all respects except for the IoU computation (which is performed over boxes or masks, respectively).

3. Evaluation Code

Evaluation code is available on the COCO github. Specifically, see either CocoEval.m or cocoeval.py in the Matlab or Python code, respectively. Also see evalDemo in either the Matlab or Python code (demo). Before running the evaluation code, please prepare your results in the format described on the results format page.

The evaluation parameters are as follows (defaults in brackets, in general no need to change):

params{

\"imgIds\"

: [all] N img ids to use for evaluation

\"catIds\"

: [all] K cat ids to use for evaluation

\"iouThrs\"

: [.5:.05:.95] T=10 IoU thresholds for evaluation

\"recThrs\"

: [0:.01:1] R=101 recall thresholds for evaluation

\"areaRng\"

: [all,small,medium,large] A=4 area ranges for evaluation

\"maxDets\"

: [1 10 100] M=3 thresholds on max detections per image

\"useSegm\"

: [1] if true evaluate against ground-truth segments

\"useCats\"

: [1] if true use category labels for evaluation

}

Running the evaluation code via calls to evaluate() and accumulate() produces two data structures that measure detection quality. The two structs are evalImgs and eval, which measure quality per-image and aggregated across the entire dataset, respectively. The evalImgs struct has KxA entries, one per evaluation setting, while the eval struct combines this information into precision and recall arrays. Details for the two structs are below (see also CocoEval.m or cocoeval.py):

evalImgs[{

\"dtIds\"

: [1xD] id for each of the D detections (dt)

\"gtIds\"

: [1xG] id for each of the G ground truths (gt)

\"dtImgIds\"

: [1xD] image id for each dt

\"gtImgIds\"

: [1xG] image id for each gt

\"dtMatches\"

: [TxD] matching gt id at each IoU or 0

\"gtMatches\"

: [TxG] matching dt id at each IoU or 0

\"dtScores\"

: [1xD] confidence of each dt

\"dtIgnore\"

: [TxD] ignore flag for each dt at each IoU

\"gtIgnore\"

: [1xG] ignore flag for each gt

}]

 

eval{

\"params\"

: parameters used for evaluation

\"date\"

: date evaluation was performed

\"counts\"

: [T,R,K,A,M] parameter dimensions (see above)

\"precision\"

: [TxRxKxAxM] precision for every evaluation setting

\"recall\"

: [TxKxAxM] max recall for every evaluation setting

}

Finally summarize() computes the 12 detection metrics defined earlier based on the eval struct.

4. Analysis Code

In addition to the evaluation code, we also provide a function analyze() for performing a detailed breakdown of false positives. This was inspired by Diagnosing Error in Object Detectors by Derek Hoiem et al., but is quite different in implementation and details. The code generates plots like this:

\"\" \"\"

Both plots show analysis of the ResNet (bbox) detector from Kaiming He et al., winner of the 2015 Detection Challenge. The first plot shows a breakdown of errors of ResNet for the person class; the second plot is an overall analysis of ResNet averaged over all categories.

Each plot is a series of precision recall curves where each PR curve is guaranteed to be strictly higher than the previous as the evaluation setting becomes more permissive. The curves are as follows:

  1. C75: PR at IoU=.75 (AP at strict IoU), area under curve corresponds to APIoU=.75 metric.
  2. C50: PR at IoU=.50 (AP at PASCAL IoU), area under curve corresponds to APIoU=.50 metric.
  3. Loc: PR at IoU=.10 (localization errors ignored, but not duplicate detections). All remaining settings use IoU=.1.
  4. Sim: PR after supercategory false positives (fps) are removed. Specifically, any matches to objects with a different class label but that belong to the same supercategory don\'t count as either a fp (or tp). Sim is computed by setting all objects in the same supercategory to have the same class label as the class in question and setting their ignore flag to 1. Note that person is a singleton supercategory so its Sim result is identical to Loc.
  5. Oth: PR after all class confusions are removed. Similar to Sim, except now if a detection matches any other object it is no longer a fp (or tp). Oth is computed by setting all other objects to have the same class label as the class in question and setting their ignore flag to 1.
  6. BG: PR after all background (and class confusion) fps are removed. For a single category, BG is a step function that is 1 until max recall is reached then drops to 0 (the curve is smoother after averaging across categories).
  7. FN: PR after all remaining errors are removed (trivially AP=1).

The area under each curve is shown in brackets in the legend. In the case of the ResNet detector, overall AP at IoU=.75 is .399 and perfect localization would increase AP to .682. Interesting, removing all class confusions (both within supercategory and across supercategories) would only raise AP slightly to .713. Removing background fp would bump performance to .870 AP and the rest of the errors are missing detections (although presumably if more detections were added this would also add lots of fps). In summary, ResNet\'s errors are dominated by imperfect localization and background confusions.

For a given detector, the code generates a total of 372 plots! There are 80 categories, 12 supercategories, and 1 overall result, for a total of 93 different settings, and the analysis is performed at 4 scales (all, small, medium, large, so 93*4=372 plots). The file naming is [supercategory]-[category]-[size].pdf for the 80*4 per-category results, overall-[supercategory]-[size].pdf for the 12*4 per supercategory results, and overall-all-[size].pdf for the 1*4 overall results. Of all the plots, typically the overall and supercategory results are of the most interest.

Note: analyze() can take significant time to run, please be patient. As such, we typically do not run this code on the evaluation server; you must run the code locally using the validation set. Finally, currently analyze() is only part of the Matlab API; Python code coming soon.

 

版权声明

本文仅代表作者观点,不代表百度立场。
本文系作者授权百度百家发表,未经许可,不得转载。

热门文章
  • 机房智能化温湿度解决方式之POE供电以太网温湿度传感器

    机房智能化温湿度解决方式之POE供电以太网温湿度传感器
    机房智能化温湿度解决方式之POE供电以太网温湿度传感器 北京盈创力和电子科技有限公司 智能型TCP网口温湿度记录仪 北京IP网络温湿度记录仪厂家,北京盈创力和 北京智能型TCP网口温湿度记录仪IP网络温湿度记录仪是一种新型的基于TCP/IP协议双绞线以太网标准温湿度采集模块,利用它可以实现现场温度值、相对湿度值的采集,同时利用其自身的RJ45通信接口可以方便地和机房监控主机或交换机集线器进行联网。 工作于-40℃~85℃工业级带...
  • Sequential Monte Carlo Methods (SMC) 序列蒙特卡洛/粒子滤波/Bootstrap Filtering

    Sequential Monte Carlo Methods (SMC) 序列蒙特卡洛/粒子滤波/Bootstrap Filtering
    Problem Statement 我们考虑一个具有马尔可夫性质、非线性、非高斯的状态空间模型(State Space Model):对于一个时间序列上的观测结果{yt,t∈N}\\{ y_t , t \\in N \\}{yt​,t∈N},我们认为每个观测结果yty_tyt​的生成依赖于一个无法直接观察的隐变量xt∈{xt,t∈N}x_t \\in \\{x_t , t \\in N \\}xt​∈{xt​,t∈N},即:p(...
  • HTTP状态保持的原理

    HTTP状态保持的原理
    a)在用户登录之后,浏览器返回响应的时候会在响应中添加上cookieb)浏览器接收到cookie之后会自动保存c)当用户再次请求同一服务器中的其他网页的时候,浏览器会自动带上之前保存的cookied)服务接收到请求之后可以请 request 对象中取到cookie 判断当前用户是否登录  Http是无状态的,就是连接时数据互通,关闭后...
  • Hive 系统函数及示例

    Hive 系统函数及示例
    查看所有系统函数 show functions; 函数分类 内置函数【系统函数】 数学函数: floor、round、ceil、cos、log2等 字符串函数: length、reverse、trim、lower、get_json_object、repeat等 收集函数: size 转换函数: cast 日期函数: year、month、datediff、date、date_add等 条件函数: coalesce、case…w...
  • CSRF的原理和防范措施

    CSRF的原理和防范措施
    a)攻击原理:i.用户C访问正常网站A时进行登录,浏览器保存A的cookieii.用户C再访问攻击网站B,网站B上有某个隐藏的链接或者图片标签会自动请求网站A的URL地址,例如表单提交,传指定的参数iii.而攻击网站B在访问网站A的时候,浏览器会自动带上网站A的cookieiv.所以网站A在接收到请求之后可判断当前用户是登录状态,所以...
标签列表