(三) COCO Python API - 数据集类数量分布

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/gzj2013/article/details/82425954

COCO 数据集中各类数量的分布到底是怎样的?

如果一个数据集中各类数量分布差异很大, 是否会对深度学习模型训练有影响? 为什么?

如果有影响, 那又应该如何处理?


初步分析


以下这段代码给出了 COCO 数据集 val2017 中 80 类的图片数据和标注数据的数量.

from pycocotools.coco import COCO

dataDir='/path/to/your/cocoDataset'
dataType='val2017'
annFile='{}/annotations/instances_{}.json'.format(dataDir,dataType)

# initialize COCO api for instance annotations
coco=COCO(annFile)

# display COCO categories and supercategories
cats = coco.loadCats(coco.getCatIds())
cat_nms=[cat['name'] for cat in cats]
print('COCO categories: \n{}\n'.format(' '.join(cat_nms)))

# 统计各类的图片数量和标注框数量
for cat_name in cat_nms:
    catId = coco.getCatIds(catNms=cat_name)
    imgId = coco.getImgIds(catIds=catId)
    annId = coco.getAnnIds(imgIds=imgId, catIds=catId, iscrowd=None)

    print("{:<15} {:<6d}     {:<10d}".format(cat_name, len(imgId), len(annId)))

输出信息如下:

category 图片数量 标注框数量 category 图片数量 标注框数量
person 2693 11004 bicycle 149 316
car 535 1932 motorcycle 159 371
airplane 97 143 bus 189 285
train 157 190 truck 250 415
boat 121 430 traffic light 191 637
fire hydrant 86 101 stop sign 69 75
parking meter 37 60 bench 235 413
bird 125 440 cat 184 202
dog 177 218 horse 128 273
sheep 65 361 cow 87 380
elephant 89 255 bear 49 71
zebra 85 268 giraffe 101 232
backpack 228 371 umbrella 174 413
handbag 292 540 tie 145 254
suitcase 105 303 frisbee 84 115
skis 120 241 snowboard 49 69
sports ball 169 263 kite 91 336
baseball bat 97 146 baseball glove 100 148
skateboard 127 179 surfboard 149 269
tennis racket 167 225 bottle 379 1025
wine glass 110 343 cup 390 899
fork 155 215 knife 181 326
spoon 153 253 bowl 314 626
banana 103 379 apple 76 239
sandwich 98 177 orange 85 287
broccoli 71 316 carrot 3 17
hot dog 0 345 pizza 153 285
donut 62 338 cake 124 316
chair 580 1791 couch 195 261
potted plant 172 343 bed 149 163
dining table 501 697 toilet 149 179
tv 207 288 laptop 183 231
mouse 88 106 remote 145 283
keyboard 106 153 cell phone 214 262
microwave 54 55 oven 115 143
toaster 8 9 sink 187 225
refrigerator 101 126 book 230 1161
clock 204 267 vase 137 277
scissors 28 36 teddy bear 0 262
hair drier 9 11 toothbrush 34 57

可以看出, 不管是各类的图片数目还是标注框数目, 其数量分布差异均很大. 特别是 ‘person’ 类的标注框数目明显要其中大多数类的标注框数目大约多出 50 倍.


2017train 类数目分布


以下是 COCO 数据集 train2017 中 80 类的图片数据和标注数据的数量.

category 图片数量 标注框数量 category 图片数量 标注框数量
person 64115 262465 bicycle 3252 7113
car 12251 43867 motorcycle 3502 8725
airplane 2986 5135 bus 3952 6069
train 3588 4571 truck 6127 9973
boat 3025 10759 traffic light 4139 12884
fire hydrant 1711 1865 stop sign 1734 1983
parking meter 705 1285 bench 5570 9838
bird 3237 10806 cat 4114 4768
dog 4385 5508 horse 2941 6587
sheep 1529 9509 cow 1968 8147
elephant 2143 5513 bear 960 1294
zebra 1916 5303 giraffe 2546 5131
backpack 5528 8720 umbrella 3968 11431
handbag 6841 12354 tie 3810 6496
suitcase 2402 6192 frisbee 2184 2682
skis 3082 6646 snowboard 1654 2685
sports ball 4262 6347 kite 2261 9076
baseball bat 2506 3276 baseball glove 2629 3747
skateboard 3476 5543 surfboard 3486 6126
tennis racket 3394 4812 bottle 8501 24342
wine glass 2533 7913 cup 9189 20650
fork 3555 5479 knife 4326 7770
spoon 3529 6165 bowl 7111 14358
banana 2243 9458 apple 1586 5851
sandwich 2365 4373 orange 1699 6399
broccoli 1939 7308 carrot 24 142
hot dog 11 29 pizza 3166 5821
donut 1523 7179 cake 2925 6353
chair 12774 38491 couch 4423 5779
potted plant 4452 8652 bed 3682 4192
dining table 11837 15714 toilet 3353 4157
tv 4561 5805 laptop 3524 4970
mouse 1876 2262 remote 3076 5703
keyboard 2115 2855 cell phone 4803 6434
microwave 1547 1673 oven 2877 3334
toaster 217 225 sink 4678 5610
refrigerator 2360 2637 book 5332 24715
clock 4659 6334 vase 3593 6613
scissors 947 1481 teddy bear 16 92
hair drier 189 198 toothbrush 1007 1954

TODO…


猜你喜欢

转载自blog.csdn.net/gzj2013/article/details/82425954