YOLOv3 depth understanding

YOLOv3 there is not much innovation, mainly to learn some good solutions combine to YOLO inside. But the effect is good, while maintaining the advantages of speed, improved forecasting accuracy, especially to strengthen the ability to identify small objects.

This paper stresses the improvement v3, since v1 and v2 is based on YOLO1 and the Department YOLO2 analysis venue YOLO v1 in-depth understanding and YOLOv2 / YOLO9000 in-depth understanding .

YOLO3 major improvements are: adjusting the network structure; multi-scale object detection characteristics; softmax object classification substituted with Logistic.

The new network architecture Darknet-53

In the basic image feature extraction, YOLO3 called Darknet-53 using the network structure (comprising convolution layer 53), it draws the residual network residual network approach, some of the layers is provided between the link shortcut (shortcut connections).

FIG 1 Darknet-53 [1]

FIG on the network using Darknet-53 256 * 256 * 3 as input, the number of repeats that residual leftmost column assembly 1,2,8 like numerals. Each residual component has two layers and a convolution quick link diagram is as follows:

A residual component of FIG. 2 [2]

Multi-scale object detection features

FIG 3 YOLO3 network structure [3]

YOLO2 passthrough have been used to detect fine-grained structure wherein, in YOLO3 further characterized using three different scales of FIG object detection is performed.

Binding figure, after the convolutional network layer 79, a few yellow convolutional layer under detection to obtain the result of a scale. Compared to the input image, where a feature detecting FIG 32 times downsampling. Such as input is 416 * 416, then, features here it is 13 * 13 of FIG. Since multiple high downsampling, where FIG receptive field characteristic is relatively large, and therefore suitable for detecting the image size is relatively large objects.

为了实现细粒度的检测,第79层的特征图又开始作上采样(从79层往右开始上采样卷积),然后与第61层特征图融合(Concatenation),这样得到第91层较细粒度的特征图,同样经过几个卷积层后得到相对输入图像16倍下采样的特征图。它具有中等尺度的感受野,适合检测中等尺度的对象。

最后,第91层特征图再次上采样,并与第36层特征图融合(Concatenation),最后得到相对输入图像8倍下采样的特征图。它的感受野最小,适合检测小尺寸的对象。

9种尺度的先验框

随着输出的特征图的数量和尺度的变化,先验框的尺寸也需要相应的调整。YOLO2已经开始采用K-means聚类得到先验框的尺寸,YOLO3延续了这种方法,为每种下采样尺度设定3种先验框,总共聚类出9种尺寸的先验框。在COCO数据集这9个先验框是:(10x13),(16x30),(33x23),(30x61),(62x45),(59x119),(116x90),(156x198),(373x326)。

分配上,在最小的13*13特征图上(有最大的感受野)应用较大的先验框(116x90),(156x198),(373x326),适合检测较大的对象。中等的26*26特征图上(中等感受野)应用中等的先验框(30x61),(62x45),(59x119),适合检测中等大小的对象。较大的52*52特征图上(较小的感受野)应用较小的先验框(10x13),(16x30),(33x23),适合检测较小的对象。

图4 特征图与先验框

感受一下9种先验框的尺寸,下图中蓝色框为聚类得到的先验框。黄色框式ground truth,红框是对象中心点所在的网格。

图5 9种先验框尺寸

对象分类softmax改成logistic

预测对象类别时不使用softmax,改成使用logistic的输出进行预测。这样能够支持多标签对象(比如一个人有Woman 和 Person两个标签)。

输入映射到输出

图6 输入->输出

不考虑神经网络结构细节的话,总的来说,对于一个输入图像,YOLO3将其映射到3个尺度的输出张量,代表图像各个位置存在各种对象的概率。

我们看一下YOLO3共进行了多少个预测。对于一个416*416的输入图像,在每个尺度的特征图的每个网格设置3个先验框,总共有 13*13*3 + 26*26*3 + 52*52*3 = 10647 个预测。每一个预测是一个(4+1+80)=85维向量,这个85维向量包含边框坐标(4个数值),边框置信度(1个数值),对象类别的概率(对于COCO数据集,有80种对象)。

对比一下,YOLO2采用13*13*5 = 845个预测,YOLO3的尝试预测边框数量增加了10多倍,而且是在不同分辨率上进行,所以mAP以及对小物体的检测效果有一定的提升。

小结

YOLO3借鉴了残差网络结构,形成更深的网络层次,以及多尺度检测,提升了mAP及小物体检测效果。如果采用COCO mAP50做评估指标(不是太介意预测框的准确性的话),YOLO3的表现相当惊人,如下图所示,在精确度相当的情况下,YOLOv3的速度是其它模型的3、4倍。

图7 YOLOv3与其它模型的性能对比[1]

不过如果要求更精准的预测边框,采用COCO AP做评估标准的话,YOLO3在精确率上的表现就弱了一些。如下图所示。

图8 YOLOv3与其它模型的性能对比[1]

参考

[1]YOLOv3: An Incremental Improvement
[2]Deep Residual Learning for Image Recognition
[3]What’s new in YOLO v3?
[4]How to implement a YOLO (v3) object detector from scratch in PyTorch
[5]一文看懂YOLO v3

发布了4 篇原创文章 · 获赞 8 · 访问量 9113

Guess you like

Origin blog.csdn.net/u012254599/article/details/104014691