Deep learning - target detection (Yolo 2)

Reproduced in: https://blog.csdn.net/Jesse_Mx/article/details/53925356

Papers Address: YOLO9000: of Better, Faster, Stronger
project home page: YOLO: Real-Time Object Detection

Outline

　　One year later, YOLO (You Only Look Once: Unified, Real-Time Object Detection) evolved from v1 to v2 release version, one step ahead of the release of the source code in the darknet home page paper under our waiting and finally in December 25 released, an important part of the thesis of this paper was to understand the translation work, may not be entirely right, if in doubt, welcome to the discussion. If bloggers have a new understanding, we will update the article.

　　The new version of the paper YOLO full name " YOLO9000: of Better, Faster, Stronger ", there are two major areas of improvement :

　　1) The authors used a range of methods for the original YOLO multi-target detection framework has been improved, under the advantage of maintaining the original speed, accuracy can be improved . VOC 2007 test data set, 67FPS mAP reached at 76.8%, 40FPS mAP reached at 78.6%, with basically a war Faster R-CNN and SSD. This section is where the main concern of this article.

　　2) the authors propose a joint training method for target detection and classification of,, YOLO9000 can be carried out simultaneously in this way and ImageNet COCO data intensive training, the trained model can achieve real-time detection of up to 9000 kinds of objects . This aspect does not involve paper temporarily, to be followed by a time to replenish.

Review YOLOv1

　　YOLOv2 always be improvements made in the v1 version of my previous blog talked about in detail, if necessary, see https://www.cnblogs.com/zxj9487/p/10933604.html

YOLOv2 improved accuracy (of Better)

　　A first-come, overview map, look at it in the end with how many skills, and these skills played a role in how much:

Write pictures described here

Batch Normalization

　　CNN distribution network in the training process each input has been changed, make the training process more difficult, but you can enter each to solve this problem by normalize. The new network YOLO added after each batch normalization convolution layer, by this method, to obtain a 2% mAP lift. batch normalization also help to standardize model, we can still not too fit after discarding dropout optimization.

High Resolution Classifier

　　The current target detection methods, will basically use ImageNet model (classifier) pre-trained to extract features, if using a AlexNet network, then the input image will be resize to less than 256 * 256, resulting in the resolution is not high enough, to detection difficult. To this end, the new YOLO network upgrade directly to the resolution of 448 * 448, it also means that the existing network model must be some kind of adjustment to the new resolution of the input.

　　For YOLOv2, first author of the classified network (custom darknet) were fine tune, into a 448 * 448 resolution, training 10 (10 epochs) on ImageNet data set, after training the network can adapt to high resolution the entered. Then, the detection of the network part (that is, the second half) also fine tune. Thus by increasing the resolution of the input, mAP 4% improvement is obtained.

Convolutional With Anchor Boxes

　　之前的YOLO利用全连接层的数据完成边框的预测，导致丢失较多的空间信息，定位不准。作者在这一版本中借鉴了Faster R-CNN中的anchor思想，回顾一下，anchor是RNP网络中的一个关键步骤，说的是在卷积特征图上进行滑窗操作，每一个中心可以预测9种不同大小的建议框。看到YOLOv2的这一借鉴，我只能说SSD的作者是有先见之明的。

Write pictures described here

　　为了引入anchor boxes来预测bounding boxes，作者在网络中果断去掉了全连接层。剩下的具体怎么操作呢？首先，作者去掉了后面的一个池化层以确保输出的卷积特征图有更高的分辨率。然后，通过缩减网络，让图片输入分辨率为416 * 416，这一步的目的是为了让后面产生的卷积特征图宽高都为奇数，这样就可以产生一个center cell。作者观察到，大物体通常占据了图像的中间位置，就可以只用中心的一个cell来预测这些物体的位置，否则就要用中间的4个cell来进行预测，这个技巧可稍稍提升效率。最后，YOLOv2使用了卷积层降采样（factor为32），使得输入卷积网络的416 * 416图片最终得到13 * 13的卷积特征图（416/32=13）。

　　加入了anchor boxes后，可以预料到的结果是召回率上升，准确率下降。我们来计算一下，假设每个cell预测9个建议框，那么总共会预测13 * 13 * 9 = 1521个boxes，而之前的网络仅仅预测7 * 7 * 2 = 98个boxes。具体数据为：没有anchor boxes，模型recall为81%，mAP为69.5%；加入anchor boxes，模型recall为88%，mAP为69.2%。这样看来，准确率只有小幅度的下降，而召回率则提升了7%，说明可以通过进一步的工作来加强准确率，的确有改进空间。

Dimension Clusters（维度聚类）

　　作者在使用anchor的时候遇到了两个问题，第一个是anchor boxes的宽高维度往往是精选的先验框（hand-picked priors），虽说在训练过程中网络也会学习调整boxes的宽高维度，最终得到准确的bounding boxes。但是，如果一开始就选择了更好的、更有代表性的先验boxes维度，那么网络就更容易学到准确的预测位置。

　　和以前的精选boxes维度不同，作者使用了K-means聚类方法类训练bounding boxes，可以自动找到更好的boxes宽高维度。传统的K-means聚类方法使用的是欧氏距离函数，也就意味着较大的boxes会比较小的boxes产生更多的error，聚类结果可能会偏离。为此，作者采用的评判标准是IOU得分（也就是boxes之间的交集除以并集），这样的话，error就和box的尺度无关了，最终的距离函数为：

Write pictures described here

作者通过改进的K-means对训练集中的boxes进行了聚类，判别标准是平均IOU得分，聚类结果如下图：

Write pictures described here

　　可以看到，平衡复杂度和IOU之后，最终得到k值为5，意味着作者选择了5种大小的box维度来进行定位预测，这与手动精选的box维度不同。结果中扁长的框较少，而瘦高的框更多（这符合行人的特征），这种结论如不通过聚类实验恐怕是发现不了的。

　　当然，作者也做了实验来对比两种策略的优劣，如下图，使用聚类方法，仅仅5种boxes的召回率就和Faster R-CNN的9种相当。说明K-means方法的引入使得生成的boxes更具有代表性，为后面的检测任务提供了便利。

Write pictures described here

Direct location prediction（直接位置预测）

　　那么，作者在使用anchor boxes时发现的第二个问题就是：模型不稳定，尤其是在早期迭代的时候。大部分的不稳定现象出现在预测box的(x,y)坐标上了。在区域建议网络中，预测(x,y)以及 $t_{y}$ 使用的是如下公式：

Write pictures described here

　　后来修改博文时，发现这个公式有误，作者应该是把加号写成了减号。理由如下，anchor的预测公式来自于Faster-RCNN，我们来看看人家是怎么写的：

Write pictures described here

　　公式中，符号的含义解释一下： $x$ 是坐标预测值,xa是anchor坐标（预设固定值),x* $x^{*}$ 是坐标真实值（标注信息），其他变量 y,w,h 以此类推， $t$ 变量是偏移量。然后把前两个公式变形，就可以得到正确的公式：

$\begin{aligned} x = (t_{x} * w_{a}) + x_{a} \\ y = (t_{y} * w_{a}) + y_{a} \end{aligned}$

　　这个公式的理解为：当预测 tx = 1，就会把box向右边移动一定距离（具体为anchor box的宽度），预测

t_{x} = - 1

，就会把box向左边移动相同的距离。

　　这个公式没有任何限制，使得无论在什么位置进行预测，任何anchor boxes可以在图像中任意一点结束（我的理解是 $t_{x}$ 没有数值限定，可能会出现anchor检测很远的目标box的情况，效率比较低。正确做法应该是每一个anchor只负责检测周围正负一个单位以内的目标box）。模型随机初始化后，需要花很长一段时间才能稳定预测敏感的物体位置。

　　在此，作者就没有采用预测直接的offset的方法，而使用了预测相对于grid cell的坐标位置的办法，作者又把ground truth限制在了0到1之间，利用logistic回归函数来进行这一限制。

Write pictures described here

　　现在，神经网络在特征图（13 *13 ）的每个cell上预测5个bounding boxes（聚类得出的值），同时每一个bounding box预测5个坐值，分别为t $t_{x}, t_{y}, t_{w}, t_{h}, t_{o}$ ，其中前四个是坐标， $t_{o}$ 是置信度。如果这个cell距离图像左上角的边距为( $(c_{x}, c_{y})$ 以及该cell对应box（bounding box prior）的长和宽分别为( $(p_{w}, p_{h})$ ，那么预测值可以表示为：

Write pictures described here

　　这几个公式参考上面Faster-RCNN和YOLOv1的公式以及下图就比较容易理解.。t $t_{x}, t_{y}$ 经sigmod函数处理过，取值限定在了0~1，实际意义就是使anchor只负责周围的box，有利于提升效率和网络收敛, $σ$ 函数的意义没有给，但估计是把归一化值转化为图中真实值，使用 $e$ 的幂函数是因为前面做了I $l n$ 计算，因此, $σ (t_{x})$ (tx) $σ (t_{x})$ 是bounding box的中心相对栅格左上角的横坐标, $σ (t_{x})$ 是纵坐标 $σ (t_{o})$ 是bounding box的confidence score。

　　定位预测值被归一化后，参数就更容易得到学习，模型就更稳定。作者使用Dimension Clusters和Direct location prediction这两项anchor boxes改进方法，mAP获得了5%的提升。

Write pictures described here

Fine-Grained Features（细粒度特征）

　　上述网络上的修改使YOLO最终在13 * 13的特征图上进行预测，虽然这足以胜任大尺度物体的检测，但是用上细粒度特征的话，这可能对小尺度的物体检测有帮助。Faser R-CNN和SSD都在不同层次的特征图上产生区域建议（SSD直接就可看得出来这一点），获得了多尺度的适应性。这里使用了一种不同的方法，简单添加了一个转移层（ passthrough layer），这一层要把浅层特征图（分辨率为26 * 26，是底层分辨率4倍）连接到深层特征图。

Write pictures described here

　　这个转移层也就是把高低两种分辨率的特征图做了一次连结，连接方式是叠加特征到不同的通道而不是空间位置，类似于Resnet中的identity mappings。这个方法把26 * 26 * 512的特征图连接到了13 * 13 * 2048的特征图，这个特征图与原来的特征相连接。YOLO的检测器使用的就是经过扩张的特征图，它可以拥有更好的细粒度特征，使得模型的性能获得了1%的提升。（这段理解的也不是很好，要看到网络结构图才能清楚）

　　补充：关于passthrough layer，具体来说就是特征重排（不涉及到参数学习），前面26 * 26 * 512的特征图使用按行和按列隔行采样的方法，就可以得到4个新的特征图，维度都是13 * 13 * 512，然后做concat操作，得到13 * 13 * 2048的特征图，将其拼接到后面的层，相当于做了一次特征融合，有利于检测小目标。

Multi-Scale Training

　　原来的YOLO网络使用固定的448 * 448的图片作为输入，现在加入anchor boxes后，输入变成了416 * 416。目前的网络只用到了卷积层和池化层，那么就可以进行动态调整（意思是可检测任意大小图片）。作者希望YOLOv2具有不同尺寸图片的鲁棒性，因此在训练的时候也考虑了这一点。

　　不同于固定输入网络的图片尺寸的方法，作者在几次迭代后就会微调网络。没经过10次训练（10 epoch），就会随机选择新的图片尺寸。YOLO网络使用的降采样参数为32，那么就使用32的倍数进行尺度池化{320,352，…，608}。最终最小的尺寸为320 * 320，最大的尺寸为608 * 608。接着按照输入尺寸调整网络进行训练。

这种机制使得网络可以更好地预测不同尺寸的图片，意味着同一个网络可以进行不同分辨率的检测任务，在小尺寸图片上YOLOv2运行更快，在速度和精度上达到了平衡。

　　在小尺寸图片检测中，YOLOv2成绩很好，输入为228 * 228的时候，帧率达到90FPS，mAP几乎和Faster R-CNN的水准相同。使得其在低性能GPU、高帧率视频、多路视频场景中更加适用。

　　在大尺寸图片检测中，YOLOv2达到了先进水平，VOC2007 上mAP为78.6%，仍然高于平均水准，下图是YOLOv2和其他网络的成绩对比：

Write pictures described here

Further Experiments

　　作者在VOC2012上对YOLOv2进行训练，下图是和其他方法的对比。YOLOv2精度达到了73.4%，并且速度更快。同时YOLOV2也在COCO上做了测试（IOU=0.5），也和Faster R-CNN、SSD作了成绩对比。总的来说，比上不足，比下有余。

Write pictures described here

YOLOv2速度的改进（Faster）

　　YOLO一向是速度和精度并重，作者为了改善检测速度，也作了一些相关工作。

　　大多数检测网络有赖于VGG-16作为特征提取部分，VGG-16的确是一个强大而准确的分类网络，但是复杂度有些冗余。224 * 224的图片进行一次前向传播，其卷积层就需要多达306.9亿次浮点数运算。

　　YOLOv2使用的是基于Googlenet的定制网络，比VGG-16更快，一次前向传播仅需85.2亿次运算。可是它的精度要略低于VGG-16，单张224 * 224取前五个预测概率的对比成绩为88%和90%（低一点点也是可以接受的）。

Darknet-19

　　YOLOv2使用了一个新的分类网络作为特征提取部分，参考了前人的先进经验，比如类似于VGG，作者使用了较多的3 * 3卷积核，在每一次池化操作后把通道数翻倍。借鉴了network in network的思想，网络使用了全局平均池化（global average pooling），把1 * 1的卷积核置于3 * 3的卷积核之间，用来压缩特征。也用了batch normalization（前面介绍过）稳定模型训练。

　　最终得出的基础模型就是Darknet-19，如下图，其包含19个卷积层、5个最大值池化层（maxpooling layers ），下图展示网络具体结构。Darknet-19运算次数为55.8亿次，imagenet图片分类top-1准确率72.9%，top-5准确率91.2%。

Write pictures described here

Training for classification

　　作者使用Darknet-19在标准1000类的ImageNet上训练了160次，用的随机梯度下降法，starting learning rate 为0.1，polynomial rate decay 为4，weight decay为0.0005 ，momentum 为0.9。训练的时候仍然使用了很多常见的数据扩充方法（data augmentation），包括random crops, rotations, and hue, saturation, and exposure shifts。（这些训练参数是基于darknet框架，和caffe不尽相同）

　　初始的224 * 224训练后，作者把分辨率上调到了448 * 448，然后又训练了10次，学习率调整到了0.001。高分辨率下训练的分类网络在top-1准确率76.5%，top-5准确率93.3%。

Training for detection

　　分类网络训练完后，就该训练检测网络了，作者去掉了原网络最后一个卷积层，转而增加了三个3 * 3 * 1024的卷积层（可参考darknet中cfg文件），并且在每一个上述卷积层后面跟一个1 * 1的卷积层，输出维度是检测所需的数量。对于VOC数据集，预测5种boxes大小，每个box包含5个坐标值和20个类别，所以总共是5 * （5+20）= 125个输出维度。同时也添加了转移层（passthrough layer ），从最后那个3 * 3 * 512的卷积层连到倒数第二层，使模型有了细粒度特征。

　　作者的检测模型以0.001的初始学习率训练了160次，在60次和90次的时候，学习率减为原来的十分之一。其他的方面，weight decay为0.0005，momentum为0.9，依然使用了类似于Faster-RCNN和SSD的数据扩充（data augmentation）策略。

YOLOv2分类的改进（Stronger）

　　This part, the authors used a joint training methods, combined with wordtree and other methods, the detection of the type of YOLOv2 expanded to thousands of specific content adjourned.

Summary and Outlook

　　Author probably say is that technological improvements before the inspection tasks helpful in future work may be involved in weak oversight methods for image segmentation. For demanding supervised learning marker data, future technology to be considered a weak mark, which will greatly expand the data set to enhance the amount of training.

Further reading: deepsystems.io:Illustration of YOLO