1、论文总述

在这里插入图片描述

YOLOv2是在YOLO基础上改进的版本，感觉焕然一新了，加了许多其他网络的比较好使的模块，如BN层、FasterRcnn中的RPN(yolo 用了anchor之后map虽然下降一点，但召回率提升很多)、ssd中的多尺度特征融合；然后还有一些自己提出的比较新颖的模块，如hi-res classifier模块（高分辨率分类器之后转到检测器）、新的backbone(DarkNet19既快又准确率高)、anchor的长宽比使用聚类分析得到、位置预测时预测中心点在这个grid cell里的偏移（0到1之间）、多尺度训练（测试时候只用这个多尺度训练的模型就可以测试多个尺度）。

还有很重要的YOLO9000模型，在YOLOV2的基础上改进得到，模型结构上没啥变化，就是将数据多的分类数据集与数据少的检测数据集在训练时候融合到一起训练（造出了WordNet解决了类别不是完全互斥的问题），是分类样本就只返回分类loss，若是检测样本就分类loss和回归loss一起回传（BP）。

We propose a new method to harness the large amount
of classification data we already have and use it to expand
the scope of current detection systems. Our method uses a
hierarchical view of object classification that allows us to
combine distinct datasets together.
We also propose a joint training algorithm that allows
us to train object detectors on both detection and classification data.
Our method leverages labeled detection images to
learn to precisely localize objects while it uses classification
images to increase its vocabulary and robustness.
Using this method we train YOLO9000, a real-time object detector that can detect over 9000 different object categories. First we improve upon the base YOLO detection
system to produce YOLOv2, a state-of-the-art, real-time
detector. Then we use our dataset combination method
and joint training algorithm to train a model on more than
9000 classes from ImageNet as well as detection data from
COCO.

2、location prediction.

The network predicts 5 bounding boxes at each cell in the output
feature map. The network predicts 5 coordinates for each bounding box,
tx, ty, tw, th, and to. If the cell is offset from the top left corner
of the image by (cx, cy) and the bounding box prior has width and
height pw, ph, then the predictions correspond to: bx = σ(tx) + cx by
= σ(ty) + cy bw = pwetw ；bh = pheth； P r(object) ∗ IOU(b, object) = σ(to)

在这里插入图片描述

3、尺寸26的feature map转到13

在这里插入图片描述

4、性能比较

在这里插入图片描述

5、Darknet-19

在这里插入图片描述

We propose a new classification model to
be used as the base of YOLOv2. Our model builds off of
prior work on network design as well as common knowledge in the field. Similar to the VGG models we use mostly
3 × 3 filters and double the number of channels after every pooling step [17]. Following the work on Network in
Network (NIN) we use global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions [9]. We use batch
normalization to stabilize training, speed up convergence,
and regularize the model [7].
Our final model, called Darknet-19, has 19 convolutional
layers and 5 maxpooling layers.
Darknet-19 only requires 5.58 billion operations
to process an image yet achieves 72.9% top-1 accuracy and
91.2% top-5 accuracy on ImageNet （快很准）