Summary of YOLO V2 principle

Yolo v2 adds or replaces some content on the basis of yolo v1, which ends some problems of yolo v1 to a certain extent.

✨1 Summary

There are 8 changes made:

  1. Add Batch Normalization layer
  2. High Resolution Backbone Network
  3. anchor box mechanism
  4. Fully Convolutional Network Structure
  5. new backbone network
  6. K-means clustering prior box
  7. Use higher resolution features
  8. multi-scale training

✨2 Add Batch Normalization layer

After using Batch Normalizaiton, yolo v1 got a performance improvement. On the VOC2007 test set, it increased from the original 63.4% mAP to 65.8% mAP.
The specific Batch Normalization is summarized separately

✨3 High resolution backbone network

In yolo v1, the image size of the backbone in ImageNet pre-training is 224, and when doing the detection task, the received image size is 448. Therefore, the network must first overcome the problems caused by the drastic change of the resolution size .
To solve this problem: the classification network trained on the low-resolution images of 224×224 is fine-tuned on the high-resolution images of 448×448 for a total of 10 rounds of fine-tuning. After fine-tuning, remove the final global average pooling layer and softmax layer as the final backbone network.
The yolo v1 network got a second performance boost: from 65.8% mAP to 69.5% mAP.

4 anchor box mechanism

This refers to anchor based, and there is another one called anchor free.
The first time here is Faster R-CNN, which has been summarized separately and will not be summarized anymore.

5 Fully convolutional network structure

There are several obvious problems in yolo v1:

  1. flatten destroys the spatial structure of features
  2. The fully connected layer also leads to an explosion in the amount of parameters
  3. The occurrence of multiple targets on the same grid is unpredictable,

For the problem of flatten destroying the spatial structure and parameter explosion:
The input image size of the network is changed from 448 to 416 , and the last pooling layer and all fully connected layers in the YOLOv1 network are removed. The maximum downsampling factor of the modified network is 32 , and the final result is 13×13 Grid , no longer 7×7.

The same grid cannot predict multiple targets:
The anchor box mechanism is used here. There are k anchor boxes preset at each grid. The network only needs to learn the offset that maps the prior frame to the size of the real frame, and does not need to learn the size information of the entire real frame , which makes training easier. After adding the prior box at the same time, YOLOv2 changed each prior box to predict a category and confidence, that is, there will be multiple prediction outputs of bounding boxes at each grid . Therefore, the output tensor size of yolo v2 is S×S×k×(1+4+C) , and each bounding box prediction contains 1 confidence, 4 bounding box position parameters and C category predictions . Multi-target detection is possible.


After the above improvements, the accuracy of the network did not increase, but decreased slightly , from 69.5% mAP to 69.2% mAP , but the recall rate increased from 81% to 88% . The increase in recall means that YOLO can find more objects , although the precision drops a little. It can be seen that outputting multiple detection results per grid does help the network detect more objects. Therefore, the author did not give up this improvement because of this small loss of accuracy.

✨ 6 new backbone networks

Yolo v2 designed a new backbone for feature extraction,The network structure is as follows:
insert image description here
First, the author pre-trained on ImageNet and obtained 72.9% top1 accuracy and 91.2% top5 accuracy.
In terms of accuracy, the DarkNet19 network has reached the level of the VGG network, butThe former model is smaller, and the amount of floating point calculation is only about 1/5 of the latter, so the calculation speed is extremely fast.
After the training is completed, the 24th, 25th and 26th layers are removed as the backbone of yolo v2, and the yolo v1 network is improved from the previous 69.2% mAP to 69.6% mAP.

✨7 K-means clustering prior box

For the anchor based mechanism mentioned earlier, some parameters need to be selected , such as the height and width of the anchor, the aspect ratio, and the number. Unlike Faster R-CNN, which is manually designed , yolo v2 uses the kmeans method to cluster on the VOC data set , and a total of k prior frames are clustered** (using IOU as a measurement indicator)**. Through experiments, the author finally set up. Clustering is to detect the height and width of the anchor suitable for the dataset .
In order to solve the problems caused by the linear outputter in yolo v1, here are some changes to the position parameters of the four bounding boxes:

  1. The center coordinate offset itself should be between 0-1 , but the linear output device has no range constraints, which easily leads to a very large value predicted by the network in the early stage of learning, resulting in unstable learning of the bbox branch . Therefore, at the time of YOLOv2, the author added the sigmoid function to map it to the range of 01 .
  2. The height and width should be non-negative numbers . The exp-log method is used here. Specifically, w and h are processed with the log function :
    insert image description here

What you're doing here is taking the training set label and letting the network train towards it!
When predicting, we need to get the center coordinates and height and width according to the bounding box parameters. How to do it:

  1. Suppose the width and height of the prior box are insert image description hereandinsert image description here

  2. The output height and width offsets are insert image description heresum insert image description here
    .

  3. Assuming that the grid coordinates where the center of the prior frame is located are insert image description hereand insert image description here
    , the output displacement offset is insert image description here
    and insert image description here
    .

insert image description here
Use the formula in the picture to get the center coordinates and height and width

Using the kmeans clustering method to obtain the a priori box, combined with the "location prediction" bounding box prediction method, the performance of YOLOv1 has been significantly improved: from 69.6% mAP to 74.4% mAP .

7 Using higher resolution features

The work of SSD is borrowed here. In the previous work of yolo v1, the last feature layer was used to make predictions. When it comes to yolo v2, the authorThe 17th layer of the backbone is convolved to output the 26x26x512 feature layer for a downsampling to obtain a 13x13x2048 feature layer, and the final output is spliced ​​​​in the channel dimension.
After the above operations, the feature map has more information. Then take it for testing

But it should be noted that the downsampling operation here isreorg, after downsampling, the width and height of the feature map will be halved, and the channel will be expanded to 4 times. The advantage is that no details are lost while reducing the resolution, and the amount of information remains the same.

After this operation, the mAP on the VOC 2007 test set increased from 74.4% to 75.4% again .

✨8 Multi-scale training

Introduce the idea of ​​feature pyramid into yolo.
Specifically, when training the network, every 10 iterations, a new image size is selected from {320, 352, 384, 416, 448, 480, 512, 576, 608} as the image size for subsequent 10 training (These sizes are all integer multiples of 32 , because the maximum downsampling multiple of the network is 32. If you input an image size that cannot be divisible by 32, you will encounter unnecessary troubles)

The advantage of this multi-scale training is that it can change the size ratio of various objects in the data set. For example, an object occupies more pixels in a 608 image and has a larger area, but it will become less in a 320 image. Now, in terms of the number of pixels occupied, it is equivalent to going from a larger object to a smaller object. In general, multi-scale training is one of the commonly used techniques to improve model performance. If the object does not have a significant size change, the technique will not have a significant effect, and then there is no need for multi-scale training.

With multi-size training, YOLOv1 has been improved again: from 75.4% mAP to 76.8% mAP .

✨9 A priori box adaptation problem

The prior box obtained by the kmeans clustering method is obviously more suitable for the data set used:
The a priori box clustered from the A data set is obviously difficult to adapt to the new B data set. Especially when the data contained in the two data sets A and B are very different, this problem will be more serious. When we change a data set, such as Pasical Voc to COCO data set, we need to re-cluster.
The a priori frame obtained by clustering depends heavily on the dataset itself. If the dataset is too small and the samples are not rich enough, the a priori frame obtained by clustering may not provide good enough size prior information.

But this problem is actually a problem that exists in Anchor based, and it was not solved until the emergence of Anchor free.

Guess you like

Origin blog.csdn.net/weixin_51691064/article/details/131187295