[Target detection] YOLOV2 detailed explanation

foreword

We have already explained YOLOV1 before, so here I will continue the explanation of the previous few days and further write about the basic ideas and improvements of YOLOV2.

The full name of YOLOv2's paper is YOLO9000: Better, Faster, Stronger, which won the CVPR 2017 Best Paper Honorable Mention. In this article, the author first proposed an improved YOLOv2 based on YOLOv1, and then proposed a joint training method for detection and classification, using this joint training method to train on the COCO detection dataset and the ImageNet classification dataset. The YOLO9000 model, which can detect more than 9000 types of objects. Therefore, this article actually contains two models: YOLOv2 and YOLO9000, but the latter is proposed on the basis of the former, and the main structure of the two models is consistent. Compared with YOLOv1, YOLOv2 has made many improvements, which also makes the mAP of YOLOv2 significantly improved, and the speed of YOLOv2 is still very fast, maintaining its own advantages as a one-stage method, YOLOv2 and Faster R-CNN, SSD, etc. The comparison of the models is shown in the figure below. The following will first introduce the YOLOv2 improvement strategy for YOLOv1.

insert image description here

 1. Improvement steps

The improvement of YOLOV2 relative to YOLOV1 can be mainly divided into the following three points:

1. Improvement of network structure

2. Anchor design

3. Improvement of training strategy

Next, I will elaborate on these three points.

1. Improvement of network structure

 A BN layer is added between each convolution layer and activation function layer. Batch Normalization helps to solve the gradient disappearance and gradient explosion problems in the backpropagation process, reduces the sensitivity to some hyperparameters (such as learning rate, network parameter size range, activation function selection), and can improve the model convergence speed. Moreover, it can play a certain regularization effect and reduce the overfitting of the model. In YOLOv2, a Batch Normalization layer is added after each convolutional layer, and dropout is no longer used. This improvement is also used by many models now. After using Batch Normalization. The mAP of YOLOv2 has increased by 2.4%.

 After adding the BN layer, in order to better detect small objects, YOLOV2 also proposes a passthrough layer to detect fine-grained features. Unlike the multi-scale feature map used by SSD, the passthrough layer is similar to the shortcut of the ResNet network, taking the previous higher-resolution feature map as input, and then connecting it to the latter low-resolution feature map. Specifically, the feature map (26×26×512) before the penultimate downsampling is extracted into a 2×2 local area and spliced ​​on the channel dimension. For the feature map of 26×26×512, after the passthrough layer processing, it becomes a new feature map of 13×13×2048 (the size of the feature map is reduced by 4 times, and the channels are increased by 4 times, so that it can be compared with the subsequent 13× The 13×1024 feature maps are connected together to form a 13×13×3072 feature map, and then convolution is made on the basis of this feature map to make predictions. The specific implementation method is shown in the figure below. (Note that the connection here uses the concatenate connection method , that is, channel splicing).

insert image description here

insert image description here

 As for why passthrough is easy to use, specifically, the shallow feature map contains more locations. The detailed information can be better used for the regression of the detection frame in the detection. The deeper feature map contains semantic information, classification requires semantic information, and detection requires more physical information of objects. Therefore, fusing shallow features can improve the detection effect.

2.  Anchor design

The part of YOLOV2 about Anchor has changed a lot. First reflected in YOLOv1, the input picture is finally divided into 7×7 grids, and each cell predicts 2 bounding boxes. YOLOv1 contains a fully connected layer, which can directly predict the coordinate values ​​​​of bounding boxes, where the width and height of the bounding box are relative to the size of the entire picture. Due to the existence of objects of different scales and aspect ratios in each picture, it is difficult for YOLOv1 to learn to adapt to the shape of different objects during the training process, which also leads to poor performance of YOLOv1 in terms of precise positioning. The authors found that predicting offsets rather than coordinates simplifies the problem, making it easier for the network to learn.

Drawing on the practice of Faster R-CNN, YOLOv2 removed the fully connected layer in YOLOv1 and used the convolution kernel anchor boxes to predict the bounding box. By presetting a set of borders with different sizes and aspect ratios in each cell, it can cover different positions and multiple scales of the entire image. At the same time, in order to make the resolution of the practical feature map higher for detection, a pooling layer in the network is removed. And the network input image size becomes 416×416 instead of the original 448×448. YOLO's convolutional layer uses a value of 32 to downsample the image, so the network input is 416×416, and the output is a 13×13 feature map. Using Anchor Box will reduce the accuracy slightly, but using it can enable YOLO to predict 13×13×9=1521 frames. Therefore, compared to the 81% recall rate of YOLOv1, the recall rate of YOLOv2 has been greatly increased to 88%.

In Faster R-CNN and SSD, the dimensions (length and width) of the prior box are manually set, with a certain degree of subjectivity. If the dimension of the selected prior box is appropriate, the model will be easier to learn and thus make better predictions. Therefore, YOLOv2 uses the k-means clustering method to cluster the bounding boxes in the training set to find the box size that matches the sample as much as possible.
       The most important thing for a clustering algorithm is to choose how to calculate the "distance" between two bounding boxes. For the commonly used Euclidean distance, large bounding boxes will produce larger errors. Moreover, the main purpose of setting the prior box is to make the IOU between the prediction box and the ground truth better, so the IOU value between the box and the cluster center box is used as the distance indicator during cluster analysis:

centroid is the border selected as the center when clustering, box is the other border, and d is the "distance" between the two. The larger the IOU, the closer the "distance". The cluster analysis results given by YOLO2 are shown in the figure below:

insert image description here

The left side of the figure above shows the k centroid borders obtained when different clustering k values ​​are selected, and the Avg IOU of the marked borders in the sample and each centroid is calculated. Obviously, the more the border number k, the larger the Avg IOU. YOLO2 chooses k=5 as a compromise between the number of borders and IOU. Compared with the manually selected prior boxes, 0.61 Avg IOU can be achieved by using 5 clustering boxes, which is equivalent to 0.609 Avg IOU of 9 manually set prior boxes.

insert image description here

 So based on the above three points, the output of YOLOV2 is 7*7*(5*5+s), where s is the number of classifications.

3. Improvement of training strategy

 There are a lot of training samples for image classification, but there are relatively few samples marked with borders for training object detection, because the labor cost of marking borders is relatively high. Therefore, object detection models usually first train convolutional layers with image classification samples to extract image features. But this raises another problem that the resolution of the image classification samples is not very high. So YOLO v1 uses ImageNet's image classification samples to use 224*224 as input to train the CNN convolutional layer. Then when training object detection, the image samples for detection use a higher resolution 448×448 image as input. But this switch has a certain impact on the performance of the model. Therefore, after YOLO2 uses 224×224 images for classification model pre-training, it uses 448×448 high-resolution samples to fine-tune the classification model (10 epochs), so that the network features gradually adapt to the resolution of 448×448. Then use 448×448 detection samples for training, which alleviates the impact caused by sudden resolution switching.

Since there are only convolutional layers and pooling layers in the YOLOv2 model, the input of YOLOv2 is not limited to pictures of 416×416 size. In order to enhance the robustness of the model, YOLOv2 adopts a multi-scale input training strategy. Specifically, it changes the input image size of the model after a certain interval of iterations during the training process. Since the total downsampling step size of YOLOv2 is 32, the input image size is selected as a series of values ​​that are multiples of 32: {320, 352, ..., 608}, the minimum input image is 320×320, and the corresponding feature map size is 10 ×10, while the maximum input image size is 608*608, and the corresponding feature map size is 19×19. During the training process, an input image size is randomly selected every 10 iterations, and then only need to modify the processing of the last detection layer to retrain. The model trained in this way can predict objects at multiple scales. Moreover, the larger the scale of the input picture, the higher the accuracy, and the smaller the scale, the faster the speed. Therefore, the model trained by YOLOv2 multi-scale can adapt to the requirements of many different scenarios.

insert image description here

Using the Multi-Scale Training strategy, YOLOv2 can adapt to pictures of different sizes and predict good results. When testing, YOLOv2 can use pictures of different sizes as input, and the effect on the VOC 2007 data set is shown in the figure below. It can be seen that when using a smaller resolution, the mAP value of YOLOv2 is slightly lower, but the speed is faster. When using high-resolution input, the mAP value is higher, but the speed is slightly reduced. For 544*544, mAP is as high as 78.6%. Note that this is only a different size of the input image during testing, but actually uses the same model (trained with Multi-Scale Training).

insert image description here

 2. Network Architecture

The things here have basically been mentioned before. What needs to be said is that since YOLOv2 pre-presets 5 frames in each area, and each frame has 25 predicted values, the number of channels of the final output feature map is 125. Among them, the 25 prediction values ​​of a frame are 20 category predictions, 4 position predictions and 1 confidence prediction value. There is a big difference here from v1. v1 is a bounding box sharing category prediction in an area, and here are independent category prediction values ​​(that is, to solve the disadvantage that each cell in YOLOv1 can only predict one object ).

insert image description here

 As for the positive and negative samples and the loss function, like YOLOV1, no changes have been made.

3. Summary

 In summary, although YOLOv2 has made a lot of improvements, most of them are based on some techniques from other papers, such as the anchor boxes of Faster R-CNN, YOLOv2 uses anchor boxes and convolution for prediction, which is basically the same as the SSD model (single The SSD of the scale feature map) is very similar, and the SSD also draws on the RPN network of Faster R-CNN. In a sense, the two one-stage models of YOLOv2 and SSD are essentially the same as the RPN network, except that RPN does not make category predictions, but simply distinguishes between objects and backgrounds. In the two-stage method, the role of RPN is to give region proposals, which is actually to make rough detection, so an additional stage is added, that is, the R-CNN network is used to further improve the accuracy of detection (including giving categories predict). For the one-stage method, they want to use the "RPN" network directly to make accurate predictions in one step, so they need to do a lot of tricks in network design. A major innovation of YOLOv2 is the use of the Multi-Scale Training strategy, so that the same model can actually adapt to pictures of various sizes.
       Compared with the previous version, YOLOv2 has made a qualitative leap, which is mainly reflected in the absorption of the advantages of other algorithms, the use of prior frames, feature fusion and other methods, and the use of a variety of training techniques at the same time, so that the model can maintain a very fast speed. At the same time, the detection accuracy is greatly improved. YOLOv2 has reached a high level of detection, but if you want to analyze its shortcomings, there are generally three points:
● Single-layer feature map: Although the Passthought layer is used to fuse shallow features and enhance multi-scale detection performance, it only uses A layer of feature maps is used for prediction, and the fine-grainedness is still not enough, and the detection of small objects is limited, and the relatively simple and effective structure of residuals is not used.
● Limited by its overall network architecture, it still does not solve the prediction problem of small objects well.

Reference: [Target Detection] Single-stage Algorithm--YOLOv2 Detailed Explanation

Guess you like

Origin blog.csdn.net/qq_38375203/article/details/125502438