In-depth understanding of target detection YOLOv2

Paper: YOLO9000: Better, Faster, Stronger
Paper link: https://arxiv.org/abs/1612.08242

YOLOv1 detailed paper: https://blog.csdn.net/qq_40716944/article/details/104908692

The YOLOv2 paper was published in CVPR2017. The YOLOv2 algorithm can be said to be a leapfrog improvement on the basis of v1, drawing on the advantages of all the scholars. At the same time, YOLOv2's paper is also the most dry article in the YOLO series so far.

This article then introduces the principle and implementation of YOLOv2. The full name of the YOLOv2 paper is YOLO9000: Better, Faster, Stronger . In this article, the author first proposed an improved YOLOv2 based on YOLOv1, and then proposed a detection With the classification joint training method, the YOLO9000 model is trained on the COCO detection data set and the ImageNet classification data set using this joint training method, which can detect more than 9,000 types of objects. Therefore, this article actually contains two models: YOLOv2 and YOLO9000, but the latter is based on the former, and the main structure of the two models is the same. Compared with YOLOv1, YOLOv2 has made many improvements, which also makes YOLOv2's mAP significantly improved, and the speed of YOLOv2 is still very fast, maintaining its own advantages as a one-stage method, YOLOv2 and Faster R-CNN, SSD, etc. The comparison of the models is shown in Figure 1.

 1. YOLOv2's improvement strategy on YOLOv1

Although the detection speed of YOLOv1 is very fast, it is not as accurate as the R-CNN series of detection methods. YOLOv1 is not accurate enough in object localization and has a low recall rate. Therefore, YOLOv2 proposes several improvement strategies to improve the positioning accuracy and recall rate of the YOLO model, thereby improving mAP. YOLOv2 follows a principle in the improvement: maintaining the detection speed, which is also a major advantage of the YOLO model. Each improvement strategy is described in detail below. 

1、Batch Normalization

The BN (Batch Normalization) layer is simply to normalize the input of each layer of the network, so that the network does not need to learn the distribution of data at each layer, and the convergence will be accelerated. The original YOLOv1 algorithm (using the GoogleNet network to extract features) does not have a BN layer, so in YOLOv2 the author adds a BN layer after each convolutional layer. In addition, because BN can standardize the model, droput is no longer used. Experiments have shown that after using Batch Normalization, the mAP of YOLOv2 has increased by 2.4%. At the same time, this operation is still reserved on YOLOv3, and the BN layer has become the standard configuration of the YOLO algorithm since YOLOv2.

2、High Resolution Classifier 

First of all, the role of fine-tuning is self-evident. Nowadays, basically running a classification or detection model will not start from initializing all the parameters randomly, so it is generally a pre-trained network to fine-tuning your own network, and pre-training Basically all of the networks are run on the ImageNet dataset. On the one hand, the amount of data is large, on the other hand, the training time is long, and such networks can be found on the corresponding github.
       The original YOLOv1 network used 224*224 input during pre-training (this is because generally pre-trained classification models are carried out on the ImageNet data set), and then use 448*448 input during detection. This will cause the model to adapt to changes in image resolution when switching from the classification model to the detection model. YOLOv2 divides the pre-training into two steps: first use 224*224 input to train the network from scratch, about 160 epochs (which means that all training data will be run 160 times in a loop), and then adjust the input to 448*448, and then train 10 epochs. Note that these two steps are all operated on the ImageNet dataset. Finally, fine-tuning on the detected data set, that is, using a 448*448 image as input during detection can make a smooth transition. The author proves through experiments that this can increase mAP by about 4%.

 3、Convolutional With Anchor Boxes

In YOLOv1, the input picture is finally divided into 7*7 grids, and each grid predicts 2 bounding boxes. YOLOv1 finally uses a fully connected layer to directly predict the bounding box, where the width and height of the bounding box are relative to the size of the entire picture, and because there are objects with different scales and ratios in each picture, It is more difficult for YOLOv1 to learn to adapt to the shape of different objects during the training process, which also leads to poor performance of YOLOv1 in precise positioning. And YOLOv2 draws on the ideas of Faster R-CNN and introduces anchors. First remove the fully connected layer and the last pooling layer of the original network, so that the final convolutional layer can have higher resolution features; then reduce the network input size and replace the original 448*448 input with a 416*416 input . Because the number of steps for downsampling of YOLOv2 is 32, for a 416*416 size image, the final feature map size is 13*13, and the dimension is odd. The reason for this is that the width and height of the odd size will make each feature When the graph is divided into cells, there is only one center cell (for example, it can be divided into 7*7 or 9*9 cells, and there is only one center cell. If it is divided into 8*8 or 10*10, there are 4 center cells) . Why do you want only one center cell? Because large objects generally occupy the center of the image, it is desirable to use one center cell to predict instead of four center cells.

We know that the original YOLOv1 algorithm divides the input image into 7*7 grids, each grid predicts two bounding boxes, so there are only 98 boxes in total, but by introducing anchor boxes in YOLOv2, the number of boxes predicted exceeds 1,000 (Take the output feature map size of 13*13 as an example. If there are 9 anchor boxes in each grid cell, there are 13*13*9=1521 in total. Of course, we can see from the fourth point later, and finally each grid cell chooses 5. An anchor box). Obviously increasing the number of boxes is to improve the positioning accuracy of the object. The author’s experiment proves that although the addition of the anchor reduces the MAP value a little (69.5 to 69.2), it improves the recall (81% to 88%).

 4、Dimension Clusters

In Faster R-CNN and SSD, the dimensions (length and width) of the a priori box are manually set, with a certain degree of subjectivity. If the selected a priori box dimensions are more appropriate, then the model is easier to learn and make better predictions. Therefore, YOLOv2 uses the k-means clustering method to cluster the bounding boxes in the training set. Because the main purpose of setting the a priori box is to make the IOU between the prediction box and the ground truth better, the IOU value between the box and the cluster center box is used as the distance indicator in the cluster analysis:

As shown in Figure 2, the left side is the relationship between the number of clusters and the IOU. The two curves represent two different data sets. After analyzing the results of clustering and balancing the complexity of the model and the recall value, the author chose k=5, which is the size of the selected 5 boxes on the right side of Figure 2, where purple and black also represent two It can be seen that the basic shapes of different data sets are similar. And it is found that the clustering result is significantly different from the manually set anchor box size. Most of the clustering results are tall and thin boxes, while the number of short and fat boxes is small.

5、Direct Location prediction 

The second problem the author encountered when introducing the anchor box: the model is unstable, especially at the beginning of training. The author believes that this instability mainly comes from predicting the (x, y) value of the box. We know that in the object detection algorithm based on the region proposal, the (x, y) value is obtained by predicting the tx and ty in the figure below, that is, the offset is predicted. In addition, regarding the formula in the text, I personally think that the minus sign should be changed to a plus sign, so as to conform to the example below the formula. Here xa and ya are the coordinates of the anchor, wa and ha are the size of the anchor, x and y are the predicted values ​​of the coordinates, and tx and ty are the offsets. An example is specifically given in the article: A prediction of tx = 1 would shift the box to the right by the width of the anchor box, a prediction of tx = -1 would shift it to the left by the same amount.

Post the formula in Faster R-CNN here, which is consistent with the above formula changing the minus sign to the plus sign. 

Here the author does not use the method of directly predicting the offset, but still uses the method of directly predicting the coordinate position relative to the grid cell in the YOLOv1 algorithm.
As mentioned earlier, the network outputs a 13*13 feature map in the last convolutional layer, and then each cell predicts 5 bounding boxes, and then each bounding box predicts 5 values: tx, ty, tw, th and to (here The to is similar to the confidence in YOLOv1). Look at the figure below, after tx and ty are processed by the sigmoid function, the range is between 0 and 1. This normalization process also makes the model training more stable; cx and cy represent the horizontal and vertical distance between a cell and the upper left corner of the image; pw and Ph represents the width and height of the bounding box, so that bx and by are the anchors near the cell of cx and cy to predict the results of tx and ty.

If you don’t understand the above formula, you can look at Figure 3. First, cx and cy indicate the distance between the grid cell and the upper left corner of the image. The black dashed box is the bounding box, and the blue rectangular box is the predicted result. 

6、Fine-Grained Features

Here is mainly to add a layer: passthrough layer. The function of this layer is to connect the feature map of the previous layer with a size of 26*26 and the feature map of this layer with a size of 13*13, a bit like ResNet. The reason for this is that although a feature map with a size of 13*13 is sufficient for predicting large objects, it is not necessarily effective for predicting small objects. It is also easy to understand that the smaller the object, after layers of convolution and pooling, may disappear in the end, so by merging the feature map of the previous layer with a larger size, the smaller object can be effectively detected. This improvement increases performance by 1%.

7、Multi-Scale Training 

In order to make the YOLOv2 model more robust, the author introduced Multi-Scale Training. Simply put, the size of the input image is dynamically changed during training. Note that this step is used when the fine tune is detected on the data set. Do not follow the previous one on Imagenet. The two-step pre-trained classification model on the data set is confused. Specifically, when training the network, for every 10 batches (10 batches in the text, I think it’s a clerical error, shouldn’t it be 10 epochs?), the network will randomly select another size input. So how do you determine the size of the input image? Earlier we know that the original input of the network in this article is 416*416, and finally it will output a feature map of 13*13, which means that the factor of downsample is 32, so the author uses a multiple of 32 as the input size. Specifically, the author uses {320,352,...,608} input size. This network training method allows the same network to detect images of different resolutions. Although the training speed is slower when the input size is large, at the same time when the input size is small, the training speed is faster, and multi-scale training can improve the accuracy, so the accuracy and speed are both a good balance.

Table 3 is the comparison between YOLOv2 and other object detection algorithms under different input sizes during detection. It can be seen that through the multi-scale training detection model, during the test, the input image can achieve a balance between mAP and FPS even with a large range of size changes. However, it can also be seen that the performance of the SSD algorithm is also very eye-catching.

Finally, look at the contribution of these techniques to mAP: 

The improvement of High Resolution Classifier is very obvious (about 4%), and the introduction of anchors by combining the two methods of dimension prior + localtion prediction can also bring about 5% mAP improvement. 

2. Detection is faster (faster)

1、New Network: Darknet-19 

YOLOv2 uses a new basic model (feature extractor), called Darknet-19, including 19 convolutional layers and 5 maxpooling layers, as shown in Figure 3. Darknet-19 and VGG16 model design principles are the same, mainly using 3*3 convolution, after 2*2 maxpooling layer, the feature map dimension is reduced by 2 times, and the channles of the feature map are increased by two times. Similar to NIN ( Network in Network ), Darknet-19 finally uses global avgpooling to make predictions, and uses 1*1 convolution between 3*3 convolutions to compress feature map channles to reduce model calculations and parameters. Darknet-19 also uses the batch norm layer behind each convolutional layer to speed up the convergence speed and reduce model overfitting. On the ImageNet classification data set, the top-1 accuracy of Darknet-19 is 72.9% and the top-5 accuracy is 91.2%, but the model parameters are relatively small. After using Darknet-19, the mAP value of YOLOv2 is not significantly improved, but the amount of calculation can be reduced by about 33%. 

2、Training for Classification

Training for classification is pre-trained on ImageNet, which is mainly divided into two steps: 1. Training Darknet-19 from scratch, the data set is ImageNet, the training is 160 epochs, the size of the input image is 224*224, and the initial learning rate is 0.1 . In addition, standard data increase methods such as random crops, rotations, and hue, saturation, and exposure shifts are used during training. 2. In the fine-tuning network, use 448*448 input at this time. Except for the epoch and learning rate, the other parameters have not changed. Here, the learning rate is changed to 0.001, and 10 epochs are trained. The results show that the top-1 accuracy rate after fine-tuning is 76.5%, and the top-5 accuracy rate is 93.3%. If according to the original training method, the top-1 accuracy rate of Darknet-19 is 72.9%, and the top-5 accuracy rate is 72.9%. The rate is 91.2%. Therefore, it can be seen that the first and second steps have improved the classification accuracy of the main network from the aspects of network structure and training method.

 

3、Training for Detection 

After step 2 above, start to port the network to detection, and start fine-tuning based on the detected data. First remove the last convolutional layer, and then add 3 3*3 convolutional layers, each convolutional layer has 1024 filters, and each is connected with a 1*1 convolutional layer, 1*1 volume The number of filters of the product is determined according to the number of classes to be detected. For example, for VOC data, since each grid cell we need to predict 5 boxes, each box has 5 coordinate values ​​and 20 category values, so each grid cell has 125 filters (unlike YOLOv1, each in YOLOv1 The grid cell has 30 filters, remember the 7*7*30 matrix? And in YOLOv1, the category probability is predicted by the grid cell, that is to say, the category probability of two boxes corresponding to a grid cell is The same, but in YOLOv2, the category probability belongs to the box. Each box corresponds to a category probability instead of being determined by the grid cell. Therefore, each box corresponds to 25 predicted values ​​(5 coordinates plus 20 category values) ), and the 20 category values ​​of the two boxes of a grid cell in YOLOv1 are the same). In addition, the author also mentioned connecting the last 3*3*512 convolutional layer to the penultimate convolutional layer. Finally, the author used the fine tune pre-training model on the detection data set for 160 epochs, the learning rate was 0.001, and the learning rate was divided by 10 at the 60th and 90th epochs, and the weight decay was 0.0005.

 

Guess you like

Origin blog.csdn.net/qq_40716944/article/details/104977876