YOLOV2 network model

Table of contents

material

Network Model Principles

network framework

Improvements over yoloV1 

Batch Norm

 High Resolution Classifier

Convolutional With Anchor Boxes

Dimension Clusters

New Network:Darknet-19 

Direct location prediction

PassThrough  

Multi-Scale Training 

Loss 

YOLOV2 training 

YOLO9000 

discuss 

Reference 


material

Paper address: https://arxiv.org/abs/1612.08242

pytorch code: GitHub - longcw/yolo2-pytorch: YOLOv2 in PyTorch 

Network Model Principles

network framework

The above figure is a classification-only model. The model uses 1x1 convolution for channel dimensionality reduction, 3x3 convolution for feature extraction, and Maxpool for downsampling. The feature map size is halved in length and width. Darknet-19 finally uses global avgpooling to make predictions.

The picture above shows the network structure of yolov2. The first 18 layers of convolutional structure are the same as the first 18 layers of the classification model. The size of the feature map generated by the 13th convolutional layer is 28x28x512, and the paththrough structure is connected to this convolutional layer. The passthrough layer is similar to the shortcut of the ResNet network, taking the previous higher-resolution feature map as input, and then connecting it to the latter low-resolution feature map. The dimension of the previous feature map is twice that of the following feature map. The passthrough layer extracts each 2x2 local area of ​​the previous layer, and then converts it into a channel dimension. For the feature map of 26x26x512, use 1x1 convolution to reduce the dimension to 26x26x64 (It is not mentioned in this part of the paper, but the actual code is written like this). After the passthrough layer is processed, it becomes a new feature map of 13x13x256 (the size of the feature map is reduced by 4 times, and the channelles are increased by 4 times), so that It is connected with the following 13x13x1024 feature map to form a 13x13x1280 feature map. Finally, the network uses 1x1 convolution to reduce the 13x13x1280 feature dimension to 13x13x125. 13x13x(5*(5+20)) corresponds to 13x13 grid cells. There are 5 anchor information in each grid cell, and each anchor information includes the center coordinates x, y, the length and width h, w of the box, the confidence c, and the corresponding probability of 20 categories.

Improvements over yoloV1 

 BN: YoloV2 adds BN layer compared to YoloV1; Hi-res classifier: the pre-training model uses a larger resolution image from 224x224 to 448x448; convolutional+anchor boxes: removes the fully connected layer and uses convolution and anchor boxes to predict the bounding box; new network: using a new feature extractor Darknet19; dimension priors: using the k-means clustering method to cluster the boundaries in the training set; location prediction: using the sigmoid function to deal with partial Shift the value so that the predicted offset value is within the range of (0, 1) and constrained in the current cell; passthrough: use a method similar to the connection feature of the shortcut in resnet, so that the relatively low-level features are connected to the high-level features, which is The features are more abundant; multi-scale: Since yoloV2 does not have a fully connected layer, multi-scale training can be performed to increase the robustness of the model; hi-res detector: high-resolution (544x544) images are used as input;

Batch Norm

The mAP of the BN layer is introduced from 63.4 to 65.8. The above picture is the pseudo code of BatchNorm. The steps of the algorithm:

1. Find the mean value of the batch data;

2. Find the variance of the batch data;

3. Normalize the input data;

4. Introduce zoom and translation variables to get the final result.

The main purpose of the deep neural network is to learn the distribution of the training data and achieve a good generalization effect on the test set. However, if the data input by each batch has a different distribution, it will obviously bring difficulties to the training of the network. . On the other hand, for the output of each layer of the neural network, because they have undergone intra-layer operations, their distribution is obviously different from the corresponding input signal distribution of each layer, and the difference will increase with the increase of network depth. This phenomenon is called For Internal Covariate Shift. In order to reduce the Internal Covariate Shift, each layer of the neural network can be normalized. It is assumed that the output data of each layer are normalized to 0 mean, 1 variance, and satisfy the normal distribution. However, at this time there is a The problem is that the data distribution of each layer is a standard normal distribution, which makes it impossible to learn the characteristics of the input data at all, because the feature distribution learned with great effort is normalized, so each layer is directly normalized is clearly unreasonable. BatchNorm introduces two parameters, the scaling variable \gammaand the translation variable \beta. Let’s discuss the worse case first. The scaling variable and the translation variable are the standard deviation and the mean, respectively. Then the y_{i}result is the initial input x. When BatchNorm works, it is guaranteed After each data is normalized, the learned features are retained, and at the same time, the normalization operation can be completed to speed up the training.

Advantages of adding BatchNorm:

1. Without it, you need to carefully adjust the learning rate and weight initialization, but with BN, you can safely use a large learning rate, but with BN, you don’t need to carefully adjust the parameters, and a larger learning rate is greatly improved learning speed;

2. Batchnorm itself is also a regular method, which can replace other regular methods such as dropout;

3. Batchnorm reduces the absolute difference between data, has a de-correlation property, and considers more relative differences, so it has a better effect on classification tasks.

 High Resolution Classifier

The ImageNet classification model basically uses a picture with a size of 224x224 as input, and the resolution is relatively low, which is not conducive to the detection model. So YOLOv1 increases the resolution to 448x448 after pre-training with the 224x224 classification model, and uses this high resolution to finetune on the detection data set. But switching resolutions directly, the detection model may struggle to quickly adapt to high resolutions. So YOLOv2 adds the intermediate process (10 epochs) of using 448x448 input on the ImageNet dataset to finetune the classification network, which allows the model to apply high-resolution input before finetune on the detection dataset. After using the high-resolution classifier, the mAP of YOLOv2 ranges from 65.8 to 69.5.

Convolutional With Anchor Boxes

The original YOLO used the fully connected layer to directly predict the coordinates of the bounding box, while YOLOv2 borrowed the idea of ​​Faster R-CNN and introduced the anchor. First, the fully connected layer and the last pooling layer of the original network are removed, so that the last convolutional layer can have higher resolution features. In the detection model, YOLOv2 does not use a 448x448 picture as input, but a 416x416 picture. Because the total step size of the YOLOv2 model downsampling is 32, for a picture of 416x416 size, the final feature map size is 13x13, and the dimension is odd, so the feature map has exactly one central position. For some large objects, their center points often fall into the center of the picture. At this time, it is relatively easy to use a center point of the feature map to predict the bounding boxes of these objects (because large objects generally occupy the center of the image, so it is hoped to use a center cell to predict, instead of 4 center cells to predict), so in the YOLOv2 design, it is necessary to ensure that the final feature map has an odd number of positions.


The original YOLOV1 algorithm divides the input image into 7x7 grids, and each grid predicts two bounding boxes, so there are only 98 boxes in total. However, by introducing anchor boxes in YOLOv2, the number of predicted boxes exceeds 1,000 (in order to output feature The map size is 13*13 as an example. If each grid cell has 9 anchor boxes, the total is 13*13*9=1521. Of course, it can be seen from the fourth point below that each grid cell finally selects 5 anchor boxes) . By the way, the number of boxes in Faster RCNN is about 6000 when the input size is 1000*600, and the number of boxes in SSD300 is 8732. Obviously increasing the number of boxes is to improve the positioning accuracy of the object. The author's experiment proves that although the addition of anchor makes the MAP value drop a little (69.5 to 69.2), but the recall is improved (81% to 88%), and the precision is reduced.

Dimension Clusters

In Faster R-CNN and SSD, the dimensions (length and width) of the prior box are manually set, with a certain degree of subjectivity. If the dimension of the selected prior box is appropriate, the model will be easier to learn and thus make better predictions. Therefore, YOLOv2 uses the k-means clustering method to cluster the bounding boxes in the training set. Because the main purpose of setting the prior box is to make the IOU of the prediction box and the ground truth better, the smaller the distance to the cluster center, the better, but the larger the IOU value, the better, so use 1 - IOU, so that the distance is guaranteed The smaller the value, the greater the IOU value, so the IOU value between the box and the cluster center box is selected as the distance indicator during cluster analysis: 

 d(box, centroid) = 1 - IOU(box, centroid)

The figure below shows the cluster analysis results on the VOC and COCO datasets. As the number of cluster centers increases, the average IOU value (the average of the IOU of each bounding box and cluster center) increases, but the comprehensive consideration of the model Complexity and recall rate, the author finally selects 5 cluster centers as the prior box, and its size relative to the picture is shown in the figure on the right.

Bounding box clustering steps:

New Network:Darknet-19 

The model structure was introduced in the network framework section. On the ImageNet classification dataset, the top-1 accuracy of Darknet-19 is 72.9%, and the top-5 accuracy is 91.2%, but the model parameters are relatively small. After using Darknet-19, the mAP value of YOLOv2 has not been significantly improved, but the amount of calculation can be reduced by about 33%.

Direct location prediction

YOLOv2 draws on the RPN network to use anchor boxes to predict the offsets of the bounding box relative to the prior box. The actual center position (x, y) of the bounding box needs to be calculated according to the predicted coordinate offset value ( d_{x}^{'}, d_{y}^{'}), the size of the prior box ( w_{a}, h_{a}) and the center coordinates ( x_{a}, ):already}

x = x_{a} + (d_{x}^{'} \times w_{a})

y = y_{a} + (d_{y}^{'} \times h_{a})

But the above formula is unconstrained, and the predicted bounding box is easy to shift in any direction, so the predicted bounding box of each position can fall anywhere in the picture, which leads to the instability of the model and takes a long time during training time to predict the correct offsets. Therefore, YOLOv2 abandoned this prediction method, but followed the method of YOLOv1, which is to predict the relative offset value of the center point of the bounding box relative to the upper left corner of the corresponding cell. In order to constrain the center point of the bounding box in the current cell, use The sigmoid function handles the offset value so that the predicted offset value is in the range (0,1) (the scale of each cell is regarded as 1). In summary, according to the four offsets t_{x}, t_{y},  predicted by the bounding box t_{w}, t_{h}the actual position and size of the bounding box can be calculated according to the following formula: 

b_{x} = \sigma (t_{x}) + c_{x}

b_{y} = \sigma (t_{y}) + c_{y}

b_{w} = p_we^{t_{w}}

b_{h} = p_he^{t_{h}}

Among them ( c_{x}, c_{y}) are the coordinates of the upper left corner of the cell. As shown in the figure below, the scale of each cell is 1 during calculation, so the coordinate point of the green point in the figure below is (1, 1). Due to the processing of the sigmoid function, the center position of the bounding box will be constrained within the current cell to prevent excessive offset. Due to the processing of the sigmoid function, the center position of the bounding box will be constrained within the current cell to prevent excessive offset. Among them e^{t_{w}}, e^{t_{h}}the sum is not too constrained, because the size of the object is not limited, so the size of the box is not constrained. p_{w}The sum is the width and length of the prior frame. As p_{h}mentioned earlier, their values ​​are also relative to the size of the feature map. The length and width of each cell in the feature map are 1. Here the size of the feature map is (W, H), (in the text (13, 13)), so that we can calculate the position and size of the bounding box relative to the entire picture (4 values ​​​​are between 0 and 1 between):

If you multiply the above 4 values ​​by the width and length (pixel value) of the picture, you can get the final position and size of the bounding box. This is the entire decoding process of the YOLOv2 bounding box. Constraining the position prediction value of the bounding box makes the model easier to train stably. Combining the cluster analysis to obtain the prior box and this prediction method, the mAP value of YOLOv2 has increased from 69.6 to 74.4.

PassThrough  

In the previous part of the network framework, I wrote a part of the content of passthrough. The following figure is the finishing operation process of passthrough.

 Here is how to transform the 26x26 feature map to 13x13. For example, as shown in the figure below, the original feature map is a 3-dimensional matrix of 4x4x3. Using the method of cross-fetching, the same number is separately formed into a 2x2 three-dimensional matrix. In the end, there will be 4 such matrices, that is The length and width of the feature map mentioned above are halved, and the depth is increased by 4 times. In this way, it maintains the same size as the deeper feature map, which can be fused to increase the richness of features.

Multi-Scale Training 

Since there are only convolutional layers and pooling layers in the YOLOv2 model, the input of YOLOv2 is not limited to pictures of 416x416 size. In order to enhance the robustness of the model, YOLOv2 adopts a multi-scale input training strategy. Specifically, it changes the input image size of the model after a certain interval of iterations during the training process. Since the total downsampling step size of YOLOv2 is 32, the input image size is selected as a series of values ​​that are multiples of 32: {320, 352, ...,608}, the minimum input image is 320x320, and the corresponding feature map size is 10x10 (It's not an odd number, it's really a bit embarrassing), and the maximum input image size is 608x608, and the corresponding feature map size is 19x19. During the training process, an input image size is randomly selected every 10 iterations, and then only need to modify the processing of the last detection layer to retrain.

Using the Multi-Scale Training strategy, YOLOv2 can adapt to pictures of different sizes and predict good results. When testing, YOLOv2 can use pictures of different sizes as input, and the effect on the VOC 2007 data set is shown in the figure below. It can be seen that when using a smaller resolution, the mAP value of YOLOv2 is slightly lower, but the speed is faster. When using high-resolution input, the mAP value is higher, but the speed is slightly reduced. For 544x544, mAP is as high as 78.6%. Note that this is only a different size of the input image during testing, but actually uses the same model (trained with Multi-Scale Training). 

Loss 

In the above formula b_{ijk}^{o}, is the confidence of the prediction, b_{ijk}^{r}and is the predicted position information, where r\in (x,y,w,h)is b_{ijk}^{c}the category of the predicted frame. prior_{k}^{r}is the position of the anchor, because the five anchors in each cell are the same, so the subscript only needs to know which information the anchor belongs to among the five anchors. truth^{r}is the location information of the label box. truth^{c}is the category information of the label box. IOU_{truth}^{k}It is the IOU of the anchor and the label box.

The loss consists of three parts , of which the yellow part is either 0 or 1.

The yellow part of the first part\lambda _{noobj} * (-b_{ijk}^{o})^{2} indicates that if the IOU of the prediction frame and the label frame is less than the threshold value of 0.6, the value of this part is 1, otherwise it is 0, and the latter part indicates that the prediction frame is not responsible for predicting the object, so the smaller the confidence, the better;

The second part marked in yellow indicates whether the first 12800 iterations, the latter \lambda _{prior} * \sum _{r\in (x,y,w,h)}(prior_{k}^{r} - b_{ijk}^{r})^{2}indicates the error between the anchor and the prediction frame position, which can enable the model to learn to predict the anchor position faster, making t^{x}, t^{y}, t^{w}, t^{h}more stable;

The third part marked in yellow indicates whether the prediction frame is responsible for predicting the object, and the prediction frame corresponding to the largest IOU of the anchor and the annotation frame is responsible for predicting the object (IOU>0.6 but the non-largest prediction frame is ignored), which indicates that the prediction frame and the \lambda_{coord} * \sum _{r\in (x,y,w,h)}(truth^{r} - b_{ijk}^{r})^{2}annotation The positioning error of the frame \lambda _{obj} * (IOU_{truth}^{k} - b_{ijk}^{o})^{2}indicates the error between the confidence of the predicted frame and the IOU of the marked frame and the anchor, and \lambda _{class} * \sum_{c=1}^{C}(truth^{c} - b_{ijk}^{c})^{2}the error between the classification result of the predicted frame and the classification information of the marked frame. Each of them \lambdarepresents the weight of the item, which is a defined hyperparameter.

YOLOV2 training 

YOLOV2 training is divided into three stages:

The first stage : The first stage is to pre-train Darknet-19 on the ImageNet classification dataset. At this time, the model input is 224x224, and a total of 160 epochs are trained.

The second stage : Adjust the input of the network to 448x448, continue to finetune the classification model on the ImageNet dataset, and train 10 epochs. At this time, the top-1 accuracy of the classification model is 76.5%, while the top-5 accuracy is 93.3%. .

The third stage : modify the Darknet-19 classification model to a detection model, and continue the finetune network on the detection data set. The network modification includes: removing the last convolutional layer, global avgpooling layer and softmax layer, and adding three 3x3x1024 convolutional layers, and adding a passthrough layer at the same time, and finally using the 1x1 convolutional layer to output the prediction results, the output channels The number is: num_anchors x (5 + num_classes), which is related to the dataset used for training.

Since the number of anchors is 5, the number of channels output for the VOC dataset is 125, while for the COCO dataset it is 425. Here we take the VOC data set as an example. The final prediction matrix is ​​T (shape is (batch_size, 13, 13, 125)), which can be reshaped to (batch_size, 13, 13, 5, 25), where T[:, :, :, :,0:4] is the position and size of the predicted frame, T[:, :, :, :,5] is the confidence of the predicted frame, T[:, :, :, :,5:] is the category Predictive value.

YOLO9000 

YOLO9000 is a model that can detect more than 9,000 categories based on YOLOv2. Its main contribution is to propose a joint training strategy for classification and detection. As we all know, the labeling of detection datasets is much more cumbersome than that of classification datasets, so ImageNet classification datasets are several orders of magnitude higher than VOC and other detection datasets. In YOLO, the prediction of the bounding box does not actually depend on the label of the object, so YOLO can achieve joint training on classification and detection data sets. For the detection data set, it can be used to learn to predict the bounding box, confidence and classification of the object, while for the classification data set, it can only be used to learn the classification, but it can greatly expand the types of objects that the model can detect.

The author chooses to conduct joint training on the COCO and ImageNet datasets, but the first problem encountered is that the categories of the two are not completely mutually exclusive. For example, "Norfolk terrier" obviously belongs to "dog", so the author proposes a hierarchy The classification method (Hierarchical classification), the main idea is to establish a tree structure WordTree according to the affiliation between each category (according to WordNet). The WordTree established by combining COCO and ImageNet is shown in the following figure:

The root node in WordTree is a "physical object", and the child nodes of each node belong to the same subclass, which can be processed by softmax. When the predicted probability of a certain category is given, it is necessary to find its location, traverse the path, and then calculate the product of the probabilities of each node on the path.

During training, if it is a detection sample, the error is calculated according to the loss of YOLOv2, and for the classification sample, only the classification error is calculated. When predicting, the confidence given by YOLOv2 is Pr(physical object), at the same time, the position of the bounding box and a tree probability map will be given. Find the path with the highest probability in this probability map, stop when a certain threshold is reached, and use the current node to represent the predicted category. Through the joint training strategy, YOLO9000 can quickly detect more than 9,000 categories of objects, with an overall mAP value of 19.7%.

discuss 

Welcome everyone to join the group discussion

Reference 

Target detection | YOLOv2 principle and implementation (with YOLOv3) bzdww

Why is Batch Normalization effective in deep learning? - Know almost

YOLOv2, v3 use K-means clustering to calculate the specific method of anchor boxes_fu18946764506's blog-CSDN blog_yolov2 clustering

[Intensive reading of AI papers] YOLO V2 target detection algorithm_哔哩哔哩_bilibili

Guess you like

Origin blog.csdn.net/qq_36076233/article/details/123083821