Detailed interpretation of YOLO v3 model

Table of contents

foreword

1. Network Architecture

1.1 backbone network

2. Bounding Box Prediction—boundary box prediction

2.1 Multi-scale prediction

2.2 Anchor grid offset prediction

2.3 Yolo Head

3. Positive and negative sample matching rules 

4. Loss function

 5. Training strategy


foreword

YOLO v3 ("Yolov3: An incremental improvement") is a single-stage target detection paper published by Joseph Redmon in 2018. This is also the author's last paper on the yolo series. Yolo v3 is a relatively mature model in the yolo series, and it is also widely used in the industry, so it is of great significance to do research on yolo v3.

Compared with the previous V1 and V2, except for the network structure, the rest of v3 has not changed much. The main reason is that some good detection ideas are integrated into YOLO. On the premise of maintaining the speed advantage, the detection accuracy is further improved, especially for small object detection capabilities. Specifically, YOLOv3 mainly improves the network structure, network features and subsequent calculations in three parts.

Improvements:

  • Multi-scale prediction (introducing FPN)
  • Better backbone (darknet-53, similar to ResNet introducing residual structure)
  • The classifier no longer uses softmax (used in darknet-19), and the loss function uses binary cross-entropy loss (two-class cross loss entropy)

1. Network Architecture

YOLOv3 continues to draw on the ideas of the current excellent detection framework, such as the residual network and feature fusion, etc., and proposes the network structure shown in the figure below, which is called DarkNet-53. The author experimented on ImageNet and found that darknet-53 is not only similar in classification accuracy compared to ResNet-152 and ResNet101, but also has a much faster calculation speed than ResNet-152 and ResNet-101, and the number of network layers is also less than them.

The structure of the YOLO v3 model is shown in the figure below:

  ● DBL: represents the combination of three layers of convolution, BN and Leaky ReLU. In YOLOv3, convolution appears in this combination, which constitutes the basic unit of DarkNet. The numbers after DBL represent several DBL modules.
● res: res represents the residual module, and the number after res indicates that there are several residual modules connected in series.
● Upsampling: The method of upsampling is pooling, that is, the method of element copy and expansion makes the size of the feature map larger, and there is no learning parameter.
● Concat: After upsampling, perform the Concat operation on the deep and shallow feature maps, that is, the splicing of channels, similar to FPN, but FPN uses element-by-element addition.
● Residual idea: DarkNet-53 draws on the residual idea of ​​ResNet, and uses a large number of residual connections in the basic network, so the network structure can be designed very deep, and the problem of gradient disappearance in training is alleviated, making the model easier convergence.
● Multi-layer feature map: Through upsampling and concat operations, deep and shallow features are fused, and feature maps of three sizes are finally output for subsequent detection. Multi-layer feature maps are beneficial for multi-scale object and small object detection.
● No pooling layer: The previous YOLO network has 5 maximum pooling layers, which are used to reduce the size of the feature map, and the downsampling rate is 32, while DarkNet-53 does not use pooling, but uses a step size of 2 convolution kernel to achieve the effect of size reduction, the number of downsampling is also 5 times, and the overall downsampling rate is 32.

 

It should be noted that the difference between the concat operation and the sum operation: the sum operation comes from the ResNet idea, and the input feature map is added to the corresponding position of the corresponding dimension of the output feature map, that is, y = f(x) + x; The concat operation is derived from the design idea of ​​the DenseNet network, and the feature map is directly spliced ​​according to the channel dimension. For example, the feature map of 8*8*16 is spliced ​​with the feature map of 8*8*16 to generate the feature map of 8*8*32. . Upsampling layer (upsample): The function is to generate a large-size image by interpolating the small-size feature map. For example, use the nearest neighbor interpolation algorithm to transform an 8*8 image into a 16*16 image. The upsampling layer does not change the number of channels of the feature map.

1.1 backbone network

YOLOv3 introduced the residual module based on the Darknet-19 proposed by YOLOv2, and further deepened the network. The improved network has 53 convolutional layers, named Darknet-53, and the network structure is as follows:

 The author uses Darknet-53 as the feature extraction network, and removes the Avgpool, connected and softmax layers in the network.

In the entire v3 structure, there is no pooling layer and fully connected layer . In the process of forward propagation, the size transformation of the tensor is realized by changing the step size of the convolution kernel, such as stride=(2, 2), which is equivalent to reducing the image side length by half (that is, the area is reduced to the original 1/4). In yolo_v3, after 5 reductions, the feature map will be reduced to 1/32 of the original input size. The input is 416x416, the output is 13x13 (416/32=13).

2. Bounding Box Prediction—boundary box prediction

2.1 Multi-scale prediction

From the model structure, it can be found that YOLOv3 outputs 3 feature maps of different sizes, corresponding to the deep, middle and shallow features from top to bottom. The feature size of the deep layer is small and the receptive field is large, which is conducive to the detection of large-scale targets. The medium-scale feature map is used to detect medium-sized targets, and the shallow feature map is used to detect small-scale targets. This is similar to the FPN structure.

A set of borders with different sizes and aspect ratios is preset in each grid cell to cover different positions and multiple scales of the entire image. Each scale predicts 3 boxes, and the anchor design method is still used 聚类(get 9 cluster centers, divide them into 3 scales according to their size, and use the feature layers of these three scales to predict the frame .

2.2 Anchor grid offset prediction

From the explanation in the previous part, it can be seen that each point Yolo Headin the output is three-dimensional w,h corresponding to the grid divided by the original image, channelcorresponding to anchorthe predicted value, so how do we calculate anchorthe specific position?
The idea of ​​yolo is to divide a picture into W*Wgrids, and each grid is responsible for the target where the center point falls on the grid. Each grid can be regarded as an area of ​​interest. Since it is an area, it is necessary to calculate the anchorpredicted The sum bboxof the specific coordinates . channel = tx + ty + tw + th + obj + numclasses channel =t_x+t_y+t_w+t_h+obj+num_{classes} channel=tx​+ty​+tw​+th​+obj + numclasses ​The predicted value is not the coordinates of the anchor, but the offset of the same tw, th t_w,t_h tw​,th​Is the scaling factor of the prior box, and the size of the prior box is determined by the clustering label file The coordinate positions saved in are obtained, each has three prior frames of different sizes, and there are three prediction layers, and the total number of prior frames is three times the total number of grids.wh

tx,tyanchork-meansxmlanchorYolohead

 The above figure shows bboxthe regression process cx, cy c_x, c_y cx​, cy​ are the coordinates of the upper left corner of the grid, anchoroffset to the lower right, in order to prevent anchorthe offset from exceeding the grid and cause excessive loss of positioning accuracy, the function will yolov3be used sigmoidThe predicted value tx, ty is limited to [0,1](can speed up the convergence of the network), and the final anchorpredicted coordinate bx , by b_x, b_y bx​, by​.bbox is determined w与hby the predicted scaling factor tw and th t_w and t_h tw​and th ​decision, pw and ph p_w and p_h pw​and ph​ are anchorthe width and height of the template mapped to the feature map, and the exponential function is used to enlarge tw t_w tw​ and th t_h th​ in pw and ph p_w and p_h respectively Multiply pw​and ph​to get the final predicted w and h. : The coordinate points
PShere are the values ​​mapped to the feature map, not the coordinates on the original image, which should be converted to the real image Coordinates are drawn again.x,yanchorplot bbox

2.3 Yolo Head

   The method used by YOLOv3 is different from that of SSD. Although the information of multiple feature maps is used, the features of SSD are predicted separately from shallow to deep, without deep and shallow fusion, and the basic network of YOLOv3 is more like SSD and FPN. combined. YOLOv3 uses the COCO dataset by default, a total of 80 object categories, so an Anchor needs 80-dimensional category prediction values, 4 position predictions and a confidence prediction. Each cell has three Anchors, so a total of 3×(80+5)=255 is required, which is the number of predicted channels for each feature map.

When the input is 416*416, the model will output 10647 prediction boxes (feature maps of three sizes, three prediction boxes for each size, a total of (13×13+26×26+52×52)×3=10647), for each Each output frame is labeled according to the ground truth in the training set (positive example: the IOU with the ground truth is the largest; negative example: IOU<threshold 0.5; ignore: the frame with the object is predicted but the IOU is not the largest and it is rejected in NMS rounding). Then use the loss function to optimize and update the network parameters.

As shown in the figure above: During the training process, for each input image, yolov3 will predict three 3D tensors of different sizes, corresponding to three different scales. The purpose of designing these three scales is to detect objects of different sizes. . Here we take the 13 * 13 tensor as an example. For this scale, the original input image will be divided into 13 × 13 grid cells, and each grid cell corresponds to a long voxel of 1x1x255 in the 3D tensor. 255 is derived from 3*(4+1+80). As can be seen from the figure above, N×N in the formula N×N×[3×(4+1+80)] represents the scale size, such as the one mentioned above 13×13. 3 means each grid cell predict 3 boxes. 4 represents the coordinate value (tx,ty,th,tw). 1 represents the confidence level, and 80 represents the number of COCO categories.

  • If the center of the bounding box corresponding to a certain ground truth in the training set happens to fall in a certain grid cell of the input image, then this grid cell is responsible for predicting the bounding box of this object, so the confidence corresponding to this grid cell is 1, The confidence of other grid cells is 0. Each grid cell will be given 3 prior boxes of different sizes. During the learning process, the grid cell will learn how to choose which size prior box. The author defines the prior box with the highest coincidence with the IOU of the ground truth .
  • The three preset prior boxes of different sizes mentioned above, how are these three sizes calculated? First, before training, all bboxes in the COCO dataset are divided into 9 categories using k-means clustering, and each 3 categories Corresponds to a scale, so there are 3 scales in total. This prior information about the size of the box helps the network to accurately predict the offset and coordinate of each box. Intuitively, an appropriate box size will make the network learn more accurately.

in the forecasting process

Input the picture into the trained prediction network, first output the information of the prediction frame (obj,tx,ty,th,tw,cls), after the class-specific confidence score (conf_score=obj*cls) of each prediction frame, set the threshold , filter out the prediction frames with low scores, and perform NMS processing on the reserved prediction frames to get the final detection result.

  • Threshold processing: remove most of the background frames that do not contain predicted objects
  • NMS processing: remove redundant bounding boxes to prevent repeated prediction of the same object

The summary forecasting process is:

Then traverse the three scales

→ traverse the prediction boxes of each scale

→ Use the largest prediction box classification scores as the prediction category of the box

→ Multiply the confidence of the prediction box with its 80-dimensional classification scores

→ Set nms_thresh and iou_thresh, use nms and iou to remove background frame and repeat frame

→ Traverse every prediction box left, visualize

3. Positive and negative sample matching rules 

The matching rule of positive and negative samples mentioned in the yolov3 paper is: groundtrue boxassign a positive sample to each, and this positive sample is a prediction box with the largest overlapping area bboxamong all, that is, the prediction with the largest iou of the gt_box gt_boxbox. But if you use this rule to find positive samples, the number of positive samples is very small, which will make it difficult to train the network. 如果一个样本不是正样本,那么它既没有定位损失,也没有类别损失,只有置信度损失In the yolov3 paper, the author tried to use focal loss to alleviate the problem of uneven positive and negative samples, but The reason is that the negative sample value participates in the confidence loss, and the impact on the loss is very small. The author first aligns the upper left corner of and, and then bboxcalculates gr_boxthe and of anchorthe target , and sets Set a threshold, if the value is greater than the threshold, it is classified as a positive sample.bboxgr_boxiouiouanchor templateiou

4. Loss function

Losses include confidence loss, localization loss, and category prediction loss. Among them, the confidence loss considers all samples, the localization loss only considers positive samples, and the classification loss considers positive samples.

 

 5. Training strategy

1. Multi-scale training is used.

2. Data augmentation is used.

3. Standardized normalization is used.


Guess you like

Origin blog.csdn.net/m0_53675977/article/details/130402444