[Target detection] YOLOV3 detailed explanation

foreword

The previous V1 and V2 have been finished, and then I will explain YOLOV3. Except for the network structure, there are not many other changes in v3. The main reason is that some of the better detection ideas are integrated into YOLO. On the premise of maintaining the speed advantage, the detection accuracy is further improved, especially the detection ability of small objects. Specifically, YOLOv3 mainly improves the network structure, network features and subsequent calculations in three parts.

1. Network Architecture

YOLOv3 continues to absorb the ideas of the current excellent detection framework, such as residual network and feature fusion, and proposes the network structure shown in the figure below, which is called DarkNet-53. The author experimented on ImageNet and found that darknet-53 is not only similar in classification accuracy compared to ResNet-152 and ResNet101, but also has a much faster calculation speed than ResNet-152 and ResNet-101, and the number of network layers is also less than them.

insert image description here

 ● DBL: represents the combination of three layers of convolution, BN and Leaky ReLU. In YOLOv3, convolution appears in this combination, which constitutes the basic unit of DarkNet. The numbers after DBL represent several DBL modules.
● res: res represents the residual module, and the number after res indicates that there are several residual modules connected in series.
● Upsampling: The method of upsampling is pooling, that is, the method of element copy and expansion makes the size of the feature map larger, and there is no learning parameter.
● Concat: After upsampling, perform the Concat operation on the deep and shallow feature maps, that is, the splicing of channels, similar to FPN, but FPN uses element-by-element addition.
● Residual idea: DarkNet-53 draws on the residual idea of ​​ResNet, and uses a large number of residual connections in the basic network, so the network structure can be designed very deep, and the problem of gradient disappearance in training is alleviated, making the model easier convergence.
● Multi-layer feature map: Through upsampling and concat operations, deep and shallow features are fused, and feature maps of three sizes are finally output for subsequent detection. Multi-layer feature maps are beneficial for multi-scale object and small object detection.
● No pooling layer: The previous YOLO network has 5 maximum pooling layers, which are used to reduce the size of the feature map, and the downsampling rate is 32, while DarkNet-53 does not use pooling, but uses a step size of 2 convolution kernel to achieve the effect of size reduction, the number of downsampling is also 5 times, and the overall downsampling rate is 32.

It should be noted that the difference between the concat operation and the sum operation: the sum operation comes from the idea of ​​ResNet, which adds the input feature map to the corresponding dimension of the output feature map, that is, y = f(x) + x; and concat The operation originates from the design idea of ​​the DenseNet network, and the feature map is directly spliced ​​according to the channel dimension. For example, the feature map of 8*8*16 and the feature map of 8*8*16 are spliced ​​to generate the feature map of 8*8*32. Upsampling layer (upsample): The function is to generate a large-size image by interpolating the small-size feature map. For example, use the nearest neighbor interpolation algorithm to transform an 8*8 image into a 16*16 image. The upsampling layer does not change the number of channels of the feature map.

 2. Multi-scale prediction

It can be seen from the network structure that YOLOv3 outputs 3 feature maps of different sizes, corresponding to the deep, middle and shallow features from top to bottom. The feature size of the deep layer is small and the receptive field is large, which is conducive to the detection of large-scale objects, while the feature map of the shallow layer is the opposite, which is more convenient for detecting small-scale objects, which is similar to the FPN structure.
       YOLOv3 still uses the pre-selection box Anchor. Since the number of feature maps is no longer one, the matching method should be changed accordingly. The specific method is: still use the clustering algorithm to obtain 9 prior frames of different sizes, widths and heights, and then allocate the prior frames according to the method shown in the figure below, so that only 3 points are needed for each feature map A priori box instead of 5 in YOLOv2.

insert image description here

   The method used by YOLOv3 is different from that of SSD. Although the information of multiple feature maps is used, the features of SSD are predicted separately from shallow to deep, without deep and shallow fusion, and the basic network of YOLOv3 is more like SSD and FPN. combined. YOLOv3 uses the COCO dataset by default, a total of 80 object categories, so an Anchor needs 80-dimensional category prediction values, 4 position predictions and a confidence prediction. Each cell has three Anchors, so a total of 3×(80+5)=255 is required, which is the number of predicted channels for each feature map.

insert image description here

 The COCO dataset has 80 categories, so the number of categories accounts for 80 dimensions in the 85-dimensional output, and each dimension independently represents the confidence of a category. Using the sigmoid activation function instead of the softmax in Yolov2 cancels the mutual exclusion between categories, which can make the network more flexible. Experiments have proved that Softmax can be replaced by multiple independent Logistic classifiers, and the accuracy rate will not drop. This design can realize multi-label classification of objects. For example, if an object is a Woman, it also belongs to a Person.

3. Training strategy and loss function

The previous V1 and V2 are only positive and negative examples, but in V3 there are also ignored samples.

Positive example: Take any ground truth, calculate IOU with all 4032 frames, and the prediction frame with the largest IOU is a positive example. And a prediction box can only be assigned to one ground truth. For example, the first ground truth has already matched a positive detection frame, then the next ground truth, in the remaining 4031 detection frames, looks for the detection frame with the largest IOU as a positive example. The order of ground truth can be ignored. Positive examples generate confidence loss, detection frame loss, and category loss. The prediction box is the corresponding ground truth box label (requires reverse coding, calculated using the real x, y, w, h); the category label corresponds to 1, and the rest are 0; the confidence label is 1.

Ignore the sample: Except for the positive example, if the IOU with any ground truth is greater than the threshold (0.5 is used in the paper), it is an ignored sample. Ignoring examples does not generate any loss.

Negative example: Except for the positive example (the detection frame with the largest IOU after calculation with the ground truth, but the IOU is less than the threshold, it is still a positive example), and the IOU with all the ground truth is less than the threshold (0.5), then it is a negative example. Negative examples only produce loss with confidence, and the confidence label is 0.

It should be noted that the ground truth of V3 is not the same as that of V1, and the corresponding prediction box is allocated according to the center point. Instead, find the prediction frame with the largest IOU based on the prediction value as a positive example. All 4032 output boxes directly calculate the IOU with the ground truth, and the cell with the highest IOU is assigned to the ground truth. The reason is that Yolov3 generates a total of 3 feature maps, and the centers of the cells on the 3 feature maps overlap. During training, the third box of feature map 1 may be the most suitable, but during inference, the first box of feature map 2 has the highest confidence.

 The confidence label in Yolov1 is the IOU between the predicted frame and the real frame, while Yolov3 is 1. Confidence means that the prediction frame is or is not a real object, which is a binary classification, so the label is 1, 0 is more reasonable. During the training of V1, the IOU limit value of some prediction boxes and real boxes is about 0.7, and the confidence level is 0.7 as the label. There are some deviations in the confidence level learning, and the final value learned is 0.5, 0.6, so assuming the activation during inference With a threshold of 0.7, this detection box is filtered out. However, the prediction frame with an IOU of 0.7 is actually a better learning example. Especially for small pixel objects in coco, a few pixels may greatly affect the IOU, so in the first training method, the confidence label is always small and cannot be effectively learned, resulting in a low detection recall rate.

Ignoring examples is the finishing touch in Yolov3. Since Yolov3 uses multi-scale feature maps, there will be overlapping detection parts between feature maps of different scales. For example, if there is a real object, the detection frame assigned during training is the third box of feature map 1, and the IOU reaches 0.98. At this time, the IOU between the first box of feature map 2 and the ground truth is 0.95, which is also After the ground truth is detected, if the confidence level is forcibly labeled as 0 at this time, the network learning effect will be unsatisfactory.

Guess you like

Origin blog.csdn.net/qq_38375203/article/details/125505508