YOLOv3 study notes

YOLOv1: YOLO v1 in-depth understanding
YOLOv2: YOLOv2 / YOLO9000 in-depth understanding

Reprinted from: YOLOv3 in-depth understanding

YOLOv3 doesn't have too much innovation, it is mainly to integrate some good solutions into YOLO. However, the effect is still good. On the premise of maintaining the speed advantage, the prediction accuracy is improved , especially the ability to recognize small objects is strengthened .

The main improvements of YOLO3 include:
adjusting the network structure; using multi-scale features for object detection; and replacing softmax with Logistic for object classification.

New network structure Darknet-53

In terms of basic image feature extraction, YOLO3 uses a network structure called Darknet-53 (containing 53 convolutional layers), which draws on the practice of residual network, and sets up shortcut links between some layers (Shortcut connections)
Insert picture description hereThe Darknet-53 network in the above figure uses 256x256x3 as input, and the numbers 1, 2, 8 in the leftmost column indicate how many repeated residual components. Each residual component has two convolutional layers and a shortcut link. The schematic diagram of a residual component is as follows:
Insert picture description here

Detailed network structure

Insert picture description hereThe above picture is referenced from: yolo v3 of the yolo series [in-depth analysis]

Object detection using multi-scale features

Insert picture description here

YOLO2 used the passthrough structure to detect fine-grained features, and YOLO3 further used 3 feature maps of different scales for object detection.

In combination with the above figure, after the convolutional network is at the 79th layer, it passes through the yellow convolutional layers below to obtain a scale of detection results. Compared with the input image, the feature map used for detection here has 32 times downsampling. For example, if the input is 416 416, the feature map here is 13 13. Due to the high downsampling factor, the receptive field of the feature map here is relatively large, so it is suitable for detecting large-sized objects in the image .

In order to achieve fine-grained detection, the 79th layer's feature map starts to be upsampled again (upsampling and convolution from the 79th layer to the right), and then merged with the 61st layer's feature map (Concatenation), so that the 91st layer is finer The granular feature map is also obtained after several convolutional layers, which is 16 times down-sampled relative to the input image. It has a medium-scale receptive field and is suitable for detecting medium-scale objects .

Finally, the 91st layer feature map is up-sampled again and merged with the 36th layer feature map (Concatenation), and finally a feature map that is downsampled 8 times relative to the input image is obtained. It has the smallest receptive field and is suitable for detecting small size objects .

A priori box of 9 scales

As the number and scale of the output feature maps change, the size of the a priori box also needs to be adjusted accordingly. YOLO2 has begun to use K-means clustering to obtain the size of the a priori box. YOLO3 continues this method by setting 3 a priori boxes for each downsampling scale, and clustering a priori box of 9 sizes in total. The 9 a priori boxes in the COCO data set are: (10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), (373x326).

In terms of allocation, larger a priori boxes (116x90), (156x198), (373x326) are applied to the smallest 13x13 feature map (with the largest receptive field), which is suitable for detecting larger objects. The medium a priori box (30x61), (62x45), (59x119) is applied to the medium 26*26 feature map (medium receptive field), which is suitable for detecting medium-sized objects. The larger 52x52 feature map (smaller receptive field) applies smaller prior boxes (10x13), (16x30), (33x23), which is suitable for detecting smaller objects.

Insert picture description here
Feel the size of 9 a priori boxes. The blue box in the figure below is the a priori box obtained by clustering. The yellow frame is ground truth, and the red frame is the grid where the center point of the object is located.
Insert picture description here

Object classification softmax changed to logistic

When predicting the target category, softmax is not used, and the logistic output is used for prediction. This can support multi-label objects (for example, a person has two labels, Woman and Person).

Input to output

Insert picture description here
Regardless of the details of the neural network structure, in general, for an input image, YOLO3 maps it to an output tensor of 3 scales, which represents the probability of various objects in each position of the image.

Let's take a look at how many predictions YOLO3 has made. For an input image of 416x416, 3 a priori boxes are set in each grid of the feature map of each scale, and there are a total of 13x13x3 + 26x26x3 + 52x52x3 = 10647 predictions. Each prediction is a (4+1+80)=85-dimensional vector. This 85-dimensional vector contains frame coordinates (4 values), frame confidence (1 value), and the probability of the object category (for the COCO data set, there are 80 kinds of objects).

For comparison, YOLO2 uses 13 13 5 = 845 predictions. YOLO3's attempt to predict the number of frames has increased by more than 10 times, and it is performed at different resolutions, so mAP and the detection of small objects have a certain improvement.

summary

YOLO3 draws on the residual network structure to form a deeper network level, and multi-scale detection, which improves the detection effect of mAP and small objects. If COCO mAP50 is used as the evaluation index ( if you do n't mind the accuracy of the prediction frame too much), YOLO3's performance is quite amazing. As shown in the figure below, with the same accuracy, YOLOv3 is 3 or 4 times faster than other models.

Insert picture description here
However, if a more accurate prediction frame is required and COCO AP is used as the evaluation standard , the accuracy of YOLO3 will be weaker. As shown below.
Insert picture description here

Guess you like

Origin blog.csdn.net/W1995S/article/details/112907370