Introduction to target detection: Detailed explanation of Yolov3 (1)

First, let’s introduce the origin of the name yolo: you only look once, which translates as you only need to look once, so it is obvious that yolov3 is a single-stage target detection algorithm.

So: 1. Why does yolov3 become a "single" stage target detection algorithm instead of a "two" stage?

           2.Why is yolov3 attracting attention? Why should I start telling you about it from yolov3?

            3. Compared with the original traditional rcnn series network, what do you think is the most subtle thing about yolov3?

With these questions, let’s start the next study. After studying, leave your answers in the comment area!

Note that I directly started the explanation of yolov3 here and did not explain the principle and current situation of target detection. I will talk about the rcnn series later, where I will start from the beginning. Beginners are recommended to start with rcnn.

Yolo was officially noticed after yolov3 came out. It is worth noting that the overall architecture of future yolo (v4-v5) is not much different from yolov3, so understanding yolov3 is an indispensable step for learning the yolo series algorithms.

The basic idea of ​​the YOLOv3 algorithm can be divided into two parts:

The YOLOV3 framework is shown in the figure above. In fact, the entire framework can be divided into three parts: Darknet-53 (actually only 52 layers are used) structure (feature extraction) , feature layer fusion structure (concat part in the figure above), and three Detection head (Y1, Y2, Y3) , what are the functions of these three parts or how do they work?

Let's start with the input of the network structure. An image x (size: 416×416) is input to the Darkenet-53 network structure, and a series of convolutions and staggered networks are performed to obtain the original image: 1/8 (size: 52×52), 1/16 (size: 16×26), 1/32 feature map (ie feature map), this process is the so-called feature extraction process.

You may want to ask what is the use of these three feature maps of different sizes? Why are there three feature maps?

 The traditional faster-RCNN uses 2000 anchor boxes, while yolov3 only uses 9 across the ages . The rectangular boxes in the world can be divided into: large, medium and small, and each size corresponds to three types of boxes (this is One of the most subtle things about yolov3):

With three feature maps of different sizes, these features should be used directly for classification and detection. However, considering that such a feature layer may not have enough performance capabilities, or that the extracted features cannot fully reflect the target information in the original image. . Therefore, the three feature maps need to be feature fused to obtain stronger feature expressiveness and achieve better results; due to the different sizes, upsampling and downsampling (actually convolution) are required in the middle. Make the feature maps the same size, and then perform stacking, fusion and corresponding convolution operations to obtain the final three feature layers, namely 13×13×255 (Y1) and 26×26×255 (Y1) in the above figure. Y2), 52×52×255 (Y3), these three feature maps are exactly 1/32, 1/16, and 1/8 of the original image x respectively.

At this point, in fact, the entire YOLOV3 framework is basically finished.

To sum up, for the entire framework, we put aside other details. In fact, an input image x with a size of 416×416×3 is transformed into a much smaller image after a large number of convolution operations. feature map. The principle is explained below:

Guess you like

Origin blog.csdn.net/GWENGJING/article/details/129625571