Target Detection YOLO Algorithm Series Study Notes Chapters 2, 3, and 4 - YOLOv1, v2, v3 Algorithm Ideas and Architecture

The neural network algorithm focuses on: network structure output value + loss function.
insert image description here

Publication time, authors and papers of each version

YOLO(You Only Look Once),2015.6.8,Joseph Redmon
YOLOv2(YOLO9000),2016.12.25,Joseph Redmon
YOLOv3,2018.4.8,Joseph Redmon
YOLOv4,2020.4.23,Alexey Bochkovskiy
YOLOv5,2020.6.10,Ultralytics
YOLOX,2021.7.20,旷世
YOLOv6,2022.6.23,美团
YOLOv7,2022.7.7,Alexey Bochkovskiy
YOLOv8,2023.1,Ultralytics
参考:https://blog.csdn.net/qq_34451909/article/details/128779602

YOLOv1:2015,paper:https://arxiv.org/pdf/1506.02640.pdf
YOLOv2:2017,paper:https://arxiv.org/pdf/1612.08242.pdf
YOLOv3:2018,paper:https://arxiv.org/pdf/1804.02767.pdf
YOLOv4:2020,paper:https://arxiv.org/pdf/2004.10934.pdf
YOLOv5:2020,paper:https://arxiv.org/pdf/2108.11539.pdf
YOLOv6:2022,paper:https://arxiv.org/pdf/2209.02976.pdf
YOLOv7:2022,paper:https://arxiv.org/pdf/2207.02696.pdf

yolo-v1 (proposed in 2015)

overall thinking

The core idea is to treat object detection as a single regression task. It first divides the image into SxS grids, and on which grid the center of the real frame of the object falls, the anchor frame corresponding to the grid is responsible for detecting the object.
The classic one-stage method converts the detection problem into a regression problem. A CNN can handle it and output x, y, h, w.
insert image description here

insert image description here
Divide the input data into small grids of S*S** (the output in v1 is 7×7)**.
Determine which grid the object is in and look at its center point.
First give the H and W of the two candidate frames based on experience, calculate the IoU with the real frame, use whichever is higher as the prediction frame, and then perform fine-tuning (regression task), generate a prediction frame for each grid, and finally output x, y, h, w (bounding box, bounding box) and confidence (confidence), and then filter many of the generated boxes to obtain a box with a higher confidence, and then fine-tune the box with a higher confidence.

Interpretation of network architecture

For the fully connected layer, because the size of its weight and bias is constant in each regression, the size of the input feature must be limited, otherwise there is no way to do it, so the input image size is given in YOLO-v1 as 448× 448×3.
insert image description here
The front part of the fully connected layer can be regarded as a convolution process, because the GoogLeNet network is rarely used now, so I won’t go into details, and the network model used in v3 will be explained in detail.

Mainly look at the meaning of 7×7×30 output later:
7×7 means dividing the image into 7×7 grids.
30=10+20: Indicates the position, size and confidence of two candidate boxes (bounding box), plus 20 categories.
insert image description here
The dimension of the final prediction result is S×S×(B*5+C)

loss function

insert image description here

position error

i means that for different grids, there are a total of S squares, j means that there are a total of B for different candidate boxes, select the box with the largest IoU, and calculate the difference between the real value and the predicted value.
insert image description here
Adding the root sign to w and h is to deal with the different sizes of objects. For small objects, the size of the frame may change by one unit and the frame may not be able to be framed.
y=root sign The x function is more sensitive when the value is small, and not so sensitive when the value is large. For small objects, the offset may not be sensitive when the offset is relatively small. Now make it more sensitive.
The previous coefficients are equivalent to weight items.

confidence error

Divided into foreground (ie with target) and background (without target)

When the IoU between a certain frame and the real frame is greater than the threshold, such as 0.5, the closer the confidence of the current frame is to the IoU, the more appropriate, but there may be many overlapping parts of the frame and the real frame, then calculate the frame with the larger IoU .

In the actual image, the proportion of the foreground and the background is somewhat different, so there is a weight term in the calculation of the background, so that the influence of the background on the loss is less.

classification error

Computes the cross-entropy loss.

NMS (Non-Maximum Suppression)

insert image description here
When predicting, for a frame with a relatively large IoU (predicting the same object), compare their confidence, keep the maximum value, and remove the others.

Features of yolo-v1

Pros
Quick and easy.

shortcoming

  1. Things that overlap are hard to detect.
  2. The detection effect of small objects is average, and the aspect ratio is optional but single.
  3. It is difficult to solve the problem of multiple labels for the same object. For example, the labels of an object are dogs and huskies.

yolo-v2 (2017)

improve details

insert image description here
v2 does not have a fully connected layer.

insert image description here
BN is used a lot now, and conv-bn can be said to be standard.
insert image description here
In v1, the computer performance may be relatively poor, and the training of 448×448×3 images will be relatively slow, so the number of image pixels used for training and testing is different.
In v2, the time is not considered, and 10 additional fine-tunings have been carried out.

Interpretation of network architecture

insert image description here
It borrows from VGG (2014) and Resnet network structure.

  • It is to extract features, so there is no fully connected layer .
    The fully connected layer is easy to overfit, and because of the large number of parameters, the training is relatively slow.
    The output on the right is just an example, the actual situation is not the case.

  • 5 times of downsampling, each downsampling is reduced to half of the original, and a point on the output feature map is equivalent to 1/32 of the original image.
    The actual input is 416×416, and the output is 13×13 (more refined than v2). This is the actual network structure Darknet 19 , which has 19 layers. The number of layers can be changed according to your needs. If there are more layers, the mAP may be higher and the FPS may be higher. just low. The output is an odd number so that the feature map has a center point.

  • There are two kinds of convolution kernels in the network: 3×3 and 1×1. The former is based on the idea of ​​the VGG network paper. When choosing a smaller convolution kernel, the parameters are relatively small and the receptive field is relatively large. 1×1 convolution only changes the number of feature maps (the number of channels), and it is a process to achieve feature concentration. Using 1 is less than using 3 parameters, and the calculation is faster.

Choose a priori box size based on clustering

insert image description here

  • YOLOv1 uses 2 types.
  • FasterRCNN uses 9 types. 3 scales, three different sizes, each with three shapes. The prior ratios selected by the faster-rcnn series are conventional, but not necessarily completely suitable for the data set.
    insert image description here
  • YOLOv2: Select the prior box size based on K-means clustering . From the actual examples, 5 categories (k=5) are clustered, and the center of each category is used as a representative to take out h and w as the prior box, and each grid in the feature map has 5 prior boxes. If the Euclidean distance is used, the error of the large frame may be large, and the error of the small frame is small. A definition of distance is proposed, using 1-IoU as the distance.
    insert image description here
    They did an experiment and took a compromise value k=5. At this time, the average IoU is relatively large, and the increase in the k value will be very slow (slope). Select five prior boxes (anchors).
    insert image description here
  • As a result,
    mAP decreased slightly, and recall increased significantly.
    There are more candidate boxes, but not every candidate box can do well, so the mAP value is slightly reduced (for example, I found two well-learned and played well, but if I found five, I can Not so well), but the recall rate has improved significantly. The recall rate means that there are targets in the picture and more targets can be detected.
    insert image description here

Offset calculation method, coordinate mapping and restoration

  • The offset is directly predicted in YOLOv1 .
  • In v2, a relative position is predicted to prevent the bounding box from moving out of the grid after adding the offset.
    insert image description here

Coordinate Mapping and Restoration
insert image description here

  • tx, ty is the predicted offset, the sigmoid function, the value range is 0 to 1, the predicted offset is input into the sigmoid function to get a number between 0-1, and the coordinates of the upper left corner of the grid are added to get bx ,by, the final predicted values ​​bx,by,bw,bh are mapped to the feature map.
  • tw, th are also predicted values, which have undergone a logarithmic transformation, so they change back to bw, bh. The final predicted values ​​bx,by,bw,bh are the positions on the actual feature map.
  • pw, ph is the prior frame obtained by clustering, which is the size of the frame in the feature map, and the original image has been mapped (divided by 32).

The role of receptive field

The larger the receptive field is equivalent to the more information it sees in the original image, it is convenient to recognize larger objects.
insert image description here
The calculation formula of the receptive field:
insert image description here
If there is a hollow convolution:
insert image description here
insert image description here
the receptive field of the first layer of convolution is 3, the receptive field of the second layer of convolution is 5, and the third layer: 5+(3-1)×1=7
insert image description here

Feature Fusion Improvements

The larger the receptive field is suitable for capturing large objects, the smaller features may be lost.
insert image description here
Both large and small receptive fields are required, so the two layers are superimposed.

Multiscale

insert image description here
The size of the image in v2 must be divisible by 32.
After several iterations, the size of the image is changed to allow the network to have an adaptability, which can detect both small images and large images, making it more comprehensive.

yolo-v3 (2018)

Improvement overview

insert image description here
v3 is very practical, it is used in military and privacy fields. In 2020, the original author of YOLO published a tweet saying that he was quitting research in the field of computer vision.

Versions after v3 are not developed by the original author.
insert image description here
Feature map grid size:
v1: 7×7
v2: 13×13
v3: 13×13, 26×26, 52×52.

Multi-scale method improvement and feature fusion

insert image description here

  • 3 scales: 52×52 is good at predicting small targets, 26×26 is good at predicting medium-sized targets, and 13×13 is good at predicting large targets. It's not about letting him do what he is good at (not doing his own thing), nor is it directly integrating the characteristics of the front and the back.
  • Three candidate boxes are generated on each feature map.
  • Instead of directly letting 52×52 predict small targets, 26×26 predict medium-sized targets, and 13×13 predict large targets, a feature fusion is done. 13×13 already knows the global information, so use 26×26 The 13×13 experience is also used when predicting objects of medium size. For predicting the target, some ideas are borrowed from the feature map of predicting the large target, and for small targets, the characteristics of the target in the forecast are used for reference.

Comparative Analysis of Scale Transformation Classical Methods

insert image description here
Left picture: image pyramid, resize the input image, input it into the network separately, and generate three feature map sizes: 52×52 , 26×26 , 13×13 , but this speed is too slow.
Right picture: In fact, it is YOLOv1, a single prediction result, and a CNN for one image.

insert image description here
The picture on the left is each doing its own thing.
In v3, use the right picture, use upsampling, and use linear interpolation to do it.
The 13×13 feature map is upsampled to 26×26, and then fused with 26×26 to predict medium-sized targets. After upsampling, 26×26 is fused with 13×13 to predict small targets. .

Interpretation of residual connection method

insert image description here
YOLOv3 incorporates ideas from the resnet (2016) paper.
In 2014, VGG only made 19 layers, because sometimes the number of layers increases, the prediction effect is not good, and the gradient disappears.
Resnet solves the problem of gradient disappearance, because there is an equivalent mapping x, so the network performance can reach at least not worse than the original.
In the lower left corner is a residual block.

Overall network model architecture analysis

insert image description here
1x, 2x, 8x refer to the number of stacks.
YOLO v3 calls the model DarkNet 53 , which can actually be regarded as resnet.
Pooling will compress features, which may not be effective. Convolution saves time and effort and is fast and effective.

A priori box design improvement

insert image description here
Output value : 13×13×3×85 (13 grid numbers×3 kinds of prior frame ratios×4 frame coordinates×1 confidence degree×80 categories)

insert image description here
After the clustering is done, the common specifications of the 9 are displayed, but not every grid has the 9 a priori boxes, but it is divided into categories, and different a priori boxes are assigned to different feature maps.
insert image description here

Softmax layer improvements

insert image description here
Each label does a binary classification.
Can't understand in many places.

Guess you like

Origin blog.csdn.net/ThreeS_tones/article/details/129770230