Target detection for deep learning (3) yoloV3

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


foreword

This blog is to record the notes of learning yolov3. If there are any mistakes, please let me know!


Tip: The following is the text of this article, and the following cases are for reference Note
: This article refers to the articles and videos of the Great God Article
: yolov3 theory
Video: Yolo series theory

1. Yolov1 (one-stage)

You Only Look Once : Unified,Real-Time Object Detection

1.1 Ideas of the paper

1) Divide an image into sss* s s s network (grid cell), if the center of an object falls in the grid, the grid is responsible for predicting the target.

insert image description here
2) Each grid needs to predict B bounding boxes. In addition to predicting the position, each bounding box also needs to predict a confidence value. Each network also predicts scores for C classes.
Among them, x, y, w, and h are all relative information relative to the grid cell.
Confidence is the intersection ratio of the predicted target and its real target.
pr(object) * IOU(Ground Truth and Predict)
where pr(object)=0 or 1 (whether there is a target in the grid)

When testing, conditional class * confidence predictions
Pr(Class i|object) * Pr(Object)*IOU(truth and pred)=Pr(class)*Iou(truth and pred), that is, the final predicted
probability contains The class probability of the object and how well the predicted bounding box coincides with the true bounding box.

The overall network structure:
insert image description here
insert image description here
Loss function:
insert image description here
Bounding box loss: The root sign is used for width and height. It should be that if w and h are used directly, the actual effect of his prediction is different for bounding boxes of different sizes.
For example, the figure below:
insert image description here
Confidence loss: includes two losses of positive samples and negative samples.

1.2 Existing problems

1) The detection effect of group small targets is poor, because yolov1 predicts two bounding boxes in each cell, and these two bounding boxes only predict the same category.
2) It also performs poorly when the target appears in a new size or configuration.
3) The main reason for the error of the yolov1 model is inaccurate positioning.

Two, Yolov2

2017 CVPR
YOLO 9000 :Better,Faster, Stronger

2.1 Various attempts in Yolov2 (Better chapter)

1) Batch Normalization
reduces the regularization processing and improves the convergence of training. Compared with before, it improves 2% mAP, and can remove the Dropout layer.

2) High Resolution Classifier
uses a better resolution classifier, which increases mAP by 4%.

3) Convolutional With Anchor Boxes
uses anchor + coordinate offset to simplify the problem and make the network easy to learn.
Through experiments, it is found that the use of Anchor will have a small decrease in mAP, but the recall rate Recall will have a greater improvement. An increase in recall means that the model has more room to improve.

4) Dimension Clusters
uses k-means clustering method to obtain priors.
(Based on the clustering of all bounding boxes in the training set to obtain the corresponding priors (anchor))

5) Direct Location prediction
If you directly use the anchor-based method to train, it will lead to instability, especially in the early stage of training. Most of the unstable factors are caused by predicting the center coordinates x, y of prior.
The paper places a limit on the center coordinates of the prior so that it falls within the grid cell area.
This approach plus (4) increased 5% mPA.
Also use sigmoid to limit the confidence value.

6) Fine-Grained Features
fuse the relatively low-level feature map and the high-level feature map through the passthrough layer to improve the effect of detecting small targets.
insert image description here
This approach improves performance by at least 1%.

7) Multi-Scale Training (multi-scale training)
randomly selects the size of the input network image every ten iterations.

2.2 BackBone :DarkNet-19

Network structure:
insert image description here

2.3 Overall network structure:

insert image description here
for voc use 5 boxes

3. Yolov3

2018 CVPR
Yolov3:An Incremental Improvement

3.1 BackBone:DarkNet-53

The specific model of BackBone is shown in the figure below:
insert image description here

3.2 YOLOv3 model structure

In the paper, the author mentions that three feature layers are used to predict the border, and the specific three layers are the following three layers according to the blog written by the little mung bean of the great god Sunflower (take the output size of 416x416 as an example, corresponding to The three feature layer sizes are 52, 26, 13 respectively):

insert image description here

3.3 Object Bounding Box Prediction

YOLOv3 passes (4+1+c) × \times in the three feature layers respectively× k convolution kernels with a size of 1$\times$1 for convolution prediction, k is the number of preset bounding box priors (k defaults to 3), and c is the number of categories of the predicted target.
The specific bounding box prediction process is as follows:
insert image description here

The dashed-line rectangle in the figure is the preset bounding box, and the solid-line rectangle is the predicted bounding box calculated through the offset predicted by the network. Where (cx, cy) are the center coordinates of the preset bounding box on the feature map, (pw, py) are the width and height of the preset bounding box on the feature map, and (tx, ty, tw, th) are the network The center offset of the predicted bounding box and the aspect ratio (tw, ty), (bx, by, bw, bh) are the final predicted target bounding box. The conversion process from the preset bounding box to the final predicted bounding box is as follows As shown in the formula on the right side of the figure, the function is the sigmoid function whose purpose is to scale the predicted offset between 0 and 1 (this can fix the center coordinates of the preset bounding box in a cell, the author said that this can speed up the network convergence).

The preset bounding box size on each feature map is obtained by the author through coco dataset clustering.
insert image description here

3.4 Matching of positive and negative samples

For each picture, only one priori is assigned to each Ground Truth as a positive sample, while other predicted priori that overlap with the Ground Truth and whose iou is greater than the threshold are discarded directly.
If a bounding box prior does not have a corresponding Ground Truth, then it has no coordinate loss, no class prediction loss, only
confidence loss.

3.5 Loss calculation

insert image description here

3.5.1 Confidence Loss

insert image description here

3.5.2 Class loss

insert image description here

3.5.3 Localization loss

insert image description here

四、YOLOv3spp(ultralytics)

4.1 Mosaic Image Enhancement

How to do it: Stitch multiple pictures together.

Advantages:
1) Increase the diversity of data
2) Increase the number of targets
3) BN can count the parameters of multiple pictures at one time

4.2 SPP module

It is different from the SPP structure in SPPNet.
The feature fusion of different scales is realized. (The spp module can be used before each feature layer, but it is not required.)
insert image description here

4.3 CIoU Loss

1. IoU Loss
insert image description here
From the above figure, it can be found that the l2 loss cannot express the prediction effect very well, and the IoU loss can do this.
insert image description here
Advantages:
1) It can better reflect the degree of coincidence
2) It has scale invariance
Disadvantages:
1) When you don't want to intersect, the loss is 0

2.GIoU Loss(Generalized IoU)
insert image description here
GIoU Loss:
insert image description here

At this time, when the predict bounding box does not overlap with the Ground Truth, there is also a corresponding loss value, which can be backpropagated.

Disadvantages:
When the following situations occur, it will degenerate into IoU Loss
insert image description here

3 DIoU Loss
IoU Loss and GIoU Loss Disadvantages:
1. Slow convergence speed
2. Inaccurate regression

insert image description here
where p^2 ​​(b,bgt) is the Euclidean distance between the center coordinates of the two bounding boxes and
c^2 is the diagonal length of the smallest circumscribed rectangle.
insert image description here

4 CIoU Loss
An excellent regression positioning loss should take into account three geometric parameters.
Overlap area, center point distance, aspect ratio.
insert image description here

insert image description here

4.4 Focal Loss

Mainly for the one-stage target detection model.
The number of candidate frames that can match the target in an image is generally only a dozen or dozens, while the number of candidate frames that do not match is about 10^4 - 10^5.
Most of these candidate boxes that do not match the target are simple and easy-to-divide negative samples (which have no effect on the training network, but due to the large number, a small number of samples that are helpful for training will be overwhelmed)

In general, we use binary log entropy loss.
One of the general methods to solve the problem of category balance is to add a weight a (0,1).
insert image description here
However, the above a balances the importance of positive and negative samples, but it cannot distinguish between easy-to-segment samples and difficult-to-segment samples.
So the author modified the loss function, which can reduce the weight of easy-to-segment samples, so we can focus more on difficult-to-segment samples.
insert image description here
Among them, 1-pt ^y can reduce the loss contribution of easily divided samples.
insert image description here

Guess you like

Origin blog.csdn.net/weixin_43869415/article/details/121732624