YOLOV3 paper reading (study notes)

Summary of cv Xiaobai's yolov3 study notes

Paper download address: YOLOv3: An Incremental Improvement
yolov URL: https://pjreddie.com/darknet/yolo/

yolov3 is the third version of the yolov series, yolov1 and yolov2 were published on CVPR in 2016 and 2017, respectively. The yolo series of algorithms are single-stage general target detection algorithms based on deep learning and convolutional neural networks, which transform the target detection problem into a regression problem without the redundant problem of extracting candidate frames.

1. Abstract (Abstract) part of the original text:

The yolov3 network is larger and deeper than the yolov2 network, but has higher accuracy. Under the input image size of 320×320, yolov3 can reach mAP of 28.2, and it takes 22ms to infer an image, which is as accurate as SSD, but three times faster than SSD. The mAP of yolov3 is better when 0.5 is the IOU threshold. In the Titan X environment, the detection accuracy of yolov3 is 57.9AP50 and takes 51ms; while the accuracy of RetinaNet is only 57.5AP500, but it takes 198ms, and yolov3 is 3.8 times faster than RetinaNet.

Analysis of the summary part:

mAP : Mean Average Precision, that is, the mean average precision. As a measure of detection accuracy in object detection.

As shown in Figure 3 in the paper, the author draws the yolov3 polyline in the second quadrant to show that under the same GPU conditions, when 0.5 is the IOU threshold, yolov3 is better than RetinaNet.
The horizontal axis is the operation time, and the farther to the left, the faster ;
The vertical axis is mAP when 0.5 is the IOU threshold, the higher the line in the graph, the more accurate

Figure 3 below shows the trade-off between speed and accuracy. Speed ​​is computing time, and accuracy is mAP when 0.5 is the IOU threshold. The
blue line is RetinaNet with RetinaNet-50 as the backbone network, and the orange line is RetinaNet-101 as the backbone network. The YOLOv3-320 , YOLOv3-416 , YOLOv3-608
Paper Figure 3
at the bottom right of the RetinaNet chart refer to the size of the input image, yolov3 and yolov2 are fully convolutional networks, so images of any size can be input, but these images must be 32 Multiples (320, 416, 608 are all multiples of 32). If the weights are the same, input images of different sizes will output different results.

Let's look at Figure 1 in the paper. This picture is the average mAP calculated by yolov3 when 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, and 0.95 are the IOU thresholds.
Paper Figure 1
We can see that, in the case of the same GPU, compared with the IOU threshold of 0.5 in Figure 3 above, although the accuracy of yolov3 in Figure 1 has decreased, it is still at the upper left of RetinaNet, which is still better than RetinaNet. This also confirms what the author wrote in Abstract: When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good.
That is, the performance of yolov3 at high IOU threshold is not as good as that at low IOU threshold.

2. Bounding Box Prediction

1. The first paragraph of the original text:
In the version of yolov9000 (ie yolov2), we clustered the data set and used the cluster center as the anchor; each prediction box will predict four values ​​related to the coordinates (tx, ty, tw, th, through these four values, the horizontal and vertical coordinates of the center point of the prediction frame, the width and height of the prediction frame can be calculated). Each grid cell is offset (cx, cy) relative to the upper left corner of the image, and the height of the anchor is pw and ph

The first paragraph of frame prediction analysis:

anchor : A fixed prediction frame with an aspect ratio that has been set; for example, a tall and thin frame will be used to identify a telephone pole, and a low frame will be used to identify a dog; and through the following formula, we can use the following four values ​​(tx , ty, tw, th) to calculate the position of the center point of the prediction frame and the width and height
σ function of the frame: limit the position of the center point of the prediction frame to the grid cell, as shown in Figure 2 in the following paper
four formulas
bx: the center point of the prediction frame x
by: the center point of the prediction frame y
bw: the width of the prediction frame
bh: the height of the prediction frame
cx: the offset cx of the grid cell where it is located compared to the upper left corner of the whole image
cy: the grid cell where it is located compared to the upper left corner of the whole image The offset cy
pw: the width of the
anchor ph: the height of the anchor

Yolov draws the entire image into grid cells of several grids. The length and width of each grid cell are normalized to 1. In simple terms, each grid cell is a square with a side length of 1,
as shown in Figure 2 in the paper:
The cx of the grid cell framed in red is 1, and the cy is 1
Paper Figure 2

2.2 Multi-class annotation classification (Class Prediction)

Original text:
In yolov3, each prediction box will output 85 values, 5 of which are the coordinates of the center point, width and height, confidence, and 80 are conditional category probabilities. For each category, a binary classification is used to output the probability between 0 and 1, that is to say, there may be multiple categories with probability 1, label value 1, and prediction value 1. We don't use softmax because we found that it doesn't perform well, we use separate logistic regression for each class. During training, each class is individually trained with a binary cross-entropy loss function.
This approach helps us in more complex task domains, such as Google's Open Image dataset, where a prediction box can have multiple labels at the same time (for example, there are two labels for women and people at the same time). However, softmax will assume that the labels of each category are mutually exclusive, and the use of multiple labels can better model the data

Note: Each category of each prediction box can use logistic regression to output the binary classification probability one by one, and there can be multiple categories to output high probability

2.3 Multi-scale target detection (Predictions Across Scales, network topology)

Original:
yolov3 makes predictions at 3 scales. Our system is inspired by feature pyramid networks (FPN feature pyramids), and three paths can be extended from the backbone network for feature extraction, that is, features of different scales are added to convolutional layers of different scales, and each scale obtains a three-dimensional feature map , each feature map has 255 channels. In our yolov3 experiment, the COCO dataset is used. There are 80 categories. Each grid cell of each scale generates 3 anchors. The output result of each scale is N×N×[3×(4+1+ 80)], this output is processed by functions such as sigmod as 4 coordinate offsets

Note:
Three ways: that is, input a three-channel (RGB) image of any scale, and after yolov3, output three different scales of feature maps (feature maps)

255 channels: each grid cell has three anchors (prediction boxes), each anchor has 80+5=85 values, 3×85=255, so a feature map has 255 channels

N×N×[3×(4+1+80)]:
N×N: The number of grid cells in this scale
3: Each grid cell generates 3 anchors (prediction boxes)
4: Center point coordinates, width, High
1: Confidence
80: Conditional class probability for 80 classes

Compared:

yolov1: The input is a 448×448 three-channel image, each image is divided into 7×7=49 grid cells, and each grid cell generates 2 big bounding boxes (without anchors), that is, 98 boxes are generated; output The feature map structure is 7×7×(5×2+20)

yolov2: The input is a 416×416 three-channel image, each image is divided into 13×13=169 grid cells, and each grid cell generates 5 anchors, that is, 845 boxes; the output feature map structure is 13 ×13×(5+20)
5=4+1: Center point coordinates, width, height, confidence level
20: Conditional category probability of 20 categories
starts from yolov2, compatible with pictures of different input sizes, and the output size is determined by the input size It is decided that the feature map generated by the small picture is small, and the feature map generated by the large picture is large. If the input is exactly 416×416, the output is a 13×13 feature map

yolov3: If the input is a 416×416 three-channel image, yolov3 will generate three scales: 13×13, 26×26, 52×52, which also correspond to the number of grid cells. For each grid cell, 3×(5+80) tensors are generated.
That is, a total of 13×13+26×26+52×52 grid cells are generated, (13×13+26×26+52×52)×3=10647 prediction frames
5=4+1: center point coordinates, width, High, Confidence
80: Conditional class probabilities for 80 classes

If the input is a 256×256 three-channel image, yolov3 will generate three scales: 8×8, 16×16, 32×32, each grid cell generates 3 anchors, a total of (8×8+16×16 +32×32)×3=4032 prediction frames
13×13 predicts large objects, 26×26 predicts medium objects, and 52×52 predicts small objects

insert image description here

A picture whose author is Levio is quoted here. As shown in the figure, a 416×416 three-channel picture is input, and features are extracted through the backbone network (Backbone), and the features of different scales are summarized in Neck.
13×13 gets 26×26 in the output layer by upsampling and 26×26 concatenation 26×26
gets 52×52 in the output layer by upsampling and 52×52 concatenation

The meaning of 255 in the output layer:
255=85×3=(80+5)×3
5=4+1: center point coordinates, width, height, confidence
80: conditional category probability of 80 categories

2.4 Feature Extractor

The backbone network extracts good features in the image to achieve our desired goal
Original:
We use a new network for feature extraction. The new network mixes Darknet-19 and ResNet from yolov2. Our network employs consecutive 3×3 convolutions and 1×1 convolutions, plus cross-layer connections. There are a total of 53 convolutional layers with weights, so we call it Darknet-53 (52 convolutional layers + 1 fully connected layer)
insert image description here

This new network is much stronger than Darknet-19 and much more efficient than ResNet-101 and ResNet152 (shown below)
insert image description here

Top-5: Accuracy
Bn Ops: Computational amount
BFLOP/s: billions of operations, one billion floating-point operations per second
FPS: Frames Per Second, FPS is the definition in the image field, which refers to the The number of frames transmitted per second
(the backbone network of yolov2 is Darknet-19, the backbone network of yolov3 is Darknet-53, and the backbone network of the latest yolov5 is CSPDarknet-53, that is, the Darknet-53 network that integrates the csp module)

2.5 Training

Original text:
We still use end-to-end training and do not use hard example mining in RCN. We used multi-scale training, massive data augmentation, bn layers, all the standard stuff. We use Darknet for deep learning training and testing.

3.How We Do

Part of the original text:
yolov3 has the same effect as ssd on COCO's strange [email protected]:0.95, and is three times faster than ssd on another indicator. But yolov3 still lags behind other networks like RetinaNet on this metric.
However, at IOU=0.5 (or AP50 in the chart), the effect of yolov3 is very good, almost comparable to RetinaNet, and significantly exceeds the SSD variant. This shows that the performance of YOLOv3 at low IOU threshold is very good. However, as the IOU threshold increases, the performance degrades significantly, which indicates that the performance of YOLOv3 under the condition of high IOU threshold is not satisfactory.
In the past, yolov did not work very well for small object detection, but now the situation has been reversed. With the new multi-scale prediction, we see better APs performance for yolov3
insert image description here

Note:
It can be seen from the chart that yolov3 works better at AP50 (red box), but not under AP and AP75 conditions (green box)

S: Small object, area (frame size) <32×32
M: Medium object, 32×32<area<96×96
L: Large object, 96×96<area
APs: APs of small targets (the remaining two are the same as We can see from the lower right corner of
the table that the performance on small targets (18.3) is very good, and the performance on medium and large objects is not good (35.4 and 41.9 respectively)
. The previous version of yolov3 is on small targets And the effect of dense target detection is poor. Taking yolov1 as an example, yolov1 divides each image into 7 × 7 grid cells, and each grid cell can only predict one object, that is, the entire image can only predict 49 at most. Objects, once the number of objects exceeds 49, yolov1 is difficult to predict; and if two objects are very close, dense targets cannot be predicted.

Improvement of yolov3 in small target\dense target:
1. The number of grid cells is increased, yolov1 (7×7), yolov2 (13×13), yolov3 (13×13+26×26+52×52)
2.yolov2 and yolov3 can input pictures of any size. The larger the input picture, the more grid cells are generated, and the more prediction boxes are generated.
3. Special small targets are preset with some fixed aspect ratio anchors to directly generate predictions for small targets The frame is more difficult, but it is easier to regenerate the prediction frame of the small target on the basis of the small prediction frame.
4. Multi-scale prediction (drawing on FPN), which not only exerts the specialized semantic features of the deep network, but also integrates shallow Pixel structure information of the fineness of the layer network
5. For small objects, the edge contour is very important, that is, the edge information of the shallow network. There is a penalty box item in the loss function
6. Network structure: The backbone network adds cross-layer connections and residual connections (shortcut connections), which can integrate the features of each layer, which makes the feature extraction ability of the backbone network itself better.

4.Things We Tried That Didn’t Work

Part of the original text:
When developing YOLOv3, we tried a lot of things. Many didn't work. Here are some of the things we took note of.
Anchor box x, y offset prediction. We try to use the normal anchor box prediction mechanism, using x,y linear offsets to predict as multiples of the box width or height. We found that this approach reduces the stability of the model and does not perform well. (The prediction is relative to the initial Anchor width and height multiples as the offset, and the prediction box is not constrained)
The x, y linear regression offset prediction is directly used instead of the sigmoid function. We try to use linear activations to directly predict x,y offsets instead of logistic activations. This caused mAP to drop by a few points.
Focal loss. We try to use Focal loss. It dropped our mAP by about 2 points. YOLOv3 may already be robust to the problem that focal loss is trying to solve because yolov3 employs separate confidence and conditional class probabilities. For most cases, we're not entirely sure what the reason is.
Dual IOU threshold and ground truth assignment. Faster RCNN adopts two IOU thresholds during training. For all ground-truth objects, if the IOU threshold is greater than 0.7, it is considered a positive sample, and if it is less than 0.3, it is a negative sample, and those between 0.7 and 0.3 are ignored. In yolov3 we tried a similar approach without getting good results.
We like our current formulation very much, it seems to be at least in a local optimum. Some of these techniques may end up yielding good results, maybe they just need some tweaking to stabilize the training.

Note:
Focal loss solves the problem of unbalanced positive and negative samples for single-stage target detection (ssd, yolov, RetinaNet, etc.), and few really useful negative samples. It is equivalent to a certain degree of difficulty in mining
negative samples in yolov3 where the IOU threshold is set too high (0.5), resulting in positive samples mixed in negative samples. Positive samples will be given label noise, and Focal loss will give label noise a larger weight value, so the effect is not good

The paper of RetinaNet pointed out that single-stage target detection does not lack positive samples, but lacks high-quality negative samples. Using the double IOU threshold can only increase the number of positive samples, while the negative samples are still filtered according to the method less than a certain IOU threshold.

5.What This All Means

The fifth part is the author's outlook for the future

End of the above notes

Guess you like

Origin blog.csdn.net/thy0000/article/details/123610055