YOLOV3 paper reading (study notes 2)

Summary of cv Xiaobai's yolov3 study notes

Paper download address: YOLOv3: An Incremental Improvement
yolov URL: https://pjreddie.com/darknet/yolo/

2.2 Multi-class annotation classification (Class Prediction)

Original text:
In yolov3, each prediction box will output 85 values, 5 of which are the coordinates of the center point, width and height, confidence, and 80 are conditional category probabilities. For each category, a binary classification is used to output the probability between 0 and 1, that is to say, there may be multiple categories with probability 1, label value 1, and prediction value 1. We don't use softmax because we found that it doesn't perform well, we use separate logistic regression for each class. During training, each class is individually trained with a binary cross-entropy loss function.
This approach helps us in more complex task domains, such as Google's Open Image dataset, where a prediction box can have multiple labels at the same time (for example, there are two labels for women and people at the same time). However, softmax will assume that the labels of each category are mutually exclusive, and the use of multiple labels can better model the data

Note: Each category of each prediction box can use logistic regression to output the binary classification probability one by one, and there can be multiple categories to output high probability

2.3 Multi-scale target detection (Predictions Across Scales, network topology)

Original:
yolov3 makes predictions at 3 scales. Our system is inspired by feature pyramid networks (FPN feature pyramids), and three paths can be extended from the backbone network for feature extraction, that is, features of different scales are added to convolutional layers of different scales, and each scale obtains a three-dimensional feature map , each feature map has 255 channels. In our yolov3 experiment, the COCO dataset is used. There are 80 categories. Each grid cell of each scale generates 3 anchors. The output result of each scale is N×N×[3×(4+1+ 80)], this output is processed by functions such as sigmod as 4 coordinate offsets

Note:
Three ways: that is, input a three-channel (RGB) image of any scale, and after yolov3, output three different scales of feature maps (feature maps)

255 channels: each grid cell has three anchors (prediction boxes), each anchor has 80+5=85 values, 3×85=255, so a feature map has 255 channels

N×N×[3×(4+1+80)]:
N×N: The number of grid cells in this scale
3: Each grid cell generates 3 anchors (prediction boxes)
4: Center point coordinates, width, High
1: Confidence
80: Conditional class probability for 80 classes

Compared:

yolov1: The input is a 448×448 three-channel image, each image is divided into 7×7=49 grid cells, and each grid cell generates 2 big bounding boxes (without anchors), that is, 98 boxes are generated; output The feature map structure is 7×7×(5×2+20)

yolov2: The input is a 416×416 three-channel image, each image is divided into 13×13=169 grid cells, and each grid cell generates 5 anchors, that is, 845 boxes; the output feature map structure is 13 ×13×(5+20)
5=4+1: Center point coordinates, width, height, confidence level
20: Conditional category probability of 20 categories
starts from yolov2, compatible with pictures of different input sizes, and the output size is determined by the input size It is decided that the feature map generated by the small picture is small, and the feature map generated by the large picture is large. If the input is exactly 416×416, the output is a 13×13 feature map

yolov3: If the input is a 416×416 three-channel image, yolov3 will generate three scales: 13×13, 26×26, 52×52, which also correspond to the number of grid cells. For each grid cell, 3×(5+80) tensors are generated.
That is, a total of 13×13+26×26+52×52 grid cells are generated, (13×13+26×26+52×52)×3=10647 prediction frames
5=4+1: center point coordinates, width, High, Confidence
80: Conditional class probabilities for 80 classes

If the input is a 256×256 three-channel image, yolov3 will generate three scales: 8×8, 16×16, 32×32, each grid cell generates 3 anchors, a total of (8×8+16×16 +32×32)×3=4032 prediction frames
13×13 predicts large objects, 26×26 predicts medium objects, and 52×52 predicts small objects

insert image description here

A picture whose author is Levio is quoted here. As shown in the figure, a 416×416 three-channel picture is input, and features are extracted through the backbone network (Backbone), and the features of different scales are summarized in Neck.
13×13 gets 26×26 in the output layer by upsampling and 26×26 concatenation 26×26
gets 52×52 in the output layer by upsampling and 52×52 concatenation

The meaning of 255 in the output layer:
255=85×3=(80+5)×3
5=4+1: center point coordinates, width, height, confidence
80: conditional category probability of 80 categories

2.4 Feature Extractor

The backbone network extracts good features in the image to achieve our desired goal
Original:
We use a new network for feature extraction. The new network mixes Darknet-19 and ResNet from yolov2. Our network employs consecutive 3×3 convolutions and 1×1 convolutions, plus cross-layer connections. There are a total of 53 convolutional layers with weights, so we call it Darknet-53 (52 convolutional layers + 1 fully connected layer)
insert image description here
This new network is much stronger than Darknet-19 and better than ResNet-101 and ResNet152 The efficiency is much higher (as shown in the figure below)
insert image description here
Top-5: Accuracy
Bn Ops: Calculation
BFLOP/s: billions of operations, billion floating-point operations per second
FPS: Frames Per Second, FPS It is the definition in the image field, which refers to the number of frames per second transmitted by the picture
(the yolov2 backbone network is Darknet-19, the yolov3 backbone network is Darknet-53, and the latest yolov5 backbone network is CSPDarknet-53, which is Darknet that integrates the csp module. -53 network)

2.5 Training

Original text:
We still use end-to-end training and do not use hard example mining in RCN. We used multi-scale training, massive data augmentation, bn layers, all the standard stuff. We use Darknet for deep learning training and testing.

Guess you like

Origin blog.csdn.net/thy0000/article/details/123638884