- yolov1: Redmon J , Divvala S , Girshick R , et al. You Only Look Once: Unified, Real-Time Object Detection[J]. 2015.
- Redmon, Joseph, and Ali Farhadi. “YOLO9000: better, faster, stronger.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
- Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).
1 FPN (Feature Pyramid Network) (FCN full convolution network name is easy to mix and use)
The network integrates the features of different layers, and better improves the multi-scale detection problem
(for deep features, the resolution is small, the perception field is large, which is good for large-scale target detection, but not for detecting small targets. Idea 1 The input image is Multi-scale feature maps can be obtained at multiple scales; idea 2 feature maps of different depths can be regarded as multi-scale feature maps after fusion of semantic feature maps of different scales
- 1 bottom-up: resnet
- 2 Top-down: Upsampling is closest to upsampling (copying neighbor elements)
- 3 Horizontal connection: the up-sampled upper layer and shallow layer (1x1 to 256 channels) elements are added to get p2, p3, p4
- 4 Convolution fusion: 3x3 convolution kernel is used to extract features to eliminate the overlapping effect of upsampling
https://blog.csdn.net/WZZ18191171661/article/details/79494534
2 SSD:Single Shot Multibox Detector
- Drawing on the ideas of fasterrcnn and yolo
- Generating regional fixed frame (lifting speed), multi-feature information fusion (improve accuracy)
Characteristics - 1 Data enhancement (optical, geometric)
- 2vggnet + 4 convolution modules, forming different sizes and receptive fields
- 3 priorBox and multi-layer feature maps: fixed frame feature maps are pre-selected on 6 scales (shallow small frame, deep large frame)
- 4 Positive and negative samples and losses: predict box and GT to find iou to judge the positive and negative, and then calculate the classification and regression losses
3 yolov1 (without anchor prediction) (45fps)
- Get 77 feature maps of 30 channels: 24 convolution + 2fn ==》 7x7x30
- Assign prediction boxes to feature maps: 49 points correspond to receptive fields, each receptive field predicts 2 boxes = 98 boxes
- The predicted box and the real box calculate the IOU, and the predicted box is a direct regression. (Because the area itself is positioned within the receptive field). A receptive field actually only predicts one category, so select the box with a large IOU when training to continue cls classification; when testing, select the box with high confidence to predict cls
- Output prediction 20cls + 2x (4-bbox + 1confi) = 28
- 20cls is the category that contains the target in the grid: the category prediction of the box with the largest IOU in the two bboxes
- Confidence indicates whether the probability of including objects (foreground or background) is similar to that of frcnn, counting the background in the category)
- The confidence level is whether the predicted two bboxes have confidence in the object, where the object is judged as positive according to the largest box of the IOU
- Separate category and confidence prediction (and no background)
- Loss function
Judgment of positive and negative samples: 1 If there are objects in the area, the frame with a large IOU is taken as the target positive sample = 1, and the others are negative samples. Negative samples only have confidence loss.
- The first item: positive sample center loss (weight set to 5, increase the proportion of position loss)
- The second item: width and height loss (because the target scale is variable, the square is first opened to reduce the sensitivity of the loss to the scale)
- The third and fourth items: the confidence loss of positive samples and negative samples (c-cap value: the largest through the IOU is regular = 1, the other is 0)
- Item 5: Category losses
Disadvantages
- 1 The prediction of detecting 2 frames in each area, there is only one category, so the detection rate is limited, and the detection of small objects is difficult
- 2 There is no anchor prior, and the large downsampling (correspondence from the original image to the feature map) leads to a low accuracy
- 3 The loss ratio of large objects and small objects in the loss function is the same, but the loss value of small objects is originally small, and the propagation of reverberation is not obvious, resulting in inaccurate positioning of small objects
4 yolov2 (with anchor)
Improve
- 1 Increase the anchor, predict the offset value (center and scale wh), reduce the difficulty of prediction, and improve the positioning accuracy
- The prior frame is obtained: k-means clusters the frames by IOU distance in advance (d = 1-iou) to get the first 5
- The optimized offset formula is the dashed prior box, and the solid prediction box sigmoid (tx) \ ty is the center prediction (cx and cy are known)
- Note: The center coordinate offset can be quantized to (0, 1) by sigmoid, which is conducive to convergence
- tw and th width and height offset prediction (pw and ph are the width and height obtained by prior clustering, tw / h needs to be obtained by taking the index of e)
- Here, the confidence in yolov1 is taken as the confidence by the sigmoid function
- 2 darknet (3X3 and 1x1 convolution) 19 convolutions and 5 pooling; passthrough fusion of deep and shallow feature maps improves small object detection rate
- 3 Each region predicts 5 borders, each region output vector = 5 x [(4 + 1) +20] = 125
- 4 Note: For the upper confidence prediction formula, the category prediction is a separate category probability for each box
- 5 loss function
- Positive samples is greater than 0.6 IOU
first negative samples confidence loss
loss second frame and prior prediction frame (first iteration 12800 that predicted return to the previous cluster priori box)
third and predicted real Position loss
Fourth confidence loss
Fifth category loss
- Positive samples is greater than 0.6 IOU
5 yolov3 (anchor + multi-scale feature fusion)
Improve
- DBL: darkenet53 bn Leaky Relu
- Up-sampling: Up-pooling (element copying), no-pooling (all replaced by a convolution step of 2)
- concat: concatenation of deep and shallow feature channels (not additive multi-feature of fpn)
- Output three-scale feature maps can be predicted three times (detect large-scale, medium-scale, and small-scale targets respectively)
- yolov3 uses coco data, an a priori frame of 80 categories (80 dimensions + 4 positions + 1 confidence) x3 = 255 dimensions, each feature map can predict 255
- Use logistic regression instead of softmax (a box can predict multiple categories: person, women) (binary cross entropy loss)
Experiment yolov3
Coco data set train 170,000,
weight parameter test of val 500 epoch17 can reach mAP11%
, verify with author's test