Study notes: yolo (preliminary basic FPN and SSD)

  1. yolov1: Redmon J , Divvala S , Girshick R , et al. You Only Look Once: Unified, Real-Time Object Detection[J]. 2015.
  2. Redmon, Joseph, and Ali Farhadi. “YOLO9000: better, faster, stronger.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  3. Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).
    Insert picture description here
    Insert picture description here

1 FPN (Feature Pyramid Network) (FCN full convolution network name is easy to mix and use)

The network integrates the features of different layers, and better improves the multi-scale detection problem
(for deep features, the resolution is small, the perception field is large, which is good for large-scale target detection, but not for detecting small targets. Idea 1 The input image is Multi-scale feature maps can be obtained at multiple scales; idea 2 feature maps of different depths can be regarded as multi-scale feature maps after fusion of semantic feature maps of different scales
Insert picture description here

  • 1 bottom-up: resnet
  • 2 Top-down: Upsampling is closest to upsampling (copying neighbor elements)
  • 3 Horizontal connection: the up-sampled upper layer and shallow layer (1x1 to 256 channels) elements are added to get p2, p3, p4
  • 4 Convolution fusion: 3x3 convolution kernel is used to extract features to eliminate the overlapping effect of upsampling
    https://blog.csdn.net/WZZ18191171661/article/details/79494534

2 SSD:Single Shot Multibox Detector

  • Drawing on the ideas of fasterrcnn and yolo
  • Generating regional fixed frame (lifting speed), multi-feature information fusion (improve accuracy)
    Insert picture description here
    Characteristics
  • 1 Data enhancement (optical, geometric)
    Insert picture description here
  • 2vggnet + 4 convolution modules, forming different sizes and receptive fields
  • 3 priorBox and multi-layer feature maps: fixed frame feature maps are pre-selected on 6 scales (shallow small frame, deep large frame)
  • 4 Positive and negative samples and losses: predict box and GT to find iou to judge the positive and negative, and then calculate the classification and regression losses
    Insert picture description here

3 yolov1 (without anchor prediction) (45fps)

  1. Get 77 feature maps of 30 channels: 24 convolution + 2fn ==》 7x7x30
    Insert picture description here
  2. Assign prediction boxes to feature maps: 49 points correspond to receptive fields, each receptive field predicts 2 boxes = 98 boxes
    Insert picture description here
  3. The predicted box and the real box calculate the IOU, and the predicted box is a direct regression. (Because the area itself is positioned within the receptive field). A receptive field actually only predicts one category, so select the box with a large IOU when training to continue cls classification; when testing, select the box with high confidence to predict cls
  4. Output prediction 20cls + 2x (4-bbox + 1confi) = 28
    • 20cls is the category that contains the target in the grid: the category prediction of the box with the largest IOU in the two bboxes
    • Confidence indicates whether the probability of including objects (foreground or background) is similar to that of frcnn, counting the background in the category)
    • The confidence level is whether the predicted two bboxes have confidence in the object, where the object is judged as positive according to the largest box of the IOU Insert picture description here
      Insert picture description here
  5. Separate category and confidence prediction (and no background)
  6. Loss function
    Insert picture description here
    Judgment of positive and negative samples: 1 If there are objects in the area, the frame with a large IOU is taken as the target positive sample = 1, and the others are negative samples. Negative samples only have confidence loss.
  • The first item: positive sample center loss (weight set to 5, increase the proportion of position loss)
  • The second item: width and height loss (because the target scale is variable, the square is first opened to reduce the sensitivity of the loss to the scale)
  • The third and fourth items: the confidence loss of positive samples and negative samples (c-cap value: the largest through the IOU is regular = 1, the other is 0)
  • Item 5: Category losses

Disadvantages

  • 1 The prediction of detecting 2 frames in each area, there is only one category, so the detection rate is limited, and the detection of small objects is difficult
  • 2 There is no anchor prior, and the large downsampling (correspondence from the original image to the feature map) leads to a low accuracy
  • 3 The loss ratio of large objects and small objects in the loss function is the same, but the loss value of small objects is originally small, and the propagation of reverberation is not obvious, resulting in inaccurate positioning of small objects

4 yolov2 (with anchor)

Improve

  • 1 Increase the anchor, predict the offset value (center and scale wh), reduce the difficulty of prediction, and improve the positioning accuracy
    • The prior frame is obtained: k-means clusters the frames by IOU distance in advance (d = 1-iou) to get the first 5
    • The optimized offset formula is the dashed prior box, and the solid prediction box sigmoid (tx) \ ty is the center prediction (cx and cy are known)
      • Note: The center coordinate offset can be quantized to (0, 1) by sigmoid, which is conducive to convergence
    • tw and th width and height offset prediction (pw and ph are the width and height obtained by prior clustering, tw / h needs to be obtained by taking the index of e)
    • Here, the confidence in yolov1 is taken as the confidence by the sigmoid function
    • Insert picture description hereInsert picture description here
  • 2 darknet (3X3 and 1x1 convolution) 19 convolutions and 5 pooling; passthrough fusion of deep and shallow feature maps improves small object detection rate
    Insert picture description here
  • 3 Each region predicts 5 borders, each region output vector = 5 x [(4 + 1) +20] = 125
  • 4 Note: For the upper confidence prediction formula, the category prediction is a separate category probability for each box
  • 5 loss function
    • Positive samples is greater than 0.6 IOU
      Insert picture description here
      first negative samples confidence loss
      loss second frame and prior prediction frame (first iteration 12800 that predicted return to the previous cluster priori box)
      third and predicted real Position loss
      Fourth confidence loss
      Fifth category loss

5 yolov3 (anchor + multi-scale feature fusion)

Insert picture description here
Improve

  1. DBL: darkenet53 bn Leaky Relu
  2. Up-sampling: Up-pooling (element copying), no-pooling (all replaced by a convolution step of 2)
  3. concat: concatenation of deep and shallow feature channels (not additive multi-feature of fpn)
  4. Output three-scale feature maps can be predicted three times (detect large-scale, medium-scale, and small-scale targets respectively)
    Insert picture description here
    Insert picture description here
  5. yolov3 uses coco data, an a priori frame of 80 categories (80 dimensions + 4 positions + 1 confidence) x3 = 255 dimensions, each feature map can predict 255
  6. Use logistic regression instead of softmax (a box can predict multiple categories: person, women) (binary cross entropy loss)

Experiment yolov3

Coco data set train 170,000,
weight parameter test of val 500 epoch17 can reach mAP11%
, verify with author's test

Published 63 original articles · praised 7 · views 3396

Guess you like

Origin blog.csdn.net/weixin_44523062/article/details/105196640