Overview of target detection - SSD in the second part

SSD( Single Shot MultiBox Detector)

SSD paper address: https://arxiv.org/abs/1512.02325
SSD paper uses VGG16 to extract feature maps. Then use the Conv4_3 layer to detect objects, which is a classic ones stage network . Network structure:

Please add a picture description

Algorithm steps

  • Input a picture, let the picture extract features through a convolutional neural network (CNN), and generate a feature map
  • Extract the feature map of six layers, (multi-scale), and then generate a default box on each point of the feature map
  • Collect all the generated default boxes, throw them all into NMS (maximum value suppression), output the filtered default boxes, and output

SSD combines the regression idea in YOLO and the Anchor mechanism in Faster-RCNN (called Defalut Box in the paper ), and uses the multi-scale areas of each position of the whole image for regression, which not only maintains the fast speed of YOLO, but also guarantees the window The prediction is as accurate as Faster-RCNN. The core of SSD is to use convolution kernels on feature maps of different scales to predict the category and coordinate offset of a series of Default Bounding Boxes . SSD core design mainly has the following three points:

1. Multi-scale

Conv4_3, c, conv8_2, conv7_2, conv8_2, conv9_2, conv10_2, conv11_2 feature maps of different sizes are used in the SSD algorithm. The purpose is to accurately detect objects of different scales, because in the low-level feature map, the receptive field It is relatively small, and the high-level receptive field is relatively large. Convolution in different feature maps can achieve multi-scale purposes.
Please add a picture description
Matching smaller objects on larger-scale feature maps (a), matching larger objects on deeper feature maps (b),
insert image description here

2. Use the convolutional layer instead of the fully connected layer for prediction

SSD does not use fully connected layers. It computes location and class scores using small convolutional filters. After extracting feature maps, SSD applies 3×3 convolutional filters to each unit for prediction. (These filters compute just like regular CNN filters.) Each filter outputs 25 channels: 21 scores for each class plus a bounding box.

3. Set the prior box

The default boxes (Prior Box) are similar to the candidate boxes generated by the sliding window in RPN. In SSD, several boxes are also generated for each pixel in the feature map.

name Out_size prior_box_name Total_num
conv4_3 38x38 4 5776
conv5_2 19x19 6 2166
conv7_2 10x10 6 600
conv9_2 5x5 6 150
conv10_2 3x3 4 36
conv11_2 1x1 4 4
8732

The prior box is equivalent to the anchors in the faster rcnn, some boxes are preset, and the network gives the category and position of the detected object through classification and regression according to the box. Each window is sorted and returned to a more accurate position and size.
In the paper, 4 default boxes are used for conv4_3, conv10_2, and conv11, and 6 default boxes are set for the other three. The number of default boxes (prior_box_name) and size settings are calculated according to the following table:
insert image description
Image source: SSD algorithm theory

training and prediction

Input->Output->Regression Loss Calculation of Results and Ground Truth Marked Samples->Back Propagation, Updating Weights
First match the prior box with the ground truth box to mark positive and negative samples, and do not train 8732 calculated ones each time default boxes, first carry out confidence screening, and train the specified positive samples and negative samples, the following rules:
positive sample:
the default box that matches the maximum IOU of GT (mark value-ground truth) is a positive sample, for any ground truth IOU Greater than 0.5 is also set as a positive sample.
Please add a picture description
The picture is intercepted from the original paper

Negative samples :
The greater the confidence loss, the greater the loss. Select the top value as a negative sample. During training, the default boxes are controlled according to positive and negative samples. positive: negative=1:3

损失计算
L ( x , c , l , g ) = 1 N ( L c o n f ( x , c ) + α L l o c ( x , l , g ) L(x, c, l, g)=\frac{1}{N}\left(L_{c o n f}(x, c)+\alpha L_{l o c}(x, l, g)\right. L(x,c,l,g)=N1(Lconf(x,c)+αLloc(x,l,g)

L conf  ( x , c ) = − ∑ i ∈ P o s N x i j p log ⁡ ( c ^ i p ) − ∑ i ∈ N e g log ⁡ ( c ^ i 0 )  where  c ^ i p = exp ⁡ ( c i p ) ∑ p exp ⁡ ( c i p ) L_{\text {conf }}(x, c)=-\sum_{i \in P o s}^{N} x_{i j}^{p} \log \left(\hat{c}_{i}^{p}\right)-\sum_{i \in N e g} \log \left(\hat{c}_{i}^{0}\right) \quad \text { where } \quad \hat{c}_{i}^{p}=\frac{\exp \left(c_{i}^{p}\right)}{\sum_{p} \exp \left(c_{i}^{p}\right)} Lconf (x,c)=iPosNxijplog(c^ip)iNeglog(c^i0) where c^ip=pexp(cip)exp(cip)

  • N is the number of prior boxes from match to GT (Ground Truth)
  • c ^ i p \hat{c}_{i}^{p} c^ipfor the predicted iii default box corresponds to the category probability of GT boxP \mathrm{P}P
  • x i j p = { 0 , 1 } x_{i j}^{p}=\{0,1\} xijp={ 0,1 } tosecondThe jth \mathrm{j}matched by the i default boxj GT boxes (category isP \mathrm{P}P )

References:
SSD: Single Shot MultiBox Detector
SSD object detection: Single Shot MultiBox Detector for real-time processing
target detection|SSD principle and implementation
of deep learning – SSD algorithm process detailed explanation

Guess you like

Origin blog.csdn.net/Peyzhang/article/details/126304415