SSD( Single Shot MultiBox Detector)
SSD paper address: https://arxiv.org/abs/1512.02325
SSD paper uses VGG16 to extract feature maps. Then use the Conv4_3 layer to detect objects, which is a classic ones stage network . Network structure:
Algorithm steps
- Input a picture, let the picture extract features through a convolutional neural network (CNN), and generate a feature map
- Extract the feature map of six layers, (multi-scale), and then generate a default box on each point of the feature map
- Collect all the generated default boxes, throw them all into NMS (maximum value suppression), output the filtered default boxes, and output
SSD combines the regression idea in YOLO and the Anchor mechanism in Faster-RCNN (called Defalut Box in the paper ), and uses the multi-scale areas of each position of the whole image for regression, which not only maintains the fast speed of YOLO, but also guarantees the window The prediction is as accurate as Faster-RCNN. The core of SSD is to use convolution kernels on feature maps of different scales to predict the category and coordinate offset of a series of Default Bounding Boxes . SSD core design mainly has the following three points:
1. Multi-scale
Conv4_3, c, conv8_2, conv7_2, conv8_2, conv9_2, conv10_2, conv11_2 feature maps of different sizes are used in the SSD algorithm. The purpose is to accurately detect objects of different scales, because in the low-level feature map, the receptive field It is relatively small, and the high-level receptive field is relatively large. Convolution in different feature maps can achieve multi-scale purposes.
Matching smaller objects on larger-scale feature maps (a), matching larger objects on deeper feature maps (b),
2. Use the convolutional layer instead of the fully connected layer for prediction
SSD does not use fully connected layers. It computes location and class scores using small convolutional filters. After extracting feature maps, SSD applies 3×3 convolutional filters to each unit for prediction. (These filters compute just like regular CNN filters.) Each filter outputs 25 channels: 21 scores for each class plus a bounding box.
3. Set the prior box
The default boxes (Prior Box) are similar to the candidate boxes generated by the sliding window in RPN. In SSD, several boxes are also generated for each pixel in the feature map.
name | Out_size | prior_box_name | Total_num |
---|---|---|---|
conv4_3 | 38x38 | 4 | 5776 |
conv5_2 | 19x19 | 6 | 2166 |
conv7_2 | 10x10 | 6 | 600 |
conv9_2 | 5x5 | 6 | 150 |
conv10_2 | 3x3 | 4 | 36 |
conv11_2 | 1x1 | 4 | 4 |
8732 |
The prior box is equivalent to the anchors in the faster rcnn, some boxes are preset, and the network gives the category and position of the detected object through classification and regression according to the box. Each window is sorted and returned to a more accurate position and size.
In the paper, 4 default boxes are used for conv4_3, conv10_2, and conv11, and 6 default boxes are set for the other three. The number of default boxes (prior_box_name) and size settings are calculated according to the following table:
Image source: SSD algorithm theory
training and prediction
Input->Output->Regression Loss Calculation of Results and Ground Truth Marked Samples->Back Propagation, Updating Weights
First match the prior box with the ground truth box to mark positive and negative samples, and do not train 8732 calculated ones each time default boxes, first carry out confidence screening, and train the specified positive samples and negative samples, the following rules:
positive sample:
the default box that matches the maximum IOU of GT (mark value-ground truth) is a positive sample, for any ground truth IOU Greater than 0.5 is also set as a positive sample.
The picture is intercepted from the original paper
Negative samples :
The greater the confidence loss, the greater the loss. Select the top value as a negative sample. During training, the default boxes are controlled according to positive and negative samples. positive: negative=1:3
损失计算
L ( x , c , l , g ) = 1 N ( L c o n f ( x , c ) + α L l o c ( x , l , g ) L(x, c, l, g)=\frac{1}{N}\left(L_{c o n f}(x, c)+\alpha L_{l o c}(x, l, g)\right. L(x,c,l,g)=N1(Lconf(x,c)+αLloc(x,l,g)
L conf ( x , c ) = − ∑ i ∈ P o s N x i j p log ( c ^ i p ) − ∑ i ∈ N e g log ( c ^ i 0 ) where c ^ i p = exp ( c i p ) ∑ p exp ( c i p ) L_{\text {conf }}(x, c)=-\sum_{i \in P o s}^{N} x_{i j}^{p} \log \left(\hat{c}_{i}^{p}\right)-\sum_{i \in N e g} \log \left(\hat{c}_{i}^{0}\right) \quad \text { where } \quad \hat{c}_{i}^{p}=\frac{\exp \left(c_{i}^{p}\right)}{\sum_{p} \exp \left(c_{i}^{p}\right)} Lconf (x,c)=−i∈Pos∑Nxijplog(c^ip)−i∈Neg∑log(c^i0) where c^ip=∑pexp(cip)exp(cip)
- N is the number of prior boxes from match to GT (Ground Truth)
- c ^ i p \hat{c}_{i}^{p} c^ipfor the predicted iii default box corresponds to the category probability of GT boxP \mathrm{P}P
- x i j p = { 0 , 1 } x_{i j}^{p}=\{0,1\} xijp={ 0,1 } tosecondThe jth \mathrm{j}matched by the i default boxj GT boxes (category isP \mathrm{P}P )
References:
SSD: Single Shot MultiBox Detector
SSD object detection: Single Shot MultiBox Detector for real-time processing
target detection|SSD principle and implementation
of deep learning – SSD algorithm process detailed explanation