SSD (single shot multibox detector) algorithm and Caffe code detailed

The article is reproduced from: original blog

This blog mainly introduces the SSD algorithm, which is an excellent object detection algorithm in the past year. The main feature is the use of feature fusion.

Paper: SSD single shot multibox detector 
Paper link: https://arxiv.org/abs/1512.02325

Algorithm overview:

The SSD algorithm proposed in this paper is an object detection algorithm that directly predicts the coordinates and categories of bounding boxes, without the process of generating proposals . For object detection of different sizes, the traditional method is to convert the images into different sizes, then process them separately, and finally combine the results. The SSD in this paper can also achieve the same effect by using the feature maps of different convolutional layers for integration . The main network structure of the algorithm is VGG16, and the two fully connected layers are changed to convolutional layers and four convolutional layers are added to construct the network structure. The outputs of the 5 different convolutional layers are convolved with two 3*3 convolution kernels, one for output classification confidence, and each default box generates 21 confidences (this is for the VOC dataset containing 20 confidences). In terms of object categories); a localization for output regression, each default box generates 4 coordinate values ​​(x, y, w, h) . In addition, the five convolutional layers also generate default boxes through the priorBox layer (the generated coordinates are). The number of default boxes for each of the 5 convolutional layers described above is given. Finally, the previous three calculation results are merged and passed to the loss layer. 
Results of the algorithm: For 300*300 input, SSD can have 74.3% mAP on VOC2007 test at 59 FPS (Nvidia Titan X), and for 512*512 input, SSD can have 76.9% mAP . In contrast, Faster RCNN is 73.2% mAP and 7FPS, and YOLO is 63.4% mAP and 45FPS. High accuracy is achieved even for lower resolution inputs. It can be seen that the author does not improve the detection speed by sacrificing accuracy like the traditional method.The author believes that the reason why his algorithm has improved significantly in speed is due to the removal of the bounding box proposal and the subsequent resampling steps of pixel or feature.

code address: https://github.com/weiliu89/caffe/tree/ssd

Algorithm details:

The SSD algorithm only needs one input image and its ground truth boxes for each object during training. 
The basic network structure is based on VGG16. After pre-training on the ImageNet dataset, fc6 and fc7 are replaced with two new convolutional layers. In addition, a small change has been made to pool5, and four convolutional layers are added to form this article. network of. The structure of VGG is shown in the following figure:

write picture description here

A core of the article is that the author uses both lower and upper feature maps for detection. As shown in Fig1 below, there are feature maps of 8*8 and 4*4 sizes, and the feature map cell is each of them. There is another concept: default box, which means that there are a series of fixed-size boxes on each cell of the feature map, there are 4 boxes in the following figure (the dotted box in the figure below, carefully look at the ratio in the middle of the grid) The grid is still a small box). Assuming that each feature map cell has k default boxes, then each default box needs to predict c category scores and 4 offsets, then if the size of a feature map is m*n, that is, there are m*n features map cell, then this feature map has a total of (c+4)*k*m*n boxes ( the actual code is to convolve the feature map of this layer with different numbers of 3*3 convolution kernels, such as convolution kernels. The number is c*k corresponding to the confidence output, indicating the confidence of each default box, which is the category; the number of convolution kernels is 4*k corresponding to the localization output, indicating the coordinates of each default box ). The author's experiments also show that the more shapes the default box has, the better the effect. 
So the default box used here is very similar to the anchor in Faster RCNN. In Faster RCNN, the anchor is only used in the last convolutional layer, but in this article, the default box is applied to feature maps of multiple different layers.

Another important information in the figure below is: in the training phase, the algorithm will first match these default boxes with the ground truth box at the beginning, for example, the two blue dashed boxes match the ground truth box of the cat, and one The red dashed box matches the ground truth box of the dog. So a ground truth may correspond to multiple default boxes. In the prediction stage, the offset of each default box and the corresponding score for each category are directly predicted, and the final result is obtained through NMS.  
Fig1(c) illustrates that for each default box, its coordinate offset and confidence of all classes are predicted simultaneously.

write picture description here

For the matching rules of ground truth and default box, please refer to the following figure:

write picture description here

So how to determine the scale (size) and aspect ratio (aspect ratio) of the default box? Suppose we use m feature maps for prediction, then for each feature map, the scale of its default box is calculated according to the following formula: 
write picture description here 
where smin is 0.2, indicating that the lowest scale is 0.2, and smax is 0.9, indicating the highest The scale of the layer is 0.9. 
As for aspect ratio, use ar as the following formula: Note that there are 5 aspect ratios 
write picture description here 
, so the calculation formula for the width of each default box is: 
write picture description here 
The calculation formula for height is: (It is easy to understand that the product of width and height is the square of scale ) 
write picture description here 
In addition, when the aspect ratio is 1, the author also adds a default box of scale: 
write picture description here

Therefore, for each feature map cell, there are a total of 6 default boxes. 
It can be seen that this default box has different scales in different feature layers, and different aspect ratios in the same feature layer, so it can basically cover objects of various shapes and sizes in the input image!

Obviously, when the default box matches the ground truth, the default box is a positive example (positive sample), and if it does not match, it is a negative example (negative sample). Obviously, the number of negative samples generated in this way is much more than positive sample. So the author sorts the negative samples according to confidence loss, and then selects some of the top negative samples as training, so that the ratio of the final negative samples and positive samples is about 3:1.

The following figure is a comparison of the structure diagram of the SSD algorithm and the YOLO algorithm . The input of the YOLO algorithm is 448*448*3, and the output is 7*7*30. These 7*7 grid cells predict a total of 98 bounding boxes. The SSD algorithm adds several convolutional layers to the back of the original VGG16 to predict offset and confidence (by contrast, the YOLO algorithm uses a fully connected layer). The input of the algorithm is 300*300*3, using conv4_3, conv7, conv8_2 , conv9_2, conv10_2 and conv11_2 outputs to predict location and confidence.

write picture description here

To talk about the structure of SSD in detail, you can refer to the Caffe code. The structure of SSD is conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 (512), fc6: 3*3*1024 convolution (the original fc6 in VGG16 is The fully connected layer, here becomes a convolutional layer, the same is true for the fc7 layer below), fc7: 1*1*1024 convolution, conv6_1, conv6_2 (corresponding to conv8_2 in the above figure), ..., conv9_1, conv9_2, loss. Then for each of conv4_3(4), fc7(6), conv6_2(6), conv7_2(6), conv8_2(4), and conv9_2(4), two 3*3 convolution kernels are used for convolution respectively , the two convolution kernels are side by side (the numbers in parentheses represent the number of default boxes, you can refer to the Caffe code, so the number 8732 in the penultimate column of the SSD structure in the above figure represents the number of all default boxes, yes 38*38*4+19*19*6+10*10*6+5*5*6+3*3*4+1*1*4=8732), these two 3*3 volumes One of the product kernels is used for localization (for regression, if the default box is 6, then there are 6*4=24 such convolution kernels, the size of the map after convolution is the same as before the convolution, because pad= 1, the same below), the other is used for confidence (for classification, if the default box is 6 and the VOC object category has 20, then there are 6*(20+1)=126 such convolutions nuclear). The following figure shows the 3*3 convolution kernel operation of the localizaiton of conv6_2. The number of convolution kernels is 24 (6*4=24, since pad=1, the map size of the convolution result remains unchanged, the same below): here The permute layer is the function of exchange. For example, the dimension after your convolution is 32*24*19*19, then after the exchange layer, it becomes 32*19*19*24, and the order has changed. The role of the flatten layer is to turn 32*19*19*24 into 32*8664, where 32 is the size of the batchsize.

write picture description here

The 3*3 convolution kernel of confidence operates as follows, note that the number of convolution kernels is 126 (6*21=126):

write picture description here

Then there is an operation to generate the default box, which is generated according to the minimum size, maximum size and aspect ratio. Step means that a pixel of this layer is equivalent to 1/32 of the original input image, which is simply the receptive field. The source code is It is obtained by dividing the size of the original input image by the size of the feature map of this layer. Variance visual observation is a scale transformation. The four coordinates in this article are the center coordinates plus the length and width. When calculating the loss, it may be necessary to make a trade-off between the loss of the center coordinates and the loss of the length and width, so there is this variance. If the coordinates of the four vertices of the box are used, the default variance is 0.1, that is, there is no weight difference between them. After the above three operations, the processing of this layer of features is over.

Take a look at the output dimensions of the next few layers, and pay attention to the dimension of the priorbox. Take conv8_2_mbox_priorbox as an example, it is (1, 2, 144). This 144 represents all the coordinates of the generated default box, so the number of coordinates is the same as the previous regression. : 3*3*4*4. 2 is related to variance.

write picture description here

After performing the above operations on the outputs of the five convolutional layers listed above, the results are combined: Concat, similar to googleNet's Inception operation, is channel combining instead of numerical addition.

write picture description here

Here are the dimensions after several channels are merged:

write picture description here

The last is the loss function layer customized by the author. The overlap_threshold here indicates that the coincidence of the default box and the ground truth exceeds this threshold and it is a positive sample. In addition, I think which default boxes are positive samples and which are negative samples are calculated in the loss layer, but this detail has little to do with the algorithm:

write picture description here

损失函数方面:和Faster RCNN的基本一样,由分类和回归两部分组成,可以参考Faster RCNN,这里不细讲。总之,回归部分的loss是希望预测的box和default box的差距尽可能跟ground truth和default box的差距接近,这样预测的box就能尽量和ground truth一样。

write picture description here

这里稍微列了下几种object detection算法的default boxes数量以及为什么要有这么多的box:

write picture description here

实验结果:

数据集增加对于mAP的提升确实相当明显!

write picture description here

通过对比各种设计方法说明增加数据集对mAP的增加是最明显的。

write picture description here

在Fast RCNN和Faster RCNN中,数据集增加的办法主要是采用原有的数据及其水平翻转的结果作为训练集。

这个图说明使用多层特征的有效性。融合不同层的特征是一种重要的方法,在这里主要解决了大小不同的object的检测问题。

write picture description here

通过实验对比YOLO和Faster RCNN,说明SSD速度快且准确率更高。

write picture description here

总结:

这种算法对于不同横纵比的object的检测都有效,这是因为算法对于每个feature map cell都使用多种横纵比的default boxes,这也是本文算法的核心。另外本文的default box做法是很类似Faster RCNN中的anchor的做法的。最后本文也强调了增加数据集的作用,包括随机裁剪,旋转,对比度调整等等。 
文中作者提到该算法对于小的object的detection比大的object要差。作者认为原因在于这些小的object在网络的顶层所占的信息量太少,所以增加输入图像的尺寸对于小的object的检测有帮助。另外增加数据集对于小的object的检测也有帮助,原因在于随机裁剪后的图像相当于“放大”原图像,所以这样的裁剪操作不仅增加了图像数量,也放大了图像。

参考资料: 
1、https://docs.google.com/presentation/d/1rtfeV_VmdGdZD5ObVVpPDPIODSDxKnFSU0bsN_rgZXc/pub?start=false&loop=false&delayms=3000&slide=id.g179f601b72_0_51 
2、http://www.cs.unc.edu/~wliu/papers/ssd_eccv2016_slide.pdf

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325537422&siteId=291194637