Depth articles - Target Detection History (five) elaborate SSD target detection

Skip to main content

Returns the directory object detection history

Previous: Depth articles - target detection history (four)  elaborate Fast R-CNN from the target detection Faster R-CNN

Next: depth articles - Target Detection History (vi)  elaborate YOLO-V3 target detection

 

Paper Address: "SSD: Single Shot MultiBox Detector"

 

In this section, elaborate SSD target detection, the next section target detection elaborate YOLO-V3

 

Six. SSD (Single Shot MultiBox Detector, SSD) single-point multi-box detector target detection (2016)

Released by 1.2016 in late 11 SSD papers in target detection task performance and precision to break new records. mAP (mean Average Precision, mAP) scored more than 74%, such as by the representative data set COCO pascal VOC and can reach speeds of 59 per second. To better understand SSD, the name of the architecture of understanding:

   (1). Single Shot a single point

       A single point, which means targeting and classification tasks are forwarded in the word network completed.

   (2). MultiBox multi-cartridge

      This is the name Szegedy, who developed a bounding boxes regression. MultiBox coordinate bounding boxes is recommended that a quick way of class-independent, using convolution inception-style network in the work of the MultiBox. MultiBox structure similar to the structure of the following pattern:

      It comprises two key portions of the loss function MultiBox:

      ①. confidence_loss:

          This is partly a measure of confidence in the network to calculate the border as the goal, namely the probability of background and foreground or c class \large (c_{1}, c_{2}, ......, c_{p}), calculate loss using a classification cross entropy. Similar to the RPN confidence loss Faster R-CNN's.

      ②. location_loss:

           The border with the training data network predictive measure of actual error bounds using  \large smooth_{L_{1}} to calculate the position loss norm. Although it is not  \large smooth_{L_{2}} as accurate, but it is still very effective, and provides more room for maneuver for the SSD, because it does not try to be "pixel perfect" in its bounding boxes prediction. 

      Overall loss function as follows:

        \LARGE L(x, c, l, g) = \frac{1}{N} (L_{conf}(x, c) + \alpha L_{loc}(x, l, g))

       \large L_{conf}(x, c) = - \sum_{i \in Pos}^{N} x_{ij}^{p} log(\hat{c}_{i}^{p}) - \sum_{i \in Neg} log(\hat{c}_{i}^{0})            \large where \;\; \hat{c}_{i}^{p} = \frac{exp(c_{i}^{p})}{\sum_{p} exp(c_{i}^{p})}

       \large L_{loc} (x, l, g) = \sum_{i \in Pos}^{N} \sum_{m \in \{ cx, cy, w, h\}} x_{ij}^{k} smooth_{L_{1}} (l_{i}^{m} - \hat{g}_{j}^{m})

          \large \hat{g}_{j}^{cx} = \frac{g_{j}^{cx} - d_{i}^{cx}}{d_{i}^{w}}

          \large \hat{g}_{j}^{cy} = \frac{g_{j}^{cy} - d_{i}^{cy}}{d_{i}^{h}}

          \large \hat{g}_{j}^{w} = log (\frac{g_{j}^{w}}{d_{i}^{w}})

           \large \hat{g}_{j}^{h} = log (\frac{g_{j}^{h}}{d_{i}^{h}})

           \large x_{ij}^{p} It represents  \large i a default frame with the first  \large j two matches of the ground truth box  \large p class probability, \large x_{ij}^{p} = \{1, 0\}that the probability of the foreground and background, foreground probability is 1, the background probability is zero.

           \large g Compared with ground truth box,

            \large l To predict the box

           \large d The bounding box is the default, and  \large (cx, cy) the default is the center point of bounding box, \large w, h respectively, the width and height of the bounding box

            \large c Represented  \large c categories, namely multi-classification category.

            \large Pos Denotes a positive sample, \large Neg it indicates a negative sample

 

2. SSD configuration diagram (SSD two loss: loc_loss, conf_loss)

 

3. SSD architecture abandoned fully connected layers, increasing the convolution of a set of auxiliary layer instead of the whole connection layer, thereby extracting a feature in a multi-scale, and gradually reduce the size of each subsequent input.

 

4. classification 

    classification for the SSD target classifier. For bounding boxes location of k given class c to be calculated scores and four offset with respect to the original default frame shape for each bounding box (i.e., coordinates). Therefore, it is a location of the feature maps in the total of (c + 4) * k Filters size, Similarly, for the feature maps m * n size, is (c + 4) * k * m * n Size of filters.

 

5. SSD Improvement

   (1) The default bounding boxes

        It recommended to configure a default set of bounding boxes of different sizes and aspect ratios, to ensure that captured most of the goals. In the SSD paper, each cell feature maps about six bounding boxes.

   (2) fixed a priori (e.g., anchors, bounding boxes)

        In the SSD, each cell feature maps are bounding boxes associated with a different set of default scales and aspect ratios. These prior manual (but carefully) selected, they chose MultiBox because they value with respect to ground truth IOU boxes exceeds 0.5.

   (3) a priori and IOU

        In MultiBox, a priori (e.g., an anchors) created by researchers precomputed bounding boxes and fixed size, and closely match the original distribution of ground truth boxes. In fact, these are a priori set of selected size 0.5 IOU. In reality, however, IOU 0.5 is still not good enough, but it provides a strong starting point for bounding boxes return. This is much better than random prediction coordinates. Therefore, MultiBox to predict a priori as the start and tried to return to closer to the ground truth boxes.

   (4). Hard Negative Mining hard negative sampling

        In training, there are many low IOU's bounding boxes is seen as negative samples. In the training set, there may be a lot of negative samples. Accordingly, the recommended ratio between positive and negative samples held in the sample of about 3: 1 to replace all negative samples do the forecast. The reason you want to keep the negative samples, because the network need to learn and be told clearly what constitutes improper testing.

 

6. SSD Summary

   (1). More default block can be more precise detection, although this will affect the speed (e.g. Faster R-CNN would use nine default anchor boxes)

   (2) Since the operation of the detector flash feature plurality of resolutions, thus using the multilayer MultiBox can be obtained in a better detection.

   (3). SSD paper, 80% of the time is spent based on VGG16 network, which means faster clothes moth, the exact same network, SSD's performance will be better (such as ResNet)

   (4). SSD similar target class confusion, it may be because a plurality of positions shared classes (e.g. classes c1 and c2 frame position is very similar to the class, or a substantially overlapping, then, when the predicted c1 and c2, it is easy to confuse)

   (5). SSD500 (using the input image with the highest resolution 512 x 512) in pascal VOC2007 at the expense of sacrificing speed, to obtain the best accuracy was 76.8% of mAP, at this time the speed drops to 22 per second. SSD300 and accuracy of 74.3% in mAP, 59 frames per second, is a better trade-off. 

   (6). SSD when dealing with smaller target, how performance is not good, because small targets may not appear on all the feature maps (because downsampling). Increase the resolution of the input image can alleviate this problem, but it does not completely solve the problem.

 

                  

 

Back to the main directory

Returns the directory object detection history

Previous: Depth articles - target detection history (four)  elaborate Fast R-CNN from the target detection Faster R-CNN

Next: depth articles - Target Detection History (vi)  elaborate YOLO-V3 target detection

Published 63 original articles · won praise 16 · views 5982

Guess you like

Origin blog.csdn.net/qq_38299170/article/details/104470653