Target detection -> SSD algorithm

Target detection algorithms are generally divided into: region-based algorithms and regression-based algorithms

1) Region-based algorithms: RCNN, Fast RCNN, Faster RCNN , Mask RCNN, etc. The whole detection process is divided into two stages. In the first stage, the detector needs to find some hypothetical regions (ROI); in the second stage, the detector needs to perform classification and position regression (bounding box regression) on these hypothetical regions.

2) Regression-based algorithms: YOLO series, etc. Detection is an end-to-end process that directly regresses the category and location of objects.

SSD (Single Shot MultiBox Detector) is a one-stage detection algorithm. It can be considered as a combination of Faster RCNN and YOLO: a regression-based model (similar to YOLO) is used to directly return the category and position of the object in a network, so the detection speed is very fast. At the same time, the region-based concept (similar to Faster RCNN) is also used. During the detection process, many candidate regions are used as ROIs.

Review of Faster RCNN:

Problems with Faster RCNN:

1) The detection effect on small targets is very poor (prediction is only performed on one feature layer. This feature layer has passed through many convolutional layers. The more passes through, the higher the abstraction level, the less detailed information of the image is retained, and the small target The worse the effect, it is necessary to predict the features at a relatively low level)

2) The model is large and the detection speed is slow (two predictions, a common problem of the two stages method)

SSD network

Predict objects of different scales at different feature scales

Backbone network:

The backbone network of SSD is based on the traditional image classification network, and part of the network of vgg16 is used as the basic network. As shown in the figure, after processing 10 convolutional layers (conv layer) and 3 pooling layers (max pooling), we can get a feature map (Conv4_3 feature map) with a size of 38×38×512. In the next step, we need to perform regression on this feature map to get the location and category of the object.

Regression:

Similar to the regression operation of YOLO, first we consider the case where there is one and only one candidate box (default box) at each position of the feature map.

1) Position regression: The detector needs to give the frame center offset (cx, cy), relative to the width and height (w, h) of the picture size, and a total of 4 parameters need to be returned. (Fast RCNN requires a regression box for each category, with a total of (N+1)*4 parameters)

2) Classification: For each bounding box, we need to give scores of 20 categories + 1 background category.

For each position, we need a 25-dimensional vector to store the position and category information of the detected object. For our 38×38 feature map, we need a space of dimension 38×38×25 to store this information. Therefore, the detector needs to learn the mapping relationship from the feature map (38×38×512) to the detection result (38×38×25). This step of conversion uses a convolution operation: use 25 3×3 convolution kernels to convolve the feature map. So far, we have completed the operation of regressing a box at each position.

3) Multiple candidate boxes: SSD hopes to return k boxes based on different sizes at each position. Therefore, a 25×k-dimensional space is required at each position to store the regression and classification information of these boxes, so the convolution operation becomes the use of 25×k 3×3 convolution kernels to obtain 38×38×25k Dimension test result map (score map).

4) Multiple feature maps: For neural networks, shallow feature maps contain more detailed information and are more suitable for small object detection; while deeper feature maps contain more global information and are more suitable for large objects. object detection. Therefore, by regressing the candidate boxes of different sizes on different feature maps, we can have better detection results for objects of different sizes.

The detection accuracy and speed of SSD are excellent, with 76.8 mAP  and 22FPS surpassing Faster RCNN and YOLO  

Guess you like

Origin blog.csdn.net/wanchengkai/article/details/124377589