Deep learning (target detection) - from RCNN to SSD, this should be the most complete inventory of target detection algorithms

Object detection is the foundation of many computer vision tasks, and it provides reliable information whether we need to interact with images and text or identify fine-grained categories. This paper provides an overall review of object detection. The first part starts with RCNN and introduces object detectors based on candidate regions, including Fast R-CNN, Faster R-CNN, and FPN. The second part focuses on single-shot detectors including YOLO, SSD, and RetinaNet, which are all state-of-the-art methods.

The Heart of the Machine has discussed a lot of target detection algorithms before, and readers who are interested in computer vision can also combine the previous articles to strengthen their understanding.

  • A Comprehensive Review of Deep Learning Object Detection Models: Faster R-CNN, R-FCN and SSD

  • PyTorch project from scratch: YOLO v3 object detection implementation

  • Dismantling Faster R-CNN like Lego: Explain the implementation process of target detection in detail

  • Advances in Object Detection and Instance Segmentation in the Post-RCNN Era

  • Overview of Object Detection Algorithms: From Traditional Detection Methods to Deep Neural Network Frameworks

Object detector based on candidate regions

Sliding window detector

Classification with CNN has become mainstream since AlexNet won the ILSVRC 2012 challenge. A brute force method for object detection is to slide the window from left to right and top to bottom, using classification to identify objects. To detect different object types at different viewing distances, we use windows of different sizes and aspect ratios.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Sliding window (right to left, top to bottom)

We clip image patches from the image according to a sliding window. Since many classifiers only take fixed-size images, these image patches are warped. However, this does not affect the classification accuracy because the classifier can handle the deformed image.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Transform an image into a fixed-size image

The deformed image patch was fed into the CNN classifier, and 4096 features were extracted. After that, we use the SVM classifier to identify the class and another linear regressor for that bounding box.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

The system working flow chart of the sliding window detector.

Below is pseudo code. We create many windows to detect different objects at different locations. One obvious way to improve performance is to reduce the number of windows.

for window in windows patch = get_patch(image, window)
results = detector(patch)

selective search

Instead of using the brute force method, we use the region proposal method to create a region of interest (ROI) for object detection. In selective search (SS), we first treat each pixel as a group. Then, calculate the texture for each group and combine the two closest groups. But to avoid a single region swallowing other regions, we first grouped smaller groups. We keep merging regions until all regions are combined. The first row of the figure below shows how to make the region grow, and the blue rectangles in the second row represent all possible ROIs during the merging process.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Source: van de Sande et al. ICCV'11

R-CNN

R-CNN creates about 2000 ROIs using the region proposal method. These regions are converted into fixed-size images and fed separately into a convolutional neural network. The network architecture is followed by several fully connected layers to achieve object classification and refine bounding boxes.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Use region proposals, CNN, affine layers to localize objects.

The following is a flow chart of the entire system of R-CNN:

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

By using fewer and higher quality ROIs, R-CNN is faster and more accurate than sliding window methods.

ROIs = region_proposal(image)for ROI in ROIs
patch = get_patch(image, ROI)
results = detector(patch)

Bounding Box Regressor

The candidate region method has very high computational complexity. To speed up this process, we usually construct the ROI using a less computationally expensive candidate region selection method, and further refine the bounding box using a linear regressor (using a fully connected layer) later.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Refine the blue original bounding box to red using a regression method.

Fast R-CNN

R-CNN needs a lot of candidate regions to improve the accuracy, but in fact many regions overlap each other, so the training and inference speed of R-CNN is very slow. If we have 2000 proposals and each one needs to be fed into the CNN independently, then we need to repeat feature extraction 2000 times for different ROIs.

In addition, the feature maps in CNN represent spatial features in a dense way, so can we directly use the feature maps instead of the original images to detect objects?

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Calculate the ROI directly using the feature map.

Fast R-CNN uses a feature extractor (CNN) to extract features from the entire image first, rather than extracting each image patch multiple times from scratch. Then, we can directly apply the method of creating candidate regions to the extracted feature maps. For example, Fast R-CNN selects the convolutional layer conv5 in VGG16 to generate ROIs, and these regions of interest are then combined with corresponding feature maps to crop into feature patches and used in object detection tasks. We use ROI pooling to convert feature patches to a fixed size and feed them to fully connected layers for classification and localization. Because Fast-RCNN does not repeatedly extract features, it can significantly reduce processing time.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

The candidate regions are directly applied to the feature maps and converted into fixed-size feature maps using ROI pooling.

The following is the flow chart of Fast R-CNN:

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

In the pseudocode below, the computationally expensive feature extraction process is moved out of the For loop, resulting in a significant speedup. Fast R-CNN is 10 times faster to train and 150 times faster to infer than R-CNN.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
patch = roi_pooling(feature_maps, ROI)
results = detector2(patch)

The most important point of Fast R-CNN is that the entire network including feature extractor, classifier and bounding box regressor can be trained end-to-end through a multi-task loss function, which combines classification loss and localization The loss method greatly improves the accuracy of the model.

ROI pooling

Because Fast R-CNN uses fully connected layers, we apply ROI pooling to convert ROIs of different sizes to a fixed size.

For brevity, we first convert the 8×8 feature map to a predefined 2×2 size.

  • Bottom left corner: feature map.

  • Top right: Overlap the ROI (blue area) with the feature map.

  • Bottom left: Split ROI into target dimensions. For example, for a 2×2 target, we segment the ROI into 4 parts of similar or equal size.

  • Bottom right corner: Find the maximum value of each part to get the transformed feature map.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Input feature map (top left), output feature map (bottom right), ROI (top right, blue box).

Follow the steps above to get a 2×2 feature map that can be fed into the classifier and bounding box regressor.

Faster R-CNN

Fast R-CNN relies on external candidate region methods such as selective search. But these algorithms run on CPU and are slow. In testing, Fast R-CNN takes 2.3 seconds to make predictions, of which 2 seconds are used to generate 2000 ROIs.

feature_maps = process(image)
ROIs = region_proposal(feature_maps) # Expensive!
for ROI in ROIs
patch = roi_pooling(feature_maps, ROI)
results = detector2(patch)

Faster R-CNN adopts the same design as Fast R-CNN except that it replaces the proposal region method with an internal deep network. The new candidate region network (RPN) is more efficient at generating ROIs and runs at 10ms per image.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

The flowchart of Faster R-CNN is the same as Fast R-CNN.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

The outer candidate region method replaces the inner deep network.

candidate area network

The candidate region network (RPN) takes the output feature map of the first convolutional network as input. It slides a 3×3 convolution kernel on the feature map to build class-independent regions using a convolutional network (ZF network shown below). Other deep networks such as VGG or ResNet can be used for more comprehensive feature extraction, but this comes at a cost of speed. The ZF network finally outputs 256 values, which are fed to two separate fully connected layers to predict the bounding box and two objectness scores that measure whether the bounding box contains an object. We could actually use a regressor to compute a single objectness score, but for brevity, Faster R-CNN uses a classifier with only two classes: those with objects and those without objects.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

For each position in the feature map, the RPN makes k predictions. Therefore, the RPN will output 4×k coordinates and 2×k scores at each location. The figure below shows the 8×8 feature map, and there is a 3×3 convolution kernel to perform the operation, which finally outputs 8×8×3 ROIs (where k=3). The figure below (right) shows 3 candidate regions for a single location.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

There are 3 conjectures here, which we will refine later. Since only one correct guess is required, our initial guesses preferably cover different shapes and sizes. Therefore, Faster R-CNN does not create random bounding boxes. Instead, it predicts some offset (e.g. x, y) relative to a reference frame named "anchor" in the upper left corner. We limit the values ​​of these offsets, so our guesses are still similar to anchor points.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

To make k predictions for each location, we need k anchors centered at each location. Each prediction is associated with a specific anchor, but different locations share anchors of the same shape.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

These anchors are carefully chosen so that they are diverse and cover realistic targets with different scales and aspect ratios. This allows us to guide initial training with better guesses and allows each prediction to be specialized for a specific shape. This strategy makes early training more stable and easier.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Faster R-CNN uses more anchors. It deploys 9 anchor boxes: 3 anchor boxes of different sizes in 3 different aspect ratios. Each location uses 9 anchors, and each location generates 2×9 objectness scores and 4×9 coordinates.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Source: https://arxiv.org/pdf/1506.01497.pdf

Performance of the R-CNN method

As shown in the figure below, Faster R-CNN is much faster.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Region-Based Fully Convolutional Neural Network (R-FCN)

Suppose we have only one feature map for detecting the right eye. So can we use it to locate faces? it should be OK. Since the right eye should be in the upper left corner of the face image, we can use this to locate the entire face.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

If we have other feature maps that we use to detect the left eye, nose or mouth, then we can combine the detections to better localize the face.

Now let's review all the issues. In Faster R-CNN, the detector uses multiple fully connected layers for prediction. If there is 2000 ROI, the cost is very high.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
patch = roi_pooling(feature_maps, ROI)
class_scores, box = detector(patch) # Expensive!
class_probabilities = softmax(class_scores)

R-FCN achieves acceleration by reducing the effort required per ROI. The region-based feature maps above are independent of the ROI and can be computed separately outside each ROI. The rest of the work is simpler, so R-FCN is faster than Faster R-CNN.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
score_maps = compute_score_map(feature_maps)
for ROI in ROIs V = region_roi_pool(score_maps, ROI)
class_scores, box = average(V) # Much simpler!
class_probabilities = softmax(class_scores)

Now let's look at the 5 × 5 feature map M, which contains a blue square inside. We divide the square into 3 × 3 regions equally. Now, we create a new feature map in M ​​to detect the top left corner (TL) of the square. This new feature map is shown in the figure below (right). Only the yellow grid cells [2, 2] are active.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Create a new feature map on the left for detecting the top left corner of the object.

We divide the square into 9 parts, thus creating 9 feature maps, each used to detect the corresponding object region. These feature maps are called position-sensitive score maps, because each map detects a sub-region of the object (computes its score).

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Generate 9 score maps

The red dashed rectangle in the image below is the suggested ROI. We divide it into 3 × 3 regions and ask what the probability is that each region contains the corresponding part of the object. For example, the probability that the top left ROI region contains the left eye. We store the result as a 3 × 3 vote array, as shown in the image below (right). For example, vote_array[0][0] contains the score of whether the top left area contains the corresponding part of the target.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Apply the ROI to the feature map, outputting a 3 x 3 array.

The process of mapping score maps and ROIs to vote arrays is called position-sensitive ROI-pooling. This process is very close to the ROI pooling discussed earlier.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

A portion of the ROI is superimposed on the corresponding score map to calculate V[i][j].

After computing all values ​​for position-sensitive ROI pooling, the class score is the average of all its element scores.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

ROI pooling

Suppose we have C classes to detect. We extend this to C+1 categories, which adds a new category for the background (non-target). There are 3 × 3 score maps for each class, so there are (C+1) × 3 × 3 score maps in total. Using the score map for each class predicts the class score for that class. We then apply a softmax function to these scores to calculate the probability of each class.

Below is the data flow graph, in our case, k=3.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Summarize

We first understand the basic sliding window algorithm:

for window in windows
patch = get_patch(image, window)
results = detector(patch)

Then try reducing the number of windows to minimize the amount of work in the for loop as much as possible.

ROIs = region_proposal(image)
for ROI in ROIs
patch = get_patch(image, ROI)
results = detector(patch)

one-shot object detector

In the second part, we will review the single-shot object detectors (including SSD, YOLO, YOLOv2, YOLOv3). We will analyze FPN to understand how multi-scale feature maps can improve accuracy, especially the detection of small objects, which often perform poorly in single-shot detectors. We will then analyze Focal loss and RetinaNet to see how they address class imbalance during training.

single shot detector

In Faster R-CNN, there is a dedicated region proposal network after the classifier.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Faster R-CNN Workflow

Region-based detectors are very accurate, but come at a cost. Faster R-CNN processes images at 7 frames per second (7 FPS) on the PASCAL VOC 2007 test set. Similar to R-FCN, researchers streamline the process by reducing the effort per ROI.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
patch = roi_align(feature_maps, ROI)
results = detector2(patch) # Reduce the amount of work here!

Instead, do we need a separate candidate region step? Can we get bounding boxes and classes directly in one step?

feature_maps = process(image)
results = detector3(feature_maps)
# No more separate step for ROIs

Let's take another look at the sliding window detector. We can detect objects by sliding a window over the feature map. For different target types, we use different window types. The fatal error of the previous sliding window method is to use the window as the final bounding box, which requires very many shapes to cover most of the objects. A more efficient approach is to use the window as the initial guess, so that we get a detector that predicts both the class and the bounding box from the current sliding window.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Prediction based on sliding windows

This concept is very similar to anchors in Faster R-CNN. However, a one-shot detector predicts both bounding boxes and classes. For example, we have an 8 × 8 feature map and make k predictions at each location, i.e. 8 × 8 × k predictions in total.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

64 locations

At each location, we have k anchors (anchors are fixed initial bounding box guesses), one anchor corresponds to a specific location. We carefully select the anchor points and each position using the same anchor point shape.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Use 4 anchors to make 4 predictions at each location.

Below are 4 anchors (green) and 4 corresponding predictions (blue), each for a specific anchor.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

4 predictions, one for each prediction.

In Faster R-CNN, we use convolution kernels to make predictions of 5 parameters: 4 parameters correspond to the predicted bounding box of an anchor point, and 1 parameter corresponds to the objectness confidence score. So the 3×3×D×5 convolution kernel transforms the feature map from 8×8×D to 8×8×5.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Predictions are computed using a 3x3 convolution kernel.

In a one-shot detector, the convolution kernel also predicts C class probabilities to perform classification (one for each class). Therefore we apply a 3 × 3 × D × 25 convolution kernel to transform the feature map from 8 × 8 × D to 8 × 8 × 25 (C=20).

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Each location makes k predictions, each with 25 parameters.

One-shot detectors often require a trade-off between accuracy and real-time processing speed. They are prone to problems detecting objects that are too close or too small. In the image below, there are 9 Santas in the lower left corner, but only 5 were detected by a single-shot detector.

SSD

SSD is a one-shot detector using VGG19 network as feature extractor (same CNN used in Faster R-CNN). We add custom convolutional layers (blue) after this network and perform predictions using convolutional kernels (green).

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Perform a single prediction for both category and location at the same time.

However, convolutional layers reduce the spatial dimension and resolution. Therefore the above model can only detect larger objects. To address this problem, we perform independent object detection from multiple feature maps.

From RCNN to SSD, this should be the most complete inventory of target detection algorithms

Use multi-scale feature maps for detection.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325598330&siteId=291194637