Deep learning - target detection (R-CNN, Fast R-CNN, Faster R-CNN)

Table of contents

1. RCNN

2. Fast R-CNN

3. Faster R-CNN

4. FPN (Feature Pyramid Networks)


1. RCNN

There are 2000 boxes, each box gets 4096 features, input the obtained results into svm, and get 20 classification results.

       The probability of A after svm classification is 0.98, and the probability of B is 0.86. By calculating the IOU of the bounding box, if it is greater than the threshold we set, it means that it is the same target, and the one with a low probability will be deleted.

In the regression, the dependent variables are height, width, length ratio, and width ratio, a four-dimensional data

2. Fast R-CNN

RCNN inputs the candidate region into the network, and Fast R-CNN inputs the entire image into the network. 

Object detection architecture can usually be divided into two stages:
(1) Region proposal: Given an input image, find all locations where objects may exist. The output of this stage should be a series of bounding boxes of possible positions of the object. These are usually called region proposals or regions of interest (ROI). The methods used in this process are based on sliding window methods and selective search.
(2) Final classification: Determine whether each region proposal in the previous stage belongs to the target category or background.
 

The specific operations of ROI pooling are as follows:

  1. According to the input image, map the ROI to the corresponding position of the feature map;
  2. Divide the mapped area into sections of the same size (the number of sections is the same as the output dimension);
  3. Perform max pooling operation on each section;

From 2000 candidate boxes, 64 candidate areas are collected, some of which are positive samples and some of which are negative samples.

As long as IOU is greater than 0.5, it is a positive sample, and 0.1 to 0.5 is a negative sample.

Finally, two fully connected layers are connected in parallel, one is used to predict the classification probability, and the other is used to predict the bounding box parameters.

Each category has 4 parameters

The classification loss here is actually cross-entropy loss.

[u≥1] means: when u satisfies ≥1, this term = 1. When u does not satisfy ≥1, it corresponds to the background, and this term =0

True box parameter v , such as Vy = (Gy - Py)/Ph

3. Faster R-CNN

Projection method: find the position of the candidate frame ROI on the original image, and then scale it proportionally to the same position in the feature map

Deep Learning RPN (RegionProposal Network) - Regional Candidate Network_Running Onishiyoshi's Blog-CSDN Blog

2k probabilities: background and foreground

4k bounding box regression parameters

Backbone is the backbone network structure you use for feature extraction.

This is just predicting whether it is background or foreground, not classifying it

A small receptive field can also predict a wide range of

For each picture, 256 anchors are sampled from tens of thousands of anchors. These 256 are probably composed of positive and negative samples at a ratio of 1:1. If the number of positive samples is less than 128, negative samples are used to fill it.

Positive sample: (1) IOU with the real box is greater than 0.7

               (2) The largest IOU in the box that you want to intersect with the real box (when both IOUs are less than 0.7)

Negative sample: IOU with the real box is less than 0.3

Discard all others

λ*1/Nreg is sometimes directly replaced by 1/Ncls, because when λ=10, Nreg=2400, which is not much different in practice.

This is k binary cross entropy losses, not softmax cross entropy. Here are the details

What is predicted here is whether it is the background

This should be the binary cross entropy loss. If it is softmax, the probability of the foreground should also be greater, not 0.1. pi is the probability of the target.

4. FPN (Feature Pyramid Networks)

Feature fusion prediction

When small feature maps are fused with large ones, upsampling is performed

Predict the generated proposal on P2 to P65 feature maps

Use the generated proposal to perform Fast RCNN prediction on the four feature maps from P2 to P5.

       Targets of different scales can be predicted on different feature layers. For example, P2 is a relatively low-level feature layer. It will retain more underlying detailed information, so it is more suitable for predicting small targets, so the 32² ratio {1:2, 1:1, 2:1} anchor is generated on P2

There is actually no difference between using different RPN and Fast RCNN on different feature layers and using the same final effect, so parameters can be shared.

Use the formula to map the proposal to the feature layer

Guess you like

Origin blog.csdn.net/qq_47941078/article/details/132539191