Target detection of deep learning (1) - Faster RCNN

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


foreword

Tip: Here you can add the general content to be recorded in this article:
For example: With the continuous development of artificial intelligence, the technology of machine learning is becoming more and more important. Many people have started learning machine learning. This article introduces the basics of machine learning. content.


Tip: The following is the text of this article, and the following cases are for reference

1. Preface to Target Detection

1.1 Classification of Object Detection Networks

  1. One-Stage: SSD, YOLO
    1) Directly classify and adjust bounding boxes based on anchors
    Advantages: fast detection speed

  2. Two-Stage:Faster-RCNN

  1. Use a dedicated module to generate a candidate frame (RPN), find the foreground and adjust the bounding box (based on anchors) (the foreground is the target that needs to be detected and recognized, and the rest is called the background) 2) Further classification based on the previously generated candidate frames
    and Advantages of adjusting the bounding box (based on proposals)
    : higher detection accuracy

2. R-CNN

原论文:Rich feature hierarchies for accurate object detection and semantic segmentation

R-CNN can be said to be the pioneering work of using deep learning for target detection.

2.1 R-CNN algorithm flow

The RCNN algorithm process can be divided into 4 steps:
1. An image generates 1k-2k candidate areas (using the Selective Search method)
2. For each candidate area, use a deep network to extract special diagnosis
3. Features are sent to each category The SVM classifier to determine whether it belongs to this category
4. Use the regressor to fine-tune the position of the candidate frame

2.1.1 Generation of candidate regions

Use the Selective Search algorithm to obtain some original regions through image segmentation, and then use some merging strategies to merge these regions to obtain a hierarchical region structure, and these structures contain possible objects.

2.1.2 Feature extraction by deep network

Scale the 2000 candidate areas to 227 227pixel, and then input the candidate areas to the pre-trained AlexNet CNN network to obtain 4096-dimensional features to obtain 2000 4096-dimensional matrices.
insert image description here

2.1.3 SVM Classifier Judgment Category

Multiply the 2000*4096-dimensional feature with the weight matrix 4096 20 composed of 20 svm to obtain a 2000 20-dimensional matrix indicating that each suggestion box is a score of a certain target category. Each column in the above 2000*20-dimensional matrix, that is, each category, is subjected to non-maximum value suppression to remove overlapping suggestion boxes, and the column, that is, some suggestion boxes with the highest scores in this category, are obtained.
(Here is a Pascal dataset as an example, there are 20 categories)
insert image description here
Non-maximum value suppression:
IoU:
insert image description here
non-maximum value suppression to remove overlapping suggestion boxes:
1) Find the target with the highest score
2) Calculate other targets and the target 3) delete all targets whose
iou value is greater than a given threshold

2.1.4 The regressor finely corrects the position of the candidate frame

The remaining suggestion boxes after NMS processing are further screened, and then 20 regressors are used to perform regression operations on the remaining suggestion boxes in the above 20 categories, and finally the corrected bounding box with the highest score for each category is obtained.

Examples are as follows:
insert image description here

2.2 R-CNN framework

insert image description here

2.3 Problems in R-CNN

1. The test speed is slow:
it takes about 53s (CPU) to test a picture, and it takes about 2 seconds to extract the candidate frame with the Selective Search algorithm. There is a large amount of overlap between the candidate frames in an image, and the feature extraction operation is redundant.
2. Slow training speed:
the process is extremely cumbersome.
3. The space required for training is large:
For SVM and bbox regression training, features need to be extracted from each target candidate box in each image and written to disk. For very deep networks such as VGG16, features extracted from 5k images on the VOC07 training set require hundreds of GB of storage.

3. Fast R-CNN

Original paper: Fast R-CNN
Fast R-CNN is another masterpiece of the author Ross Girshick after R-CNN. Also using VGG16 as the backbone of the network, compared with R-CNN, the training time is 9 times faster, the test reasoning time is 213 times faster, and the accuracy rate is increased from 62% to 66% (on the Pascal VOC dataset).

3.1 Fast R-CNN steps

1) An image mall has 1k-2k candidate areas (using the Selective Search method)
2) Input the image into the network to obtain the corresponding feature map, and project the candidate frame generated by the ss algorithm onto the feature map to obtain the corresponding feature matrix
3) Will Each feature matrix is ​​scaled to a 7*7 feature map through the POI pooling layer, and then the feature map is flattened through a series of fully connected layers to obtain the prediction result
(ROI——Region of Interest)
insert image description here

3.2 Detailed process

3.2.1 Calculate the whole image features at one time

Fast-RCNN sends the entire image into the network, and then extracts the corresponding candidate regions from the feature image. The features of these candidate regions do not need to be recalculated.
insert image description here

3.2.2 ROI Pooling Layer

The image of the candidate area is divided into 7 7, and maxpooling is used for each area to obtain a 7 7 feature matrix.

3.2.3 Classifiers

Output the probability of N+1 categories (N is the type of detection target, 1 is the background) a total of N+1 nodes
insert image description here

3.2.4 Bounding box regressor

Output the candidate bounding box regression parameters (dx, dy, dw, dh) corresponding to N+1 categories, a total of (N+1)*4 nodes How to obtain specific bounding boxes: The corresponding parameters need to be
insert image description here
trained
insert image description here
.

3.3 Fast R-CNN loss function

Loss of multitasking:
insert image description here

3.3.1 Classification loss

insert image description here

3.3.2 Bounding box regression loss

insert image description here
Iverson brackets: equal to 1 when u>=1, otherwise equal to 0, it can be considered that when it is a negative sample, there is no bounding box loss at this time.

Supplement: The smooth L1 function can refer to the following:
regression loss function

3.4 Fast R-CNN framework

insert image description here

4. Faster R-CNN

Faster R-CNN is another masterpiece of the author Ross Girshick after Fast R-CNN. Also using VGG16 as the backbone of the network, the inference speed reaches 5fps on the GPU (including the generation of candidate regions), and the accuracy rate is further improved. In the 2015 ILSVRC and COCO competitions, it won the first place in several projects.

4.1 Faster R-CNN steps

1) Input the image into the network to obtain the corresponding feature map
2) Use the RPN structure to generate a candidate frame, and project the candidate frame generated by the RPN to the feature map to obtain the corresponding feature matrix.
3) Scale each feature matrix to a 7*7 feature map through the ROI pooling layer, and then flatten the feature map through a series of fully connected layers to obtain the prediction result.

(RPN+Fast R-CNN)

4.2 RPN

For each 3*3 sliding window on the feature map, calculate the center point of the sliding window corresponding to the center point on the original image, and calculate k anchor boxes.
(where k anchor boxes are given fixed-size borders)
2k corresponding scores are generated for each window, that is, foreground or background.
4k bounding box parameters are also generated.

Corresponding anchor area and ratio:
{128^2,
256^2
512 ^2 }
{1:1, 1:2, 2:1}

So there are nine anchors in total.

From the above, it can be seen that for a 1000 600 3 image, there are about 60 40 9 (20k) anchors, and after ignoring the anchors that cross the border, there are about 6k anchors left. There is a lot of overlap between the candidate frames generated by RPN. Based on the cls score of the candidate frame, non-maximum suppression is used, and the IoU is set to 0.7, so that only 2k candidate frames are left for each picture.

4.2.1 Sampling

In training, select 256 samples from all anchors, that is, 128 positive samples and 128 negative samples. If the positive samples are not enough, fill them with negative samples.
The definition of a positive sample: the IoU with the ground-truth box is greater than 0.7 or has the largest IoU with one of the ground-truth boxes.
Negative sample: IoU with all ground-truth boxes is less than 0.3

4.2.2 RPN loss

insert image description here
Classification loss:
insert image description here
Bounding box regression loss:
insert image description here

4.3 Faster R-CNN Training

Directly adopt the joint training method of RPN Loss + Fast R-CNN Loss

The method used in the original paper is to train RPN and Fast R-CNN separately
1) Use the ImageNet pre-trained classification model to initialize the network parameters of the pre-convolution layer, and start to train the RPN network parameters separately
2) Given the unique RPN network Convolutional layer and fully connected layer parameters, use ImageNet pre-training model parameters to initialize the pre-convolutional network parameters, and use the target suggestion frame generated by the RNP network to train the Fast RCNN network parameters 3) Fix the pre-trained Fast
RCNN Convolutional network layer parameters,
to fine-tune the unique convolutional layer and fully connected layer parameters of RPN network The RCNN network shares the parameters of the pre-convolutional network layer to form a unified network.

4.4 Faster R-CNN framework

insert image description here
insert image description here

5.FPN

Original paper: Feature Pyamid Networks for Object Detection

Guess you like

Origin blog.csdn.net/weixin_43869415/article/details/121583737