笔记二 | Rich feature hierarchies for accurate object detection and semantic segmentation

divided into three parts

The first part region proposal

method selective search

one

Two use R-CNN for target detection

Algorithm flow: extract candidate regions, and then extract

1 Region proposal

Use the selective search method to obtain candidate regions (region proposals), about 2000

2 Feature extraction

2.1 Extraction process

        Each region proposal is first adjusted to a 227×227 image, and then the Alexnet (5-layer convolutional layer, 2-layer fully connected layer) neural network is used to extract a 4096-dimensional feature vector.

2.2 Network framework

        Using the framework of Caffe, you can refer to caffe.berkeleyvision

2.3 Network description

2.3.1 The size of the input region needs to be adjusted to be consistent with the network input

Before feature extraction, each suggestion box needs to be adjusted to a size of 227×227.

[Adjustment method] The original text discusses several adjustment methods

2.3.2 Time and Space Complexity Analysis

【Amount of parameters】

        Since the region is first extracted with CNN and then classified with SVM, the dimension of the SVM input data terminal is greatly reduced (360K vs 4K-dimension).

[Time] The running time of the two processes of region proposals and feature extraction can be amortized by all target categories. The individual time cost of each category is the feature vector feature in SVM classification and non-maximum suppression .

【storage】

Using the method in this paper : R-CNN extracts a feature matrix of 2000×4096 (number of regions×feature dimensions), and SVM weight matrix 4096×N (dimension×number of categories). When doing classification, only multiply these two matrices.

Traditional method : When high-latitude features are used for SVM classification, a large amount of memory storage is required

In comparison, the memory required for the features extracted by dimensionality reduction in this paper is very small

3 training

3.1 CNN pre-training

Supervised pre-training: CNN is pre-trained on ILSVRC2012 (images have labels but no bounding box labels), and the last layer is the mapping of 4096-dimensional features to 1000 categories on the ILSVRC classification dataset. Such pre-training enables AlexNet to obtain strong feature extraction capabilities in image classification tasks.

3.2 Domain-specific fine-tuning (tuning)

        In order to apply CNN to new detection tasks and new domains (warped VOC windows), it is necessary to use the SGD method to retrain the network parameters. At this time, the input of the network is the wrapped region proposals from VOC.

【Network adjustment】

Replacement layer: Replace the 1000-channel classification layer during pre-training with a randomly initialized 21-way classification layer (20 classes of VOC plus a background)

【Classification criteria】        

Positive example: region proposals and ground-truth boxIoU \geq 0.5

Negative example: region proposals and ground-truth boxIoU \leq 0.5

【Training process】

Use the SGD method to train, adopt the mini-batch method, uniformly sample 32 positive samples (on all classes) and 96 negative samples

Calculate the value of their average loss function to update the parameters, with a learning rate of 0.001 (0.1 times that of pre-training).

3.3 SVM training

        The output of the R-CNN network is a 4096-dimensional feature vector, which is then sent to the SVM for classification, and the SVM outputs the score belonging to this category. SVM is a binary classifier, so a SVM classifier is trained for each category.

【Training method】

The positive samples are the feature vectors extracted by CNN from the calibrated Ground-truth bounding boxes .

The negative samples are the feature vectorsIoU\leq 0.3 extracted by CNN with all the region proposals of the ground-truth bounding boxes .

 Because the number of negative samples is very large, the method of standard hard negative mining is used to select representative negative samples. Every time the detection result is wrong, it is sent back as a hard negative sample (hard negative) for training until the performance of the model no longer improves.

【discuss】

The threshold for selecting IoU for negative samples in the SVM stage is 0.3. The reason is that SVM is suitable for small sample training, so the restriction on IoU is strict; CNN is prone to overfitting when the number of samples is small, so a large amount of training data is required, so " Relaxed" IoU.

The IoU of the positive samples for tuning training is loose, and the accuracy of the position is not emphasized

positive sample negative sample
SVM Calibrated ground-truth bounding box Difficult Negative Sample Mining Method Screening
CNN Proposal boxes with IoU greater than or equal to 0.5 with the ground truth bounding box random sampling

Using SVM instead of softmax classification, mAP increased from 50.9% to 54.2%.

3.4 Training results

Three visualization, ablation experiment, error mode analysis

1 Visual Learning Features

The article proposes a simple, parameter-free method to directly visualize what the network has learned.

1.1 Ideas

        The idea is to pick out a feature and use it as an object detector. The method is to calculate its activations in a large number of region proposals (10 million), sort the results of activations from high to low, use the method of non-maximum suppression, and then display the regions with the highest score

1.2 Experimental approach

        The article visualizes Alexnet's pool5, the feature map (also known as unit) is 6×6×256, and the receptive field of each pool5 unit on the original 227×227 input is 195×195.

        Fine-tuned on VOC 2007, each row in the figure below is the 16 activations with the highest score of a pool5 unit:

        It can be seen that the selected six units have learned different content, such as the first line is the human face, the second line is the arrangement of the dog's face and holes; the network reflects the representation learned at the pool5 layer combined with some class- tuned feature and feature distribution about shape, texture, etc.

2 Ablation experiment

2.1 Experimental content

        In the fine-tune stage, the classification performance of the last three layers of the CNN network is analyzed, that is, the results of each layer are output and classified with SVM.

【Introduction to each layer】

pool5 : Output 9216-dimensional feature vector

Fully connected layer 6 : Multiply a 4096×9216 weight matrix to the left of pool5, then add bias to the result, and perform half-wave rectification

Fully connected layer 7 : its function is to multiply the feature of fc6 by a 4096×4096 weight matrix, and add a bias vector, half-wave rectification

2.2 Experiment introduction and result analysis

2.2.1 Experiment 1: Do not fine-tuning the network, only pre-train CNN parameters on ILSVRC 2012

         It turns out that after the fc7 layer, the features are not as good as the classification results of the fc6 layer features, which means that this layer can be removed (saving 29% of the parameters, about 16.8million), and will not make the mAP value drop.

The article also said: Most of CNN's representation capabilities come from the convolutional layer, rather than the subsequent fully connected layer (accounting for a large amount of parameters)

2.2.2 Experiment 2: fine-tuned on the VOC 2007 trainval dataset:

         It can be seen that the mAP of fc7 has increased by 8%, and the mAP improvement of fc6 and fc7 layers is much greater than that of pool5 layer

It shows that pool5 learns more generalized features. The improvement of fc6 and fc7 layers is due to learning domain-specific and non-linear features.

2.2.3 Experiment 3: Comparison with other feature learning methods in the same period

DPM ST(A learned mid-level representation for contour and object detection)

DPM HSC

It can be seen that the performance of any R-CNN is better than the three DPM-based methods 

3. Error analysis

        Using the analysis tool, from Hoiem's ​​Diagnosing error in object detectors, the discussion results are in the title part of Figure4 and figure5

Four Bounding Box Regression

Through the error analysis, it will be found that there is a positioning error. Assuming that there is a calibrated true target bounding box and a proposal box, even if the proposal box is recognized as an airplane by the classifier, due to the inaccurate positioning of the box, the IoU of the true bounding box Smaller, equivalent to not detecting the aircraft correctly

We can use the method of bounding box regression to reduce the error of object localization. The idea of ​​bounding box regression is to fine-tune the suggestion box with inaccurate positioning, so that the adjusted bounding box is closer to the true bounding box, thereby improving the accuracy of positioning.

Train a set of category-specific linear regression models (including all 20 categories in the Pascal VOC dataset), and after using SVM to score the proposal box, predict the position of a new target bounding box on the top layer of CNN through the proposal box. The experimental results show that after using the bounding box regression method, a large number of wrong position detection results are repaired, and the mAP is increased by 3%~4% accordingly.

To fix the above problems

Bounding box regression

Trained a linear regression model to predict a new detection window, as a selective search region proposal, improved mAP

Five applied to semantic segmentation

Semantic segmentation

The Region classification task is a standard technique for semantic segmentation, so R-CNN can be applied to the PASCAL VOC segmentation challenge. This article compares with the leading semantic segmentation system O2P in the same period, and uses their open source framework for uniformity. O2P uses CPMC to generate 150 region proposals for each image, and uses support vector regression (SVR) to predict the quality of each region of each category. CPMC regions and second-order pooling of multi-feature types

CNN features for segmentation

Three strategies for computing features on CPMC regions are evaluated, first wrapping the rectangular window containing the region to 227×227.

The first method (full): Ignore the shape of the region and directly calculate the CNN features on the warped window, similar to doing detection

But the disadvantage of this is that the non-matrix shape characteristics of the region are ignored, resulting in two regions that may have a small amount of overlap and have very similar boundng boxes.

The second method (fg) computes CNN features on the foreground mask of a region.

The third method is (full+fg):

Results on the validation set at VOC 2011

The result of fc6 is better than fg

masked region is better than full

six supplements

1 Transformation of object proposal boxes

        The input of the CNN network is a fixed size of 227×227, but the proposal is a rectangular frame of any size, so it is necessary to transform the proposal into a size that can be sent to the CNN network.

1.1 Size transformation

There are two transformation methods:

Tightest square with context: The proposal is tightly surrounded by a square, and then the image contained in the square is scaled to the size of the CNN input. Another variation is to discard parts of the image other than the proposal.

wrap: directly scale the proposal to the CNN input size and change the aspect ratio.

1.2 padding

        Padding around the proposal.

2 Positive and negative sample selection and softmax

2.1 Selection of positive and negative samples

        It can be seen from the text that positive and negative samples define positive and negative samples differently in the fine-tuning stage and the SVM stage : in fine-tuning, the positive sample is a proposal with an IoU greater than or equal to 0.5 with ground-truth boxes, and other proposals are called background (Negative samples); when training SVM, the positive samples are ground-truth boxes, and the IoU with the ground-truth boxes is less than 0.3 as negative samples, and the rest of the proposals are discarded.

 [reason explanation]

        When training SVM, the training data used is the features obtained from the pre-trained CNN on ImageNet, and fine-tuning has not been performed at this time. Experiments verify that the current definition of positive and negative samples works best. When I started fine-tuning, I found that when using the same definition of positive and negative samples as SVM, the effect is not as good as the current definition.

        The fine-tuning stage uses a lot of proposals with IoU between 0.5 and 1 (called "jittered" examples), expanding the number of positive samples by nearly 30 times. The author speculates that in the fine-tuning stage, a large training set can help obtain better training results and prevent overfitting. But such jittered examples are more like suboptimal solutions, not fine-tuned for precise positioning of the target.

 2.2 softmax classification and SVM

        It can be seen from 2.1 that the fine-tuning stage cannot train precise positioning, which introduces another question: why do you need to train SVM after fine-tuning instead of directly using the last layer of the fine-tuned network for classification.

        After the experiment, the author found that on the VOC 2007 data set, the softmax classification was directly used, and the mAP dropped from 54.2% to 50.9%. The possible reasons are:

  • The positive examples defined in the fine-tuning stage do not emphasize the accuracy of positioning;
  • The softmax classifier is trained with randomly sampled negative examples, and the SVM is trained with a subset of hard negatives.

        Since the gap is not large, the author speculates that some additional fine-tuning can be used to narrow the gap between performances, replacing SVM with softmax, simplifying and speeding up the training of R-CNN without losing performance.

3 Bounding-box regression        

        In this paper, the performance of localization is improved in the bounding-box regression stage. After getting the proposal score on a certain class of SVM, use a class-specific bounding-box regressor to predict a bounding box.

3.1 Training algorithm

 [Algorithm input]  The input is N training pairs {(P^i,G^i)}_{i=1,\cdots,N}, which P^i = (P_x^i,P_y^i,P_w^i,P_h^i)represent  P^ithe coordinates of the center pixel of the proposal. Each ground-truth bounding box Gis also represented G = (G_x,G_y,G_w,G_h)by .

The training goal is to learn a transformation capable of mapping a proposal Pto a ground-truth bounding boxG .

[Algorithm introduction and training process] 

        The transformation function of each dimension is d_x(P),d_y(P),d_w(P),d_h(P), realized as:

\\\hat{G_x} = P_wd_x(P)+P_x \\ \hat{G_y} = P_hd_y(P)+P_y\\ \hat{G_w} = P_wexp(d_w(P))\\ \hat{G_h} = P_hexp(d_h(P))

        The first two functions implement translation transformation:

        \Delta x = P_wd_x(P),\Delta y = P_hd_y(P)

        The latter two functions implement a logarithmic space scaling transformation :

        S_w = e^{d_w(P)},S_h = e^{d_h(P)}

        Each transformation function d_{\star}(P)(\star = x,y,w,h)is a linear function, modeled on the feature map of the highest layer of CNN, and the input is the feature of the proposal pool_5(this is very important!! The feature is obtained from the pool5 layer of CNN), represented by \phi _5(P).

        According to the above definition, the relationship between the original proposal transformation function and the input can be described as: d_{\star}(P) = w_{\star}^T\phi_5(P).

        w_{\star}is a learnable parameter vector that w_{\star}can be learned by optimizing the following regularized least squares objective (ridge regression):

w_{\star} = \mathop{\arg\min}_{\hat{w_{\star}}} \sum\limits_{i}^N(t_{\star}^i-\hat{w_{\star}}^T\phi_5(P^i))^2 + \lambda\left \| \hat{w_{\star}} \right \|^2

         t_{\star}is defined as follows:

\\t_x =(G_X-P_x)/P_w \\ t_y = (G_y-P_y)/P_h \\ t_w = log(G_w/P_w) \\ t_h = log(G_h/P_h)

Its meaning is the amount of translation (t_x,t_y)and size scaling that is really needed(t_w,t_h)

        The training loss function is

Loss = \sum\limits_i^N(t_{\star}^i - \hat{w_{\star}}^T\phi_5(P^i))^2

        

        After training, the features of the proposal at the top layer of CNN can be predicted \phi _5(P)according to the learned parameters , and then the required translation transformation and size scaling can be obtained to achieve more accurate target positioning.w_{\star}d_{\star}(P)\Delta x,\Delta yS_w,S_h

[ Two questions about the implementation of bounding-box regression ]

One is regularization : set on the validation set\lambda = 1000

The second is to pay attention to (P,G)the selection of training pairs :

        If the proposal is far away from all ground-truth boxes, it is meaningless to do regression on it;

        Therefore, when selecting a proposal as an input for training, it is necessary to select a proposal that is at least 'close' to a ground-truth box . The quantitative description of ' near ' is: the maximum value of IoU with ground-truth boxes is greater than 0.6. All other unqualified proposals are discarded.

  

4 Feature Visualization

        The figure below shows 69 features visualized from pool5. Each feature shows the top 24 most activated proposals on VOC2007. Each feature marks its position on the pool5 feature map (dimension 6×6×256) (y , x, channel), (y, x) locates the receptive field.

5 Per-category segmentation results

Guess you like

Origin blog.csdn.net/weixin_45581089/article/details/120081245