[Intensive reading of papers] Fast R-CNN

"Fast R-CNN" is an improvement based on the same author's previous R-CNN work.
Fast R-CNN is also an algorithm based on deep convolutional neural networks for computer vision tasks (mainly for target detection).
He has made great innovations on the basis of R-CNN, such as unifying the steps of target classification and positioning, and realizing the end-to-end training and prediction process.
When Fast R-CNN uses the backbone network of VGG-16, the training speed is 9 times faster than R-CNN, the test speed is 213 times faster, and the detection accuracy is improved.
The training speed is 3 times faster than SPPnet, the testing speed is 10 times faster, and the accuracy is improved.

This article is an intensive reading summary of the paper. I will study every point thoroughly, not only to let everyone understand Fast R-CNN, but also to summarize my own experience when reading various parts of the paper. For the intensive reading of the previous R-CNN paper of Fast R-CNN, please refer to this article
of mine .

background

Why is Fast R-CNN proposed?
The authors point out that the previous object detection algorithms R-CNN and SPPnet have various shortcomings.

Disadvantages of R-CNN

Disadvantages of R-CNN:

  1. Training is multi-stage. First, a CNN network needs to be trained for image feature extraction.
    Next, a linear SVM classifier needs to be trained to classify the extracted features.
    Finally, a regression model needs to be trained to adjust the positioning of the target detection candidate box.
  2. Training is very expensive. Because it is a multi-stage target detection, it is necessary to save the extracted features of each candidate frame on each picture before training.
    Therefore, these features need to be stored on disk.
    Not only is feature extraction time-consuming, but storage of features also consumes space.
  3. When testing, object detection is slow. An R-CNN using the VGG-16 backbone model takes 47 seconds (on GPU) to predict an image.

The author pointed out that the root cause of R-CNN's slowness is that each candidate box uses the same CNN network for forward calculation, there are a lot of repeated operations between them, and there is no shared computing resources.
In order to solve this problem, some big guys (He Kaiming, etc.) proposed SPPnet.

Problems and Shortcomings Solved by SPPnet

SPPnet overview:
SPPnet passes the entire image through a CNN network forward calculation, and uses an SPP layer (spatial pyramid pooling layer) instead of the ordinary maximum pooling layer before entering the fully connected layer. As shown below.
SPPnet

In the SPP layer, the convolutional feature map output by the convolutional layer (the feature maps of conv5 in the figure, which can be of any size) will be pooled with a fixed output, as shown in the figure above. Regardless of how the size of the input image changes, at the end of the SPP layer, a 4×4 feature map, a 2×2 feature map, and a 1×1 feature map will be fixed (specifically using dynamic pooling) Kernel and step size algorithm), similar to the feeling of pyramid stacking, and finally these fixed feature map exhibits are passed into the fully connected layer as 21×256 fixed-length vectors.
The advantages of doing this:
① Here, the problem of fixing the input image is solved first, so that the image does not need to be input into CNN with a fixed size. ②Secondly, the 2000 candidate boxes corresponding to each picture can find the small area of ​​the corresponding position in the convolutional feature map
before the last SPP layer, and map the area of ​​the candidate box in the picture to the corresponding area of ​​the convolutional feature map. Use the cut out small feature map (window in the figure below) to perform SPP pooling and output fixed features. (As shown below)
insert image description here

In this way, we only need one forward propagation calculation of the CNN network to extract the features of all candidate boxes (no need to run 2000 times).
This solves the problem of shared computation in R-CNN above, which speeds up detection.
However, the SPPnet algorithm still uses a linear SVM classifier and a regressor to classify and locate the vector finally extracted by the spatial pyramid pooling layer, which is still multi-stage like R-CNN. This still cannot solve the problem of slow multi-stage training and high resource consumption in R-CNN (points 1 and 2 above).
And at the same time, the author proposes that in SPPnet, when using fine tuning to fine-tune the training of the CNN network, the network parameters before the SPP layer are difficult to be updated, which will affect the accuracy of target detection (this will be detailed later).

Advantages of Fast R-CNN

The Fast R-CNN proposed by the author solves the shortcomings of the above R-CNN and SPPnet, while improving the training and testing speed, and improving the accuracy.
Features of Fast R-CNN:

  • Higher detection accuracy than R-CNN and SPPnet.
  • A multi-task loss function is used, which is a single-stage training process
  • Parameters of all network layers can be updated during training
  • It is no longer necessary to cache the extracted feature proposals to disk

Fast R-CNN model structure

Let's look at the model structure of Fast R-CNN (as shown below):
the input of the network is an image + a set of target candidate boxes (also generated by selective search) information.

① After the image passes through the CNN network, a set of Conv feature maps are output in the last convolutional layer .
The target candidate frame information is projected to the corresponding position of the Conv feature map through RoI projection, that is, the projection of the target position, and this small feature map is extracted.

②We pass the small feature map that we extracted into the RoI pooling layer .
After this RoI pooling layer, we can extract a fixed- length feature vector.

③ Pass this feature vector into two modules at the same time.
Incoming fully connected layer + softmax classification for predicting K + 1 K + 1K+Probability of 1 class.
Pass the bounding-box regressor to predict the location of each category.

insert image description here


It can be seen that compared to R-CNN (below), the model has three differences:

  • The input is no longer the target candidate frame, but the positioning information of the entire image + candidate frame, and the feature map of the entire image is output. Then use the positioning information of the candidate frame to perform position projection, find the corresponding position of the candidate frame on the feature map, and pull it out and pass it to the next layer. This solves the problem of repeated calculations in the CNN network. It is no longer necessary to perform 2000 forward calculations on 2000 candidate frames for a picture, and only one calculation is required.
  • The model does not limit the size of the input image, so the output (conv) feature map is large or small, and the small feature map extracted is also large or small. The paper uses the RoI pooling layer to uniformly pool and downsample small feature maps of different sizes into fixed-size feature vectors (replacing the last Max Pooling layer of the ordinary CNN network). (The principle of RoI pooling layer will be explained in detail below)
  • When classifying and predicting feature vectors, the linear SVM classifier is no longer used, but the probability of each category is output directly using softmax. This eliminates the need to train a linear SVM classifier specifically for each category. This design is also unified with the design of single-stage training, which will be described in detail later.

insert image description here

(The picture above is taken from towards datascience )

RoI pooling layer

The RoI pooling layer is a special kind of adaptive Max Pooling layer. It is a Max Pooling that dynamically specifies the pooling kernel and step size according to the current feature map size . It actually inherits the SPP layer (spatial pyramid pooling layer) in SPPnet. the design of.

Let me give two examples to explain in detail what the RoI pooling layer does:

①Assuming that we are in a conv feature map, according to the mapping relationship, we take one of the small feature maps of C×13×13 and enter the RoI pooling layer. We want to specify the output feature vector of C×4×4 size.
For our Max Pooling:
the size of the pooling core is the size of the feature map divided by the output size and then rounded up, that is, size=ceil(13/4)=4
stride is the size of the feature map divided by the output After the size is rounded down, that is, size=floor(13/4)=3, the
resulting pooling result is exactly a 4×4 output.
See the figure below for details. Finally we flatten the output to get a C×16 eigenvector.

insert image description here

②This example is an example of a non-square feature map (the picture is taken from benrishi.ai )
. Suppose we perform RoI pooling on the 6×4 feature map in the picture, and set the output feature map with a size of 2×2. Then the pooling kernel and step size
of Max pooling should have their own size and stride in the horizontal and vertical directions. The author in the article did not detail how to perform pooling in non-square situations. According to my understanding of the description of the author's paper RoI pooling, the pooling parameters of the rectangular feature map should be:

kernel_size=( ceil(6/2), ceil(4/2) )=(3,2)
stride=( floor(6/2), floor(4/2) )=(3,2)

That is, the pooling kernel size is (3,2), the step size is 3 in the vertical direction, and 2 in the horizontal direction.
(PS: It also conforms to the parameter specification in the nn.MaxPool2d layer in Pytorch)
The feature map obtained by such pooling is also 2×2. We can also remove the 5×8 small feature map in the picture to verify it by ourselves.
insert image description here

The above is the detailed explanation of the RoI pooling layer. Through an adaptive Max Pooling layer, the output size is fixed. The author himself also said that RoI pooling is borrowed from the SPP layer of SPPnet, which is a special case of the SPP layer.

Model training details

In general, the model uses the Fine-tuning method to perform fine-tuning training after modifying the pre-trained image classification model.
At the same time, the fine-tuning training is a one-stage process, that is, a loss function is used to optimize the softmax and bbox regressors at the same time, and the training is completed in one step. (Unlike R-CNN and SPPnet need to train 3 modules)

Build a model (modified based on the pre-trained CNN network)

The author used the pre-trained CNN model (such as AlexNet, VGG-16, etc.) on the ImgeNet image classification dataset and modified it as follows.

  • Change the last Max Pooling of the network to the RoI Pooling layer.
  • Changed the last fully connected + softmax layer of the network into two parallel layers, as described above:
    ① Fully connected layer + softmax, used to predict K + 1 K + 1K+Probability of 1 class.
    ② bounding-box regressor, used to predict the positioning of each category.
  • Change the input of the network to the image + the positioning information of the target candidate frame in the image.

Why is it difficult to update the weights of the convolutional network during SPPnet training? (About the sampling method during training)

The author mentioned in the article that Fast R-CNN can train the weight parameters of the entire network through backpropagation.
The paper here explains in detail that when the SPPne proposed above uses fine tuning to fine-tune the training of the CNN network, the parameters of the network are difficult to be updated.
The author used the statement "SPPnet is unable to update weights" in the article. I think a more appropriate statement is that it is difficult to update.

I searched the whole Internet and couldn't find a more detailed answer, so I drew a picture myself to explain in detail how it is difficult to update the parameters.
First of all, it must be clear that this problem is closely related to the sampling method of Fast R-CNN. This problem directly guides the sampling method of a mini-batch during training of Fast R-CNN.
I drew a picture to compare the training differences between Fast-R-CNN and SPPnet:
insert image description here

The first half is the training process of SPPnet/R-CNN. Assuming that our batch_size is set to 4, according to the principle of average sampling, generally the candidate frames in each of the 4 pictures enter the forward propagation. Since there are 4 different pictures, we need to go through 4 forward propagations . Then calculate the Loss, and then perform backpropagation.
The second half is the training process of Fast R-CNN. We still set batch_size to 4, but we use the principle of stratified sampling. Because each picture has 2000 candidate boxes, we first extract several pictures, and then extract multiple candidate boxes in each picture. For example, we extract 2 pictures (N=2), and extract 2 candidate boxes from each picture. In this way, each picture of Fast R-CNN only needs to be forwarded once, and the candidate frame can be mapped and matted according to the position. In this example, we only need 2 times of forward propagation .

If the batch_size is enlarged to 128, we assume that each mini-batch only extracts 2 pictures, and each picture draws 64 candidate boxes, then we only need 2 forward propagations to scan a batch.
However, SPPnet/R-CNN requires 128 forward propagations, which requires a sharp increase in computing power. So for fine tuning, this is inefficient.

This is why it is difficult to update the weights of the convolutional network during SPPnet training. It is not impossible to update, but very inefficient.


Therefore, the mini-batch sampling strategy adopted by the author in this article is:
first randomly sample 2 pictures, sample 64 candidate boxes for each picture, and batch_size is 128 in total.
Of these 128 samples, 25% are positive samples and the rest are negative samples (this is the same as R-CNN).
Among them, IoU greater than 0.5 is a positive sample, and IoU between 0.1 and 0.5 is a negative sample.
In addition, the hard example mining strategy (hard example mining) is used to train samples with IoU less than 0.1. My understanding is that because the number of samples with IoU less than 0.1 is very large, it is worth digging. During training, predict the samples with IoU<0.1 of the target frame. If the prediction is wrong, add the hard example wrong question set to enter the next round of training.
In addition, the author also used a horizontal flip with a probability of 0.5 as a means of data augmentation.

Loss function for multitasking

The Fast R-CNN network has two outputs at the end, which are softmax classification and bbox regressor, so this is a multi-objective optimization problem.
For this two-objective optimization problem, the author's idea is to sum the two designed loss functions, and simplify the two-objective optimization problem into a single-objective optimization problem.

In Fast R-CNN, each sample (candidate box) will output:
a discrete probability distribution p = ( p 0 , … , p K ) p=\left(p_{0}, \ldots, p_{K}\right )p=(p0,,pK) , which contains the predicted probability of k+1 categories
A set of regression coefficientstk = ( txk , tyk , twk , thk ) t^{k}=\left(t_{\mathrm{x}}^{k}, t_ {\mathrm{y}}^{k}, t_{\mathrm{w}}^{k}, t_{\mathrm{h}}^{k}\right)tk=(txk,tyk,twk,thk) k k k corresponds to each category.

(Note: For regression coefficients and Bounding box regression, see my article for details .)

In each round of training, a candidate box will have a Ground-Truth-related category label (recorded as the category number uuu , where 0 is the background class); there is also a set of regression coefficients vvcalculated based on Ground-Truthv .
Because of the category labels, our regression coefficienttkt^{k}tk just need to pay attentionto uuu category, so we only needtut^{u}tu .
As shown in the figure below, the softmax and the Loss of the regressor can be calculated separately for each candidate frame.
insert image description here

Our purpose is to design the respective Loss of the softmax and the regressor, and add them up to become a single-objective optimization problem, so that a multi-task loss function can be optimized normally.

Let's use the following solution:
L ( p , u , tu , v ) = L cls ( p , u ) + λ [ u ≥ 1 ] L loc ( tu , v ) L\left(p , u , t^{ . u}, v\right)=L_{\mathrm{cls}}(p,u)+\lambda[u\geq1] L_{\mathrm{loc}}\left(t^{u}, v\right )L(p,u,tu,v)=Lcls(p,u)+λ [ u1]Lloc(tu,v)

The first item of the loss function is the classification loss L cls L_{\mathrm{cls}}Lcls, is the cross-entropy loss function . Since only uuThe true probability of item u is 1, so the cross-entropy loss function item can be directly equivalent to:
L cls ( p , u ) = − log ⁡ pu L_{\mathrm{cls}}(p, u)=-\log p_{u}Lcls(p,u)=logpu
p u p_{u} pufor uuThe probability of item u , that is, for the uuthThe predicted probability of item u can be taken as logarithmic loss.


The second term of the loss function is the regressor loss L loc L_{\mathrm{loc}} of the localization coefficientLloc, the author here uses smooth ⁡ L 1 \operatorname{smooth}_{L_{1}}smoothL1 损失函数。表示为:
L l o c ( t u , v ) = ∑ i ∈ { x , y , w , h } smooth ⁡ L 1 ( t i u − v i ) L_{\mathrm{loc}}\left(t^{u}, v\right)=\sum_{i \in\{\mathrm{x}, \mathrm{y}, \mathrm{w}, \mathrm{h}\}} \operatorname{smooth}_{L_{1}}\left(t_{i}^{u}-v_{i}\right) Lloc(tu,v)=i{ x,y,w,h}smoothL1(tiuvi) ignore smooth
for nowsmoothL1. For a candidate box sample, it has a positioning information iii , which is relative to the categoryuuThe GT box for u will produce a true offset coefficientvi v_ivi, and predict an offset coefficient tiu t_{i}^{u} according to the regressortiu, constructing a loss function from between them.

In R-CNN's bbox regression problem (see my article for details ), the loss function of the regressor is the sum of squares, that is, L 2 L_2L2A loss function that finds an analytical solution to the parameters.

However, in Fast R-CNN, we need to optimize two tasks (classification + localization) at the same time to build end-to-end training, and use the sum of the two to convert it into a single-objective optimization problem. If using L 2 L_2L2Lost,Once tiu t_{i}^{u}tiusum vi v_iviIf the difference is too large, because of the square term, the Loss value will be very large, which may be several orders of magnitude greater than the loss of the cross entropy. This may cause learning difficulties, so it is best to keep the Loss of the two tasks on the same scale.
But if using L 1 L_1L1If there is a loss, it is difficult to find the derivative at the zero point. So the author uses smooth ⁡ L 1 \operatorname{smooth}_{L_{1}}smoothL1Loss function, in the interval range close to 0 ( [ − 1 , 1 ] [-1,1][1,1 ] ), the function becomes a curved shape for easy derivation. The function images of the three are shown in the figure below:

insert image description here

smooth ⁡ L 1 \operatorname{smooth}_{L_{1}} smoothL1Written as a function,
smooth ⁡ L 1 ( x ) = { 0.5 x 2 if ∣ x ∣ < 1 ∣ x ∣ − 0.5 otherwise \operatorname{smooth}_{L_{1}}(x)= \begin{cases} 0.5 x^{2} & \text { if }|x|<1 \\ |x|-0.5 & \text { otherwise }\end{cases}smoothL1(x)={ 0.5 x2x0.5 if x<1 otherwise 

The author also mentioned in the paper that the Smooth L1 loss is less sensitive to far-end values ​​than L2. If the range of regression is not limited, using L2 loss requires careful tuning of the learning rate to prevent gradient explosion.

Finally, L loc L_{\mathrm{loc}}Llocpreceding [ u ≥ 1 ] [u \geq 1][u1 ] item means that when the category number is greater than or equal to 1, this item is 1. Such asu = 0 u=0u=0 , that is, when it is the background class, this item is equal to 0, that is, the regression loss is not calculated.
The precedingλ \lambdaλ is the sum coefficient, which is a hyperparameter. The author believes that setting it to 1 can balance the two Loss tasks.


In summary, the design of the loss function turns the training of Fast R-CNN into a single-stage training, and the training efficiency is greatly improved.
This is a major innovation compared to R-CNN and SPPnet.

About the gradient backpropagation of the RoI pooling layer

This part of the paper talks about the details of a training, that is, how does RoI pooling perform gradient backpropagation?


Because RoI pooling is a type of Max pooling, we first need to know how Max pooling backpropagates the gradient :
insert image description here

In the above figure, we perform a maximum pooling with a size of 2×2 and a step size of 2 on a 4×4 feature map to obtain a 2×2 output. And these 4 output values ​​y 1 , y 2 , y 3 , y 4 y_1,y_2,y_3,y_4y1,y2,y3,y4It should bring its position information in the original feature map, namely x 6 , x 8 , x 9 , x 16 x_6,x_8,x_9,x_{16}x6,x8,x9,x16).
After deriving this output, the gradient of each position should be reversely assigned to the corresponding position of the original feature map, and then the gradient of the rest of the positions is 0.
This is the process of reverse gradient transfer of max pooling.

In practical problems, we often encounter the situation where the pooling kernel area overlaps, that is, the step size < size .
When the maximum value scanned by the previous pooling kernel and the maximum value scanned by the next one are both in the same position, and both are located in the overlapping area, how should the gradient be calculated?
As shown below.
We are counting x 4 x_4x4The gradient at should be the last y 1 , y 2 y_1,y_2y1,y2The sum of the two partial derivatives, because their corresponding positions of the original feature map are the same.

insert image description here


In RoI pooling, the backpropagation of the gradient is the same as the Max pooling mechanism mentioned above.
One new difference is that a mini-batch passed in during Fast R-CNN training is 64 candidate frames randomly selected from each of the 2 pictures.
Let's take 64 candidate boxes extracted from a picture as an example. We need to extract 64 small feature maps, large and small, from the conv feature map generated by one picture, and then pass them into the RoI pooling layer. So the positions of these small feature maps must overlap. What if the maximum value obtained by our max pooling happens to be in the overlapping area?

See below:
insert image description here

There are 2 candidate frame maps in the conv feature map of a picture, each of which needs to output a 2×2 output.
Assume that there is an overlap between them as shown in the figure (for box 1, it is x 169 x_{169}x169; x 1 x_1 for box 2x1), after each max pooling, this number is selected as the output for the position (box 1 output is in y 4 y_4y4, Box 2 outputs at y 1 y_1y1).
Essentially this is xx from the same locationx , then the partial derivative here is also the corresponding positionyyThe sum of the partial derivatives of y .

Therefore, in the RoI pooling layer of Fast R-CNN, for an input xxThe partial derivative at the x position has undergone two additions. If there is a homologous maximum value between the outputs, the gradient needs to be summed at the source position during backpropagation.

Therefore, this difficult formula in the thesis refers to the above meaning, as long as you understand what this formula is doing:
∂ L ∂ xi = ∑ r ∑ j [ i = i ∗ ( r , j ) ] ∂ L ∂ yrj \frac{\partial L}{\partial x_{i}}=\sum_{r} \sum_{j}\left[i=i^{*}(r, j)\right] \frac{\ partial L}{\partial y_{rj}}xiL=rj[i=i(r,j)]yrjL

On Scale Invariance

Scale invariance is an evaluation metric for object detection models.
If you are given a zoomed large picture and a small picture, the model can recognize that there is a horse in the picture and a person is riding on it , then the scale invariance of the model is better.

To improve the scale invariance of the model, all pictures can be fixed at one size during training, and the same size is also used during testing. Our goal is to let the model learn the scale invariance by itself. This is called single-scale object detection.
In addition, the image can also be randomly scaled to a preset size for training during training, which is also a method of data enhancement. At test time, the image is scaled to all preset sizes to form an "image pyramid", and the generated candidate boxes under each size are scale-normalized. This is called multi-scale object detection.
insert image description here

Later in the paper (Part 2 of Experimental Results) the author will do some experiments to explore scale invariance.

Model Test Details

After the model fine tuning is trained, it can be tested.
Before testing a picture, it is also necessary to randomly generate (through selective search) the position information of about 2000 candidate boxes.
Then enter the forward calculation, and then use the non-maximum value suppression algorithm to process the redundant candidate boxes (same as R-CNN).

Accelerate inference with truncated SVD

We know that there are a large number of fully connected layers at the end of the Fast R-CNN model, as shown in the figure below.
insert image description here

We know that after Fast R-CNN passes in a picture, it only needs to perform one forward calculation on the CNN network, and then project the corresponding feature map through the RoI project to enter the RoI pooling. Then, each output feature vector (about 2000) will undergo a forward operation of the fully connected layer, and then enter softmax and bbox regression.

Therefore, the computational load of the fully connected layer during testing is much greater than that during training.
The author pointed out that in fact, a large amount of computing power will be consumed in the calculation of the fully connected layer during the test.
We know that the calculation of the fully connected layer is a matrix operation, for a matrix WWW , we can perform singular value decomposition (SVD) on it, simplify a bloated matrix into the product of three matrices, we can "slim down" the three matrices, and take the corresponding parts with the highest singular value to form a new matrix, and almost no loss of the original matrixWWW的信息:
W ≈ U Σ t V T W \approx U \Sigma_{t} V^{T} WThe StVT
The speed of matrix operation after slimming will be greatly improved. See the experimental results section below for the specific improvement.
(For the specific knowledge details of this part of SVD, please refer to linear algebra)

Experimental result 1

The following three main experimental results of the model support the contribution of Fast R-CNN:

  • Excellent accuracy on VOC2007, 2010, 2012 datasets
  • Faster training and testing speed than R-CNN and SPPnet
  • When using the backbone network as the VGG-16 model, fine-tuning the weight of the convolutional layer improves the detection accuracy

In the experiment, the author mainly used three CNN models as the backbone network of Fast R-CNN.
The model with AlexNet as the backbone network is called S model (small);
the model with VGG_CNN_M_1024 as the backbone network is called M model (medium), which is the same depth as the S model, but wider;
VGG-16 is the backbone The model of the network is called the L model (largest), which is the largest model.

Results on the VOC dataset

The following three tables are the results of VOC2007, 2010, and 2012 in turn, and are subdivided into the accuracy of each category. The backbone network of all models uses VGG-16 (ie L model).
It can be seen that the overall accuracy of Fast R-CNN (FRCN [ours] in the text) is the highest. It can be roughly seen that the mAP values ​​​​in the lower right corner are higher than those above.
insert image description here

Improvements in training and testing speed

The author made the following table, showing the comparison of training and testing time of Fast R-CNN, R-CNN, and SPPnet. Three classes of models with different sizes are compared.
It can be seen that in the S model, the training speed of Fast R-CNN is 18.3 times faster than that of R-CNN.
In the L model, the test speed is 146 times faster; if the aforementioned SVD technique is used for testing, the speed is increased by 213 times. At the same time the accuracy is the highest.
insert image description here
Here the author pointed out that the use of SVD in the test will increase the speed by more than 30%, but the accuracy will only decrease by 0.3%.
And SVD is a separate trick, so there is no need for fine tuning afterwards.

Weights of fine-fune convolutional layers improve detection accuracy

As mentioned earlier, it is difficult for SPPnet to update the model weights, so at that time SPPnet used the method of freezing all layer weights before the fully connected layer to improve efficiency, and the accuracy rate was not bad.
However, the author believes that the parameters of the previous convolutional layer are also very important, and fine-tune convolution can improve the accuracy of the model.
So the author used the L model to do experiments on the parameters of the frozen layer, which proved this conclusion.

The figure below shows that the accuracy of fine-tuning the fully connected layer fc6 is not as good as that of fine-tuning the convolutional layers conv3_1 and conv2_1. The deeper the fine-tuning layer, the better the effect.

insert image description here
But not all convolutional layers need to be fine-tuned? The author thinks not necessarily.
The author found that in smaller models (such as S model and M model), fine-tuning the first convolutional layer conv1 has no effect on the accuracy, but makes the model training efficiency decrease (such as requiring larger GPU memory). It is relatively appropriate to start fine-tuning from the second convolutional layer.

Experimental result 2 - verify the effectiveness of the Fast R-CNN design

The author then conducted some experiments to demonstrate the effectiveness of the design of the Fast R-CNN model itself.
I think this is a scientific research idea worth learning. It not only needs to compare and demonstrate the results of a new method with the old method, but also needs to demonstrate its scientific nature for its new design, and better explain and demonstrate the scientific value of the new method.

Does multitasking training really work?

We know that R-CNN and SPPnet use a multi-task training method to train a CNN model for feature extraction, a linear SVM classifier for candidate frame image classification, and a bbox regressor for adjusting the frame. position.
Fast R-CNN designed a loss function to optimize multi-task into a single task, and optimized a loss function for training in one step. This reduces the consumption of training resources and improves efficiency.

But does multi-task training really improve the accuracy of the model? The author designed an experiment. On the Fast R-CNN model, the three tasks (feature extraction, classification, and regression) were manually separated and combined for training and comparison.
The left side of the figure below is multi- task training , single-stage training , and bbox regression during testing . The ticked ones are the combination of training and testing.
In the first column, I used a black box (that is, no tick) to indicate the results obtained by only using image classification (cross-entropy loss) for training.
Experimental results are expressed in mAP on the VOC07 dataset.
insert image description here
It can be seen that using multi-task training + bbox regression has the highest accuracy improvement. Prove that multi-task training works.

On Scale Invariance

Two object detection strategies with respect to scale invariance were introduced earlier.
The author conducted experiments to compare the results of the two strategies, and compared the performance and accuracy differences between the two strategies.

In the scales in the figure below: 1 indicates a single-scale strategy (all scaled to a size of 600 pixels), and 5 indicates a multi-scale strategy (there are five preset scales of 480, 576, 688, 864, and 1200).
insert image description here

It can be seen that the detection accuracy under the multi-scale strategy is a little higher than that of single-scale, but the detection rate of single-scale is much faster than that of multi-scale.
Therefore, it seems that a single scale is cost-effective, but in reality, a trade-off is required.

Do you need more training data

The author sees that traditional target detection algorithms (such as DPM, which is also the original author's own work) have a phenomenon of accuracy saturation (mAP saturate).
That is, after the training data reaches a certain level, the accuracy of the model cannot be improved.
So does Fast R-CNN have such a phenomenon?

The author conducted experiments to expand the VOC07 data set, and expanded VOC2012 into the training set, and found that the accuracy increased (66.9% → 70.0%).
Therefore, when the training data of the Fast R-CNN model increases, the accuracy of the model is also improved, and there is no accuracy saturation phenomenon, which is a characteristic of an excellent target detection model.

SVM classifier vs softmax

We know that R-CNN and SPPnet use SVM classifiers to classify candidate boxes for images.
But Fast R-CNN is designed as a softmax classification, which is related to the design purpose of the entire model, for multi-task training and fine-tuning of the model.

In terms of classification effect, how does SVM compare with softmax? The author did the following experiments.

We found that softmax is more useful than SVM in Fast R-CNN.
And although the SVM in R-CNN has a higher accuracy on the small model, as the model increases, softmax is still the most useful one.
Considering factors such as the efficiency of multi-task training, softmax is better than SVM.
insert image description here

Is the more candidate boxes the better?

At present, there are mainly two frame generation algorithms, one is to generate a relatively sparse distribution of selective search, and the other is an algorithm based on DPM to generate dense candidate frames.

Take a brief look at the picture below, and you can see that using selective search (blue solid line in the figure) as the number of candidate frames increases, the accuracy will fluctuate, so it is not that the more candidate frames, the better .
The combined algorithm of selective search+randomly generated dense frames is used. As the number of frames increases, the accuracy drops significantly.
insert image description here

(over)

Guess you like

Origin blog.csdn.net/takedachia/article/details/126385692