Detailed explanation of Fast-RCNN network

I. Introduction

I learned the SS algorithm and the R-CNN network before, and then continue to learn the Fast-RCNN network.
insert image description here

This article has two purposes, one is as a personal note, which is convenient for you to consolidate your review knowledge; the other is to help friends who are self-studying, and hope that the poor insights can help everyone.

2. Principle steps of Fast-RCNN

Fast-RCNN can be divided into three steps:

1) Use the SS algorithm to generate about 1-2k candidate boxes.
2) Refer to the SPPnet model to input the entire picture into the network to obtain the feature map, and then map the candidate frame obtained in the first step to the feature matrix to extract the feature matrix.
3) The feature layer channel ROI pooling operation is scaled to a size of 7*7, and a series of fully connected layers are flattened and connected to obtain classification results and bounding box regression parameters.

2.1 Generation of candidate regions

insert image description here
The picture above shows the comparison of the way R-CNN and Fast-RCNN generate candidate regions

R-CNN
It can be seen that the candidate boxes generated by R-CNN need to be input into the deep network to extract features, perform forward propagation, and repeat calculations in redundant areas, which greatly increases the amount of calculation and reduces efficiency.
Fast-RCNN
It inputs the entire picture into the network to obtain a feature map, and the candidate frames obtained by the SS algorithm are mapped to the feature map to obtain features, so that there is no need to repeatedly calculate these candidate regions. The amount of calculation is greatly reduced.

Sampling of training data (positive and negative samples)

Unlike R-CNN, we did not use all the candidate boxes generated by the SS algorithm. The nearly 2,000 candidate boxes generated by the SS algorithm actually used only a small part of the candidate boxes for training, and the samples were divided into positive samples and Negative samples. It is mentioned in the specific paper that the division into positive samples and negative samples is to balance the data, because if there are obviously too many positive samples, then the network will take it for granted that the candidate frame to be recognized is a type of target type, but in fact the candidate frame It is also likely to be the background, so in order to balance the inclusion of the target and the inclusion of the background, it is divided into positive samples and negative samples.

The definition of positive samples and negative samples: the IOU between the candidate box and the real box is greater than 0.5, which is a positive sample, and the IOU is in [0.1, 0.5], which is recorded as a negative sample.
insert image description here

2.2. ROI Pooling layer

insert image description here

Implementation of ROI Pooling layer

Divide the sample into 7*7 regions, and perform max pooling operation on each region
insert image description here

2.3. Classifier

insert image description here
As mentioned above, we scaled to the same size through the ROI pooling layer after extracting features, then straightened and flattened, and then went through two fully connected layers, and then respectively performed category probability prediction and bounding box bounding boxes in parallel fully connected layers. Prediction.

For the probability that the classifier finally outputs N+1 categories, N is the number of categories of the detection target, and 1 is the background. The obtained probability distribution is processed by softmax.

2.4. Bounding Box Prediction

insert image description here
For each category, a set of candidate frame regression parameters (d_x, d_y, d_w, d_h) will be output. The number of nodes output by the last layer is (N+1)*4

So how to use these parameters to get the final prediction frame?
insert image description here

As shown in the figure: P represents the prediction frame, G is the real frame, and G^ is the final prediction frame. We can understand these formulas as follows: the output regression parameters are used to adjust the center coordinates x and y of the prediction frame. , and scale and adjust the width and height of the candidate box.

These formulas come from the R-CNN paper, and exp refers to the exponent.

2.5. Loss Calculation

insert image description here
Unlike R-CNN, Fast-RCNN is trained together for classification and bounding box regression, so the loss is also the sum of the corresponding part losses.

2.5.1. Classification loss

L c l s ( p , u ) L_{cls}(p, u) Lcls(p,u ) refers to the loss of classification, p is the probability distribution of the predicted category p=p_0,p1,...pk, u is the target real category label​, the calculation formula is:

L c l s ( p , u ) = − l o g p u L_{cls}(p, u) = -log^{p_u} Lcls(p,u)=logpu

Pu refers to the probability that the current box prediction category is u. In fact, the above formula is the cross-entropy loss, but the other category labels are 0, so they are simplified.
insert image description here

2.5.2. Bounding box regression loss

λ represents the balance coefficient to make the value stable\lambda represents the balance coefficient to make the value stableλ represents the balance coefficient to stabilize the value. [
u>=1] is the Iverson bracket. When the real label is the u category, it is 1. When this condition is not met, it is equal to 0, which means that only the predicted category is not the background. is 1, and is 0 for the background, tut^utu is the predicted bounding box regression parameter, specifically:
( txu , tyu , twu , thu ) ({t_x}^u,{t_y}^u,{t_w}^u,{t_h}^u)(txu,tyu,twu,thu )
v is the real bounding box regression parameter (v is deduced from the real box position center, box width and height using the above formula, and the inverse formula is as follows)

v x = G x − P x P w v_x = \frac{G_x - P_x}{P_w} vx=PwGxPx
v y = G y − P y P h v_y = \frac{G_y - P_y}{P_h} vy=PhGyPy
v w = l o g G w P w v_w =log^{ \frac{G_w}{P_w}} vw=logPwGw
v h = l o g G h P h v_h = log^{\frac{G_h}{P_h}} vh=logPhGh

L l o c ( t u , v ) = Σ i ∈ x , y , w , h ( s m o o t h L 1 ( t i u − v ) ) L_{loc}(t^{u}, v) = \Sigma_{i \in {x, y, w, h}}(smooth_{L_{1}}{(t_i}^{u}-v)) Lloc(tu,v)=Six,y,w,h(smoothL1(tiuv))

The SoomthL1 function is as follows:

s o o m t h L 1 = { 0.5 ∗ x 2 i f ∣ x ∣ < 1 ∣ x ∣ − 0.5 o t h e r w i s e soomth_{L_1}=\begin{cases}0.5*x^2 & if|x|< 1 \\ |x| - 0.5 & otherwise\end{cases} soomthL1={ 0.5x2x0.5ifx<1otherwise

3. Summary

insert image description here
Compared with R-CNN, the steps are simplified to two steps, and feature extraction, classification, and bounding box regression are all completed by a CNN.

Reference blog and video, code

B station up main video address

[code address] (https://github.com/rbgirshick/fast-rcnn)

[Paper address] ( [1504.08083] Fast R-CNN (arxiv.org) )
Previous blog:
SS-algorithm
R-CNN network detailed explanation

Guess you like

Origin blog.csdn.net/SL1029_/article/details/130779462