Analysis of Target Detection Faster R-CNN

Introduction

Faster R-CNN is a two-stage target detection algorithm proposed by Ross B. Girshick in 2016. Faster R-CNN integrates feature extraction, region proposal network for region of interest extraction, and final bounding box regression and classification into one network, which greatly improves the overall performance of the network.
This article analyzes Faster R-CNN from the following four aspects:

  1. Conv layers. Faster R-CNN first uses a set of simple conv+relu+pooling layers to extract feature maps of the input image. This feature maps will be shared for the subsequent RPN layer and fully connected layer;
  2. Region Proposal Networks. The RPN network is used to generate effective region proposals. This layer judges whether the anchor belongs to positive (background) or negative (foreground) through softmax, and then uses bounding box regression to correct the anchors to obtain accurate proposals;
  3. Roi Pooling. This layer collects the input feature maps and proposals, and after synthesizing these information, extracts the proposal feature maps and sends them to the subsequent fully connected layer;
  4. Classification and Regression. Use proposal feature maps to calculate the category of proposals, and perform bounding box regression again to obtain the final precise position;

backbone

Taking VGG16 as an example, introduce the backbone composition of Faster R-CNN. The backbone contains three layers of conv, relu, and pooling. As shown in the figure below, there are 13 conv layers, 13 relu layers, and 4 pooling layers.
image.png
The parameter settings of all convolutional layers are the same, kernel_size=3, pad=1, stride=1, that is, the output and input feature map sizes are the same, and the parameters of the pooling layer are set to: kernel_size=2, pad=0, stride =2, that is, the output size is only 1/2 of the input size.
Since the entire backbone contains 4 pooling layers, it is assumed that the size of the input image is M ∗ NM *NMN , then after backbone inference, the size of the output feature map isM 16 ∗ N 16 \frac{M}{16}*\frac{N}{16}16M16N.

Region Proposal Networks(RPN)

After the image features are provided by the backbone, the input image enters the RPN network. The specific results are shown in the figure below: The RPN network
image.png
is divided into two branches: classification and regression. The classification branch uses sotfmax to classify the default candidate frame anchor is positive or negative; the regression branch Responsible for predicting the offset of the actual target box relative to the default candidate box. Finally, the Proposal layer combines the positive anchor and the corresponding regression offset to get the proposal.
The following input image size is C = 3, H = 576, W = 1024 C=3, H=576, W=1024C=3,H=576,W=1024 cases, after the feature is extracted by the backbone, the size of the feature map is1x512x36x64(NCHW arrangement).

Anchors

In Faster R-CNN, 9 anchors of different sizes and different aspect ratios are provided, which contain four values ​​( x 1 , y 1 , x 2 , y 2 ) (x_1,y_1,x_2,y_2)(x1,y1,x2,y2) represents the coordinates of the upper left and lower right corners of the rectangle, and the anchor is shown in the figure below:

One thing to note is thatthe size of the anchor in Faster R-CNN is relative to the network input size. Take the input image size asC = 3, H = 576, W = 1024 C=3, H=576, W=1024C=3,H=576,W=1024 cases, Faster R-CNN contains a total of:
ceil ( 1024 / 16 ) ∗ ceil ( 576 / 16 ) ∗ 9 = 64 ∗ 36 ∗ 9 = 20736 ceil(1024/16)*ceil(576/16)*9=64 *36*9=20736ceil(1024/16)ceil(576/16)9=64369=20,736
anchors, these anchors are reflected in the network input, as shown in the figure below, for each anchor, a classification branch is required to determine whether it contains objects, and the inclusion of objects is positive, otherwise it is negative. At the same time, since there is an offset between the real position of the target and the anchor, the regression branch is needed to predict the offset.
image.png

softmax branch

After the backbone, the size of the feature map is 1x512x36x64, firstly through 1×1 convolution, the size of the feature map obtained is 1x18x36x64, among them 18=2*9, there are 9 anchors, and each anchor may be positive or negative.
image.png
Then there is a reshape operation, the dimension of the obtained feature map 1x2x36*9x64is mainly for the convenience of softmax classification, and then reshape to restore the original state. The implementation code snippet of the classification branch is as follows:

# define bg/fg classifcation score layer
self.nc_score_out = len(self.anchor_scales) * len(self.anchor_ratios) * 2 # 2(bg/fg) * 9 (anchors)
self.RPN_cls_score = nn.Conv2d(512, self.nc_score_out, 1, 1, 0)		# 1×1卷积

def forward():		# 前向推理
	# get rpn classification score
    rpn_cls_score = self.RPN_cls_score(rpn_conv1)
	# reshape + softmax + reshape
    rpn_cls_score_reshape = self.reshape(rpn_cls_score, 2)
    rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape, 1)
    rpn_cls_prob = self.reshape(rpn_cls_prob_reshape, self.nc_score_out)

The sotfmax classification obtains positive anchors, which is equivalent to initially extracting the detection target candidate area.

regression branch

The regression branch regression obtains the offset between the anchor and the ground truth, so in the training phase, the supervision signal is the gap between the anchor and the real frame ( tx , ty , tw , th ) (t_x,t_y,t_w,t_h )(tx,ty,tw,th),其中 t x , t y t_x,t_y tx,tyis the translation value of the center point, tw , th t_w,t_htw,this the scaling value of width and height. ( tx , ty , tw , th ) (t_x,t_y,t_w,t_h)(tx,ty,tw,th) is calculated as follows:
tx = ( x − xa ) / wa ; ty = ( y − ya ) / ha t_x = (x - x_a)/w_a; t_y=(y-y_a)/h_atx=(xxa)/wa;ty=(yya)/ha
t w = l o g ( w / w a ) ; t h = l o g ( h / h a ) t_w=log(w/w_a);t_h=log(h/h_a) tw=log(w/wa);th=log(h/ha)
of whichxa , ya , wa , ha x_a, y_a, w_a, h_axa,ya,wa,haIs the coordinate value of the anchor. The code snippet of the regression branch is as follows:

# define anchor box offset prediction layer
self.nc_bbox_out = len(self.anchor_scales) * len(self.anchor_ratios) * 4 # 4(coords) * 9 (anchors)
self.RPN_bbox_pred = nn.Conv2d(512, self.nc_bbox_out, 1, 1, 0)

def forward():
    # get rpn offsets to the anchor boxes
    rpn_bbox_pred = self.RPN_bbox_pred(rpn_conv1)

Proposal layer

The proposal layer calculates an accurate proposal through the positive anchors and its corresponding regression offset, and sends it to the subsequent RoI Pooling Layer. The definition of the Proposal layer is as follows:

# define proposal layer
self.RPN_proposal = _ProposalLayer(self.feat_stride, self.anchor_scales, self.anchor_ratios)
def forward():
    rois = self.RPN_proposal((rpn_cls_prob.data, rpn_bbox_pred.data, im_info, 
                              cfg_key))

The Proposal Layer has three inputs: the output of the softmax branch and the regression branch, plus im_info. Where im_info=[M,N,scale_factor], M and N are the input dimensions of the network, scale_factor is the scaling information of the original image PxQ, resize to MxN.
Proposal Layer execution process is as follows:

  1. Get the preliminary target frame, and regress all the anchors according to the predicted offset to get the preliminary target frame;
anchors = self._anchors.view(1, A, 4) + shifts.view(K, 1, 4)
anchors = anchors.view(1, K * A, 4).expand(batch_size, K * A, 4)

# Transpose and reshape predicted bbox transformations to get them
# into the same order as the anchors:

bbox_deltas = bbox_deltas.permute(0, 2, 3, 1).contiguous()
bbox_deltas = bbox_deltas.view(batch_size, -1, 4)
# Convert anchors into proposals via bbox transformations
proposals = bbox_transform_inv(anchors, bbox_deltas, batch_size)
  1. Delete positive anchors that exceed the boundaries of the image;
 # 2. clip predicted boxes to image
proposals = clip_boxes(proposals, im_info, batch_size)
  1. Sort the input positive softmax scores and select the first RPN_PRE_NMS_TOP_N anchors;
# # 4. sort all (proposal, score) pairs by score from highest to lowest
_, order = torch.sort(scores_keep, 1, True)
if pre_nms_topN > 0 and pre_nms_topN < scores_keep.numel():
    order_single = order_single[:pre_nms_topN]
proposals_single = proposals_single[order_single, :]
scores_single = scores_single[order_single].view(-1,1)
  1. Delete positive anchors with very small size;
  2. Perform NMS on the remaining positive anchors;
# 6. apply nms (e.g. threshold = 0.7)
# 7. take after_nms_topN (e.g. 300)
# 8. return the top proposals (-> RoIs top)

keep_idx_i = nms(torch.cat((proposals_single, scores_single), 1), nms_thresh, force_cpu=not cfg.USE_GPU_NMS)
keep_idx_i = keep_idx_i.long().view(-1)

if post_nms_topN > 0:
    keep_idx_i = keep_idx_i[:post_nms_topN]
proposals_single = proposals_single[keep_idx_i, :]
scores_single = scores_single[keep_idx_i, :]

The target information of the proposal output by the proposal layer is relative to the input image M×N.

RoI Pooling

The RoI Pooling layer is responsible for collecting proposals, and calculating the feature maps corresponding to the proposals, and then inputting them into the subsequent network. It has two inputs:

  • Shared feature maps;
  • Proposal boxes output by RPN (relative to network input M×N)
# define anchor target layer
self.RPN_anchor_target = _AnchorTargetLayer(self.feat_stride, self.anchor_scales, self.anchor_ratios)
def forward():
    rpn_data = self.RPN_anchor_target((rpn_cls_score.data, gt_boxes, im_info, num_boxes))

The reason for using RoI Pooling:
Since the size and shape of the proposal boxes output by RPN are different, for a trained model, the input and output must be of fixed size. For boxes of different shapes, Faster R-CNN proposes to use RoI Pooling to solve this problem.
The workflow of RoI Pooling:
Assume that the input feature map is as shown in the figure below:
image.png
The position of the proposal boxes projected onto the feature map is as shown in the figure:
image.png
Assuming that the output is a 2×2 feature map, then the projection area is divided into 2×2 Sections, get:
image.png
Then do max pooling for each section, get:
image.png
Therefore, RoI Pooling is to pool feature maps of different sizes into feature maps of the same size, which is beneficial to output to the next layer of network.
Since the position of the proposal is usually obtained by model regression, it is generally a floating-point number, and the feature map after pooling requires a fixed size. There are two quantizations in this process:

  • Quantize the proposal boundaries into integer-valued coordinates;
  • Divide the quantized boundary area into kxk sections on average, and quantize the boundary of each section;

After the above two quantizations, the candidate frame has a certain deviation from the initial regressed position at this time, and this deviation will affect the detection accuracy .

Classification and Regression

The 7×7 feature map output by RoI Pooling is then sent to a Classifier module, and then the specific category of each proposal is calculated through the Linear layer and softmax, and the target confidence is output, and the position offset of the proposal is further obtained by using regression Quantity for regressing more precise target boxes. As shown in the figure below:
image.png
The code of the Classifier module is shown below, which is a part of the VGG model;

self.classifier = nn.Sequential(
    nn.Linear(512 * 7 * 7, 4096),
    nn.ReLU(True),
    nn.Dropout(p=dropout),
    nn.Linear(4096, 4096),
    nn.ReLU(True),
    nn.Dropout(p=dropout),
)

The fully connected layer and the softmax layer determine the category of the proposal and output the corresponding confidence;

self.RCNN_cls_score = nn.Linear(4096, self.n_classes)
cls_prob = F.softmax(cls_score, 1)

The offset is obtained by using the fully connected layer to further refine the position of the target.

self.RCNN_bbox_pred = nn.Linear(4096, 4 * self.n_classes) 

reference link

Guess you like

Origin blog.csdn.net/hello_dear_you/article/details/129309592