(5) Target detection - target detector based on candidate regions

1. The key points of RCNN

(1) Use selective search to produce RP (Region Proposals), the size of RP is different, and then through warp, change RP into a unified size of 227*227

(2) Input 227*227 RP into CNN for feature extraction

(3) Feature classification of RP with independent SVM

(4) Correct the original RP with Bb regression (Bounding box regression), and produce the coordinates of the prediction window

2、fast rcnn

(1)  Generate about 2K RP with Selective Search

(2) Input the entire image into CNN to extract the feature map

(3) Map RP to the feature map of the last layer of convolution of CNN

(4) Make each RP a fixed-size feature map through the ROI pooling layer

(5)  Use Softmax Loss and Smooth L1 loss to jointly train classification and Bb regression

When testing:

(1) (1)~(4) Same training

(2)  Use Softmax Loss to detect the classification probability

(3) Use Smooth L1 loss to detect bounding box regression

(4)  Correct the original RP with the Bb regression value, and finally generate the coordinates of the prediction window

 3、Faster rcnn

Training

(1) Send the entire image to CNN for feature map extraction

(2) Use RPN to produce RP, 300 pieces per picture

(3) RP is mapped to the last layer of feature map

(4) Generate a fixed-size feature map for each ROI in the Roi pooling layer

(5) Use softmax loss and smooth L1 loss to jointly train classification probability and Bb regression

test

(1) (1)~(4) Same training

(2)  Use softmax loss to detect classification probability

(3) Use smooth L1 loss to detect bounding box regression

(4) Correct the original RP with the Bb regression value, and finally produce the coordinates of the prediction window

4. Detailed process of Faster RCNN

 (1) feature extraction

It supports inputting pictures of any size. Before entering the network, set the normalization scale of the picture. For example, you can set the short side of the image to be no more than 600 and the long side of the image to be no more than 1000. It is assumed that M*N=1000*600 (the image is less than This size, the edge is filled with 0)

13 conv layers: kernel_size= 3 ,pad=1,stride=1;

Convolution formula:

②13 relu layers : activation function, does not change the image size

4 pooling layers: kernel_size= 2 , stride=2;

The pooling layer will make the output image 1/2 of the input image. After feature extraction , the image size becomes (M/16)*(N/16) , that is: 60*40(1000/16 60,600/16 40) ;

(2) RPN

After the Feature Map enters the RPN, after 3*3 convolution, the feature map size is 60*40*512

rpn_cls _score : 60*40*512-d 1*1*512*18==>60*40*9*2    Perform binary classification on its 9 anchor boxes pixel by pixel

rpn_bbox _pred :60*40*512-d 1*1*512*36==>60*40*9*4  Get four coordinate information (offset) of 9 anchor boxes pixel by pixel

(2.1)  Generation rules of anchors

After feature extraction, the image size becomes 1/16 of the original, let feat_stride=16, when generating anchors, define a base_anchor with a size of 16*16 box (because a point on the feature map (60*40), Can correspond to the original image (1000*600) on a 16*16 size area), the source code is converted to [0,0,15,15] array, parameter ratios=[0.5,1,2], scales=[8 ,16,32]

Left: Look at [0, 0, 15, 15] first, the area remains unchanged , the length and width ratios are [0.5, 1, 2] respectively, which is the generated Anchors box

Medium: If the scales are changed, that is, the length and width are respectively (16*8=128), (16*16=256), (16*32=512), corresponding to the anchor box

Right: Two transformations, and finally generate 9 Anchors boxes

The size of the feature map is 60*40, and a total of 60*40*9=21600 anchor boxes are generated. In the source code, the shift offset array is established by width:(0~60)*16,height:(0~40)*16, Then accumulate with the base_ancho reference coordinate array to obtain the coordinate values ​​of the anchors corresponding to all pixels on the feature map, which is an array of [216000,4].

Calculation:

 c is the zoom ratio, r is the aspect ratio

(2.2) nms generation rules

The algorithm generates all candidate boxes for a picture, and the score corresponding to each box (which can be represented by a 5-dimensional array dets, the first 4 dimensions represent the coordinates of the four corners, and the 5th dimension represents the score), and the threshold threshold is thresh. Initially, it is assumed that all boxes are not suppressed, and all boxes are sorted in descending order of score. Start traversing from the 0th box (the highest score), for each box, if the box is not suppressed, set all boxes with its IOU greater than thresh to be suppressed. Returns unsuppressed boxes.

(2.3)  Analysis of the working principle of rpn

 rpn-data:

(1) Generate anchor and shift (each equal position *_feat_stride) to form a candidate rpn_proposal, and eliminate out-of-bounds

(2) Label rpn_labels according to the following rules:

②If the IoU value of the anchor box and ground truth is the largest, mark it as a positive sample, rpn_label=1

③ If the IoU of the anchor box and ground truth>0.7, mark it as a positive sample, rpn_label=1

④ If the IoU of the anchor box and the ground truth is <0.3, it is marked as a negative sample, rpn_label=0. The rest are neither positive nor negative samples, not used for final training, rpn_label=-1

⑤ Randomly screen too many rpn_labels of 1 or 0, so that the ratio of rpn_label=0 and rpn_label=1 is 1:1

(3)  Explain rpn_bbox_targets

gt: The calibrated frame also corresponds to a center point position coordinate x*, y* and width and height w*, h*

rpn_proposal: center point position coordinates x_a, y_a and width and height w_a, h_a

偏移量:△x=(x*-x_a)/w_a  △y=(y*-y_a)/h_a △w=log(w*/w_a) △h=log(h*/h_a)

rpn_bbox_targets:(△x,△y,△w,△h)

(4)  For the index bbox_inside_weights of rpn_label=1, the elements are (1.0, 1.0, 1.0, 1.0), and the rest are 0.

For rpn_label=1 and rpn_label=0, the elements in the index bbox_outside_weights are np.ones((1, 4))*1.0/np.sum(labels>=0), and the rest are 0.

The main purpose of bbox_inside_weights and bbox_outside_weights is to: some box sizes in total_anchor are out of bounds, and then process the out-of-bounds.

(5) rpn_loss_cls: Determine whether the anchor has an object, the loss function of rpn_labels and rpn_cls_score_reshape, where the value of rpn_label may be -1, 0, 1, and the case of rpn_label=-1 is ignored during calculation.

(6) rpn_loss_bbox: The accuracy of the predicted offset, the purpose is that rpn_bbox_pred has the ability to predict the offset rpn_bbox_targets between anchor and gt. The calculation formula is:

rpn_bbox_outside_weight*SmoothL1(rpn_bbox_inside_weight*(rpn_bbox_pred-rpn_bbox_targets))

proposal:

In the size of rpn_cls_prob_reshape, regenerate 60*40*9 anchor boxes, add the trained △x, △y, △w, △h (from rpn_bbox_pred ), and get a more accurate prediction box rpn_proposal than before, Perform out-of-bounds culling and use nms non-maximum suppression to cull overlapping boxes. For example, set the threshold of IOU to 0.7, and only keep the local maximum score box (coarse sieve) whose coverage does not exceed 0.7. Finally, about 2000 anchors are left, and then the first N boxes (such as 300) are taken. When entering the next layer of roi pooling, there are only about 300 region proposals. Output rpn_rois (the boxes are used as roipooling during testing)

roi_data: (only exists during training)

The input is rpn_rois (in proposal) and gt_boxes. For each rpn_rois and gt, the rpn_rois with the largest coincidence rate is reserved, and then the coincidence rate greater than 0.5 is reserved as the foreground, and the coincidence rate (0.1, 0.5) is used as the background and label. Set the value to 0, calculate the offset between rpn_rois and gt, and the ratio of foreground ROI to background ROI is about 1:3. The output rois is the rpn_rois saved after the above processing, and the output bbox_targets is the offset between rpn_rois and gt.

ROI Pooling

The input is the region proposal generated by the RPN layer (rpn_rois of roi_data in the training process, rois of proposal in the testing process) and the feature map (60*40*512) generated by the last layer of VGG16, traverse each region proposal, and put Its coordinate value is reduced by 16 times, so that the region proposal generated on the basis of the original image (1000*600) can be mapped to the feature map of 60*40, so that a region (defined as RB*) will be determined on the feature map. The area RB* determined on the feature map, according to the parameters pooled_w: 7, pooled_h: 7, this RB* area is divided into 7*7, that is, 49 small areas of the same size, for each small area, use max pooling method The largest pixel is selected as the output to form a 7*7 feature map. After the 300 region proposals are traversed, the output array [300,512,7,7] is used as the input of the full connection of the next layer.

rcnn network

After the roi pooling layer, batch_size=300, the size of the proposal feature map is 7*7,512-d, the feature map is fully connected, refer to the figure below, and finally use softmax loss and L1 loss to complete the classification and positioning.

 Calculate which category each region proposal belongs to (such as people, horses, cars, etc.) through the fc layer and softmax, and output the cls_prob probability vector; at the same time, use regression to obtain the position offset bbox_pred of each region proposal, which is used for regression to obtain more accurate target detection frame.

loss_cls: The label and fc layer classification cls_score in roi_data are solved by softmaxWithLoss.

loss_bbox: bbox_target and fc layer classification bbox_pred (offset) in roi_data are solved by SmoothL1Loss.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324245065&siteId=291194637