[Notes] paper Guided Anchor: Region Proposal by Guided Anchoring

Overview & Paper

Topic: Region Proposal by Guided Anchoring

作者&出处:Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, Dahua Lin || CUHK-Sense Time Joint Lab, The Chinese University of Hong Kong, Amazon Rekognition, Nanyang Technological University

Obtaining address: https://arxiv.org/abs/1901.03278

 

& Summary and personal views

Guided Anchoring proposed mechanism, the use of semantic features to guide the generated anchor. By combining the independent prediction of location and shape of the generated non-uniform in shape has an arbitrary anchor. The method as compared to a post RPN using a sliding window mechanism, the lifting of 9.1% and the number of recall anchor produced a 10% reduction, may be applied to the anchor-based detector, also enhance 2.7%.

This paper, there are a lot of places I did not understand, but generally a good idea, especially in the computing anchor into location and shape, and then consider the correlation between the two, so that greatly reduced the number of generated anchor and quality are increasing.

 

&contribution

1) is proposed to predict non-arbitrary shape and amorphous anchor new strategy, rather than using the anchor and the predefined set of closely spaced;

2) the decomposition of the joint distribution of two anchor distribution conditions, are designed and model;

3) Characteristics and response anchor importance alignment, of the features designed based on the potential adjustment model to refine the shape of the anchor.

 

& Problems to be solved

Question : Unified anchor strategies currently used preset is not optimal

Analysis :

A reasonable anchor design has two principles: alignment & consistency:

1) is characterized as a convolutional characterized anchor, the center anchor of FIG feature needs center corresponding position alignment. Herein refers to the center of each anchor in the original image, the sampling / convolution process in the pool through, position shift does not occur, i.e., the center position of the anchor is an integer multiple of stride. From this point of view, directly corresponding features in FIG pixel as a central point of its anchor, and then to select anchor according to the size and aspect ratio, which is the current mainstream of thinking based on the methods employed anchor ;

2) receptive fields and semantic range (scope) should be consistent with the shape feature FIG anchor dimensionally different positions coincide. Having said that, based on the effective receptive field of it not contrary to the principles S3FD inconsistent with the dimensions and shape of the anchor effective range of the receptive field of its use .

 

Based on these two principles, the anchor of a unified strategy currently used as: Each position on the feature map has predefined dimensions and aspect ratios of the k anchor. And this is not an optimal method, which is still flawed:

  • For different problems to a predefined fixed set aspect ratios of the anchor, and the error detector may hinder the design speed and accuracy;
  • In order to maintain sufficiently high recall rate, we need to use a lot of anchor, but most of them are negative anchor, and results in significant computational overhead, especially when using heavy classifier candidate area.

 

According to observation and analysis, in the picture's goal is not evenly distributed, scale and content of the target picture, is closely related to its location and geometry of the scene. Thus, to reduce the manual selection of a priori (hand-picked pirors) issue, method: first determining sub-field may contain a target, and then determining the shape of the anchor at different positions.

And this method scale and aspect ratio is variable, and therefore different features characterizing FIG adaptive pixels need to learn appropriate (Fit) of a respective anchor. This undermines the principle of consistency of the anchor. Propose effective anchor module based on the geometric features.

 

Previous methods

GA

By selecting the sliding window intensive normalized anchors

Removing the sliding window mechanism, the use of mechanisms to better guide generation sparse anchors

Excess use cascaded detectors . 1 stage to detect progressively refining bbox , introducing more model parameters, to infer the speed is reduced; use RoI Pooling / Align to bbox extracts the corresponding features, for which a phase detector and a candidate region generating expenses are too high

 

Anchor-free method of using a simple Pipelines , using a single stage generates the final detection result. As the anchors and based anchor lack of refining can not handle complex scenarios and examples

Focus is sparse, non-amorphous anchor selection mechanism, used to improve the quality of the candidate region detection performance. Hence the need to address misalignment & inconsistency problem.

Some single-shot detector using multiple regression and classification to refine anchors

Not progressively refined anchor , but directly prediction anchor distribution, this portion is decomposed into location and shape prediction

No consideration anchor aligned between the features, and therefore multiple regression anchor , destroyed alignment & consistency

The prediction anchor of Shape , fixed anchor the center, and then adjusted based on the predicted characteristic shape.

 

 

 

& Framework and main methods

1, the main model

  

2, the joint distribution decomposition

p(x, y, w, h | I) = p(x, y | I)p(w, h | x, y, I)

From this probability distribution can be decomposed two important information: zone 1) picture may be present in the target feature; 2) closely related to the shape of the target, such as the dimensions and the aspect ratio, and location. Meanwhile, the above probability may be decomposed to the anchor picture prediction, the prediction is decomposed anchor the center, and the predicted center of the shape.

 

3, the position prediction Anchor

The branch predicted a probability distribution the p-(· | F the I ), to give feature map F the I in every possible location for the center of the target probability. P | (X, Y F. I corresponding position) in the image I is ((x + 1/2) s, (y + 1/2) s) at the center, and corresponding to the picture of the receptive field, wherein s is the stride characteristic of FIG.

In this sub-network of F. The I using 1 × 1 to obtain the target convolution score map, and then converted to the corresponding possibility of a sigmoid function by elemental stage. Sub-network use a deeper more accurate prediction can be obtained, and using convolution in the transform sigmoid layer can achieve a good balance between efficiency and precision.

The possibility of this figure, a threshold of 90% of the area capable of filtering while preserving the same recall. The picture shows the possibility of the above FIG. Since no consideration of the exclusion zone, it is recommended more efficient, ensuring the convolution convolution masked layer is replaced.

 

4, the shape of the predicted Anchor

Shape anchor in the branch prediction, the prediction of the position and w each H, enabling the nearest ground truth and the maximum coverage box. However, due to direct prediction (w, h) is too large, the final accuracy is not allowed, so the conversion using the following form:

w = p s; e; dw , h = p s; e; dh

Is converted to the final prediction (dw, dh), this case σ = 8, the entire [0, 1000] is converted to the range [-1, 1], so that a more stable prediction result is also simpler.

Each anchor location with only a dynamic prediction shape associated; because it can allow an arbitrary aspect ratios, the method can better capture the target too high or too wide. FIG upper (left) is the predicted change in the aspect ratio of the anchor, (right) is generated by the shape and location of the anchor corresponding to the display.

 

. 5, Records wherein the Guided-adapted

RPN conventional single stage or a detector using predefined anchor, each anchor location sharing the same scale and aspect ratio, and therefore a consistent feature characterization can learn FIG. At this time, the anchor shape on each of the different location, it is used as a conventional method is not appropriate anchor network processing.

According to the analysis, the encoded content will be large anchor large range, corresponding to a small range of coding is small anchor. Therefore proposed anchor-guided feature adaption to do for each individual based on the characteristic shape of the anchor location potentially transform:

, f I '  = N t ( , f I , W I , H I )

F where I is LOCATION (X I , Y I ), N T is a 3 × 3 of the deformable layer convolution.

First, from the predicted output branch prediction offset field shape, then use this offset using a deformable characteristics of the original convolution FIG obtained fi ', ​​after the adjustment features may do bbox classification and regression.

 

6, the loss set

L = λ1Lloc + λ2Lshape + Lcls + Lreg

Above is the total loss of function, on the basis of classification and regression on the added loss of shape and location.

1) Location of loss calculation

For each picture requires a binary tag diagram in which the effective location 1 represents positioning of the anchor. In this process, the use of ground truth means to generate a binary labels, it is desirable in the neighborhood of the target can have more effective LOCATION, the less the farther. Using (X G , Y G , W G , H G ) represents ground truth box, (X G ', Y G ', W G ', H G result') represents ground truth box is mapped to the feature map scale corresponding. R (x, y, w, h) represented by (x, y) as the center, (w, h) respectively, the width and height of the rectangle. Anchor desired bbox can be placed close to the center, to obtain greater IOU initialization, so as to define each of the three types of bbox region:

  • = R & lt CR (X G ', Y G ', [sigma] . 1 W ', [sigma] . 1 H') represents the bbox central region, which regions are regarded as part of positive samples;
  • = R & lt the IR (X G ', Y G ', [sigma] 2 W ', [sigma] 2 H') \ CR CR except showing a larger outer region, this region is divided into a negligible, similar to the concept of Gray Zone;
  • OR is an addition of CR and IR region, the sample is negative.

 

Characterized in the use of a layer of a multilayer FPN is also adjacent to account for the interaction of the hierarchy, wherein each layer is only concerned with setting the target range within a specific scale, so only certain features when CR matching characteristic scales of FIG. exists, the same area of ​​the adjacent layers is also set to IR, particularly as shown in FIG. When multiobjective overlap, CR inhibits IR, IR simultaneously inhibit OR. Because CR exists only in a small part of the characteristic diagram, it is used to train Focal Loss branch location.

 

Loss 2) Shape Calculation

First, with the corresponding ground truth anchor match, then able to predict the maximum IoU w and h matching the ground truth.

awh={(x0, y0, w, h) | w>0, h>0},gt=(xg, yg, wg, hg)

= max iou vIoU normal (al wh , gt)

For any location with ground truth, computing vIoU is quite complex, and it is difficult to effectively design a network end to end. Thus the use of a method of approximation: for a given (X 0 , Y 0 ), some commonly used sampling w, h analog value, h enumeration of all w. Is then calculated and the anchor gt vIoU the sampled. In the experiment, the selected group 9 (w, h), such as scale and aspect ratio of RetinaNet. The final loss is calculated as follows:

 

7 results

For the anchor of two principles: alignment & consistency experimental verification of its impact on the overall results, as shown below, can play a large role in improving recall guarantee that in the case of two principles. Wherein the number of the proposal of AR 100,300,1000.

 

For the three tricks presented in this paper, Ablation location, shape and feature adaption of experiments show that the selection of the anchor into two parts: location and shape of the calculated result has greatly improved, and after feature adaption result of an increase of almost 5 or so points.

  

The following table is generated using the GA anchor with the original comparison of different methods of detection method used can be seen from the GA can improve average about two points.

FIG respectively the ground truth and the use of GA, sliding Window comparator generates anchor dimension and the distribution of aspect ratio, since the sliding window using a predefined scale and aspect ratio, the lower anchor is not generated in FIG diversity , through the anchor GA generated more in line with the distribution of ground boxes truth.

 

The following table is for the Faster R-CNN RPN module RPN using the GA module, i.e., the comparative results for use RPN anchor of higher quality, it can be seen using RPN GA significantly superior strategy, while the lower FIG RPN also generated with the GA-RPN anchor displayed contrast, with respect to the RPN is, higher quality GA-RPN generated by the anchor, and a smaller number.

  

& Problems encountered

1, in the branch location, the possibility of the object is calculated for each position, how to calculate, it will not be the case any object is not recognized appear? If the threshold is set unreasonable, the result will deviate from the emergence of the true value, the threshold should be how to choose?

2, anchor used in this case is based on the location and shape obtained, if the normal situation, the branch prediction can only predict shape a set (w, h), here how the operation may be modified?

3, if you use the same anchor RetinaNet setting, does it mean that you can use a similar method to set anchor has the potential to achieve this result, contrast, if RetinaNet directly on the nine anchor maximum IoU of computing, so why not shape the results and the predicted GA is the same? GA method with respect to the performance of RPN increased mainly due to increase in terms of location and offset?

4, the calculation vIoU, seeking the maximum coverage w, h, if the same area is covered GT 2 or more, should do?

 

& Reflection and inspiration

This paper is the brightest point of decomposition were forecast predicted anchor location and shape of the pair, followed by consolidation of operations, and by the prediction of shape, but also get a wide range of dimensions and aspect ratio, greatly improve the detection accuracy of the difficulty of the test sample. A task into multiple tasks possible to the original mission played a role in improving performance.

There are still some problems, did not get to know, we need to look at the code corresponding to the portion, after re-integration.

 

Guess you like

Origin www.cnblogs.com/fanzhongjie/p/11615432.html