[Target Detection Series]ATSS: Bridging the Gap Between Anchor-based and Anchor-free Detection via ATSS (CVPR2020)

Paper link: https://arxiv.org/pdf/1912.02424v4.pdf
Code link: https://github.com/sfzhang15/ATSS
has joined mmdetection: https://github.com/open-mmlab/mmdetection/tree /master/configs/atss

The article is mainly looking for the main factors that cause the performance difference between anchor-based and anchor-free, and points out that this factor is the definition of positive and negative samples. If the same definition is used, then the impact of the regression method on the final result is small.

1. The fundamental difference between RetinaNet and FCOS

insert image description here

Figure 1. RetinaNet

insert image description here

Figure 2. FCOS

     The typical single-stage anchor-based is RetinaNet, and the typical single-stage anchor-free is FCOS. In the case of ResNet50-FPN as the backbone, the COCO mAP of the former is 35.7, and the latter is 38.6. There are 3 points in between. There are three main differences in their design:

  • The number of anchors corresponding to each position is different : RetinaNet puts multiple anchors in each position, and FCOS has only one anchor point in each position
  • The definition of positive and negative samples is different : RetinaNet uses IoU to define positive and negative samples, and FCOS uses space and scale constraints to select samples.
  • The starting point of the regression is different : RetinaNet starts to return from the pre-set anchor, and FCOS returns according to the position of the anchor point

1.1. Experimental analysis

     The following experiment analyzes the difference between Anchor-based and Anchor-free .

     First, analyze the definition of positive and negative samples and the starting point of regression, and analyze RetinaNet and FCOS . In order to ensure the singleness of variables, RetinaNet feature pointonly anchor(originally 9, 3 length and width ratios, and 3 sizes).
     Backbone is ResNet50-FPN , RetinaNet each feature point corresponds to a square anchor, the size is 8S, where S is the strides feature levelof the stride(FPN strides are: 8, 16, 32, 64, 128 respectively, using ResNet 's c3, c4, c5and fpn conv to form p3, p4, and then perform two 3x3 convolutions with 2 on itp5 to get , p5stridep6p7

Note:
(1) FPN was not used in this way at the beginning. At the beginning, c2, c3, c4, and c5 of ResNet were used to obtain p2, p3, p4, and p5. At this time, FPN has 4 layers, and then one is added on top of p5 The layer maxpool gets p6, which is now the mode used in FasterRCNN+FPN in mmdetection.
(2) In RetinaNet, c3, c4, c5 of ResNet are used to obtain p3, p4, p5. The reason why c2 is not used is that c2 has too much calculation in the original paper. Moving it up is beneficial to large objects. Detection (I think that if it is in practical application, c2 can be used according to the specific situation, which can improve the effect of small object recognition), RetinaNet does not use p5 to generate p6, p7, but uses the original c5 through stride2 3x3 conv To generate p6, and then p6 passes through a layer of stride2 3x3 conv to generate p7.
(3) In FCOS and ATSS, c3, c4, and c5 are also used to obtain p3, p4, and p5, and then p6, p7 are respectively obtained by using two layers of stride2 3x3 conv on p5. From the FCOS paper, these two The effect of the two usages is about the same.

  • When training: the input image is unified resizeinto 1333x800, the training method is SDG, momentum=0.9, weight decay=0.0001, batch_size=16(8 GPU x 2 images/GPU), training 90k iteration(12 epoch x 14659/2), initial learning rate= 0.01, divide by 10 when and 60k(8 epoch) (i.e. ). The above are relatively routine operations.80k(11 epoch)learning ratedecay=0.1

  • During inference, the image is also resizeconverted into 1333x800, bbox and the predicted sum is obtained class scores, and those with scorea value less than , and the highest 1000 results feature levelare output for each. scoreThen perform maximum value suppression ( NMSIOU ), the scorethreshold within each category is 0.6, and finally output the highest 100 predictions for each picture.

insert image description here

image 3

     The results of the comparison experiment are shown in the figure above, using RetinaNet (#A=1) to represent feature pointeach RetinaNetanchor with only one . The FCOS paper pointed out that FCOS is better than RetinaNet (#A=1), 37.1 vs. 32.5. In addition, FCOS has some additional improvements, such as moving the centerness branch , using the GIoU loss function , and using the corresponding to normalize the regression target. These improvements brought performance from 37.1 to 37.8. Part of the process from 37.1 to 37.8 is due to some general improvement methods, such as adding GroupNorm to the detection head, using GIoU loss function during regression , limiting positive samples to the interior , adding centerness , and adding a trainable parameter for each. ATSS pointed out that these can be added to , so it is not the fundamental difference between anchor-based and anchor-free , in RetinaNetRegression branchstrideground truthfeature level scaleanchor-based detector(#A=1) Add the above improvements in order to eliminate these effects, you can see:

  • Add GroupNorm (+0.9) to the detection head
  • Use GIoU loss function when regression (+0.5)
  • Restrict positive samples to be ground truthinside (+0.4)
  • Join centerness (+1.5)

The statement here is different from the FCOS paper, so some students are a bit confused compared to the FCOS paper:
insert image description here
(1) In the FCOS paper, RetinaNet uses P6 and P7 obtained by C5, and nms thr = 0.5, AP = 35.9; while FCOS Using the same settings, AP=36.3, which is 0.4 AP higher
(2) In the ATSS paper, in order to unify the settings of RetinaNet and FCOS, the original setting of 9 anchors per pixel was changed to only 1, and nms thr is also changed to 0.6 (I think this is very unreasonable. The IOU threshold of the two models must be set according to their respective optimal values. You should compare RetinaNet 0.5 and FCOS 0.6. If they are all the same, then don’t train. The weight of the model is also the same), so this leads to RetinaNet's 35.9 AP reduced to (in the ATSS paper) 32.5 (
3) ctr. on reg: corresponding to ATSS (Figure 3) centerness, but note that they all refer to the regression branch Add centerness, increase 0.7 AP. In the FCOS paper, it is analyzed separately that centerness is added to the regression branch, that is, without a separate branch, the effect is very small (see the figure below, the second line); centerness alone as a branch can increase 3.6 AP to 37.1, which is Only surpassed the Anchor-based method.
That's the point, don't addcenterness独立分支Before, FCOS only had 33.5 AP, which is not as good as Anchor-based RetinaNet's 35.9 AP. This is also the advantage of Anchor-based that multiple anchors can be set in one pixel. The ATSS article felt that this comparison was unfair, and all the settings of RetinaNet According to FCOS, RetinaNet is not as good as FCOS, but what I didn’t expect is that after RetinaNet used all the tricks of FCOS, the AP increased from a very low 32.5 to 37.0, which increased 4.5 AP, while FCOS used these tricks, only increased by 1.5 AP to 38.6, adding these tricks in ATSS (based on the independent branch without centerness) increased 0.8 AP to 37.8, and has a gap of 0.8 AP with RetinaNet.
insert image description here
(4) ctr.sampling: Corresponding to this article In GT Box, the anchor positive samples are only taken from the ground truth, which can increase 0.3 AP

feature levelAdd     a trainable scale parameter for each (+0.2, this improvement is not much, because the correspondinganchor-based detector size of each is different, here replaces the role of the trainable scalar in anchor-free) so that the final 37.0 RetinaNet , but there is still a difference of 0.8 from the FCOS of 37.8. ATSS shows that there is an earth-shattering essential difference between anchor-based and anchor-free.feature levelanchor

     So far, there are only two differences between the two:

  • Different classification branches : different ways of defining positive and negative samples
  • The regression branches are different : one from regressionanchor box and one fromanchor point

     For the classification branch: RetinaNet uses IoUto feature leveldistinguish positive and negative samples from different, first mark each ground truthmost suitable anchor, this step will I o U > θ p IoU>\theta_pIoU>ip​ is defined as a positive sample, and then put I o U < θ n IoU<\theta_nIoU<in​ is defined as a negative sample, and the remaining anchors are ignored. FCOS uses space and scale constraints to define positive and negative samples, first considers those in ground truththe box , considers them to be candidate positive samples, and then chooses to further determine positive and negative samples anchor pointaccording to the previously defined scale range, unselected is a negative sample.feature levelanchor points

insert image description here

Figure 4

insert image description here

Figure 5

     It can be seen from the table that IoUchanging the selection method to space and scale constraints can increase from 37.0 to 37.8.

     For the regression branch: after the positive and negative samples are determined, RetinaNet starts to return from the anchor box, and the goal of the regression anchor boxis ground truththe four offsets between and ; FCOS anchor pointstarts to return from , and the goal of the regression is the distance from the anchor point to the four sides. It can be seen from the table that when using the same definition of positive and negative samples, there is no significant difference between the two regression methods.

insert image description here

Figure 6

Conclusion: The difference between single-stage anchor-based and center-based is essentially the definition of positive and negative samples.

Based on this, Adaptive Training Sample Selection (ATSS) is proposed in the article.

2. ATSS(Adaptive Training Sample Selection )

     The previous sample selection strategies (sample selection strategies) always have some sensitive hyperparameters, such as the IoU threshold in anchor-based and the scale range in anchor-free. When these hyperparameters are determined, all ground truthmust select positive samples according to certain rules, which is suitable for most objects, but not suitable for a small number of objects.

     ATSS automatically selects positive and negative samples according to the statistical characteristics of the object, with almost no hyperparameters:

  • For each ground truth, first select k (k=8, which is the only hyperparameter of ATSS) feature level on , and then the experiment proves that the choice of k has little effect on the experiment) ground truthclosest to the center of anchor box.
  • Assuming there are L(L=5) feature level, one ground truthwill have kL candidate positive samples. Calculate the relationship between the candidate positive sample ground truthandIoU
  • For each layer feature level, IoUcalculate the mean mand variance vof respectively, that is, 0.06, 0.26, 0.88, 0.32 and 0.07 in Figure 7 (a), and then calculate the average and standard deviation of these five numbers, that is, 0.312 and 0.3, set t=m+v=0.612as the The IoU threshold of the ground truth, so that the samples with IoU greater than or equal to t are the final positive samples.
  • In addition, it is also restricted that the center of the sample must be ground truth inside ; if a anchor boxis assigned to multiple gt, only the gt with the highest IoU is selected for regression. The rest are negative samples.

      Different thresholds can be selected for each different goal, so it is called Adaptive Training Sample Selection.
insert image description here

Figure 7

Some motivation to explain:

  • Select samples according to the center distance : In RetinaNet, the larger the IoU, the closer the center of the anchor box is to the center of gt; in FCOS, the closer the anchor point is to the center of gt, the higher the detection quality, so the candidate box with a closer center better

  • The sum of the mean and variance is used as the IoU threshold : IoUthe mean measures the suitability between the object and the pre-defined anchor. A relatively high mean (Fig. 7(a)) means that the proposals are of high quality, so the IoUthreshold should be high; a relatively low mean (Fig. 7(b)) means that the proposals are of low quality, so the IoU threshold should be low. A relatively high standard deviation means that there is a feature level and this gt is very suitable, and the mean value plus the standard deviation can only select samples from this level; a relatively low standard deviation means that multiple feature levels are very suitable, and the mean value plus the standard After the difference, samples can be selected from multiple feature levels.

  • Restrict the center of the positive sample to be inside gt : the center outside gt anchoruses features gt other than , so such samples are not good and should be removed.

  • Keeping fairness between different objects : According to statistics, about 16% of the samples are in the confidence interval [m+v,1]. Although the distribution of IoU is not a standard normal distribution, statistical results show that each object has about 0.2kL positive samples. This amount has nothing to do with his scale, aspect ratio, or position. However, in the original strategy of RetianNet and FCOS, more positive samples are larger objects, which leads to unfair sample extraction (not so, the larger anchor of RetinaNet is at a higher feature level, and it is relatively sparse at the same time, on average At first glance, this phenomenon does not occur)

  • Almost no hyperparameters : There is only one hyperparameter k, and experiments have shown that the value of k has little effect on the results.

3. Experimental results

insert image description here
Verification part:

  • After adding FCOS trick, RetinaNet(#A=1) reached 37.0, and adding ATSS on this basis, it can reach 39.3 (+2.3). Because ATSS only redefines positive and negative samples, the additional overhead is negligible.
  • For FCOS, ATSS can also be added. The simplified version only adds center sampling, which is increased from 37.8 to 38.6 (+0.8), and each feature level still has a scale limit; the full version uses ATSS, which is increased from 37.8 to 39.2. On the basis of center sampling +0.6, overall +1.4.

insert image description here

  • Hyperparameter k: From the experimental results, the overall impact is small. (Is 1 point still small?)

insert image description here

  • anchorSize: In the previous experiment, the size of the anchor is 8S, and S is the stride of each feature level (8, 16, 32, 64, 128, 8S8S is 64, 128, 256, 512, 1024), change the size and aspect ratio of the anchor, and find that the results are correct has less impact.
    insert image description here

  • Aspect Ratio:Has little effect
    insert image description here

  • Comparison experiment: MS COCO test-devcompare on the data set, use multi-scale training: randomly select a value from 640-800 as the length of the short side after resize, and then extend the training time from 90k to 180k, and the time for learning rate reduction is also Moved to 120k and 160k.

insert image description here

  • Consider putting more anchors in each position. Previously, one anchor RetinaNet(#A=1) was considered in each position. Now consider RetianNet(#A=9). At this time, the AP is 36.3 (RetinaNet(#A= 1) is 32.5), plus the previously mentioned trick (+Imprs in the figure), it can be increased to 38.4, and ATSS can be increased to 39.2, which is different from the result of RetinaNet(#A=1)+ATSS very small.
  • It can be said that as long as the strategy of selecting positive and negative samples is appropriate, how many anchors are placed in one position, and what size and aspect ratio anchors are placed, have little impact on the final effect.

4. How to use

ATSS is used as a positive and negative sample definition method, so put it in the positive and negative sample attribute assignment module (BBox Assigner) of mmdetection, modify the positive and negative sample assigner in your own model to ATSS mode, and add it in the training settings to call the class
insert image description here
ATSSAssigner :
insert image description here

Guess you like

Origin blog.csdn.net/qq_35759272/article/details/123829514