Paper Reading | Target Detection: ATSS that points out the essential difference between anchor-free and anchor-based

Paper related information

1.论文题目:Bridging the Gap Between Anchor-based and Anchor-free Detection via
Adaptive Training Sample Selection

2. Issuing time: 2019.12

3. Literature address: https://arxiv.org/abs/1912.02424

4. Paper source code: https://github.com/sfzhang15/ATSS

Summary

With the introduction of FPN and Focal Loss, anchor-free detectors have become popular. This article first pointed out that the essential difference between anchor-based and anchor-free detectors is how to define positive and negative training samples, which is also the reason for the performance gap between anchor-based and anchor-free. If they have the same definition of positive and negative samples during training, there will be no significant difference in their final performance regardless of whether they use bounding boxes or point regression. This shows that how to choose positive and negative samples during training is very important for the current target detector. Then the author proposes an adaptive training sample selection method (Adaptive Training Sample Selection (ATSS)) , which adaptively selects positive and negative samples according to the statistical characteristics of the object. This greatly improves the performance of anchor-free and anchor-based detectors and bridges the gap between them. The article also discussed the necessity of laying multiple anchors in each position of the picture. ATSS can increase the SOTA detector to 50.7% AP without introducing any overhead.

1 Introduction

At present, the target detector is still anchor-based , which can be divided into one-stage and two-stage . They first tile a large number of anchors with preset numbers on the picture, and then predict the category and refine its position step by step. Finally, Output refined anchors as the test result. Due to the emergence of FPN and Focal loss , scholars have begun to pay attention to anchor-free detectors. The anchor-free detectors do not apply to the preset anchors. They can be divided into two ways: one is keypoint-based , that is, find first Some preset points or learned points are then generated based on these points, such as CornerNet; the other is center-based , that is, the CenterNet method, which uses the center of the object or an area as a positive sample, and then regresses from the corrected sample The distance to the border of the object.

Among the two anchor-free methods, keypoint-based follows the standard keypoint prediction workflow, which is different from the anchor-based workflow. The center-based approach is similar to the anchor-based workflow, and points are used as preset samples. Take the one-stage anchor-based detector RetinaNet and the one-stage anchor-free detector FCOS as examples. They have three major differences:

  1. The number of anchors tiled in each location is different. RetinaNet has several anchor boxes in each position, while FCOS has only one anchor point in each position. (A point of FCOS is equivalent to the center of an anchor box in RetinaNet, so it is called anchor point here.)
  2. The definition of positive and negative samples is different. RetinaNet uses the IoU of samples and ground truth boxes to determine the positive and negative values, while FCOS uses space and scale constraints to select samples.
  3. Return to the starting state. RetinaNet returns to the object frame from the preset anchor box, while FCOS locates the object from the anchor point.

As mentioned in the FCOS paper, FCOS performs better than RetinaNet. This article will discuss which of the above three differences cause this performance difference.

From the experimental results, it can be known that the difference in their performance is due to the different ways of defining the positive and negative of the training samples. Therefore, how to judge the positive and negative of training samples is worthy of further study. For this reason, this paper proposes a new adaptive training sample selection mechanism (ATSS), which can automatically select positive and negative training samples based on the characteristics of the object. It makes up for the difference in performance between anchor-based and anchor-free. In addition, through a series of experiments, I answered the question mentioned in the next step, that is, it is unnecessary to tile multiple anchors at each position of the picture .

论文的主要贡献:

  • Point out that the essential difference between anchor-based and anchor-free detectors is actually how to define positive and negative training samples.
  • An adaptive training sample (ATSS) selection mechanism is proposed, which automatically selects positive and negative training samples according to the characteristics of the object.
  • Prove that tiling multiple anchors at each position on the image to detect objects is a useless operation.
  • SOTA performance is achieved on MS COCO without introducing any overhead.

2. Related work

The current CNN-based detectors are divided into anchor-based and anchor-free, anchor-based is divided into two-stage and one-stage, anchor-free atmosphere keypoint-based and center-based. The detectors introduced in this section are all familiar.

3. Analyze the difference between anchor-based and anchor-free detectors

In order to make the conclusion general, we use the representative anchor-based detector ReinaNet and the anchor-free detector FCOS to analyze the difference between the two types of detectors. This section will focus on the last two differences: the way the positive and negative samples are defined and the initial state of the regression. There is also a difference, that is, how many anchors at each position are discussed in the next section. Therefore, RitinaNet has an anchor for each position, which is similar to FCOS. The remaining sections will introduce experimental settings, avoid inconsistent experiments, and finally point out the essential difference between anchor-based and anchor-free.

3.1. Experimental setup

Data set . All experiments are on the MS COCO data set. Follow the convention, use 115k images in trainval135k for training, and use all 5k of the minival branch for verification analysis. Submit the model to the evaluation server to get the final performance on test-dev.

Training details . Use the pre-trained ResNet-50 with 5 levels of FPN on ImageNet as the backbone network, and the new layer adopts the same initialization method as ReinaNet. Each position on each layer on the FPN has a square anchor with a size of 8S, and S is the total stride size. The size of the picture is resized to the smallest side 800, and the largest side is less than or equal to 1333. Optimizer SGD, 90K iterations, 0.9momentum, 0.0001 weight decay and batch size of 16. The initial learning rate is 0.01, and multiply by 0.1 for 60k and 80k.

Infer the details . The size of the picture during inference is resize the same as during training, and then it is passed to the entire network to forward the output prediction box and prediction category. Then use the preset score of 0.05 to filter out the background border, and each pyramid outputs the first 1000 detectors. Finally, the NMS with the IoU threshold set to 0.6 (the same for each class) is used to generate the top 100 detections.

3.2. Removal of inconsistencies

After setting RetinaNet to an anchor per position-RetinaNet (#A=1), FCOS is still much more accurate than RetinaNet (#A=1), 37.1% vs. 32.5%. Coupled with some improvements in FCOS, this gap has become larger, from 37.1% to 37.8%. However, this gap is mainly due to some general improvements used in FCOS, such as adding GroupNorm in the heads, using GioU as the regression loss function, limiting the positive samples in the ground-truth box, and mirroring the center-ness branch, not every The pyramid layer adds a trainable scalar. These improvements can also be applied to anchor-based detectors, so they do not cause the core difference between anchor-based and anchor-free. These are applied to RetinaNet (#A=1) to exclude inconsistent settings, as shown in Table 1. As a result, RetinaNet has increased to 37.0%, but still has a 0.8% gap with FCOS. This gap is to exclude all irrelevant different situations, so we can explore their core differences next.

Insert picture description here

Table 1. RetinaNet and FCOS implementation difference analysis. #A means there is only one anchor per position.

3.3. Core Differences

After applying general improvements, RetinaNet (#A=1) and FCOS only differ in two points: one is the classification sub-task on detection, such as the definition of training positive and negative samples; the other is the regression sub-task, such as regression from anchor The box starts with the anchor point.

Insert picture description here

Figure 1. Definition of positive and negative samples. The number of positive samples is 1, and the number of negative samples is 0. The blue box, red box and red point correspond to real objects, anchor box and anchor point respectively. (A) RetinaNet judges the positive and negative of the sample according to IoU, and judges in the spatial dimension and the scale dimension at the same time. (B) FCOS first finds the candidate positive samples in the spatial dimension, and then selects the final positive samples according to the scale dimension.

Classification . RetinaNet judges the positive or negative of the training sample according to IoU. As shown in Figure 1(a), RetinaNet uses IoU to directly select the final positive samples in both spatial and scale dimensions. First, the highest IoU in the anchors corresponding to each object is the positive sample. Second, the IoU of the anchor boxes is greater than θ p \theta_pθpIt is also a positive sample. Then IoU is less than θ n \theta_nθnThe marks are negative samples, and the rest are ignored. As shown in Figure 1(b), FCOS uses space and scale constraints to divide the anchor points of different pyramid layers. First consider the anchor point in the real frame as the candidate positive sample, and then select the final positive sample based on the defined scale range of each pyramid layer (FCOS has several preset hyperparameters to define the scale of 5 pyramid layers For the range, please refer to the previous FCOS notes), the anchor points that are not selected are negative samples. Two different selection strategies produce different positive and negative samples.

Insert picture description here

Table 2. Analysis of different settings of RetinaNet and FCOS on the MS COCO minival data set.

As shown in Table 2, for RetinaNet, the Box column, when using space and scale constraints to replace IoU, the accuracy increases from 37.0% to 37.8%. For FCOS, the Point column, if a positive sample is selected using the IoU strategy, AP performance will drop from 37.8% to 36.9%. This result proves that the definition of positive and negative samples is the essential difference between anchor-based and anchor-free detectors .

Return . After determining the positive and negative samples, it is necessary to regress on the positive samples, as shown in Figure 2(a). When RetinaNet and FCOS use the same sample selection strategy, the positive/negative samples obtained are the same. At this time, no matter whether the regression starts from a point or a frame, the final performance is not significantly different, as shown in Table 2 37.0% vs. 36.9% and 37.8% vs. 37.8%, which indicates that the initial state of the regression is a non-influencing difference rather than a core difference.

Insert picture description here

Figure 2. (a) The blue dot and border are the center and border of the object, and the red dot and border are the center and border of the anchor. (b) RetinaNet returns from the anchor box based on 4 offsets. ©FCOS returns from the anchor point according to the distance to the four sides of the frame.

Conclusion . The core difference between the one-stage anchor-based detector and the center-free detector is how to define the positive and negative training samples, which is very important for today's object detection and is worthy of further study.

4. Adaptive training sample selection method

Training a detector first needs to define positive and negative samples to classify, and then regress on the positive samples. According to the previous analysis, the former is the key, and FCOS is to improve the reform process, and it introduces a new way of defining positive and negative samples, which is better than the traditional IoU-based performance. Inspired by this, the author delved into a basic problem of target detection: how to define positive and negative training samples? And proposed an adaptive training sample selection method (ATSS) . Compared with these traditional methods, ATSS has almost no hyperparameters and is robust to different settings.

4.1. Description

The previous sample selection strategy has some sensitive parameters, such as the IoU threshold of the anchor-based detector and the scale range of the anchor-free detector. The real frame selects the positive samples according to the rules corresponding to these parameter settings. This method is suitable for most objects, but some outer objects (objects outside the rules?) will be ignored. Therefore, different parameter settings will have different results.

The ATSS method automatically divides positive and negative samples according to the statistical characteristics of the object, without any hyperparameters. Algorithm 1 in the figure below describes how ATSS works on the input image. For each real box g on the picture, first find its candidate positive samples, that is, for each pyramid layer, select k centers and the L2 anchor boxes closest to the center of the real box. Assuming there are L pyramid layers, then each There are L×k candidate positive samples for each object . Then calculate the IoU between the candidate positive sample and the true box, and then count the mean and variance of these IoU as features, and use the sum of the mean and variance tg = mg + vg t_g=m_g+v_gtg=mg+vgAs the IoU threshold, the candidate anchors that are greater than the IoU threshold and whose center is inside the object are positive samples, and the rest are negative samples. (If a certain anchor is assigned as a positive sample of multiple ground truth boxes at the same time, then the one with the highest IoU is selected.)

Insert picture description here

The following is an explanation of the algorithm.

The candidate frame is selected based on the distance between the anchor box and the center of the real frame . Whether it is RetinaNet or FCOS, when the center of the anchor box or the anchor point is closer to the center of the object, the IoU is larger or the detection quality obtained is higher, so the center of the object is selected.

Use the sum of the mean and variance as the IoU threshold . The IoU mean value measures whether the preset anchor is suitable for the object. A high mean value indicates that the quality of the candidate frame anchor is very high, as shown in Figure 3(a), and low means the quality is low, as shown in Figure 3(b). The IoU variance measures the number of pyramid levels suitable for detecting this object. High variance means that there is a specific pyramid layer suitable for detecting this object. After the mean value plus the upper difference, the anchor on which pyramid layer can be selected, as shown in Figure 3(a). Low variance means that there are multiple pyramid layers suitable for detecting this object, and the average value plus the upper difference can select the anchors on these layers, as shown in Figure 3(b). Therefore, using the sum of the mean and the variance can adaptively select a sufficient number of positive samples according to the characteristics of the object.

Limit the center of the positive sample to the inside of the object . The anchor center outside the object is a poor quality candidate frame, which will be predicted by the features outside the object, which is not conducive to training, so the anchor whose center is not inside the object should be excluded.

Insert picture description here

Figure 3. Schematic diagram of ATSS. Candidate frame IoU for each level of FPN. (a) A mean IoU mg m_gmgAnd variance vg v_gvgHigher real objects. (b) A mean value mg m_gmgAnd variance vg v_gvgLower real object.

Ensure the fairness of different objects . According to statistical theory, 16% of the samples are in [ mg + vg, 1 m_g+v_g,1mg+vg,1 ], although the IoU of the candidate frame is not normally distributed, the statistical results show that each object has a positive sample of about 0.2*kL. This number has nothing to do with the scale, aspect ratio and position of the object, and it maintains Relative fairness between different objects. On the contrary, RetinaNet and FCOS are more inclined to detect large objects. Large objects will have more positive samples, and there will be unfairness between different objects.

There are almost no hyperparameters . This method has only one hyperparameter k, and because the method is not sensitive to k, it can almost be regarded as having no hyperparameter.

4.2. Verification

Anchor-based RetinaNet . We use ATSS to replace the traditional method in RetinaNet (#A=1) to verify the effectiveness of this method for anchor-based detectors. As shown in Table 3, ATSS has improved the overall performance of RetinaNet, in which AP increased by 2.3%, AP50 increased by 2.4%, AP75 increased by 2.9%, and so on. These improvements are mainly due to the model's adaptive selection of its positive samples for the statistical characteristics of each object. This improvement only changes the selection of samples without additional costs, so it is cost-free.

Anchor-free FCOS . ATSS can be applied to FCOS in two versions, a simplified version and a full version. For the simplified version, we apply some ATSS ideas to FCOS, such as replacing its positive sample selection method with the method mentioned in this article. That is, in FCOS, each anchor point is regarded as a candidate sample of the real frame, resulting in the generation of a large number of low-quality positive samples. The method in this paper is that for each real object, each pyramid layer only selects the first k=9 candidate samples as positive samples. The streamlined version is also called the center sampling version, which increases the accuracy of FCOS from 37.8% to 38.6%, but still has hyperparameters.

The full version, also known as ATSS, replaces the anchor points of FCOS with square anchor boxes to define the positive and negative samples, the scale is 8S, and S is stride, and then the object is obtained in the same way as FCOS. As shown in Table 3, the accuracy is greatly improved. The best performance of the full version also shows that ATSS is better than FCOS in sample selection.

Insert picture description here

Table 3. To verify the validity of ATSS, ATSS and center sampling are the full version and the condensed version respectively.

4.3. Analysis

Hyperparameter k . Analyzing the influence of different hyperparameter k on the detection accuracy, the result shows that too large or too small is not good, the middle is the best, and in fact, k does not have a great influence on the result, and ATSS has almost no parameters.

Insert picture description here

Table 4. The results of different hyperparameter k values ​​on MS COCO minival.

Anchor size . The square anchor size used in the experiment is 8S, and S is the stride of the pyramid layer. Experiments show that this method is robust to different anchor sizes and has little effect.

Insert picture description here

Table 5. Different anchor size settings, the ratio of length to width is 1:1, which is square.

4.4. Comparison

Comparison with other SOTA detectors on the MS COCO test-dev subset. In the experiment, the same multi-scale training strategy as RetinaNet and FCOS was used, and the scale was randomly selected between 640 and 800 as the smallest edge of the image to resize. Outside the pool, iteration is changed to 180k, and the point of decrease in learning rate is changed to 120k and 160k. The other settings are the same as the above two detectors.

As shown in Table 8, this method achieves 43.6% AP when using ResNet-101, which is better than all other detectors using the same backbone network. In addition, the replacement of larger backbone networks such as ResNeXt-32x8d-101 and ResNeXt-64x4d-101 will further improve the accuracy. The experiment also tried to combine ATSS and DCN.

Insert picture description here

4.5. Discussion

Is it necessary to tile multiple anchors in one location? It is necessary to use the traditional IoU method to determine the positive and negative of the sample, because the original version of RetinaNet has 9 anchors in each position better than RetinaNet (#A=1). If you are using ATSS, it is not necessary. Because as long as the positive sample is selected appropriately, the results of how many anchors at each position are the same.

5 Conclusion

It is pointed out that how to select the positive and negative samples during training is critical to the training of target detection. The core difference between the one-stage anchor-based detector and the center-based anchor-free detector is the selection method of positive and negative samples. For this reason, ATSS is proposed, which bridges the gap between anchor-based and anchor-free detectors and achieves SOTA, and shows that in the case of using ATSS, it is not necessary to tile multiple anchors at each location.

6. Reference

https://arxiv.org/abs/1912.02424

Guess you like

Origin blog.csdn.net/yanghao201607030101/article/details/113062630