I read yolov7's paper before and found that the hot spot of target detection in recent years seems not to be the network structure, but the positive and negative sample distribution strategy, so I reviewed a 2020 CVPR fire article: ATSS

原文标题：Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection

ATSS取自Adaptive Training Sample Selection

Paper address: https://arxiv.org/pdf/1912.02424.pdf

Table of contents

1. Introduction to the background of the paper:

2. Experimental details

3. The ATSS method proposed by the author

1. Introduction to the background of the paper:

In the past, the mainstream algorithm for target detection is anchor-based (regardless of the first stage or the second stage). In recent years, the anchor-free algorithm has become more and more popular, and it claims to be more accurate than anchor-based. So their gap is in the end where? Why is there such a gap?

This paper mainly thinks about this problem and conducts research. It is found that the essential difference between anchor-free and anchor-based algorithms (the reason for the difference in accuracy) is how to define positive training samples and negative training samples.

The author proposed (developed by himself) a set of adaptive sample selection methods (Adaptive Training Sample Selection, ATSS), which significantly improved the performance of anchor-based and anchor-free detectors (using the ATSS proposed by himself), so the conclusion It is the sample selection method that brings about the improvement of accuracy.

In addition, this article dug a hole and found that it is not necessary to place multiple anchors at each position on the image (one is enough), but I don’t know why, and it remains to be studied.

2. Experimental details

The author uses the representative anchor-based algorithm RetinaNet and the anchor-free algorithm FCOS to conduct comparative experiments to explore their differences.

First of all, there are three main differences between RetinaNet and FCOS:

The number of anchors in each position is different. RetinaNet generates 9 anchors per position; FCOS generates only 1 point.
The distribution strategies of positive and negative samples are different. RetinaNet is allocated according to the MaxIoU of anchor and gt; FCOS is allocated according to whether the point falls within gt (space limitation), and the regression range limitation (scale limitation) of different FPN feature layers.
The return strategy is different. RetinaNet returns the delta of the pred bbox and anchor; FCOS directly returns the distance from the point to the four sides of the pred bbox.

In addition to the three main differences, FCOS also has five more tricks than RetinaNet (see FCOS article for details):

GroupNorm. FCOS uses GN instead of BN in the head convolutional layer.
GIoU Loss. FCOS adopts GIoULoss as the regression localization loss function.
In GT Box. FCOS restricts positive samples to be within gt.
Centerness. FCOS predicts one more centerness branch.
Scalar. FCOS has an additional learnable parameter scale when regressing location prediction.

For the sake of fairness, RetinaNet only puts a square anchor frame in each position, and RetinaNet performs the same tricks in FCOS, and the final fine-tuned RetinaNet and FCOS accuracy are only about 0.8%.

Then the author found that the difference between the two of them actually lies in two aspects: how to define positive samples and negative samples; regression method.

Through comparative experiments, the author found that the difference in the definition of positive and negative samples is the biggest reason for the difference in accuracy.

FCOS first uses spatial constraints to find candidate positive samples in the spatial dimension, and then uses scale constraints to select the final positive samples in the scale dimension.

RetinaNet uses IoU to select the final positive samples in both space and scale dimensions, so FCOS is better.

The regression method does not affect the accuracy. Among them, FCOS starts to return from the center point, and returns to the four distances from the four sides of the border of GT

And RetinaNet starts to return from the anchor box, and returns four offsets (two are the coordinates of the GT center point, and two are the size of the anchor box)

Finally, look at the impact of the number of anchors, and directly throw out the experimental conclusion. For the positive and negative sample allocation strategy based on MaxIoU, the number of anchors has a significant impact on the accuracy. The number of anchors from 1 to 9 can bring 1.8% accuracy improvement. . It is easy to understand, because there are too few anchors, some gt may not be allocated, resulting in a decrease in accuracy.

However, if the square sample allocation strategy of FCOS or the ATSS allocation strategy proposed later are adopted, the impact of the number of anchors on the accuracy is negligible .

3. The ATSS method proposed by the author

The authors propose an ATSS allocation strategy. Specifically, it is divided into the following steps:

1. For each groud-truth box on the image, denoted as g, in each layer of the feature pyramid, select the k candidate positive samples (anchor boxes) closest to the center point of g according to the L2 distance, then if there are L-layer features Pyramid, a total of L*k candidate boxes will be selected

2. Calculate the IoU value of each candidate box and g, and calculate the mean and variance of this IoU value set

3. The sum of the mean and variance is used as the IoU threshold, and the candidate boxes greater than this threshold are selected as the final positive samples

4. In addition, if an anchor box is assigned to multiple ground-truth boxes, the one with the highest IoU will be selected. The rest are negative samples. In addition, the center of the positive sample must be limited to gt, which is the premise of being a positive sample, otherwise it will not be considered.

The author also analyzes the motivation (reason) for the adoption of these steps

1. Why consider the closest distance to the GT center point L2?

The anchor with the larger IOU in RetinaNet is closer to the center of gt. Similarly, the closer the point in FCOS to the center of gt, the higher the quality of the detected prediction frame.

2. Why calculate the sum of the mean and variance as the IoU threshold?

Use the iou mean and variance of a gt and all anchors to determine the adaptive iou threshold of the gt . This is the key in ATSS and the origin of the word Adaptive . In one sentence, the adaptive threshold is based on statistics. The resulting iou threshold.

3. Why must the center of the positive sample be limited to gt?

Drawing on the space limitation in the allocation strategy of FCOS, the point is required to fall within the gt, because it falls outside the gt, and the extracted image features are incorrect.

Digression: k is the only hyperparameter in ATSS, and through follow-up experiments, it was found that the setting of k has little effect on accuracy!

Introduction to ATSS

1. Introduction to the background of the paper:

2. Experimental details

3. The ATSS method proposed by the author

Guess you like