ATSS paper reading report

Target Detection | Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection

Interpretation: Undergraduate Huang Xun

Contact email: [email protected]

Paper address: https://arxiv.org/pdf/1912.02424.pdf

Open source code: https://github.com/sfzhang15/ATSS

Overview of the author and research team:

Team members include Zhang Shifeng, Lei Zhen, Yao Yongqiang, Li Ziqing, Cheng Chi, and the team’s research interests include machine learning and pattern recognition, with emphasis on general target detection, face detection, pedestrian detection, and video detection.


Content directory:

  1. Introduction

  2. Specific research content

  3. experiment

  4. in conclusion


1. Introduction

1. The main content of the paper

The content mainly includes 2 major aspects:

1. A detailed comparative experiment was done on the effect gap between anchor-based and anchor-free target detection algorithms

2. The ATSS method is proposed to determine the positive and negative samples.

2. Main contribution

  1. Explain that the essential difference between anchor-based and anchor-free is actually how to define positive and negative training samples.

  2. It is proposed that ATSS selection automatically selects positive and negative training samples according to the statistical characteristics of the object.

  3. Prove that placing multiple anchors at each position on the image to detect the obj conclusion is a useless operation (in the case of ATSS).

2. Specific research content

####anchor-based与anchor-free

The Anchor-free and Anchor-based methods are shown in the figure above. It can be seen that the structure of the two methods is very similar, and there are three main differences between them:

  1. The number of anchors at each location. RetinaNet sets several anchors for each position, while FCOS sets one anchor for each position;
  2. The definition of positive and negative samples. RetinaNet is set according to the IOU threshold, and FCOS uses space and scale constraints to determine positive and negative samples;
  3. Return to the starting state. RetinaNet returns the bounding box of the target from the preset anchor frame, and FCOS locates the target from the point.

####Why is the performance of anchor-free (FCOS) significantly better than anchor-based (RetinaNet)?
The article first analyzes some of the hidden points that FCOS uses more than RetinaNet, and eliminates the interference of some factors. The main ones are as follows:

  • adding GroupNorm in heads
  • using the GIoU regression
  • limiting positive samples in the ground-truth box
  • introducing the centerness branch
  • adding a trainable scalar for each level feature pyramid

The comparative experimental results of the article controlling these factors one by one are as follows: It

can be seen that FCOS is only 0.8 points higher than RetinaNetAP when the interference of some other factors is excluded.
####anchor-free and anchor-based 3 main differences Contribution
#####1. The number of
anchors at each position RetinaNet sets one or more anchors at each position, while FCOS sets one anchor
at each position: not discussed in this article, but controlled in FCOS and RetinaNet The number of anchors is 1 (#A=1) and at the end, whether the number of anchors has an impact is discussed.
#####2. The definition of positive and negative samples

RetinaNet is based on the IOU threshold setting. FCOS uses spatial and scale constraints to determine positive and negative samples:

the experimental results of the two networks in two methods of defining positive and negative samples:

found:

RetinaNet uses Box+Spatial and Scale Constraint instead of the original method (Box+Intersection over Union). The AP value is 0.8 points higher;
while FCOS uses IoU (Point Intersection over Union)+ than the original method. (Point+Spatial and Scale Constraint) is 0.9 points lower;

show:

How to define positive and negative samples is the essential difference between anchor-free and anchor-based.
#####3. Return to the start state
RetinaNet returns the target bounding box from the preset anchor box, and FCOS locates the target from the point: the

two networks are in the two return start states. Experimental results:

found:
RetinaNet uses the anchor instead The AP value of the point positioning target (Point+Intersection over Union) is 0.1 points lower than the original method (Box+Intersection over Union);
FCOS uses a preset anchor box regression (Box+Spatial and Scale Constraint) than the original The AP value of the method (Point+Spatial and Scale Constraint) has not changed

Show that:
returning to the starting state is an irrelevant difference.
####ATSS (Adaptive Training Sample Selection) The
previous sample selection strategy has some hyperparameters, such as the IoU threshold in anchor-based. After setting these hyperparameters, all ground truth boxes must select their positive samples according to fixed rules. These samples are suitable for most objects, but some objects will be ignored.
For this reason, an ATSS method is proposed, which has almost no hyperparameters.

Original ATSS algorithm:

The general process of the algorithm:

  1. For each output detection layer, choose to calculate the L2 distance between the center point of each anchor and the center point of the target, and select the K anchors whose center points are closest to the target center point as candidate positive samples (candidate positive samples)
  2. Calculate the IOU between each candidate positive sample and groundtruth, calculate the mean and variance of this group of IOUs
  3. According to the variance and mean, set the threshold for selecting positive samples: t=m+g; m is the mean, g is the variance
  4. According to the t of each layer, select the positive samples that really need to be added to the training from the candidate positive samples (IoU>=t)

Algorithm motivation:

  1. The L2 distance is used to find the k candidate positive sample sets closest to the center because for RetinaNet, when the anchor center is closer to the object center, the IoU is larger. For FCOS, the closer the point is to the target center, the higher the inspection quality. Therefore, the anchor closer to the center of the object is a better candidate positive sample.
  2. The reason for using the mean and variance of the statistical sample as the threshold is that by using the sum of the average mg and the standard deviation vg as the IoU threshold tg, it is possible to adaptively select the appropriate feature map for each object according to the statistical characteristics of the target Positive sample:

3. Experiment

####1. Verify the proposed method on MS COCO minival set. ATSS and Center sampling (only use the ATSS method when selecting candidate positive samples.) are the full and simple versions of the proposed method:

#### Second, the position of each of a plurality of different anchor experiments on MS COCO minival set of:
(SC #, # Ar is a scale and aspect ratio, imprs up to 5 points previously mentioned)

can be Findings:
Compared with the previous (#A=1) experimental results, it can be known that the improved Retina Net (#A=9) has better performance than Retina Net (#A=1) without using ATSS.
However, after using ATSS, the opposite conclusion will be reached-placing multiple anchors in each position is a useless operation.

####3. The experimental results under different hyperparameter k on the MS COCO minival set are analyzed: the

experimental results show that compared with other values ​​of k, a slightly higher AP value is obtained when k=9. Therefore, the default of k is set to 9 in the algorithm.
It also reflects that the hyperparameter k is actually insensitive, so this method can be regarded as a non-hyperparameter method.

####Four, the final experimental results under the MS COCO test-dev set training set are shown in the table: it

can be found that the various AP values ​​of each network under the condition of using ATSS have been greatly improved.

4 Conclusion

  1. Pointed out that the essential difference between anchor-free and anchor-based is actually the definition of positive training samples and negative training samples. This shows that how to choose positive and negative samples in the training process is very important.
  2. Inspired by this, the author studied this basic problem in depth and proposed an adaptive training sample selection method, which automatically divides the positive training sample and the negative training sample according to the statistical characteristics of the target (variance and mean), reducing anchor-free And the influence of anchor-based hyperparameters.
  3. The author discussed the necessity of tiling multiple anchors at each location and proved that this is a useless operation in the current situation.
  4. Extensive experiments conducted on the challenging benchmark MS COCO show that the proposed method can achieve the latest performance without introducing any additional overhead.

Guess you like

Origin blog.csdn.net/ylwhxht/article/details/106358933