[FCOS] Anchor-Free: FCOS paper summary and some details

insert image description here

Paper: https://arxiv.org/abs/1904.01355 .
Github: source code

I. Overview

1.1 Model classification

Generally speaking, the target detection model can be divided into two-stage and one-stage , and can also be divided into Anchor-base and Anchor-Free , as well as the more popular Transformer-type models, such as DETR .
Two-stage models: RCNN, Fast RCNN, Faster RCNN, Mask RCNN, etc.;
one-stage models: YOLO series, SSD, RetinaNet, etc.;
Anchor-base models: Faster RCNN, SSD, YOLO V2 and subsequent series, etc.;
Anchor-Free Models: YOLO V1, FCOS, CornerNet, CenterNet, etc.;
Transformer class models: DETR, Deformable DETR, DINO, etc.

1.2 Disadvantages of the Anchor-base model

Although these Anchor-base models (based on Anchor design) achieved relatively good or even SOTA results at the time, the author pointed out some shortcomings:
1. The design of Anchor is very important, whether it is size (size) or ratio (aspect ratio) ), because the design of Anchor will affect the performance of the model, which is equivalent to adding very important hyperparameters.
2. Even after careful design, these well-designed anchors are not universal, that is, different data sets need to design different anchors.
3. The model generates tens of thousands of Anchors. Most of these Anchors are negative samples, which will cause an imbalance between positive and negative samples (even if you can sample and adjust yourself, it will be troublesome).
4. The distinction between positive and negative samples is determined by the IOU size of Anchor and GT, and the calculation of various IOUs is cumbersome.

1.3 motivation

Motivation: In response to some of the above shortcomings, coupled with the results of FCNs (full convolutional network) working on instance segmentation, the author thinks whether he can learn from FCN of semantic segmentation to make a pixel-level (predict the target with pixels) network. to achieve target detection. These pixels are the Location of this article .


1.4 Some thoughts, why is the design of Anchor important?

In fact, I have never understood the actual meaning of Anchor before, I just think it is a rectangular frame and it does work. What exactly is an Anchor?
Taking Faster RCNN as an example, Anchor has three sizes (side lengths) in the paper: {128, 256, 512}, and each size has three aspect ratios {0.5, 1, 2}. The rectangular frame in the figure below is Anchor , why design these Anchor? It is because we don't know where there will be objects on the picture, so we need to generate tens of thousands of anchors on the picture, so that no matter which area has objects, I may predict correctly. Every time an Anchor is generated, it means that there may be objects within the range of the rectangular frame, and whether or not the model is allowed to judge.

That is to say, I divide a picture into thousands of parts of different sizes and different proportions, feed them to the model, and finally let the model judge whether this part is an object, and if it is, then use the model of the regression part to adjust this part ( The size of this Anchor).

So Anchor is relative to the original image, not generated on the feature map. Anchors are actually equivalent to the areas (about 2000) that may contain objects selected by the selective search algorithm in RCNN , but these Anchors are artificially set by us, not generated by the algorithm. It is equivalent to our thinking that these thousands/thousands of Anchors may contain objects, and the role of RPN is to select the Anchors that actually contain objects, that is, proposals. The design of the Anchor is very important. For example, the GT box of the car in the above picture should be very large. It is obviously unreasonable to use the three sizes of Anchor predictions in the picture. Not only is the size inappropriate, but when selecting positive and negative samples, generally the IOU with GT must be greater than 0.7 to be used as a positive sample. Obviously, the size of these frames cannot be achieved, and even a positive sample cannot be found, let alone predict up.
insert image description here

So the design of Anchor is very important, and different data sets really need to design different Anchor.


2. Network structure

2.1 Detection idea

insert image description here

After using the backbone (such as ResNet) to obtain the feature map, each pixel of the feature map is regarded as a point location (x, y) , and each feature point location (x, y) is mapped to the original image through the formula (feeling center of the wild). As long as the location falls within the range of a certain GT box, then this point is regarded as a positive sample, and then four coordinate parameters (l, t, r, b) are predicted according to this point, representing the location to the left border , The distance between top border, right border and bottom border. According to the addition of these 4 coordinate parameters and location, the target frame can be obtained. As shown in the figure above (left), the orange point falls in the GT box, so it is regarded as a positive sample, and four coordinate parameters are predicted.

Then the situation on the right will naturally appear, that is, a certain location falls in two GT boxes (yellow and blue) at the same time, then this location will be judged as the GT box with a smaller area (blue rectangular box) . This point is called ambiguous.

2.2 Network structure

insert image description here
After Backbone outputs the last layer of feature maps, it generates multi-layer feature maps through the FPN structure. Here, P6 and P7 are generated through a convolutional layer based on FPN's P5. A total of 5 layers of feature maps are used for the final prediction. Finally all feature maps share a prediction head. The prediction head has three parts, a classification is used for classification, a center-ness is used to limit the location of the location , and a regression is used to predict 4 regression parameters.

3. Details

3.1 Center-ness

Center-ness is centrality. The role of the Center-ness branch is to suppress the boxes generated by points that deviate from the center of the object. Because as long as the location falls within the range of GT, it is considered a positive sample, so if the loacation falls on the edge of GT, the effect of regression prediction will be very poor. The author gives the formula to calculate the centrality.
insert image description here
insert image description here


Visualizing the above formula, it can be seen that the closer to the center, the brighter the point is, that is, the greater the value of the centrality. Add the Center-ness branch to the loss function, so that the location can be limited to the center of the GT as much as possible.


insert image description here
insert image description here

It can be seen that the closer to the center, the darker the color, try to limit the location to the center of GT.


Visual code:


import cv2
import torch
import numpy as np


def calc_center_ness(bbox, s):
    l = bbox[:, 0]-s[:, 0]
    t = s[:, 1]-bbox[:, 1]
    r = s[:, 2]-bbox[:, 0]
    b = bbox[:, 1]-s[:, 3]
    # 计算center-ness
    center_ness = np.sqrt((np.minimum(l, r)/np.maximum(l, r))
                          * (np.minimum(t, b)/np.maximum(t, b)))
    return center_ness



shifts_x = torch.arange(0, 100)
shifts_y = torch.arange(0, 100)
shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
bbox1 = np.array(torch.stack((shift_y, shift_x), -1)).reshape(-1, 2)
pred_ctr1 = np.array([[0, 100, 100, 0]]).repeat(100*100, axis=0)


bbox2 = np.array(torch.stack((shift_x, shift_y), -1)).reshape(-1, 2)
pred_ctr2 = np.array([[0, 100, 100, 0]]).repeat(100*100, axis=0)

center_ness1 = calc_center_ness(bbox1, pred_ctr1)
center_ness2 = calc_center_ness(bbox2, pred_ctr2)

center_ness_norm1 = (center_ness1 - center_ness1.min()) / \
    (center_ness1.max() - center_ness1.min())
center_ness_norm2 = (center_ness2 - center_ness2.min()) / \
    (center_ness2.max() - center_ness2.min())
center_ness = center_ness_norm1 + center_ness_norm2
center_ness_norm = (center_ness - center_ness.min()) / \
    (center_ness.max() - center_ness.min())
center_ness_gray = (center_ness_norm * 255).astype(np.uint8).reshape(100, 100)
cv2.imshow('Center-ness', center_ness_gray)
cv2.imwrite('saved_image.jpg', center_ness_gray)
cv2.waitKey(0)
cv2.destroyAllWindows()

3.2 Loss function

The first is how the point location (x, y) on the feature map is mapped to the original image:
insert image description here
this formula will map the location to a position close to the center of the receptive field .
Then there are the parameter coordinates of the regression branch prediction:
insert image description here
x, y are the coordinates of location, x0, y0, x1, y1 are the coordinates of GT box, which are composed of the coordinates of the upper left corner and the lower right corner, and the GT box coordinates are based on the upper left corner of the picture calculated for the origin.

The loss function is the common classification loss plus regression loss. Only positive samples, that is, points that fall within the GT box, have regression loss: where t represents a four-dimensional vector that is the predicted coordinate parameter (l, t, r, b) . There is actually a BCE Loss of the Center-ness branch here , which is not reflected here and can be added.

- Note: Since the coordinates of these parameters must be positive, the author will take an exponential operation.

insert image description here

3.3 The role of FPN

As mentioned above, a certain location falls in two GT boxes at the same time, then this location will be judged as the GT box with the smaller area. This problem is solved by taking the FPN test. Due to the existence of FPN, multi-level feature maps can be used for prediction, and feature maps at different levels have different sizes of receptive fields. Often the deep feature map receptive field is relatively large, and the shallow feature map receptive field is relatively small. Therefore, the deep feature map is suitable for predicting large objects, and the shallow feature map is suitable for predicting small objects.

The author here cleverly limits the size of the coordinate parameters that can be predicted by the feature map of each level. If the coordinate parameters predicted by the feature map of each layer are: max(l, t, r, b)>mi or max(l, t, r, b)<mi-1 , then the location of this point is not a positive sample but a negative sample, even if the location falls within the GT box.

For example, the feature layer P5 predicts the four parameter coordinates (l, t, r, b) of a location in the GT box as: (300, 200, 150, 100), obviously the maximum value is 300, which exceeds the P5 feature The m5=256 corresponding to the layer , so this location will be regarded as a negative sample that is the background even if it falls in the GT box.

Similarly, if the four parameter coordinates (l, t, r, b) are: (120, 100, 96, 30), it is obvious that the maximum value is 120, which is less than the m4=256 corresponding to the P4 feature layer , then it will also be regarded as is a negative sample.

This is equivalent to assigning tasks to the feature maps of each layer, and trying to make the feature maps of different layers predict objects of different sizes. Because the author found that most of the overlapping objects are of different sizes, even if the location falls on two GT boxes at the same time, different feature layers can be responsible for predicting objects of corresponding sizes and detecting both objects. In extreme cases, judge this location as the GT box with a smaller area (the blue rectangular frame).
insert image description here

4. Experiment

insert image description here

5. Advantages

1. Referring to the results of FCNs in strength segmentation, target detection at the pixel (point) level, the network structure can be reused in other tasks such as semantic segmentation.

2. Anchor-free, there is no disadvantage of the Anchor-base model, and there is no need to design various Anchors.

3. There is no need to calculate IOU, etc., which can save memory consumption.

4.we encourage the community to rethink the necessity of anchor boxes in object detection, which are currently considered as the de facto standard for detection.-——Anchors是否真的必要?

5.We believe that this new method can be the new baseline for many instance-wise prediction problems.——新的baseline!


To sum up,
Anchor-free is indeed concise and efficient, turning the previous anchors into pixel level, and predicting coordinate parameters with one point. The whole paper is very clear and detailed, and the network structure diagram is clear at a glance. It is very worth reading!

Guess you like

Origin blog.csdn.net/m0_46412065/article/details/129242060