1、论文总述

在这里插入图片描述

本文从FCN语义分割的思想出发，利用像素点的密集预测提出了一种单阶段的anchor-free的全卷积的目标检测方法，总的来说与其他几篇anchor-free的方法类似，但也有不同于其他方法的创新点，
1、如这里的正样本分配，利用GT内的每个点都来预测这个GT（FCN的密集预测思想），而不是一部分（如FSAF里的0.2有效区域和0.5忽略区域）；
2、还有center-ness map的提出，即认为越是靠近目标中心点的那些点的预测框质量高，离中心点较远的点的预测会被抑制，然后NMS时候这些框会被筛掉，没有这个centerness map时，FCOS是比不上RetinaNet的，加上之后，FCOS是可以超过onestage的基于anchor的目标检测算法的；
3、但真正可以anchor free的方法再次火起来的还是FPN和focalloss的提出，没加FPN之前，FCOS效果很差，加了之后在Coco上的mAP翻倍，FPN可以让overlap的不同尺寸目标分配到FPN的不同层上去训练学习，但如果overlap的是同尺寸的目标那就不好解决了，论文里是直接让这个cell预测相对较小的目标，但论文里也有数据表明，Coco数据集里出现的这种不同类别的overlap的同尺寸的目标很少，暂时还没到要比拼这个的阶段，就是说把不重叠的目标都检测到，就已经可以提升mAP了；
4、总的来说，anchorfree的方法对小目标可能较为友好，还没做实验验证，论文里这么讲的，后续要做些实验看看这些方法的具体效果；
5、本论文里反复强调的一点就是，这个方法很简洁，网络结构也简单，训练时候不需要很复杂的box之间的IOU计算和正负样本的分配（看fasterRCNN源码有感），可以节省内存，也不需要关于anchor的尺寸长宽比等许许多多的超参数设置；同时也让研究者思考思考目标检测里anchor的必要性，也许真的不需要anchor！！
6、FCOS可以作为二阶检测器的区域建议网络(RPN)，其性能明显优于基于锚点的RPN算法。
7、FCOS可以经过最小的修改便可扩展到其他的视觉任务，包括实例分割、关键点检测。

Without bells and whistles, we achieve state-of-theart results among one-stage detectors. We also show that the proposed FCOS can be used as a Region
Proposal Networks (RPNs) in two-stage detectors and
can achieve significantly better performance than its
anchor-based RPN counterparts. Given the even better
performance of the much simpler anchor-free detector,
we encourage the community to rethink the necessity of
anchor boxes in object detection, which are currently
considered as the de facto standard for detection

In this section, we first reformulate object detection in a
per-pixel prediction fashion. Next, we show that how we
make use of multi-level prediction to improve the recall and
resolve the ambiguity resulted from overlapped bounding
boxes in training. Finally, we present our proposed “center-
ness” branch, which helps suppress the low-quality detected
bounding boxes and improve the overall performance by a
large margin

2、正负样本的分配

anchor free的目标检测网络在正负样本这块与anchorbased的方法大有不同，因为原来anchor都是根据anchor与GT的IOU来规定正样本、忽略样本、负样本；而现在没有了anchor，该如何定义正负样本？？

本文是直接将第 i层feature map上（表示为Fi）GT内的所有点当做正样本，我最近感觉随着网络的发展，对正样本的分配越来越密集，如YOLOv1中只把中心点落入的cell用来预测box，这样导致正样本太少，召回率比较低；而后是centerNet，将GT的中间的一部分点作为有效样本；而本文FCOS则把GT内的所有点都当做正样本。。。

加上FPN之后，作者也并没有根据FPN论文里给出的公式根据GT的box的尺寸大小来将不同尺寸的GT分配给不同的层级的feature map来训练，作者在本文里则是直接规定了每层feature map可以回归的偏移范围来达到分配正负样本的目的，{P3, P4, P5, P6, P7}对应的可以回归的偏移最大值为m2, m3, m4, m5, m6 and m7 are set as 0, 64,128, 256, 512 and ∞，这里的偏移是指用来预测的这个点到GT四条边的距离，都是正值，再进一步解释就是，在开始训练之前，先计算一下每层feature map的所有点的所需要预测的偏移量的最大值，如果其中有任意一个最大值超过了本层规定的mi（或者小于前一层的最大值），那么这个GT就不能在这个层进行预测了，相应的该位置设置为负样本（负样本还有别的来源：该层feature map除了正样本之外的其他所有点），然后把它放到它对应的feature map去预测。

在这里插入图片描述

feature map上的每个点（x,y）都可以对应回原图，文中给出的公式：
（s/2+xs,s/2+ys）,其中s为步长。

注： 这里可以说一下与anchorbased的区别，
这个对应回原图的点，在anchorbased里是把它作为anchor的中心点，然后在周围依次产生9个anchor，但现在在anchorfree里，是把这个点对应的feature map上的点直接当做训练样本来训练，就是让这个点本身的位置代表预测的x y，而不用再去回归与原anchor的偏移 x y ，相当于特征图上的点直接就是训练样本了，没有经过anchor这个编码过程。
关于上面那点的原文论述：

Different from anchor-based detectors, which consider the location on the input image as the center of anchor
boxes and regress the target bounding box for these anchor
boxes, we directly regress the target bounding box for each
location. In other words, our detector directly views locations as training samples instead of anchor boxes in anchorbased detectors, which is the same as in FCNs for semantic
segmentation [16] （与FCN类似）

注意：Bi表示这个GT的边界框在第i个feature map上对应的位置，好像不对，应该对应的坐标值还是在原图里的坐标，这样回归偏移量才能那么大，最大值才会设为0、 64、 128等

Specifically, location (x, y) is considered as a positive
sample if it falls into any ground-truth bounding box and the
class label c∗ of the location is the class label of Bi
. Otherwise it is a negative sample and c∗ = 0 (background class)

回归偏移量的计算：
在这里插入图片描述
x0 y 0 x1 y1是指GT标签

损失函数：

在这里插入图片描述

分类用的Focal loss ,回归用的IOU损失，t∗ = (l∗, t∗, r∗, b∗)
注意：这里只是最终损失函数的一部分，还有后面的centerness map的二值交叉熵损失

Note: 由于作者认为，不同的特征层需要回归不同的尺寸范围(例如，P3的尺寸范围为[0,64]，P4的尺寸范围为[64,128]，因此在不同的特征层使用相同的输出激活是不合理的。因此，作者没有使用标准的exp(x)函数，而是使用exp(si，x)其中si是一个可训练的标量si，能够通过si来自动调整不同层级特征的指数函数的基数，这从经验上提高了检测性能。

3、FCOS比anchor方法好的可能原因之一

It is worth noting that FCOS can leverage as many foreground samples as possible to train the regressor. It is different from anchor-based detectors, which only consider the
anchor boxes with a highly enough IOU with ground-truth
boxes as positive samples. We argue that it may be one of
the reasons that FCOS outperforms its anchor-based counterparts

就是说虽然每个grid cell只预测一个box，但是训练时被视为正样本的数量（GT内的所有点都是正样本）比anchorbased的方法要多，因为anchor类型的方法都是要求这个anchor与GT的IOU大于某个阈值才会视为正样本，所以大部分anchor为负样本，不参与回归训练其实。

4、 Center-ness for FCOS

在这里插入图片描述
上图中的公式3就是centerness这个分支的标签，就是它需要学出来的东西，训练的损失函数用的是二值交叉熵损失（BCE- binary cross entropy ），加上这个分支以后，FCOS的性能大幅提升。

测试时候：是将这个分支与对应的类别分支相乘，这样可以抑制那些离中心点远的点的预测的低质量框。

在这里插入图片描述
注：centerness的预测还是不够准确，作者有相关实验验证：

To further demonstrate the usefulness of center-ness, we
carry out one more experiment. We assume that we had
an oracle which provides the ground-truth center-ness score
during inference. With keeping all the other settings exactly the same, ground-truth center-ness for inference significantly improves the AP to 42.1, meaning that there is
much room for further improving our current accuracy of
36.6 AP as shown in Table 5, as long as we improve the
prediction accuracy of the center-ness.

作者提出centerness map 可能的未来改进的方向：

In theory we may even train a separate deep network,
which does not share any weight with the main detector,
with its only purpose being to predict the center-ness score.
This is only possible due to the fact that the center-ness
score is solely used in inference. Therefore we are able to
decouple the training of the center-ness predictor from the
training of the detector. This decouple allows us to design
the best possible center-ness predictor with the price of extra computation complexity. We also hypothesize that all
other detectors, if NMS is needed for post-processing, may
be able to benefit from such an accurate center-ness score
predictor. We leave this topic for future work

就是说想要将centerness map的训练与主网络分开，让它学的更好。

5、COCO数据集的使用惯例

Our experiments are conducted on the large-scale detection benchmark COCO [13]. Following the common practice [12, 11, 20], we use the COCO trainval35k split
(115K images) for training and minival split (5K images)
as validation for our ablation study. We report our main results on the test dev split (20K images) by uploading our
detection results to the evaluation server.

6、Best Possible Recalls（BPR）

BPR的定义：

Here BPR is defined as the ratio of the number of
ground-truth boxes a detector can recall at the most divided
by all ground-truth boxes. A ground-truth box is considered
being recalled if the box is assigned to at least one sample
(i.e., a location in FCOS or an anchor box in anchor-based
detectors) during training.

定义就是：可能被召回的数量除以总的GT数

7、与RetinaNet的两点小不同

The aforementioned FCOS has two minor differences from
the standard RetinaNet. 1) We use Group Normalization
(GN) [23] in the newly added convolutional layers except
for the last prediction layers, which makes our training more
stable. 2) We use P5 to produce the P6 and P7 instead of
C5 in the standard RetinaNet. We observe that using P5 can
improve the performance slightly

7、与CornerNet相比较

Compared to the recent state-of-the-art one-stage detector CornerNet [10], our FCOS also has 0.5% gain in
AP. Maybe the gain is relatively small, but our detector
enjoys the following advantages over CornerNet. 1) We
achieve the performance with a faster and simpler backbone
ResNet-101 instead of Hourglass-104 in CornerNet. 2) Except for the standard post-processing NMS in the detection
task, our detector does not need any other post-processing.
In contrast, CornerNet requires grouping pairs of corners
with embedding vectors, which needs special design for the
detector. 3) Compared to CornerNet, we argue that our
FCOS is more likely to serve as a strong and simple alternative for current mainstream anchor-based detectors