Introduce

RetinaNet，SSD，YOLOv3、Faster R-CNN
依赖于预定义的 anchor boxes
FCOS
anchor free
- 全卷积
- 逐像素回归预测
- 基于FPN的多尺度策略
- center-ness

基于anchor-based的检测器存在以下缺点:[5]

anchor会引入很多需要优化的超参数，比如anchor number、anchor size、anchor ratio等；
为了提高召回率，需要很多的anchors，存在正负样本类别不均衡问题；
在训练的时候，需要计算所有anchor box同ground truth boxes的IoU，计算量较大；

FCOS 创新点：[1]

使用语义分割的思想来解决目标检测问题
摒弃了目标检测中常见的 anchor boxes 和 object proposal ，使得不需要调优涉及anchor boxes和object proposal的超参数（hyper-parameters）
训练过程中避免大量计算GT boxes和anchor boxes 之间的IoU，使得训练过程占用内存更低
提出的可以FCOS代替二阶段检测中的RPN，且性能更优

Network

【Backbone】 + 【特征金字塔（Feature Pyramid）】+ 【Classification + Center-ness + Regression】

位置回归
- anchor-based
- RPN 网络的 feature map 上的每一个点 $(x, y)$ （为中心生成 9个anchor）
- 映射回原图，位置为 $([s / 2] + x * s, [s / 2] + y * s)$ ，作为中心，回归bounding box
- FCOS
- 直接对 feature map 中每个点 $(x, y)$
- 映射回原图中，位置为 $([s / 2] + x * s, [s / 2] + y * s)$
- 点作为训练样本回归bounding box
在训练过程中，算法对样本的标记方法
- anchor-based
  - 如果anchor对应的边框与真实边框(ground truth) IOU 大于一定阈值，就设为正样本
  - 把交并比最大的类别作为这个位置的类别
- FCOS
  - 正样本 : 如果位置 $(x, y)$ 落入任何 GT框
  - 类别标记: GT框的类别
    能利用的正样本明显更多
- 问题
  - 如果标注的 GT框重叠，位置 $(x, y)$ 映射到原图中落到多个 GT框，这个位置被认为是模糊样本
    为了解决真实边框重叠带来的模糊性和低召回率，FCOS采用类似FPN中的多级检测，就是在不同级别的特征层检测不同尺寸的目标
- 好处
  
  逐像素回归预测除了能够带来更多的框以外，更重要的是利用了尽可能多的前景样本来训练回归器，而传统的基于anchor的检测器，只考虑具有足够高的IOU的anchor box作为正样本
Multi-level Prediction with FPN
- anchor-base
  - 不同层回归不同尺度的 anchor boxes
- FCOS
  - 指定每层回归的目标尺寸 $m 2, m 3, m 4, m 5, m 6, m 7$ （mi表示对金字塔中某个级别需要回归的最大回归距离）分别为 $0, 64, 128, 256, 512, \infty$ （P3的尺寸范围为[0,64]，P4的尺寸范围为[64,128]）不满足每层目标回归尺寸的目标不会被回归
  - 好处
    由于不同尺寸的物体被分配到不同的特征层进行回归，又由于大部分重叠发生在尺寸相差较大的物体之间，因此多尺度预测可以在很大程度上缓解目标框重叠情况下的预测性能
Center-ness
- FCOS
  - 问题
    由于使用了逐像素回归策略，在提升召回率的同时，会产生许多低质量的中心点偏移较多的预测边界框
  - 解决
    抑制这些低质量检测到的边界框，且不引入任何超参数
  - 每一层预测中添加 brance 与分类并行，相当于给网络添加了一个 loss，保证预测的边界框尽可能靠近中心 - 好处使得分布在目标位置边缘的低质量框能够尽可能的靠近中心。非极大值抑制(NMS)就可以滤除这些低质量的边界框，提高检测性能。
训练回归公式：

损失函数：
focal loss + IOU loss + CE loss

Multi-level Prediction with FPN

使用来自5层步长分别为8, 16, 32, 64 和 128的feature map $P 3, P 4, P 5, P 6, P 7$ （ $P 6, P 7$ 是下采样）

anchor-base
- 不同层回归不同尺度的 anchor boxes
FCOS
- 指定每层回归的目标尺寸 $m 2, m 3, m 4, m 5, m 6, m 7$ 分别为 $0, 64, 128, 256, 512, \infty$
  不满足每层目标回归尺寸的目标不会被回归，可以有效减轻重叠目标带来的二义性 [2]

Center-ness

抑制低质量检测框的产生，快速过滤负样本，降低NMS负担，提高召回率和检测性能
用来度量当前位置和物体中心间的距离，即FCOS将点的坐标在目标中位置因素也加入考虑，越靠近中间权重越大。
训练的过程
- 约束 center-ness 的值，使得其接近于0，使得分布在目标位置边缘的低质量框能够尽可能的靠近中心。
- 非极大值抑制(NMS)就可以轻松滤除这些低质量的边界框，提高检测性能。

在每一个层级预测中添加了一个分支，该分支与分类并行，相当于给网络添加了一个损失，而该损失保证了预测的边界框尽可能的靠近中心
该损失的公式如下，其中l，r，t，b表示的为如下图左图中所示的预测值。

而该策略之所以能够有效

置信度 $P^{'} = P * c e n t e r n e s s$ , 框越靠近中心，centerness 越大，越不靠近中心，centerness 越小， $P^{'}$ 越小，那么在 NMS时候就可以抑制不靠近中心的框
主要是在训练的过程中我们会约束上述公式中的值，使得其置信度p接近于0
这样的话，在最终使用该网络的过程中，非极大值抑制(NMS)就可以滤除这些低质量的边界框，提高检测性能。

pytorch

    import torch
    import torch.nn as nn
    import torchvision
    
    def Conv3x3ReLU(in_channels,out_channels):
        return nn.Sequential(
            nn.Conv2d(in_channels=in_channels,out_channels=out_channels,kernel_size=3,stride=1,padding=1),
            nn.ReLU6(inplace=True)
        )
    
    def locLayer(in_channels,out_channels):
        return nn.Sequential(
                Conv3x3ReLU(in_channels=in_channels, out_channels=in_channels),
                Conv3x3ReLU(in_channels=in_channels, out_channels=in_channels),
                Conv3x3ReLU(in_channels=in_channels, out_channels=in_channels),
                Conv3x3ReLU(in_channels=in_channels, out_channels=in_channels),
                nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1),
            )
    
    def conf_centernessLayer(in_channels,out_channels):
        return nn.Sequential(
            Conv3x3ReLU(in_channels=in_channels, out_channels=in_channels),
            Conv3x3ReLU(in_channels=in_channels, out_channels=in_channels),
            Conv3x3ReLU(in_channels=in_channels, out_channels=in_channels),
            Conv3x3ReLU(in_channels=in_channels, out_channels=in_channels),
            nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1),
        )
    
    class FCOS(nn.Module):
        def __init__(self, num_classes=21):
            super(FCOS, self).__init__()
            self.num_classes = num_classes
            resnet = torchvision.models.resnet50()
            layers = list(resnet.children())
    
            self.layer1 = nn.Sequential(*layers[:5])
            self.layer2 = nn.Sequential(*layers[5])
            self.layer3 = nn.Sequential(*layers[6])
            self.layer4 = nn.Sequential(*layers[7])
    
            self.lateral5 = nn.Conv2d(in_channels=2048, out_channels=256, kernel_size=1)
            self.lateral4 = nn.Conv2d(in_channels=1024, out_channels=256, kernel_size=1)
            self.lateral3 = nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1)
    
            self.upsample4 = nn.ConvTranspose2d(in_channels=256, out_channels=256, kernel_size=4, stride=2, padding=1)
            self.upsample3 = nn.ConvTranspose2d(in_channels=256, out_channels=256, kernel_size=4, stride=2, padding=1)
    
            self.downsample6 = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, stride=2, padding=1)
            self.downsample5 = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, stride=2, padding=1)
    
            self.loc_layer3 = locLayer(in_channels=256,out_channels=4)
            self.conf_centerness_layer3 = conf_centernessLayer(in_channels=256,out_channels=self.num_classes+1)
    
            self.loc_layer4 = locLayer(in_channels=256, out_channels=4)
            self.conf_centerness_layer4 = conf_centernessLayer(in_channels=256, out_channels=self.num_classes + 1)
    
            self.loc_layer5 = locLayer(in_channels=256, out_channels=4)
            self.conf_centerness_layer5 = conf_centernessLayer(in_channels=256, out_channels=self.num_classes + 1)
    
            self.loc_layer6 = locLayer(in_channels=256, out_channels=4)
            self.conf_centerness_layer6 = conf_centernessLayer(in_channels=256, out_channels=self.num_classes + 1)
    
            self.loc_layer7 = locLayer(in_channels=256, out_channels=4)
            self.conf_centerness_layer7 = conf_centernessLayer(in_channels=256, out_channels=self.num_classes + 1)
    
            self.init_params()
    
        def init_params(self):
            for m in self.modules():
                if isinstance(m, nn.Conv2d):
                    nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                elif isinstance(m, nn.BatchNorm2d):
                    nn.init.constant_(m.weight, 1)
                    nn.init.constant_(m.bias, 0)
    
        def forward(self, x):
            x = self.layer1(x)
            c3 =x = self.layer2(x)
            c4 =x = self.layer3(x)
            c5 = x = self.layer4(x)
    
            p5 = self.lateral5(c5)
            p4 = self.upsample4(p5) + self.lateral4(c4)
            p3 = self.upsample3(p4) + self.lateral3(c3)
    
            p6 = self.downsample5(p5)
            p7 = self.downsample6(p6)
    
            loc3 = self.loc_layer3(p3)
            conf_centerness3 = self.conf_centerness_layer3(p3)
            conf3, centerness3 = conf_centerness3.split([self.num_classes, 1], dim=1)
    
            loc4 = self.loc_layer4(p4)
            conf_centerness4 = self.conf_centerness_layer4(p4)
            conf4, centerness4 = conf_centerness4.split([self.num_classes, 1], dim=1)
    
            loc5 = self.loc_layer5(p5)
            conf_centerness5 = self.conf_centerness_layer5(p5)
            conf5, centerness5 = conf_centerness5.split([self.num_classes, 1], dim=1)
    
            loc6 = self.loc_layer6(p6)
            conf_centerness6 = self.conf_centerness_layer6(p6)
            conf6, centerness6 = conf_centerness6.split([self.num_classes, 1], dim=1)
    
            loc7 = self.loc_layer7(p7)
            conf_centerness7 = self.conf_centerness_layer7(p7)
            conf7, centerness7 = conf_centerness7.split([self.num_classes, 1], dim=1)
    
            locs = torch.cat([loc3.permute(0, 2, 3, 1).contiguous().view(loc3.size(0), -1),
                        loc4.permute(0, 2, 3, 1).contiguous().view(loc4.size(0), -1),
                        loc5.permute(0, 2, 3, 1).contiguous().view(loc5.size(0), -1),
                        loc6.permute(0, 2, 3, 1).contiguous().view(loc6.size(0), -1),
                        loc7.permute(0, 2, 3, 1).contiguous().view(loc7.size(0), -1)],dim=1)
    
            confs = torch.cat([conf3.permute(0, 2, 3, 1).contiguous().view(conf3.size(0), -1),
                               conf4.permute(0, 2, 3, 1).contiguous().view(conf4.size(0), -1),
                               conf5.permute(0, 2, 3, 1).contiguous().view(conf5.size(0), -1),
                               conf6.permute(0, 2, 3, 1).contiguous().view(conf6.size(0), -1),
                               conf7.permute(0, 2, 3, 1).contiguous().view(conf7.size(0), -1),], dim=1)
    
            centernesses = torch.cat([centerness3.permute(0, 2, 3, 1).contiguous().view(centerness3.size(0), -1),
                               centerness4.permute(0, 2, 3, 1).contiguous().view(centerness4.size(0), -1),
                               centerness5.permute(0, 2, 3, 1).contiguous().view(centerness5.size(0), -1),
                               centerness6.permute(0, 2, 3, 1).contiguous().view(centerness6.size(0), -1),
                               centerness7.permute(0, 2, 3, 1).contiguous().view(centerness7.size(0), -1), ], dim=1)
    
            out = (locs, confs, centernesses)
            return out
    
    if __name__ == '__main__':
        model = FCOS()
        print(model)
    
        input = torch.randn(1, 3, 800, 1024)
        out = model(input)
        print(out[0].shape)
        print(out[1].shape)
        print(out[2].shape)

Reference

[1] https://blog.csdn.net/shanglianlm/article/details/89007219
[2] https://blog.csdn.net/shanglianlm/article/details/89007219
[3]https://blog.csdn.net/qiu931110/article/details/89073244
[4]https://zhuanlan.zhihu.com/p/63868458
[5]https://blog.csdn.net/WZZ18191171661/article/details/89258086

【目标检测系列：九】Anchor Free | FCOS | Fully Convolutional One-Stage Object Detection