Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

paper:https://arxiv.org/pdf/2011.12450.pdf
code:https://github.com/PeizeSun/SparseR-CNN

文章目录

1. Abstract
2. Motivation
3. Sparse R-CNN
4. Experiments
5. Simple Code
6. References

1. Abstract

作者提出了Sparse R-CNN，一种对于图像目标检测的sparse方法，一种固定的可学习目标候选框，总数量为N，用于目标识别，来进行分类和定位。通过将H x W x K个手工设定的目标候选框下降至N个learnable proposals，Sparse R-CNN 避免了候选框的制定以及熟悉的many-to-one的label assignment。并且，网络可以输出的output，并不需要NMS的后处理操作。

图1为不同的目标检测器的比较。其中作者认为(a)为Dense Detectors，代表为RetinaNet，一共会产生H x W x K个anchors；（b)为Dense-to-Sparse Detectors，代表为Faster R-CNN，会从H x W x K个anchors中进一步挑选N个candidates的sparse 集合；（c)就是Sparse Detectors，为本文提出的Sparse R-CNN，直接挑选N个可学习优化的 learnable object proposals的集合。
在这里插入图片描述

不同检测器的比较

2. Motivation

如上图所示，作者针对Dense和Dense-to-Sparse Detectors的不足进行了思考，它们的limitations如下：（1）这些pipelines会产生冗余和重复的结果，因此不得不使用NMS后处理操作；（2）many-to-one的label assignment问题；（3）最后的效果很大程度被anchors boxes或者稠密的相关points或者其他proposal generation算法的大小，比例以及对于的数量所影响。论文中提出了以下的疑问：

Despite the dense convention is widely recognized
among object detectors, a natural question to ask is: Is it possible to design a sparse detector?

作者认为DETR也有一定的不足，DETR中每个object query都和全局的特征图做attention交互，这本质上也是dense。因此，作者提出sparse这个属性应该表现为2个方面：sparse boxes and sparse features。前者指的是用于预测所有目标的boxes的数量要很少，后者则表明对于每一个box的特征并不是需要和整张图上的其他所有特征进行交互。

Sparse boxes mean that a small number of starting boxes (e.g. 100) is enough to predict all objects in an image. While sparse features indicate the feature of each box does not need to interactively interact with all other features over the full image. From

首先网络会选取N个可学习得候选框，用4-d坐标表示（N x 4），这组参数和整个网络中的其他参数一起被训练优化，预测回归和分类。但这只是对于图片上可能位置得一种粗略估计，仅仅只是对于目标得一种粗略得表示，并且欠缺很多许多信息丰富得细节例如pose 和 shape。因此作者还提出了proposal feature，一种高维的latent vector（N x d），用于编码丰富得实例信息，这组proposal features与proposal boxes提取出来的RoI feature做一对一的交互，称为Dynamic Instance Interactive Head。
proposal boxes 和 proposal features在整个网络中开始时都是随机初始化，并且和其他参数一起优化。

3. Sparse R-CNN

3.1 Pipeline

如下图所示，网络的输出共有3个，image，Proposal Boxes 和 Proposal Features。

在这里插入图片描述

Sparse R-CNN pipeline

3.2 Module

Backbone ：FPN+rResNet，使用P2 ~ P5层。
Learnable proposal box： N x 4 取代了RPN网络，这些boxes是归一化后的坐标[0, 1]。
Learnable proposal feature: N x d features数量和boxes的数量一致。
Dynamic instance interactive head：在给定N个proposal boxes，Sparse R-CNN首先利用RoIAlign为每一个box提取特征。这些特征会被用于predict head中产生最后的预测，也就是下图的结构。对于每一个RoI的特征 $f_i(S\times S \times C)$ 会与每一个proposal feature（C）进行交互，以过滤掉无效的垃圾箱并输出最终的对象特征。
回归预测会被一个三层的感受野计算，带有ReLU函数和隐藏层C。
分类预测会被一个线性投影层计算。
proposal feature 可以看成是一种attention结构的实现，在S x S的ROI上进行attention。proposal feature产生卷积核的参数，然后RoI feature用产生的卷积核来进行卷积，得到最后的特征。
作者还采用了迭代的结构来更近一步改进性能，即上一个head的output features和output boxes作为下一个head的proposal features和proposal boxes。Proposal features在与RoI features交互之前做self-attention。
proposal features 不需要position encoding。

在这里插入图片描述

Dynamic Module

3.3 Match and Loss of set prediction

Match
与DETR类似，Match采用的是二分图匹配法，分别计算the classification cost、the L1 cost between boxes和the giou cost betwen boxes。
但是与DETR不同的是，Sparse RCNN再算分类的cost时，采用了focal loss的二分类思想，DETR则是直接输出1-pro[index]，具体代码如下：

Loss
分类loss + bbox回归loss：
$L_{total} = L_{cls} + L_{bbox} = L_{FocalL} + L_{L1} + L_{GIoU}$

'''
Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates

            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates
'''
		if self.use_focal:
            # Compute the classification cost.

            alpha = self.focal_loss_alpha # 0.25
            gamma = self.focal_loss_gamma # 2.0
            # 这里有一个细节,在算focal loss时其实是用二分类的focal loss 公式对out_prob算neg_cost_class 和 pos_cost_class
            # 因为这个时候是不知道类别的,只是算一个match cost
            neg_cost_class = (1 - alpha) * (out_prob ** gamma) * (-(1 - out_prob + 1e-8).log()) #  [batch_size * num_queries, num_classes]
            pos_cost_class = alpha * ((1 - out_prob) ** gamma) * (-(out_prob + 1e-8).log()) #  [batch_size * num_queries, num_classes]
            cost_class = pos_cost_class[:, tgt_ids] - neg_cost_class[:, tgt_ids] # [batch_size * num_queries, num_gt] eg.[200, 8]
            pdb.set_trace()
        else: # in DETR
            cost_class = -out_prob[:, tgt_ids]

3.4 Rethinking after read the code

# eg.
# bs  = batch_size = 2
# num_prposal = num_queries = 100

阅读完代码后发现：
一、pro_bbox:
首先RoIPooler 通过了RoIAlign会生成roi_features [bs x 100, 256, 7, 7] 的proposal(bbox)，然后矩阵维度变为[49, 200, 256] 。 

二、pro_features 主要有2个步骤:
1.首先进行self-attention
	nn.MultiheadAttention(d_model, nhead, dropout=dropout) # MultiheadAttention [256, 256]，得到  # [100, 2, 256],然后进行一个残差连接,将原来的pro_features 和 self_attn后的pro_features elements-wise sum。
2.dynamic_layer得到param
	pro_featyres 通过dynamic_layer(linear[256, 2 x 256 x 64])--> parameters [200, 1, 2 x 256 x 64]。然后将parameters拆成param1 和 param2 [200, 256, 64]，同样有残差以及dropout等层得到特征。
3. inst_interact(forward DynamicConv)
	将1得到的pro_bbox的features先后与param1以及param2，做2个torch.bmm矩阵运算。
	维度变化为：
	features1 =[200, 49, 256] x [200, 256, 64] = [200, 49, 64]
	features1 = features1 x [200, 64, 256] = [200, 49, 256]
	flatten: [200, 256 x 7 x 7]
	接下里做linear[256 x 7 x 7, 256]的线性变换，得到pro_features:[200, 256],同样残差，省略不说。
4. output
	obj_features: 保存4得到的pro_features和pred_boxes，用于后续的循环，维度为[1, 200, 256]
	cls_logits:[2 x 100, 80]
	pred_bboxes:[2x100, 4]
	return 	obj_features ,cls_logits,pred_bboxes
	然后循环6次，取最后的cls_logits以及pred_bboxes计算loss和match。

4. Experiments

4.1 Ablation studies on each components in Sparse R-CNN
在这里插入图片描述
4.2 The effect of Cascade and Feature reuse

4.3 The effect of instance-interaction

4.4 Effect of initialization of proposal boxes
在这里插入图片描述
4.5 Effect of number of proposals

4.6 Effect of number of stages

4.7 Visualization of predicted boxes of each stage
在这里插入图片描述

4.8 Comparsion with different object detectors on COCO 2017
在这里插入图片描述

5. Simple Code

前言基础 RoIPooler 结合FPN的代码理解

公式：
$\lfloor{k_0+\log_{2}(\sqrt{wh}/224)} \rfloor$
其中k_0默认是4，wh就是手动设置的anchors的wide和height（例如100个框就会有100个框的面积，通过detectron2设计的boxes结构的area()，计算box.area()）

# path: SparseR-CNN/detectron2/modeling/poolers.py]
# 通过map FPN的公式，给出每一个原图上的anchors是在哪一个fpn层来进行特征提取。
# 返回对应的fpn层数
def assign_boxes_to_levels(
    box_lists: List[Boxes],# list的长度为anchor的数量（100）， 然后N里面为boxes类型的变量，在SparseRcnn中，box_lists[i]的维度为[100, 4]表示proposal_boxes的左上和右下的坐标。
    min_level: int, # 2
    max_level: int, # 5
    canonical_box_size: int, #224
    canonical_level: int,# 4
):
    """
    Map each box in `box_lists` to a feature map level index and return the assignment
    vector.

    Args:
        box_lists (list[Boxes] | list[RotatedBoxes]): A list of N Boxes or N RotatedBoxes,
            where N is the number of images in the batch.
        min_level (int): Smallest feature map level index. The input is considered index 0,
            the output of stage 1 is index 1, and so.
        max_level (int): Largest feature map level index.
        canonical_box_size (int): A canonical box size in pixels (sqrt(box area)).
        canonical_level (int): The feature map level index on which a canonically-sized box
            should be placed.

    Returns:
        A tensor of length M, where M is the total number of boxes aggregated over all
            N batch images. The memory layout corresponds to the concatenation of boxes
            from all images. Each element is the feature map index, as an offset from
            `self.min_level`, for the corresponding box (so value i means the box is at
            `self.min_level + i`).
    """
    box_sizes = torch.sqrt(cat([boxes.area() for boxes in box_lists]))
    # Eqn.(1) in FPN paper
    level_assignments = torch.floor(
        canonical_level + torch.log2(box_sizes / canonical_box_size + 1e-8)
    )
    # clamp level to (min, max), in case the box size is too large or too small
    # for the available feature maps
    level_assignments = torch.clamp(level_assignments, min=min_level, max=max_level)
    return level_assignments.to(torch.int64) - min_level # 2,3,4,5 --> 0,1,2,3 减去min_level的原因是对应后面的self.pooler, 减去min_level的原因是对应后面的self是从0开始的 0~3 所以实际没有变化 5对应后来的3
    '''
      self.level_poolers
      ModuleList(
          (0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=2, aligned=True)
          (1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=2, aligned=True)
          (2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=2, aligned=True)
          (3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=2, aligned=True)
      )

      '''

5.1 SparseR-CNN/projects/SparseRCNN/sparsercnn/detector.py

#
# Modified by Peize Sun, Rufeng Zhang
# Contact: {sunpeize, cxrfzhang}@foxmail.com
#
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
import logging
import math
from typing import List

import numpy as np
import torch
import torch.distributed as dist
import torch.nn.functional as F
from torch import nn

from detectron2.layers import ShapeSpec
from detectron2.modeling import META_ARCH_REGISTRY, build_backbone, detector_postprocess
from detectron2.modeling.roi_heads import build_roi_heads

from detectron2.structures import Boxes, ImageList, Instances
from detectron2.utils.logger import log_first_n
from fvcore.nn import giou_loss, smooth_l1_loss

from .loss import SetCriterion, HungarianMatcher
from .head import DynamicHead
from .util.box_ops import box_cxcywh_to_xyxy, box_xyxy_to_cxcywh
from .util.misc import (NestedTensor, nested_tensor_from_tensor_list,
                       accuracy, get_world_size, interpolate,
                       is_dist_avail_and_initialized)
import pdb
__all__ = ["SparseRCNN"]

# SparseRCNN 调用 DynamicHead
# DynamicHead 调用 RCNNHead
# RCNNHead 调用 DynamicConv  
# DynamicConv

@META_ARCH_REGISTRY.register()
class SparseRCNN(nn.Module):
    """
    Implement SparseRCNN
    """

    def __init__(self, cfg):
        super().__init__()

        self.device = torch.device(cfg.MODEL.DEVICE)

        self.in_features = cfg.MODEL.ROI_HEADS.IN_FEATURES # ['p2', 'p3', 'p4', 'p5']
        self.num_classes = cfg.MODEL.SparseRCNN.NUM_CLASSES # 80
        self.num_proposals = cfg.MODEL.SparseRCNN.NUM_PROPOSALS # 100
        self.hidden_dim = cfg.MODEL.SparseRCNN.HIDDEN_DIM # 256
        self.num_heads = cfg.MODEL.SparseRCNN.NUM_HEADS  # 6 number of stages? yes 6层RCNN_head
        #pdb.set_trace()
        # Build Backbone.
        self.backbone = build_backbone(cfg)
        self.size_divisibility = self.backbone.size_divisibility # 32
        #pdb.set_trace()

        # Build Proposals.
        self.init_proposal_features = nn.Embedding(self.num_proposals, self.hidden_dim) # embedding(100, 256)
        self.init_proposal_boxes = nn.Embedding(self.num_proposals, 4) # embedding(100, 4)
        nn.init.constant_(self.init_proposal_boxes.weight[:, :2], 0.5) # [100, 2] 全是0.5
        nn.init.constant_(self.init_proposal_boxes.weight[:, 2:], 1.0) # [100, 2] 全是1

        # Build Dynamic Head.
        self.head = DynamicHead(cfg=cfg, roi_input_shape=self.backbone.output_shape()) # init DynamicHead
        
        # Loss parameters
        class_weight = cfg.MODEL.SparseRCNN.CLASS_WEIGHT # 2.0
        giou_weight = cfg.MODEL.SparseRCNN.GIOU_WEIGHT # 2.0
        l1_weight = cfg.MODEL.SparseRCNN.L1_WEIGHT  # 5.0
        no_object_weight = cfg.MODEL.SparseRCNN.NO_OBJECT_WEIGHT # 0.1
        self.deep_supervision = cfg.MODEL.SparseRCNN.DEEP_SUPERVISION # True
        self.use_focal = cfg.MODEL.SparseRCNN.USE_FOCAL # True

        #pdb.set_trace()

        # build HungarianMatcher
        matcher = HungarianMatcher(cfg=cfg,
                                   cost_class=class_weight, 
                                   cost_bbox=l1_weight, 
                                   cost_giou=giou_weight,
                                   use_focal=self.use_focal)
        weight_dict = {
    
    "loss_ce": class_weight, "loss_bbox": l1_weight, "loss_giou": giou_weight} 
        # weight_dict : {'loss_ce': 2.0, 'loss_bbox': 5.0, 'loss_giou': 2.0}
        if self.deep_supervision: # True
            aux_weight_dict = {
    
    }
            for i in range(self.num_heads - 1):
                aux_weight_dict.update({
    
    k + f"_{i}": v for k, v in weight_dict.items()})
            weight_dict.update(aux_weight_dict) # 循环上面的weight_dict 5次

        losses = ["labels", "boxes"]
        
        
        # Build Criterion
        self.criterion = SetCriterion(cfg=cfg,
                                      num_classes=self.num_classes,
                                      matcher=matcher,
                                      weight_dict=weight_dict,
                                      eos_coef=no_object_weight,
                                      losses=losses,
                                      use_focal=self.use_focal)
        

        pixel_mean = torch.Tensor(cfg.MODEL.PIXEL_MEAN).to(self.device).view(3, 1, 1) # [3, 1, 1]
        pixel_std = torch.Tensor(cfg.MODEL.PIXEL_STD).to(self.device).view(3, 1, 1) # [3,1,1]
        self.normalizer = lambda x: (x - pixel_mean) / pixel_std
        self.to(self.device)
    '''
    (Pdb) pixel_mean
    tensor([[[123.6750]],

            [[116.2800]],

            [[103.5300]]], device='cuda:0')

    (Pdb) pixel_std
    tensor([[[58.3950]],

            [[57.1200]],

            [[57.3750]]], device='cuda:0')

    '''

    def forward(self, batched_inputs):
        """
        Args:
            batched_inputs: a list, batched outputs of :class:`DatasetMapper` .
                Each item in the list contains the inputs for one image.
                For now, each item in the list is a dict that contains:

                * image: Tensor, image in (C, H, W) format.
                * instances: Instances

                Other information that's included in the original dicts, such as:

                * "height", "width" (int): the output resolution of the model, used in inference.
                  See :meth:`postprocess` for details.
        """
        images, images_whwh = self.preprocess_image(batched_inputs) # len(images) = batch_size images_whwh = [batch_size, 4]
        if isinstance(images, (list, torch.Tensor)):
            images = nested_tensor_from_tensor_list(images)

        # Feature Extraction.
        src = self.backbone(images.tensor) # forward backbone  src见下方注释
        features = list()        
        for f in self.in_features:
            feature = src[f] # 只选取p2 p3 p4 p5
            features.append(feature)

        # Prepare Proposals.
        proposal_boxes = self.init_proposal_boxes.weight.clone() # [100, 4]
        proposal_boxes = box_cxcywh_to_xyxy(proposal_boxes) 
        proposal_boxes = proposal_boxes[None] * images_whwh[:, None, :] # [1, 100, 4] x [2, 1, 4] = [2, 100, 4]
        pdb.set_trace()

        # Prediction.

        #  forward DynamicHead 
        outputs_class, outputs_coord = self.head(features, proposal_boxes, self.init_proposal_features.weight)
        #  只取最终的pred_logits 和 pred_boxes
        output = {
    
    'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
        pdb.set_trace()
        if self.training:  # training
            gt_instances = [x["instances"].to(self.device) for x in batched_inputs]
            targets = self.prepare_targets(gt_instances)
            if self.deep_supervision:
                output['aux_outputs'] = [{
    
    'pred_logits': a, 'pred_boxes': b}
                                         for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]

            loss_dict = self.criterion(output, targets) # 在forward loss的时候  内部forward matcher
            weight_dict = self.criterion.weight_dict
            for k in loss_dict.keys():
                if k in weight_dict:
                    loss_dict[k] *= weight_dict[k]
            pdb.set_trace()
            return loss_dict
        # bs, w, h, 256
        # bs, 100, 4
        # bs, 100, 256
        # pooler:
        # bs,100, 7,7,256
        # bs x 100, 7, 7, 256
        # 
        else: # inference
            box_cls = output["pred_logits"]
            box_pred = output["pred_boxes"]
            # forwardinference
            results = self.inference(box_cls, box_pred, images.image_sizes) 
 
            processed_results = []
            for results_per_image, input_per_image, image_size in zip(results, batched_inputs, images.image_sizes):
                height = input_per_image.get("height", image_size[0])
                width = input_per_image.get("width", image_size[1])
                r = detector_postprocess(results_per_image, height, width)
                processed_results.append({
    
    "instances": r})
            return processed_results

    def prepare_targets(self, targets):
        new_targets = []
        for targets_per_image in targets:
            target = {
    
    }
            h, w = targets_per_image.image_size
            image_size_xyxy = torch.as_tensor([w, h, w, h], dtype=torch.float, device=self.device)
            gt_classes = targets_per_image.gt_classes
            gt_boxes = targets_per_image.gt_boxes.tensor / image_size_xyxy
            gt_boxes = box_xyxy_to_cxcywh(gt_boxes)
            target["labels"] = gt_classes.to(self.device)
            target["boxes"] = gt_boxes.to(self.device)
            target["boxes_xyxy"] = targets_per_image.gt_boxes.tensor.to(self.device)
            target["image_size_xyxy"] = image_size_xyxy.to(self.device)
            image_size_xyxy_tgt = image_size_xyxy.unsqueeze(0).repeat(len(gt_boxes), 1)
            target["image_size_xyxy_tgt"] = image_size_xyxy_tgt.to(self.device)
            target["area"] = targets_per_image.gt_boxes.area().to(self.device)
            new_targets.append(target)
        pdb.set_trace()
        return new_targets

    def inference(self, box_cls, box_pred, image_sizes):
        """
        Arguments:
            box_cls (Tensor): tensor of shape (batch_size, num_proposals, K).
                The tensor predicts the classification probability for each proposal.
            box_pred (Tensor): tensors of shape (batch_size, num_proposals, 4).
                The tensor predicts 4-vector (x,y,w,h) box
                regression values for every proposal
            image_sizes (List[torch.Size]): the input image sizes

        Returns:
            results (List[Instances]): a list of #images elements.
        """
        assert len(box_cls) == len(image_sizes)
        results = []

        if self.use_focal:
            scores = torch.sigmoid(box_cls)
            labels = torch.arange(self.num_classes, device=self.device).\
                     unsqueeze(0).repeat(self.num_proposals, 1).flatten(0, 1)

            for i, (scores_per_image, box_pred_per_image, image_size) in enumerate(zip(
                    scores, box_pred, image_sizes
            )):
                result = Instances(image_size)
                scores_per_image, topk_indices = scores_per_image.flatten(0, 1).topk(self.num_proposals, sorted=False)
                labels_per_image = labels[topk_indices]
                box_pred_per_image = box_pred_per_image.view(-1, 1, 4).repeat(1, self.num_classes, 1).view(-1, 4)
                box_pred_per_image = box_pred_per_image[topk_indices]

                result.pred_boxes = Boxes(box_pred_per_image)
                result.scores = scores_per_image
                result.pred_classes = labels_per_image
                results.append(result)

        else:
            # For each box we assign the best class or the second best if the best on is `no_object`.
            scores, labels = F.softmax(box_cls, dim=-1)[:, :, :-1].max(-1)

            for i, (scores_per_image, labels_per_image, box_pred_per_image, image_size) in enumerate(zip(
                scores, labels, box_pred, image_sizes
            )):
                result = Instances(image_size)
                result.pred_boxes = Boxes(box_pred_per_image)
                result.scores = scores_per_image
                result.pred_classes = labels_per_image
                results.append(result)
        pdb.set_trace()
        return results

    def preprocess_image(self, batched_inputs):
        """
        Normalize, pad and batch the input images.
        """
        images = [self.normalizer(x["image"].to(self.device)) for x in batched_inputs]
        images = ImageList.from_tensors(images, self.size_divisibility)

        images_whwh = list()
        for bi in batched_inputs:
            h, w = bi["image"].shape[-2:]
            images_whwh.append(torch.tensor([w, h, w, h], dtype=torch.float32, device=self.device))
        images_whwh = torch.stack(images_whwh)
        pdb.set_trace()
        return images, images_whwh

5.2 SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py


class HungarianMatcher(nn.Module):
    """This class computes an assignment between the targets and the predictions of the network

    For efficiency reasons, the targets don't include the no_object. Because of this, in general,
    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
    while the others are un-matched (and thus treated as non-objects).
    """

    def __init__(self, cfg, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1, use_focal: bool = False):
        """Creates the matcher

        Params:
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class # 2
        self.cost_bbox = cost_bbox # 5
        self.cost_giou = cost_giou # 2
        self.use_focal = use_focal # True
        if self.use_focal:
            self.focal_loss_alpha = cfg.MODEL.SparseRCNN.ALPHA
            self.focal_loss_gamma = cfg.MODEL.SparseRCNN.GAMMA
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"
        
    @torch.no_grad()
    def forward(self, outputs, targets):
        """ Performs the matching

        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates

            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates

        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        bs, num_queries = outputs["pred_logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        if self.use_focal:
            out_prob = outputs["pred_logits"].flatten(0, 1).sigmoid()  # [batch_size * num_queries, num_classes]
            out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]
        else: # 不用focal loss 就对 pred_logits 取softmax
            out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  # [batch_size * num_queries, num_classes]
            out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        tgt_ids = torch.cat([v["labels"] for v in targets])  # eg. tensor([29,  0,  0, 41, 52, 52, 48, 60], device='cuda:0')

        tgt_bbox = torch.cat([v["boxes_xyxy"] for v in targets]) # [8 ,4]

        pdb.set_trace()
        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        if self.use_focal:
            # Compute the classification cost.

            alpha = self.focal_loss_alpha # 0.25
            gamma = self.focal_loss_gamma # 2.0
            # 这里有一个细节,在算focal loss时其实是用二分类的focal loss 公式对out_prob算neg_cost_class 和 pos_cost_class
            # 因为这个时候是不知道类别的,只是算一个match cost
            neg_cost_class = (1 - alpha) * (out_prob ** gamma) * (-(1 - out_prob + 1e-8).log()) #  [batch_size * num_queries, num_classes]
            pos_cost_class = alpha * ((1 - out_prob) ** gamma) * (-(out_prob + 1e-8).log()) #  [batch_size * num_queries, num_classes]
            cost_class = pos_cost_class[:, tgt_ids] - neg_cost_class[:, tgt_ids] # [batch_size * num_queries, num_gt] eg.[200, 8]
            pdb.set_trace()
        else: # in DETR
            cost_class = -out_prob[:, tgt_ids]

        # Compute the L1 cost between boxes
        image_size_out = torch.cat([v["image_size_xyxy"].unsqueeze(0) for v in targets]) # [2, 4]  the size of image in each batch_size.
        image_size_out = image_size_out.unsqueeze(1).repeat(1, num_queries, 1).flatten(0, 1) # [2 x 100, 4]
        image_size_tgt = torch.cat([v["image_size_xyxy_tgt"] for v in targets]) # [8, 4]  the size of image in each batch_size.

        # normalization
        out_bbox_ = out_bbox / image_size_out   # [batch_size * num_queries, 4]
        tgt_bbox_ = tgt_bbox / image_size_tgt  # [num_gt , 4]
        cost_bbox = torch.cdist(out_bbox_, tgt_bbox_, p=1) # [batch_size * num_queries, num_gt]

        # Compute the giou cost betwen boxes
         # cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

        cost_giou = -generalized_box_iou(out_bbox, tgt_bbox) # [batch_size * num_queries, num_gt]

        # Final cost matrix
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
        C = C.view(bs, num_queries, -1).cpu()

        sizes = [len(v["boxes"]) for v in targets]
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        pdb.set_trace()
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]
'''
(Pdb) tgt_bbox
tensor([[ 544.9488,   55.6028,  592.7479,   87.0280],
        [ 830.8463,  283.8805,  855.0020,  434.4423],
        [ 391.0041,  232.9907,  528.6505,  543.2131],
        [ 504.2867,  169.0072,  604.9553,  314.2662],
        [  14.2239,  260.9941,  251.3566,  526.6308],
        [ 350.4360,  223.3023,  613.7350,  469.1882],
        [ 604.6720,  177.3639,  789.3936,  386.5810],
        [  10.4948,   90.9954, 1000.7849,  661.5030]], device='cuda:0')

'''

5.3 SparseR-CNN/projects/SparseRCNN/sparsercnn/head.py

head.py 主要有 DynamicHead、RCNNHead以及DynamicConv三个主要的模块。
其中的主要关系是：

# DynamicHead 调用 RCNNHead
# RCNNHead 调用 DynamicConv  
# DynamicConv

5.3.1 DynamicHead

#
# Modified by Peize Sun, Rufeng Zhang
# Contact: {sunpeize, cxrfzhang}@foxmail.com
#
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
SparseRCNN Transformer class.

Copy-paste from torch.nn.Transformer with modifications:
    * positional encodings are passed in MHattention
    * extra LN at the end of encoder is removed
    * decoder returns a stack of activations from all decoding layers
"""
import copy
import math
from typing import Optional, List

import torch
from torch import nn, Tensor
import torch.nn.functional as F

from detectron2.modeling.poolers import ROIPooler, cat
from detectron2.structures import Boxes
import pdb

_DEFAULT_SCALE_CLAMP = math.log(100000.0 / 16)

# DynamicHead 调用 RCNNHead
# RCNNHead 调用 DynamicConv  
# DynamicConv

class DynamicHead(nn.Module):

    def __init__(self, cfg, roi_input_shape):
        super().__init__()

        # Build RoI.
        box_pooler = self._init_box_pooler(cfg, roi_input_shape) # forward box_pooler
        self.box_pooler = box_pooler
        
        # Build heads.
        num_classes = cfg.MODEL.SparseRCNN.NUM_CLASSES # 80
        d_model = cfg.MODEL.SparseRCNN.HIDDEN_DIM # 256
        dim_feedforward = cfg.MODEL.SparseRCNN.DIM_FEEDFORWARD # 2048
        nhead = cfg.MODEL.SparseRCNN.NHEADS # 8
        dropout = cfg.MODEL.SparseRCNN.DROPOUT # 0.0
        activation = cfg.MODEL.SparseRCNN.ACTIVATION # 'relu'
        num_heads = cfg.MODEL.SparseRCNN.NUM_HEADS # 6

        # init RCNNHead  
        rcnn_head = RCNNHead(cfg, d_model, num_classes, dim_feedforward, nhead, dropout, activation)  # 模型结构见下方注释
        #pdb.set_trace()
        self.head_series = _get_clones(rcnn_head, num_heads) # 也就是有6个 rcnn_head
        self.return_intermediate = cfg.MODEL.SparseRCNN.DEEP_SUPERVISION
        
        # Init parameters.
        self.use_focal = cfg.MODEL.SparseRCNN.USE_FOCAL
        self.num_classes = num_classes
        if self.use_focal:
            prior_prob = cfg.MODEL.SparseRCNN.PRIOR_PROB # 0.01 
            self.bias_value = -math.log((1 - prior_prob) / prior_prob) # -4.5
        #pdb.set_trace()
        self._reset_parameters()
        
    def _reset_parameters(self):
        # init all parameters.
        for p in self.parameters(): # len(list(self.parameters())) = 240
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

            # initialize the bias for focal loss.
            if self.use_focal:
                if p.shape[-1] == self.num_classes:
                    nn.init.constant_(p, self.bias_value)
        #pdb.set_trace()

    @staticmethod
    def _init_box_pooler(cfg, input_shape):

        in_features = cfg.MODEL.ROI_HEADS.IN_FEATURES #  ['p2', 'p3', 'p4', 'p5']
        pooler_resolution = cfg.MODEL.ROI_BOX_HEAD.POOLER_RESOLUTION # 7
        pooler_scales = tuple(1.0 / input_shape[k].stride for k in in_features) #  (0.25, 0.125, 0.0625, 0.03125)
        sampling_ratio = cfg.MODEL.ROI_BOX_HEAD.POOLER_SAMPLING_RATIO # 2
        pooler_type = cfg.MODEL.ROI_BOX_HEAD.POOLER_TYPE # ROIAlignV2

        # If StandardROIHeads is applied on multiple feature maps (as in FPN),
        # then we share the same predictors and therefore the channel counts must be the same
        in_channels = [input_shape[f].channels for f in in_features] # [256,256,256,256]
        # Check all channel counts are equal
        assert len(set(in_channels)) == 1, in_channels
        
        # init RoIAilgn
        # 重点！ 
        # scales的应用： 由于proposal_boxes是在原图上就设置好的N个框,因此需要map回去,就需要对应的部长来约束。
        # eg. 0.25就是 1/4,也就是比原图缩小了4倍
        # Note: ''The box coordinates are defined on the original image and
        # will be scaled by the `scales` argument of :class:`ROIPooler`.''       
        box_pooler = ROIPooler(
            output_size=pooler_resolution, # 7
            scales=pooler_scales,  # (0.25, 0.125, 0.0625, 0.03125)
            sampling_ratio=sampling_ratio, # 2
            pooler_type=pooler_type, # ROIAlignV2
        )
        return box_pooler
    
    def forward(self, features, init_bboxes, init_features): # [100, 256]
        pdb.set_trace()
        inter_class_logits = []
        inter_pred_bboxes = []

        bs = len(features[0]) # batchsize 2
        bboxes = init_bboxes # [2, 100, 4]
        
        init_features = init_features[None].repeat(1, bs, 1) # [1, 200, 256]
        proposal_features = init_features.clone() #  [1, 200, 256]
        pdb.set_trace()
        for rcnn_head in self.head_series: # 循环6次
            # forward RCNNHead 得到的proposal_features用于下一次输入的循环。
            class_logits, pred_bboxes, proposal_features = rcnn_head(features, bboxes, proposal_features, self.box_pooler)

            if self.return_intermediate:
                inter_class_logits.append(class_logits)
                inter_pred_bboxes.append(pred_bboxes)
            bboxes = pred_bboxes.detach()

        if self.return_intermediate:
            return torch.stack(inter_class_logits), torch.stack(inter_pred_bboxes)
        pdb.set_trace()
        return class_logits[None], pred_bboxes[None]






def _get_clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])


def _get_activation_fn(activation):
    """Return an activation function given a string"""
    if activation == "relu":
        return F.relu
    if activation == "gelu":
        return F.gelu
    if activation == "glu":
        return F.glu
    raise RuntimeError(F"activation should be relu/gelu, not {activation}.")


'''
(Pdb) input_shape
{
    
    
    'p2': ShapeSpec(channels=256, height=None, width=None, stride=4), 
    'p3': ShapeSpec(channels=256, height=None, width=None, stride=8), 
    'p4': ShapeSpec(channels=256, height=None, width=None, stride=16), 
    'p5': ShapeSpec(channels=256, height=None, width=None, stride=32), 
    'p6': ShapeSpec(channels=256, height=None, width=None, stride=64)
}



ROIPooler(
  (level_poolers): ModuleList(
    (0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=2, aligned=True)
    (1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=2, aligned=True)
    (2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=2, aligned=True)
    (3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=2, aligned=True)
  )
)


self.cls_module
ModuleList(
  (0): Linear(in_features=256, out_features=256, bias=False)
  (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (2): ReLU(inplace=True)
)



self.reg_module

    ModuleList(
    (0): Linear(in_features=256, out_features=256, bias=False)
    (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (2): ReLU(inplace=True)
    (3): Linear(in_features=256, out_features=256, bias=False)
    (4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=256, out_features=256, bias=False)
    (7): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (8): ReLU(inplace=True)
    )







(Pdb) src.keys()
dict_keys(['p2', 'p3', 'p4', 'p5', 'p6'])
(Pdb) src['p2'].size()
torch.Size([2, 256, 168, 256])
(Pdb) src['p3'].size()
torch.Size([2, 256, 84, 128])
(Pdb) src['p4'].size()
torch.Size([2, 256, 42, 64])
(Pdb) src['p5'].size()
torch.Size([2, 256, 21, 32])
(Pdb) src['p6'].size()
torch.Size([2, 256, 11, 16])

'''

5.3.2 RCNNHead

class RCNNHead(nn.Module):

    def __init__(self, cfg, d_model, num_classes, dim_feedforward=2048, nhead=8, dropout=0.1, activation="relu",
                 scale_clamp: float = _DEFAULT_SCALE_CLAMP, bbox_weights=(2.0, 2.0, 1.0, 1.0)):
        super().__init__()

        self.d_model = d_model

        # dynamic.
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout) # MultiheadAttention [256, 256]
        # init DynamicConv
        self.inst_interact = DynamicConv(cfg) 
        #pdb.set_trace()
        self.linear1 = nn.Linear(d_model, dim_feedforward) # [256, 2048]
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model) # [2048, 256]

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation) # relu

        # cls.
        num_cls = cfg.MODEL.SparseRCNN.NUM_CLS # 1
        cls_module = list()
        for _ in range(num_cls):
            cls_module.append(nn.Linear(d_model, d_model, False))
            cls_module.append(nn.LayerNorm(d_model))
            cls_module.append(nn.ReLU(inplace=True))
        self.cls_module = nn.ModuleList(cls_module) # Linear LayerNorm ReLU

        # reg.
        num_reg = cfg.MODEL.SparseRCNN.NUM_REG # 3
        reg_module = list()
        for _ in range(num_reg):
            reg_module.append(nn.Linear(d_model, d_model, False))
            reg_module.append(nn.LayerNorm(d_model))
            reg_module.append(nn.ReLU(inplace=True))
        self.reg_module = nn.ModuleList(reg_module)
        
        # pred.
        self.use_focal = cfg.MODEL.SparseRCNN.USE_FOCAL #  True
        if self.use_focal:
            self.class_logits = nn.Linear(d_model, num_classes) # [256, 80]
        else:
            self.class_logits = nn.Linear(d_model, num_classes + 1)
        self.bboxes_delta = nn.Linear(d_model, 4) # [256, 4]
        self.scale_clamp = scale_clamp # 8.740
        self.bbox_weights = bbox_weights # (2.0, 2.0, 1.0, 1.0)
        #pdb.set_trace()

    def forward(self, features, bboxes, pro_features, pooler):
        """
        :param bboxes: (N, nr_boxes, 4)
        :param pro_features: (N, nr_boxes, d_model)
        """

        N, nr_boxes = bboxes.shape[:2] # batchsize, 100
        
        # roi_feature.
        proposal_boxes = list()
        for b in range(N): 
            proposal_boxes.append(Boxes(bboxes[b]))
        # len(features) = 4  bboxes.size() = [2, 100, 4]
        # proposals_boxes = 2
        roi_features = pooler(features, proposal_boxes) # pooler [2 * 100, 256, 7, 7]           
        roi_features = roi_features.view(N * nr_boxes, self.d_model, -1).permute(2, 0, 1) # [49, 200, 256]    
        pdb.set_trace()

        # self_att.
        pro_features = pro_features.view(N, nr_boxes, self.d_model).permute(1, 0, 2) # [100, 2, 256]
        pro_features2 = self.self_attn(pro_features, pro_features, value=pro_features)[0] # [100, 2, 256]

        # element-wise sum  pro_features = drop(att(feat))  + feat
        pro_features = pro_features + self.dropout1(pro_features2)
        pro_features = self.norm1(pro_features)
        
        # inst_interact.
        pro_features = pro_features.view(nr_boxes, N, self.d_model).permute(1, 0, 2).reshape(1, N * nr_boxes, self.d_model) # [1, 200, 256]
        
        pdb.set_trace()
        # forward dynamicConv
        pro_features2 = self.inst_interact(pro_features, roi_features) # [200, 256]
        pro_features = pro_features + self.dropout2(pro_features2)
        
        
        # obj_feature. obj_features用于循环回传
        obj_features = self.norm2(pro_features)
        obj_features2 = self.linear2(self.dropout(self.activation(self.linear1(obj_features)))) # [1,200,256]
        obj_features = obj_features + self.dropout3(obj_features2)
        obj_features = self.norm3(obj_features) 
        
        fc_feature = obj_features.transpose(0, 1).reshape(N * nr_boxes, -1) # [200, 256]
        cls_feature = fc_feature.clone()
        reg_feature = fc_feature.clone()
        pdb.set_trace()
        for cls_layer in self.cls_module:
            cls_feature = cls_layer(cls_feature)
        for reg_layer in self.reg_module:
            reg_feature = reg_layer(reg_feature)
        class_logits = self.class_logits(cls_feature) # [N x 100, 80]
        bboxes_deltas = self.bboxes_delta(reg_feature) # [N x 100, 4]
        pred_bboxes = self.apply_deltas(bboxes_deltas, bboxes.view(-1, 4)) # 调用apply_deltas [200, 4]
        pdb.set_trace()
        return class_logits.view(N, nr_boxes, -1), pred_bboxes.view(N, nr_boxes, -1), obj_features # [2, 100, 80]  [2, 100, 4] [1,Nx100, 256]
    

    def apply_deltas(self, deltas, boxes):
        """
        Apply transformation `deltas` (dx, dy, dw, dh) to `boxes`.

        Args:
            deltas (Tensor): transformation deltas of shape (N, k*4), where k >= 1.
                deltas[i] represents k potentially different class-specific
                box transformations for the single box boxes[i].
            boxes (Tensor): boxes to transform, of shape (N, 4)
        """
        boxes = boxes.to(deltas.dtype) # (N, k*4)

        widths = boxes[:, 2] - boxes[:, 0]
        heights = boxes[:, 3] - boxes[:, 1]
        ctr_x = boxes[:, 0] + 0.5 * widths
        ctr_y = boxes[:, 1] + 0.5 * heights

        wx, wy, ww, wh = self.bbox_weights
        dx = deltas[:, 0::4] / wx
        dy = deltas[:, 1::4] / wy
        dw = deltas[:, 2::4] / ww
        dh = deltas[:, 3::4] / wh

        # Prevent sending too large values into torch.exp()
        dw = torch.clamp(dw, max=self.scale_clamp)
        dh = torch.clamp(dh, max=self.scale_clamp)

        pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
        pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
        pred_w = torch.exp(dw) * widths[:, None]
        pred_h = torch.exp(dh) * heights[:, None]

        pred_boxes = torch.zeros_like(deltas)
        pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w  # x1
        pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h  # y1
        pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w  # x2
        pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h  # y2
        pdb.set_trace()
        return pred_boxes # [200, 4]

5.3.3DynamicConv

class DynamicConv(nn.Module):

    def __init__(self, cfg):
        super().__init__()

        self.hidden_dim = cfg.MODEL.SparseRCNN.HIDDEN_DIM # 256
        self.dim_dynamic = cfg.MODEL.SparseRCNN.DIM_DYNAMIC # 64
        self.num_dynamic = cfg.MODEL.SparseRCNN.NUM_DYNAMIC # 2
        self.num_params = self.hidden_dim * self.dim_dynamic # 256 x 64
        self.dynamic_layer = nn.Linear(self.hidden_dim, self.num_dynamic * self.num_params) # [256, 2 x 256 x 64]

        self.norm1 = nn.LayerNorm(self.dim_dynamic) # 64
        self.norm2 = nn.LayerNorm(self.hidden_dim) # 256

        self.activation = nn.ReLU(inplace=True)

        pooler_resolution = cfg.MODEL.ROI_BOX_HEAD.POOLER_RESOLUTION # 7
        num_output = self.hidden_dim * pooler_resolution ** 2 # 256 x 7 x 7
        self.out_layer = nn.Linear(num_output, self.hidden_dim) # [256 x 7x7, 256]
        self.norm3 = nn.LayerNorm(self.hidden_dim)
        #pdb.set_trace()

    def forward(self, pro_features, roi_features):
        '''
        pro_features: (1,  N * nr_boxes, self.d_model)
        roi_features: (49, N * nr_boxes, self.d_model)
        '''

        pdb.set_trace()
        features = roi_features.permute(1, 0, 2) # [N * nr_boxes, 49, self.d_model]
        # dynamic_layer: linear[256, 2 x 256 x 64]
        parameters = self.dynamic_layer(pro_features).permute(1, 0, 2) # [N * nr_boxes, 1, 2 x 256 x 64]
        # self.num_params = 256 x 64 param1 和 param2为 parmeters的一半
        param1 = parameters[:, :, :self.num_params].view(-1, self.hidden_dim, self.dim_dynamic) # [200, 256, 64]
        param2 = parameters[:, :, self.num_params:].view(-1, self.dim_dynamic, self.hidden_dim) # [200, 64, 256]

        features = torch.bmm(features, param1) # [200, 49, 64]
        features = self.norm1(features)
        features = self.activation(features)
 
        features = torch.bmm(features, param2) # [200, 49, 256]
        features = self.norm2(features)
        features = self.activation(features)

        features = features.flatten(1) # [200, 49 x 256]
        features = self.out_layer(features) # [200, 256]
        features = self.norm3(features)
        features = self.activation(features)
        pdb.set_trace()
        return features

5.4 Sparse R_CNN head simple structure

一共循环六次，这里只列出了其中一次。

(Pdb) rcnn_head
RCNNHead(
  (self_attn): MultiheadAttention(
    (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
  )
  (inst_interact): DynamicConv(
    (dynamic_layer): Linear(in_features=256, out_features=32768, bias=True)
    (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (activation): ReLU(inplace=True)
    (out_layer): Linear(in_features=12544, out_features=256, bias=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  )
  (linear1): Linear(in_features=256, out_features=2048, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (linear2): Linear(in_features=2048, out_features=256, bias=True)
  (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.0, inplace=False)
  (dropout2): Dropout(p=0.0, inplace=False)
  (dropout3): Dropout(p=0.0, inplace=False)
  (cls_module): ModuleList(
    (0): Linear(in_features=256, out_features=256, bias=False)
    (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (2): ReLU(inplace=True)
  )
  (reg_module): ModuleList(
    (0): Linear(in_features=256, out_features=256, bias=False)
    (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (2): ReLU(inplace=True)
    (3): Linear(in_features=256, out_features=256, bias=False)
    (4): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=256, out_features=256, bias=False)
    (7): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (8): ReLU(inplace=True)
  )
  (class_logits): Linear(in_features=256, out_features=80, bias=True)
  (bboxes_delta): Linear(in_features=256, out_features=4, bias=True)
)

6. References

https://zhuanlan.zhihu.com/p/310058362

[Sparse R-CNN]Sparse R-CNN: End-to-End Object Detection with Learnable Proposals笔记