End-to-End Object Detection with Transformers

如果对你帮助的话，希望给我个赞~

论文:https://arxiv.org/pdf/2005.12872.pdf
代码:https://github.com/facebookresearch/detr

一、transformer

transformer是NLP 非常经典的一篇论文，抛弃了之前用的RNN循环卷积网络细节，是一个基本由attention layer组成的网络。
论文:https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
有2篇很棒的整理帮助非NLP简单入门下transformer~
link1:https://zhuanlan.zhihu.com/p/150635505
link2:http://jalammar.github.io/illustrated-transformer/

1. 网络结构

在这里插入图片描述

2. scale dot-product attention 以及 muli-head attention

在这里插入图片描述

3. Attention(Q,K,V)公式

Q，K，V经过scale dot-product atten 的计算公式,KV的维度大小是相等的：
在这里插入图片描述

4. MultiHead(Q,K,V)公式

在这里插入图片描述

5. Position Embedding

在这里插入图片描述

二、DETR

1. Motivation

This end-to-end philosophy has led to significant advances in complex structured prediction tasks such as machine translation or speech recognition, but not yet in object detection.
Previous attempts either add other forms of prior knowledge, or have not proven to be competitive with strong baselines on challenging benchmarks.
This paper aims to bridge this gap.

2. 网络结构

关注其中具体的实现，可以参阅我第三部分相应的代码，里面有比较详细的注释。
注：图中的32, 24 是输入图片[H, W, 3]的图片的宽高缩小的32倍，即H/32, W/32。

在这里插入图片描述

还引入一个mask tensor[N, 24, 32]，用来精确的约束pos embedding。 which contains 1 on pixels that were added due to padding when batching images of different sizes, and 0 otherwise
N = 6 表示有6层这样的串联的Multi-Head Attention layer。每一层都参与计算aux loss（辅助损失），但是只使用最后一层出来的[100, N, 256]来计算bbox 和 label。
tgt and tgt2 都是object queries 生成的。 tgt2只是tgt的norm而已。

decoder 第二个q = tgt2 + query_pos
decoder 第二个k = memory + pos
decoder 第二个v = value

2. HungarianMatcher

在这里插入图片描述

① compute match cost

在这里插入图片描述
首先通过match cost 得到indices，也就是pred 和 target对应的最佳匹配。
这一部分的loss就需要计算3个，分别是l1 loss，-GIOU loss（不用1-GIOU）和一个-pro[target class]（不需要1-pro）。
另外这一部分的loss作为一个const，不参与梯度的计算，只是求cost以及indices。

② compute Hungarian loss

在这里插入图片描述

在第二步中，也是需要计算3个loss，分别是class的cross_entropy（这个和match的loss不一样。论文中也有提及。）以及bbox的l1_loss和GIOUloss，另外在计算bbox

③match and loss部分核心代码

match 在这里插入图片描述
loss

三、核心代码部分

1. detr/models/detr.py

代码：

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
DETR model and criterion classes.
"""
import torch
import torch.nn.functional as F
from torch import nn

from util import box_ops
from util.misc import (NestedTensor, nested_tensor_from_tensor_list,
                       accuracy, get_world_size, interpolate,
                       is_dist_avail_and_initialized)

from .backbone import build_backbone
from .matcher import build_matcher
from .segmentation import (DETRsegm, PostProcessPanoptic, PostProcessSegm,
                           dice_loss, sigmoid_focal_loss)
from .transformer import build_transformer
import pdb

class DETR(nn.Module):
    """ This is the DETR module that performs object detection """
    def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
        """ Initializes the model.
        Parameters:
            backbone: torch module of the backbone to be used. See backbone.py
            transformer: torch module of the transformer architecture. See transformer.py
            num_classes: number of object classes
            num_queries: number of object queries, ie detection slot. This is the maximal number of objects
                         DETR can detect in a single image. For COCO, we recommend 100 queries.
            aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.
        """

        super().__init__()
        self.num_queries = num_queries # 100
        self.transformer = transformer # transfromer
        hidden_dim = transformer.d_model # 256
        
        # class and Bounding Box
        self.class_embed = nn.Linear(hidden_dim, num_classes + 1) # [256, 92] 
        self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3) # 3 layers  the last layer : [256, 4] 
        '''
        (Pdb)  self.bbox_embed
        MLP(
          (layers): ModuleList(
            (0): Linear(in_features=256, out_features=256, bias=True)
            (1): Linear(in_features=256, out_features=256, bias=True)
            (2): Linear(in_features=256, out_features=4, bias=True)
          )
        )
        '''
        # embedding
        self.query_embed = nn.Embedding(num_queries, hidden_dim) # [100, 256]

        # in paper: a 1x1 convolution reduces the channel dimension of the high-level activation map f from C to a smaller dimension d
        self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1) # 2048 --> 256
        self.backbone = backbone
        self.aux_loss = aux_loss # True
        pdb.set_trace()
    def forward(self, samples: NestedTensor):
        """ The forward expects a NestedTensor, which consists of:
               - samples.tensor: batched images, of shape [batch_size x 3 x H x W]
               - samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels

            It returns a dict with the following elements:
               - "pred_logits": the classification logits (including no-object) for all queries.
                                Shape= [batch_size x num_queries x (num_classes + 1)] # [N, 100, 91]
               - "pred_boxes": The normalized boxes coordinates for all queries, represented as
                               (center_x, center_y, height, width). These values are normalized in [0, 1],
                               relative to the size of each individual image (disregarding possible padding). # [x, y, h, w]
                               See PostProcess for information on how to retrieve the unnormalized bounding box.
               - "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
                                dictionnaries containing the two above keys for each decoder layer.
        """
        
        # 首先进入这里
        pdb.set_trace() 
        # type(samples) NestedTensor [2, 3, 768, 1024]
        if isinstance(samples, (list, torch.Tensor)):   # False
          # github issue https://github.com/facebookresearch/detr/issues/133
          # This is not a segmentation mask, this is a padding mask 
          # which contains 1 on pixels that were added due to padding when batching images of different sizes, and 0 otherwise.
            samples = nested_tensor_from_tensor_list(samples)
        
        # 1.forward backbone
        features, pos = self.backbone(samples)

        src, mask = features[-1].decompose() #  src [2, 2048, 24, 32] mask [2, 24, 32]
        pdb.set_trace()
        assert mask is not None

        # 2.forward transformer
        # self.input_proj(src) is  1x1 conv  channel 2048 to 256
        hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0] # [6,2,100,256]

        # 3.forward class and bbox
        outputs_class = self.class_embed(hs) # [6, 2, 100, 92]
        outputs_coord = self.bbox_embed(hs).sigmoid() # torch.Size([6, 2, 100, 4])
        out = {
    
    'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]} #5个 [2, 100, 92]  [2, 100, 4]

        # aux_loss
        if self.aux_loss: # True
            out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
        pdb.set_trace()
        return out # dict

    @torch.jit.unused
    def _set_aux_loss(self, outputs_class, outputs_coord):
        # this is a workaround to make torchscript happy, as torchscript
        # doesn't support dictionary with non-homogeneous values, such
        # as a dict having both a Tensor and a list.
        return [{
    
    'pred_logits': a, 'pred_boxes': b}
                for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]


class SetCriterion(nn.Module):
    """ This class computes the loss for DETR.
    The process happens in two steps:
        1) we compute hungarian assignment between ground truth boxes and the outputs of the model
        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
    """
    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
        """ Create the criterion.
        Parameters:
            num_classes: number of object categories, omitting the special no-object category 
            matcher: module able to compute a matching between targets and proposals
            weight_dict: dict containing as key the names of the losses and as values their relative weight.
            eos_coef: relative classification weight applied to the no-object category
            losses: list of all the losses to be applied. See get_loss for list of available losses.
        """
        super().__init__()
        self.num_classes = num_classes # 91
        self.matcher = matcher
        self.weight_dict = weight_dict
        self.eos_coef = eos_coef # 0.1
        self.losses = losses # ['labels', 'boxes', 'cardinality']
        empty_weight = torch.ones(self.num_classes + 1)# [92]
        empty_weight[-1] = self.eos_coef # empty_weight[91] = 0.1 , others = 1
        self.register_buffer('empty_weight', empty_weight)
        #pdb.set_trace()

    def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
        """Classification loss (NLL)
        targets dicts must contain the key "labels" containing a tensor of dim [nb_target_boxes]
        """
        # indices [(tensor([ 2,  4, 31, 42, 83, 84, 89, 91]), tensor([7, 1, 6, 5, 3, 4, 2, 0])), (tensor([54]), tensor([0]))]
        # indices[i][0]是pred_idx, [i][1]是gt_idx, 通过idx来索引后续的gt_label
        # targets[1] 'labels': tensor([18,  1,  1, 15, 27, 44, 84, 27] targets[2] 'labels': tensor[0]
        assert 'pred_logits' in outputs
        src_logits = outputs['pred_logits'] # [2,100,92]

        # get batch_idx and src_idx  index idx[0]代表gt_box在第几个img, idx[1]代表对应的100个query_num的索引
        idx = self._get_src_permutation_idx(indices) 
        # idx (tensor([0, 0, 0, 0, 0, 0, 0, 0, 1]), tensor([ 2,  4, 31, 42, 83, 84, 89, 91, 54]))

        target_classes_o = torch.cat([t["labels"][J] for t, (_, J) in zip(targets, indices)]) 
        # target_classes_o : tensor([27,  1, 84, 44, 15, 27,  1, 18,  6], device='cuda:0')
        '''
        temp = []
        for t, (_, J) in zip(targets, indices): # t就是第几个img  eg. batch_size = 2 , (_, J)就是indices[i] 上的元组
          temp.append(t["labels"][J]) # 将targets['label']里面的gt_label按照indices重新排序
          pdb.set_trace()
        target_classes_o = torch.cat(temp)
        '''

        target_classes = torch.full(src_logits.shape[:2], self.num_classes,
                                    dtype=torch.int64, device=src_logits.device) # [2, 100] 值先预设为self.num_classes eg.91

        # target_classes[idx] 理解为 [0, 2], [0, 4],.....,[1, 54]的值用target_classes_o 来赋值
        target_classes[idx] = target_classes_o
        # This criterion combines log_softmax and nll_loss in a single function.
        loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight) 
        losses = {
    
    'loss_ce': loss_ce}

        # 后续思考class_error
        if log:
            # TODO this should probably be a separate loss, not hacked in this one here
            losses['class_error'] = 100 - accuracy(src_logits[idx], target_classes_o)[0]
        pdb.set_trace()
        return losses

    @torch.no_grad()
    def loss_cardinality(self, outputs, targets, indices, num_boxes):
        """ Compute the cardinality error, ie the absolute error in the number of predicted non-empty boxes
        This is not really a loss, it is intended for logging purposes only. It doesn't propagate gradients
        """
        pred_logits = outputs['pred_logits'] # [2, 100, 92]
        device = pred_logits.device
        tgt_lengths = torch.as_tensor([len(v["labels"]) for v in targets], device=device) # torch.size():[2], value: tensor[8, 1]([8,1]不是维度,维度是[2]是值下同)

        # Count the number of predictions that are NOT "no-object" (which is the last class)
        card_pred = (pred_logits.argmax(-1) != pred_logits.shape[-1] - 1).sum(1) # torch.size(): [2], 值就是相等为True的sum eg.balue:[100, 100]
        # cardinality_error可以理解为 所有pred_box预测有类别(除了最后一个class)的sum - target中每一个img的gt_box先验。
        card_err = F.l1_loss(card_pred.float(), tgt_lengths.float())
        losses = {
    
    'cardinality_error': card_err}
        pdb.set_trace()
        return losses

    def loss_boxes(self, outputs, targets, indices, num_boxes):
        """Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
           targets dicts must contain the key "boxes" containing a tensor of dim [nb_target_boxes, 4]
           The target boxes are expected in format (center_x, center_y, w, h), normalized by the image size.
        """
        assert 'pred_boxes' in outputs
        idx = self._get_src_permutation_idx(indices)
        src_boxes = outputs['pred_boxes'][idx] # [9, 4]
        target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0) # [9, 4]

        # l1 loss
        loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction='none')

        losses = {
    
    }
        losses['loss_bbox'] = loss_bbox.sum() / num_boxes

        # giou loss = 1 - GIOU
        loss_giou = 1 - torch.diag(box_ops.generalized_box_iou(
            box_ops.box_cxcywh_to_xyxy(src_boxes),
            box_ops.box_cxcywh_to_xyxy(target_boxes)))
        losses['loss_giou'] = loss_giou.sum() / num_boxes
        pdb.set_trace()

        return losses

    def loss_masks(self, outputs, targets, indices, num_boxes):
        """Compute the losses related to the masks: the focal loss and the dice loss.
           targets dicts must contain the key "masks" containing a tensor of dim [nb_target_boxes, h, w]
        """
        assert "pred_masks" in outputs

        src_idx = self._get_src_permutation_idx(indices)
        tgt_idx = self._get_tgt_permutation_idx(indices)
        src_masks = outputs["pred_masks"]
        src_masks = src_masks[src_idx]
        masks = [t["masks"] for t in targets]
        # TODO use valid to mask invalid areas due to padding in loss
        target_masks, valid = nested_tensor_from_tensor_list(masks).decompose()
        target_masks = target_masks.to(src_masks)
        target_masks = target_masks[tgt_idx]

        # upsample predictions to the target size
        src_masks = interpolate(src_masks[:, None], size=target_masks.shape[-2:],
                                mode="bilinear", align_corners=False)
        src_masks = src_masks[:, 0].flatten(1)

        target_masks = target_masks.flatten(1)
        target_masks = target_masks.view(src_masks.shape)
        losses = {
    
    
            "loss_mask": sigmoid_focal_loss(src_masks, target_masks, num_boxes),
            "loss_dice": dice_loss(src_masks, target_masks, num_boxes),
        }
        pdb.set_trace()

        return losses

    def _get_src_permutation_idx(self, indices):
        '''
        indices:
        eg.
        [(tensor([ 2,  4, 31, 42, 83, 84, 89, 91]), tensor([7, 1, 6, 5, 3, 4, 2, 0])), (tensor([54]), tensor([0]))]
        '''
        # permute predictions following indices
        batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)]) # tensor([0, 0, 0, 0, 0, 0, 0, 0, 1]),因此第一张img有8个gt, 第二张img有1个gt

        src_idx = torch.cat([src for (src, _) in indices]) #  tensor([ 2,  4, 31, 42, 83, 84, 89, 91, 54])

        pdb.set_trace()

        return batch_idx, src_idx

    def _get_tgt_permutation_idx(self, indices):
        # permute targets following indices
        batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])
        tgt_idx = torch.cat([tgt for (_, tgt) in indices])
        pdb.set_trace()

        return batch_idx, tgt_idx

    def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
        loss_map = {
    
    
            'labels': self.loss_labels,
            'cardinality': self.loss_cardinality,
            'boxes': self.loss_boxes,
            'masks': self.loss_masks
        }
        assert loss in loss_map, f'do you really want to compute {loss} loss?'

        pdb.set_trace()
        return loss_map[loss](outputs, targets, indices, num_boxes, **kwargs)

    def forward(self, outputs, targets):
        """ This performs the loss computation.
        Parameters:
             outputs: dict of tensors, see the output specification of the model for the format 
             etc.
             outputs: dict_keys(['pred_logits', 'pred_boxes', 'aux_outputs'])
             'pred_logits' : torch.Size([2, 100, 92])
             'pred_boxes' : torch.Size([2, 100, 4])
             'aux_outputs' : len('aux_outputs') = 5 outputs['aux_outputs'][0].keys() = dict_keys(['pred_logits', 'pred_boxes']) 除了最后一层的其他5个decoderlayer的outputs.
             
             targets: list of dicts, such that len(targets) == batch_size.
                      The expected keys in each dict depends on the losses applied, see each loss' doc
        """ 
        outputs_without_aux = {
    
    k: v for k, v in outputs.items() if k != 'aux_outputs'}  # len = 2

        # Retrieve the matching between the outputs of the last layer and the targets 检索最后一层的输出与目标之间的匹配
        # forward matcher
        indices = self.matcher(outputs_without_aux, targets)
        pdb.set_trace()

        # Compute the average number of target boxes accross all nodes, for normalization purposes
        num_boxes = sum(len(t["labels"]) for t in targets) # 9
        num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)  # tensor[9.]
        if is_dist_avail_and_initialized(): # True
            torch.distributed.all_reduce(num_boxes) # None
        num_boxes = torch.clamp(num_boxes / get_world_size(), min=1).item() # 最小值》1   eg. 9.0
        
        pdb.set_trace()

        # Compute all the requested losses
        losses = {
    
    }
        for loss in self.losses: # ['labels', 'boxes', 'cardinality']
            losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))
        pdb.set_trace()

        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.
        if 'aux_outputs' in outputs:
            for i, aux_outputs in enumerate(outputs['aux_outputs']):
                indices = self.matcher(aux_outputs, targets)
                for loss in self.losses:
                    if loss == 'masks':
                        # Intermediate masks losses are too costly to compute, we ignore them.
                        continue
                    kwargs = {
    
    }
                    if loss == 'labels':
                        # Logging is enabled only for the last layer
                        kwargs = {
    
    'log': False}
                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_boxes, **kwargs)
                    l_dict = {
    
    k + f'_{i}': v for k, v in l_dict.items()}
                    losses.update(l_dict)
                    pdb.set_trace()
        pdb.set_trace()
        return losses


class PostProcess(nn.Module):
    """ This module converts the model's output into the format expected by the coco api"""
    @torch.no_grad()
    def forward(self, outputs, target_sizes):
        """ Perform the computation
        Parameters:
            outputs: raw outputs of the model
            target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
                          For evaluation, this must be the original image size (before any data augmentation)
                          For visualization, this should be the image size after data augment, but before padding
        """
        out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']

        assert len(out_logits) == len(target_sizes)
        assert target_sizes.shape[1] == 2

        prob = F.softmax(out_logits, -1)
        scores, labels = prob[..., :-1].max(-1)

        # convert to [x0, y0, x1, y1] format
        boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
        # and from relative [0, 1] to absolute [0, height] coordinates
        img_h, img_w = target_sizes.unbind(1)
        scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
        boxes = boxes * scale_fct[:, None, :]

        results = [{
    
    'scores': s, 'labels': l, 'boxes': b} for s, l, b in zip(scores, labels, boxes)]
        pdb.set_trace()
        return results


class MLP(nn.Module): # Feed Forward Network
    """ Very simple multi-layer perceptron (also called FFN)"""

    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
        super().__init__()
        self.num_layers = num_layers # 3
        h = [hidden_dim] * (num_layers - 1) # [256, 256]
        # [input_dim] + h : [256, 256, 256] 
        # h + [output_dim] : [256, 256, 4]
        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))
        '''
        
        self.layers:
            ModuleList(
              (0): Linear(in_features=256, out_features=256, bias=True)
              (1): Linear(in_features=256, out_features=256, bias=True)
              (2): Linear(in_features=256, out_features=4, bias=True)
            )
        '''

    def forward(self, x): # torch.Size([6, 2, 100, 4])
        for i, layer in enumerate(self.layers):
            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
        pdb.set_trace()
        return x


def build(args):
    # the `num_classes` naming here is somewhat misleading.
    # it indeed corresponds to `max_obj_id + 1`, where max_obj_id
    # is the maximum id for a class in your dataset. For example,
    # COCO has a max_obj_id of 90, so we pass `num_classes` to be 91.
    # As another example, for a dataset that has a single class with id 1,
    # you should pass `num_classes` to be 2 (max_obj_id + 1).
    # For more details on this, check the following discussion
    # https://github.com/facebookresearch/detr/issues/108#issuecomment-650269223
    num_classes = 20 if args.dataset_file != 'coco' else 91 # 91 不是81
    if args.dataset_file == "coco_panoptic":
        # for panoptic, we just add a num_classes that is large enough to hold
        # max_obj_id + 1, but the exact value doesn't really matter
        num_classes = 250
    device = torch.device(args.device)
    
    # 1. build backbone entry
    backbone = build_backbone(args) 

    # 2. build transformer entry
    transformer = build_transformer(args)
    # 3. build DETR entry   integrate backbone and transformer     
    model = DETR( 
        backbone,
        transformer,
        num_classes=num_classes,
        num_queries=args.num_queries,
        aux_loss=args.aux_loss,
    )
    if args.masks:
        model = DETRsegm(model, freeze_detr=(args.frozen_weights is not None))

    # 4. build matcher 
    matcher = build_matcher(args) 

    weight_dict = {
    
    'loss_ce': 1, 'loss_bbox': args.bbox_loss_coef} # {'loss_ce': 1, 'loss_bbox': 5}
    weight_dict['loss_giou'] = args.giou_loss_coef # 1
    if args.masks:
        weight_dict["loss_mask"] = args.mask_loss_coef
        weight_dict["loss_dice"] = args.dice_loss_coef
    # TODO this is a hack
    if args.aux_loss: # True
        aux_weight_dict = {
    
    }
        for i in range(args.dec_layers - 1):# 5
            aux_weight_dict.update({
    
    k + f'_{i}': v for k, v in weight_dict.items()})
        weight_dict.update(aux_weight_dict)

    losses = ['labels', 'boxes', 'cardinality']
    if args.masks: # F
        losses += ["masks"]

    # 5. build criterion
    criterion = SetCriterion(num_classes, matcher=matcher, weight_dict=weight_dict,
                             eos_coef=args.eos_coef, losses=losses)
    criterion.to(device)

    postprocessors = {
    
    'bbox': PostProcess()}
    if args.masks:
        postprocessors['segm'] = PostProcessSegm()
        if args.dataset_file == "coco_panoptic":
            is_thing_map = {
    
    i: i <= 90 for i in range(201)}
            postprocessors["panoptic"] = PostProcessPanoptic(is_thing_map, threshold=0.85)
    #pdb.set_trace()
    return model, criterion, postprocessors


'''
{
  'loss_ce': 1, 'loss_bbox': 5, 'loss_giou': 2, 
  'loss_ce_0': 1, 'loss_bbox_0': 5, 'loss_giou_0': 2, 
  'loss_ce_1': 1, 'loss_bbox_1': 5, 'loss_giou_1': 2, 
  'loss_ce_2': 1, 'loss_bbox_2': 5, 'loss_giou_2': 2, 
  'loss_ce_3': 1, 'loss_bbox_3': 5, 'loss_giou_3': 2, 
  'loss_ce_4': 1, 'loss_bbox_4': 5, 'loss_giou_4': 2
}


Detr(
  (detr): DETR(
    (transformer): Transformer(
      (encoder): TransformerEncoder(
        (layers): ModuleList(
          (0): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (1): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (2): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (3): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (4): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (5): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (decoder): TransformerDecoder(
        (layers): ModuleList(
          (0): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (1): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (2): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (3): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (4): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (5): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
        )
        (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (class_embed): Linear(in_features=256, out_features=81, bias=True)
    (bbox_embed): MLP(
      (layers): ModuleList(
        (0): Linear(in_features=256, out_features=256, bias=True)
        (1): Linear(in_features=256, out_features=256, bias=True)
        (2): Linear(in_features=256, out_features=4, bias=True)
      )
    )
    (query_embed): Embedding(100, 256)
    (input_proj): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
    (backbone): Joiner(
      (0): MaskedBackbone(
        (backbone): ResNet(
          (stem): BasicStem(
            (conv1): Conv2d(
              3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
              (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
            )
          )
          (res2): Sequential(
            (0): BottleneckBlock(
              (shortcut): Conv2d(
                64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv1): Conv2d(
                64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv2): Conv2d(
                64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv3): Conv2d(
                64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
            )
            (1): BottleneckBlock(
              (conv1): Conv2d(
                256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv2): Conv2d(
                64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv3): Conv2d(
                64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
            )
            (2): BottleneckBlock(
              (conv1): Conv2d(
                256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv2): Conv2d(
                64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv3): Conv2d(
                64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
            )
          )
          (res3): Sequential(
            (0): BottleneckBlock(
              (shortcut): Conv2d(
                256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv1): Conv2d(
                256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv2): Conv2d(
                128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv3): Conv2d(
                128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
            )
            (1): BottleneckBlock(
              (conv1): Conv2d(
                512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv2): Conv2d(
                128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv3): Conv2d(
                128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
            )
            (2): BottleneckBlock(
              (conv1): Conv2d(
                512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv2): Conv2d(
                128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv3): Conv2d(
                128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
            )
            (3): BottleneckBlock(
              (conv1): Conv2d(
                512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv2): Conv2d(
                128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv3): Conv2d(
                128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
            )
          )
          (res4): Sequential(
            (0): BottleneckBlock(
              (shortcut): Conv2d(
                512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
              (conv1): Conv2d(
                512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (1): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (2): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (3): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (4): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (5): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
          )
          (res5): Sequential(
            (0): BottleneckBlock(
              (shortcut): Conv2d(
                1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False
                (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
              )
              (conv1): Conv2d(
                1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv2): Conv2d(
                512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv3): Conv2d(
                512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
              )
            )
            (1): BottleneckBlock(
              (conv1): Conv2d(
                2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv2): Conv2d(
                512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv3): Conv2d(
                512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
              )
            )
            (2): BottleneckBlock(
              (conv1): Conv2d(
                2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv2): Conv2d(
                512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv3): Conv2d(
                512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
              )
            )
          )
        )
      )
      (1): PositionEmbeddingSine()
    )
  )
  (criterion): SetCriterion(
    (matcher): HungarianMatcher()
  )
)

'''

'''
get_label_loss

(Pdb) t
{'boxes': tensor([[0.3722, 0.6666, 0.2496, 0.1161],
        [0.5426, 0.4170, 0.3599, 0.7220],
        [0.2823, 0.4324, 0.4917, 0.2580],
        [0.4459, 0.6099, 0.6642, 0.2646],
        [0.6796, 0.6822, 0.1735, 0.2128],
        [0.3850, 0.6643, 0.0392, 0.1714],
        [0.5527, 0.3286, 0.0334, 0.0536],
        [0.4618, 0.4616, 0.0716, 0.0671]], device='cuda:0'), 'labels': tensor([18,  1,  1, 15, 27, 44, 84, 27], device='cuda:0'), 'image_id': tensor([151988], device='cuda:0'), 'area': tensor([16129.6201, 81465.6953, 46954.9570, 48422.0742, 22253.7461,  4561.6089,
          818.3591,  2781.5159], device='cuda:0'), 'iscrowd': tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'), 'orig_size': tensor([480, 640], device='cuda:0'), 'size': tensor([ 768, 1024], device='cuda:0')}
(Pdb) p t
{'boxes': tensor([[0.3722, 0.6666, 0.2496, 0.1161],
        [0.5426, 0.4170, 0.3599, 0.7220],
        [0.2823, 0.4324, 0.4917, 0.2580],
        [0.4459, 0.6099, 0.6642, 0.2646],
        [0.6796, 0.6822, 0.1735, 0.2128],
        [0.3850, 0.6643, 0.0392, 0.1714],
        [0.5527, 0.3286, 0.0334, 0.0536],
        [0.4618, 0.4616, 0.0716, 0.0671]], device='cuda:0'), 'labels': tensor([18,  1,  1, 15, 27, 44, 84, 27], device='cuda:0'), 'image_id': tensor([151988], device='cuda:0'), 'area': tensor([16129.6201, 81465.6953, 46954.9570, 48422.0742, 22253.7461,  4561.6089,
          818.3591,  2781.5159], device='cuda:0'), 'iscrowd': tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'), 'orig_size': tensor([480, 640], device='cuda:0'), 'size': tensor([ 768, 1024], device='cuda:0')}
(Pdb) p (_,J)
(tensor([ 2,  4, 31, 42, 83, 84, 89, 91]), tensor([7, 1, 6, 5, 3, 4, 2, 0]))
(Pdb) indices
[(tensor([ 2,  4, 31, 42, 83, 84, 89, 91]), tensor([7, 1, 6, 5, 3, 4, 2, 0])), (tensor([54]), tensor([0]))]
(Pdb) t["labels"][J]
tensor([27,  1, 84, 44, 15, 27,  1, 18], device='cuda:0')
(Pdb) p J
tensor([7, 1, 6, 5, 3, 4, 2, 0])
(Pdb) t[]
*** SyntaxError: invalid syntax
(Pdb) t['labels']
tensor([18,  1,  1, 15, 27, 44, 84, 27], device='cuda:0')

'''

'''
总的损失
loss.update
{'loss_ce': tensor(4.3233, device='cuda:0', grad_fn=<NllLoss2DBackward>), 
'class_error': tensor(100., device='cuda:0'), 
'loss_bbox': tensor(0.7628, device='cuda:0', grad_fn=<DivBackward0>), 
'loss_giou': tensor(0.8565, device='cuda:0', grad_fn=<DivBackward0>), 
'cardinality_error': tensor(95.5000, device='cuda:0'), 

'loss_ce_0': tensor(4.3083, device='cuda:0', grad_fn=<NllLoss2DBackward>), 
'loss_bbox_0': tensor(0.7633, device='cuda:0', grad_fn=<DivBackward0>), 
'loss_giou_0': tensor(0.8700, device='cuda:0', grad_fn=<DivBackward0>), 
'cardinality_error_0': tensor(95.5000, device='cuda:0'), 

'loss_ce_1': tensor(4.3598, device='cuda:0', grad_fn=<NllLoss2DBackward>), 
'loss_bbox_1': tensor(0.7615, device='cuda:0', grad_fn=<DivBackward0>), 
'loss_giou_1': tensor(0.8565, device='cuda:0', grad_fn=<DivBackward0>), 
'cardinality_error_1': tensor(94.5000, device='cuda:0'), 

'loss_ce_2': tensor(4.7285, device='cuda:0', grad_fn=<NllLoss2DBackward>), 
'loss_bbox_2': tensor(0.7799, device='cuda:0', grad_fn=<DivBackward0>), 
'loss_giou_2': tensor(0.8595, device='cuda:0', grad_fn=<DivBackward0>), 
'cardinality_error_2': tensor(95.5000, device='cuda:0'), 

'loss_ce_3': tensor(4.8179, device='cuda:0', grad_fn=<NllLoss2DBackward>), 
'loss_bbox_3': tensor(0.7714, device='cuda:0', grad_fn=<DivBackward0>), 
'loss_giou_3': tensor(0.8534, device='cuda:0', grad_fn=<DivBackward0>), 
'cardinality_error_3': tensor(95.5000, device='cuda:0'), 

'loss_ce_4': tensor(4.8792, device='cuda:0', grad_fn=<NllLoss2DBackward>), 
'loss_bbox_4': tensor(0.7794, device='cuda:0', grad_fn=<DivBackward0>),
 'loss_giou_4': tensor(0.8499, device='cuda:0', grad_fn=<DivBackward0>), 
 'cardinality_error_4': tensor(95.5000, device='cuda:0')}

'''

2. detr/d2/detr/detr.py

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
import logging
import math
from typing import List

import numpy as np
import torch
import torch.distributed as dist
import torch.nn.functional as F
from scipy.optimize import linear_sum_assignment
from torch import nn

from detectron2.layers import ShapeSpec
from detectron2.modeling import META_ARCH_REGISTRY, build_backbone, detector_postprocess
from detectron2.structures import Boxes, ImageList, Instances, BitMasks, PolygonMasks
from detectron2.utils.logger import log_first_n
from fvcore.nn import giou_loss, smooth_l1_loss
from models.backbone import Joiner
from models.detr import DETR, SetCriterion
from models.matcher import HungarianMatcher
from models.position_encoding import PositionEmbeddingSine
from models.transformer import Transformer
from models.segmentation import DETRsegm, PostProcessPanoptic, PostProcessSegm
from util.box_ops import box_cxcywh_to_xyxy, box_xyxy_to_cxcywh
from util.misc import NestedTensor
from datasets.coco import convert_coco_poly_to_mask
import pdb
__all__ = ["Detr"]


class MaskedBackbone(nn.Module):
    """ This is a thin wrapper around D2's backbone to provide padding masking"""

    def __init__(self, cfg):
        super().__init__()
        self.backbone = build_backbone(cfg) # build_resnet_backone
        backbone_shape = self.backbone.output_shape()
        self.feature_strides = [backbone_shape[f].stride for f in backbone_shape.keys()] # [4, 8, 16, 32]
        self.num_channels = backbone_shape[list(backbone_shape.keys())[-1]].channels  # 2048
        pdb.set_trace()
    '''
        backbone_shape
        {
            'res2': ShapeSpec(channels=256, height=None, width=None, stride=4), 
            'res3': ShapeSpec(channels=512, height=None, width=None, stride=8), 
            'res4': ShapeSpec(channels=1024, height=None, width=None, stride=16), 
            'res5': ShapeSpec(channels=2048, height=None, width=None, stride=32)
        }
    '''
    def forward(self, images):
        features = self.backbone(images.tensor)
        masks = self.mask_out_padding(
            [features_per_level.shape for features_per_level in features.values()],
            images.image_sizes,
            images.tensor.device,
        )
        assert len(features) == len(masks)
        for i, k in enumerate(features.keys()):
            features[k] = NestedTensor(features[k], masks[i])
        return features

    def mask_out_padding(self, feature_shapes, image_sizes, device):
        masks = []
        assert len(feature_shapes) == len(self.feature_strides)
        for idx, shape in enumerate(feature_shapes):
            N, _, H, W = shape
            masks_per_feature_level = torch.ones((N, H, W), dtype=torch.bool, device=device)
            for img_idx, (h, w) in enumerate(image_sizes):
                masks_per_feature_level[
                    img_idx,
                    : int(np.ceil(float(h) / self.feature_strides[idx])),
                    : int(np.ceil(float(w) / self.feature_strides[idx])),
                ] = 0
            masks.append(masks_per_feature_level)
        return masks


@META_ARCH_REGISTRY.register()
class Detr(nn.Module):
    """
    Implement Detr
    """

    def __init__(self, cfg):
    
        super().__init__()

        self.device = torch.device(cfg.MODEL.DEVICE) # cuda
        self.num_classes = cfg.MODEL.DETR.NUM_CLASSES # 80
        self.mask_on = cfg.MODEL.MASK_ON # False 
        hidden_dim = cfg.MODEL.DETR.HIDDEN_DIM # 256
        num_queries = cfg.MODEL.DETR.NUM_OBJECT_QUERIES # 100

        # Transformer parameters:
        nheads = cfg.MODEL.DETR.NHEADS # 8
        dropout = cfg.MODEL.DETR.DROPOUT # 0.1
        dim_feedforward = cfg.MODEL.DETR.DIM_FEEDFORWARD # 2048
        enc_layers = cfg.MODEL.DETR.ENC_LAYERS # 6
        dec_layers = cfg.MODEL.DETR.DEC_LAYERS # 6
        pre_norm = cfg.MODEL.DETR.PRE_NORM # False

        # Loss parameters:
        giou_weight = cfg.MODEL.DETR.GIOU_WEIGHT # 2.0
        l1_weight = cfg.MODEL.DETR.L1_WEIGHT # 5.0
        deep_supervision = cfg.MODEL.DETR.DEEP_SUPERVISION # True  什么叫做deep_supervison？
        no_object_weight = cfg.MODEL.DETR.NO_OBJECT_WEIGHT # 0.1

        N_steps = hidden_dim // 2 # 128
        d2_backbone = MaskedBackbone(cfg)
        pdb.set_trace()
        backbone = Joiner(d2_backbone, PositionEmbeddingSine(N_steps, normalize=True)) # Joiner PositionEmbeddingSine
        backbone.num_channels = d2_backbone.num_channels # 2048
        pdb.set_trace()
        
        # init transformer
        transformer = Transformer(
            d_model=hidden_dim,
            dropout=dropout,
            nhead=nheads,
            dim_feedforward=dim_feedforward,
            num_encoder_layers=enc_layers,
            num_decoder_layers=dec_layers,
            normalize_before=pre_norm,
            return_intermediate_dec=deep_supervision,
        )
        pdb.set_trace()

        # init detr/model/detr.py
        # parameters backone transformer
        self.detr = DETR( 
            backbone, transformer, num_classes=self.num_classes, num_queries=num_queries, aux_loss=deep_supervision
        )

        if self.mask_on: # False segm时候用
            frozen_weights = cfg.MODEL.DETR.FROZEN_WEIGHTS
            if frozen_weights != '':
                print("LOAD pre-trained weights")
                weight = torch.load(frozen_weights, map_location=lambda storage, loc: storage)['model']
                new_weight = {
    
    }
                for k, v in weight.items():
                    if 'detr.' in k:
                        new_weight[k.replace('detr.', '')] = v
                    else:
                        print(f"Skipping loading weight {k} from frozen model")
                del weight
                self.detr.load_state_dict(new_weight)
                del new_weight
            self.detr = DETRsegm(self.detr, freeze_detr=(frozen_weights != ''))
            self.seg_postprocess = PostProcessSegm

        self.detr.to(self.device)

        pdb.set_trace()
        # building criterion
        # init HungarianMatcher weights (class, bbox, giou) class的error为1（一对一匹配所以不存在损失？）
        matcher = HungarianMatcher(cost_class=1, cost_bbox=l1_weight, cost_giou=giou_weight)
        weight_dict = {
    
    "loss_ce": 1, "loss_bbox": l1_weight}
        weight_dict["loss_giou"] = giou_weight
        if deep_supervision:
            aux_weight_dict = {
    
    }
            for i in range(dec_layers - 1): # 6
                aux_weight_dict.update({
    
    k + f"_{i}": v for k, v in weight_dict.items()})
            weight_dict.update(aux_weight_dict)
        '''
        weight_dict
        {
            'loss_ce_0': 1, 'loss_bbox_0': 5.0, 'loss_giou_0': 2.0, 
            'loss_ce_1': 1, 'loss_bbox_1': 5.0, 'loss_giou_1': 2.0, 
            'loss_ce_2': 1, 'loss_bbox_2': 5.0, 'loss_giou_2': 2.0, 
            'loss_ce_3': 1, 'loss_bbox_3': 5.0, 'loss_giou_3': 2.0, 
            'loss_ce_4': 1, 'loss_bbox_4': 5.0, 'loss_giou_4': 2.0
        }

        '''
        losses = ["labels", "boxes", "cardinality"]
        if self.mask_on:
            losses += ["masks"]
        self.criterion = SetCriterion(
            self.num_classes, matcher=matcher, weight_dict=weight_dict, eos_coef=no_object_weight, losses=losses,
        )
        self.criterion.to(self.device)

        pixel_mean = torch.Tensor(cfg.MODEL.PIXEL_MEAN).to(self.device).view(3, 1, 1)
        pixel_std = torch.Tensor(cfg.MODEL.PIXEL_STD).to(self.device).view(3, 1, 1)
        self.normalizer = lambda x: (x - pixel_mean) / pixel_std
        self.to(self.device)

    def forward(self, batched_inputs):
        """
        Args:
            batched_inputs: a list, batched outputs of :class:`DatasetMapper` .
                Each item in the list contains the inputs for one image.
                For now, each item in the list is a dict that contains:

                * image: Tensor, image in (C, H, W) format.
                * instances: Instances

                Other information that's included in the original dicts, such as:

                * "height", "width" (int): the output resolution of the model, used in inference.
                  See :meth:`postprocess` for details.
        Returns:
            dict[str: Tensor]:
                mapping from a named loss to a tensor storing the loss. Used during training only.
        """
        images = self.preprocess_image(batched_inputs)
        output = self.detr(images)

        if self.training:
            gt_instances = [x["instances"].to(self.device) for x in batched_inputs]

            targets = self.prepare_targets(gt_instances)
            loss_dict = self.criterion(output, targets)
            weight_dict = self.criterion.weight_dict
            for k in loss_dict.keys():
                if k in weight_dict:
                    loss_dict[k] *= weight_dict[k]
            return loss_dict
        else:
            box_cls = output["pred_logits"]
            box_pred = output["pred_boxes"]
            mask_pred = output["pred_masks"] if self.mask_on else None
            results = self.inference(box_cls, box_pred, mask_pred, images.image_sizes)
            processed_results = []
            for results_per_image, input_per_image, image_size in zip(results, batched_inputs, images.image_sizes):
                height = input_per_image.get("height", image_size[0])
                width = input_per_image.get("width", image_size[1])
                r = detector_postprocess(results_per_image, height, width)
                processed_results.append({
    
    "instances": r})
            return processed_results

    def prepare_targets(self, targets):
        new_targets = []
        for targets_per_image in targets:
            h, w = targets_per_image.image_size
            image_size_xyxy = torch.as_tensor([w, h, w, h], dtype=torch.float, device=self.device)
            gt_classes = targets_per_image.gt_classes
            gt_boxes = targets_per_image.gt_boxes.tensor / image_size_xyxy
            gt_boxes = box_xyxy_to_cxcywh(gt_boxes)
            new_targets.append({
    
    "labels": gt_classes, "boxes": gt_boxes})
            if self.mask_on and hasattr(targets_per_image, 'gt_masks'):
                gt_masks = targets_per_image.gt_masks
                gt_masks = convert_coco_poly_to_mask(gt_masks.polygons, h, w)
                new_targets[-1].update({
    
    'masks': gt_masks})
        return new_targets

    def inference(self, box_cls, box_pred, mask_pred, image_sizes):
        """
        Arguments:
            box_cls (Tensor): tensor of shape (batch_size, num_queries, K).
                The tensor predicts the classification probability for each query.
            box_pred (Tensor): tensors of shape (batch_size, num_queries, 4).
                The tensor predicts 4-vector (x,y,w,h) box
                regression values for every queryx
            image_sizes (List[torch.Size]): the input image sizes

        Returns:
            results (List[Instances]): a list of #images elements.
        """
        assert len(box_cls) == len(image_sizes)
        results = []

        # For each box we assign the best class or the second best if the best on is `no_object`.
        scores, labels = F.softmax(box_cls, dim=-1)[:, :, :-1].max(-1)

        for i, (scores_per_image, labels_per_image, box_pred_per_image, image_size) in enumerate(zip(
            scores, labels, box_pred, image_sizes
        )):
            result = Instances(image_size)
            result.pred_boxes = Boxes(box_cxcywh_to_xyxy(box_pred_per_image))

            result.pred_boxes.scale(scale_x=image_size[1], scale_y=image_size[0])
            if self.mask_on:
                mask = F.interpolate(mask_pred[i].unsqueeze(0), size=image_size, mode='bilinear', align_corners=False)
                mask = mask[0].sigmoid() > 0.5
                B, N, H, W = mask_pred.shape
                mask = BitMasks(mask.cpu()).crop_and_resize(result.pred_boxes.tensor.cpu(), 32)
                result.pred_masks = mask.unsqueeze(1).to(mask_pred[0].device)

            result.scores = scores_per_image
            result.pred_classes = labels_per_image
            results.append(result)
        return results

    def preprocess_image(self, batched_inputs):
        """
        Normalize, pad and batch the input images.
        """
        images = [self.normalizer(x["image"].to(self.device)) for x in batched_inputs]
        images = ImageList.from_tensors(images)
        return images


'''
Detr(
  (detr): DETR(
    (transformer): Transformer(
      (encoder): TransformerEncoder(
        (layers): ModuleList(
          (0): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (1): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (2): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (3): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (4): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
          (5): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (decoder): TransformerDecoder(
        (layers): ModuleList(
          (0): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (1): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (2): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (3): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (4): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
          (5): TransformerDecoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (multihead_attn): MultiheadAttention(
              (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
            )
            (linear1): Linear(in_features=256, out_features=2048, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=2048, out_features=256, bias=True)
            (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
          )
        )
        (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (class_embed): Linear(in_features=256, out_features=81, bias=True)
    (bbox_embed): MLP(
      (layers): ModuleList(
        (0): Linear(in_features=256, out_features=256, bias=True)
        (1): Linear(in_features=256, out_features=256, bias=True)
        (2): Linear(in_features=256, out_features=4, bias=True)
      )
    )
    (query_embed): Embedding(100, 256)
    (input_proj): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
    (backbone): Joiner(
      (0): MaskedBackbone(
        (backbone): ResNet(
          (stem): BasicStem(
            (conv1): Conv2d(
              3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
              (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
            )
          )
          (res2): Sequential(
            (0): BottleneckBlock(
              (shortcut): Conv2d(
                64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv1): Conv2d(
                64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv2): Conv2d(
                64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv3): Conv2d(
                64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
            )
            (1): BottleneckBlock(
              (conv1): Conv2d(
                256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv2): Conv2d(
                64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv3): Conv2d(
                64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
            )
            (2): BottleneckBlock(
              (conv1): Conv2d(
                256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv2): Conv2d(
                64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
              )
              (conv3): Conv2d(
                64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
            )
          )
          (res3): Sequential(
            (0): BottleneckBlock(
              (shortcut): Conv2d(
                256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv1): Conv2d(
                256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv2): Conv2d(
                128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv3): Conv2d(
                128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
            )
            (1): BottleneckBlock(
              (conv1): Conv2d(
                512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv2): Conv2d(
                128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv3): Conv2d(
                128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
            )
            (2): BottleneckBlock(
              (conv1): Conv2d(
                512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv2): Conv2d(
                128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv3): Conv2d(
                128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
            )
            (3): BottleneckBlock(
              (conv1): Conv2d(
                512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv2): Conv2d(
                128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
              )
              (conv3): Conv2d(
                128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
            )
          )
          (res4): Sequential(
            (0): BottleneckBlock(
              (shortcut): Conv2d(
                512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
              (conv1): Conv2d(
                512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (1): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (2): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (3): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (4): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
            (5): BottleneckBlock(
              (conv1): Conv2d(
                1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv2): Conv2d(
                256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
              )
              (conv3): Conv2d(
                256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
              )
            )
          )
          (res5): Sequential(
            (0): BottleneckBlock(
              (shortcut): Conv2d(
                1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False
                (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
              )
              (conv1): Conv2d(
                1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv2): Conv2d(
                512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv3): Conv2d(
                512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
              )
            )
            (1): BottleneckBlock(
              (conv1): Conv2d(
                2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv2): Conv2d(
                512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv3): Conv2d(
                512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
              )
            )
            (2): BottleneckBlock(
              (conv1): Conv2d(
                2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv2): Conv2d(
                512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
              )
              (conv3): Conv2d(
                512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
                (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
              )
            )
          )
        )
      )
      (1): PositionEmbeddingSine()
    )
  )
  (criterion): SetCriterion(
    (matcher): HungarianMatcher()
  )
)

'''

3.detr/models/backbone.py

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Backbone modules.
"""
from collections import OrderedDict

import torch
import torch.nn.functional as F
import torchvision
from torch import nn
from torchvision.models._utils import IntermediateLayerGetter
from typing import Dict, List

from util.misc import NestedTensor, is_main_process

from .position_encoding import build_position_encoding
import pdb

class FrozenBatchNorm2d(torch.nn.Module):
    """
    BatchNorm2d where the batch statistics and the affine parameters are fixed.

    Copy-paste from torchvision.misc.ops with added eps before rqsrt,
    without which any other models than torchvision.models.resnet[18,34,50,101]
    produce nans.
    """

    def __init__(self, n):
        super(FrozenBatchNorm2d, self).__init__()
        self.register_buffer("weight", torch.ones(n))
        self.register_buffer("bias", torch.zeros(n))
        self.register_buffer("running_mean", torch.zeros(n))
        self.register_buffer("running_var", torch.ones(n))
        #pdb.set_trace()

    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
                              missing_keys, unexpected_keys, error_msgs):
        num_batches_tracked_key = prefix + 'num_batches_tracked' # 'bn1.num_batches_tracked'
        if num_batches_tracked_key in state_dict:
            del state_dict[num_batches_tracked_key]

        super(FrozenBatchNorm2d, self)._load_from_state_dict(
            state_dict, prefix, local_metadata, strict,
            missing_keys, unexpected_keys, error_msgs)
        #pdb.set_trace()

    def forward(self, x):
        # move reshapes to the beginning
        # to make it fuser-friendly
        w = self.weight.reshape(1, -1, 1, 1) # [1, 64, 1, 1]
        b = self.bias.reshape(1, -1, 1, 1) # [1, 64, 1, 1]
        rv = self.running_var.reshape(1, -1, 1, 1) #  [1, 64, 1, 1]
        rm = self.running_mean.reshape(1, -1, 1, 1) #  [1, 64, 1, 1]
        eps = 1e-5
        scale = w * (rv + eps).rsqrt()
        bias = b - rm * scale
        #pdb.set_trace()
        return x * scale + bias


class BackboneBase(nn.Module):

    def __init__(self, backbone: nn.Module, train_backbone: bool, num_channels: int, return_interm_layers: bool):
        super().__init__()
        for name, parameter in backbone.named_parameters():
            print('name:', name)
            if not train_backbone or 'layer2' not in name and 'layer3' not in name and 'layer4' not in name:
              parameter.requires_grad_(False)
        if return_interm_layers: # F
            return_layers = {
    
    "layer1": "0", "layer2": "1", "layer3": "2", "layer4": "3"}
        else:
            return_layers = {
    
    'layer4': "0"} 
        self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)#  IntermediateLayerGetter return layer4之前的,也就是新加入的fc层不return
        self.num_channels = num_channels  #  2048
        #pdb.set_trace()

    def forward(self, tensor_list: NestedTensor):
        
        # tensor_list.tensors : torch.Size([2, 3, 768, 1024]) batchsize = 2  
        xs = self.body(tensor_list.tensors)  # 这句就是进入  FrozenBatchNorm2的dforward函数
        # xs['0'].size() torch.Size([2, 2048, 24, 32])

        out: Dict[str, NestedTensor] = {
    
    }
        for name, x in xs.items():
            m = tensor_list.mask
            assert m is not None
            mask = F.interpolate(m[None].float(), size=x.shape[-2:]).to(torch.bool)[0] # [2, 24, 32]
            out[name] = NestedTensor(x, mask)
        #pdb.set_trace()
        return out # torch.size([2, 2048, 24, 32])  论文有说 H/32， W/32


class Backbone(BackboneBase):
    """ResNet backbone with frozen BatchNorm."""
    def __init__(self, name: str,
                 train_backbone: bool,
                 return_interm_layers: bool,
                 dilation: bool):
    # name : resnet50
    # train_backbone : True
    # return_interm_layers : False
    # dilation : False
        print('name:',name)
        # get fc.bias 加入 resnet中 并且冻结BN
        backbone = getattr(torchvision.models, name)( 
            replace_stride_with_dilation=[False, False, dilation],
            pretrained=is_main_process(), norm_layer=FrozenBatchNorm2d)
        # change:
        # bn_{1~3} :FrozenBatchNorm2d()
        # (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
        # (fc): Linear(in_features=2048, out_features=1000, bias=True)
        num_channels = 512 if name in ('resnet18', 'resnet34') else 2048
        super().__init__(backbone, train_backbone, num_channels, return_interm_layers)
        #pdb.set_trace()

# joiner函数 forward会先进入这里
class Joiner(nn.Sequential):
    def __init__(self, backbone, position_embedding): # PositionEmbeddingSine()
        super().__init__(backbone, position_embedding)

    def forward(self, tensor_list: NestedTensor):
        # self[0]: IntermediateLayerGetter
        # self[1]: PositionEmbeddingSine
        xs = self[0](tensor_list) # 这句就是进入 BackboneBase的 forward函数
        out: List[NestedTensor] = []
        pos = []
        for name, x in xs.items():
            out.append(x)
            # position encoding
            pos.append(self[1](x).to(x.tensors.dtype))# self[1](x)  = PositionEmbeddingSine(x) forward
        #pdb.set_trace()
        return out, pos


def build_backbone(args):
    position_embedding = build_position_encoding(args) # PositionEmbeddingSine entry 

    train_backbone = args.lr_backbone > 0 # 1e-5 > 0 True

    return_interm_layers = args.masks # False

    # args.backbone resnet50
    backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation) # Backbone entry 

    # after Joiner
    # x = model(input) 就会先进入joiner的forward函数
    model = Joiner(backbone, position_embedding) # 最后加入了    (1): PositionEmbeddingSine()

    model.num_channels = backbone.num_channels # 2048
    #pdb.set_trace()
    return model


'''
IntermediateLayerGetter(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): FrozenBatchNorm2d()
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): FrozenBatchNorm2d()
 )
    )
    (1): Bottleneck(
      (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
    (2): Bottleneck(
      (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
  )
  (layer2): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): FrozenBatchNorm2d()
      )
    )
    (1): Bottleneck(
      (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
    (2): Bottleneck(
      (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
    (3): Bottleneck(
      (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
  )
  (layer3): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): FrozenBatchNorm2d()
      )
    )
    (1): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
    (2): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
    (3): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
    (4): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
    (5): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
  )
  (layer4): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): FrozenBatchNorm2d()
      )
    )
    (1): Bottleneck(
      (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
    (2): Bottleneck(
      (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): FrozenBatchNorm2d()
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): FrozenBatchNorm2d()
      (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
    )
  )
)

'''

4.detr/models/transformer.py

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
DETR Transformer class.

Copy-paste from torch.nn.Transformer with modifications:
    * positional encodings are passed in MHattention
    * extra LN at the end of encoder is removed
    * decoder returns a stack of activations from all decoding layers
"""
import copy
from typing import Optional, List

import torch
import torch.nn.functional as F
from torch import nn, Tensor
import pdb

class Transformer(nn.Module):

    def __init__(self, d_model=512, nhead=8, num_encoder_layers=6, # d_model 256
                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False,
                 return_intermediate_dec=False): # return_intermediate_dec true
        super().__init__()

        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        
        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None

        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
                                                dropout, activation, normalize_before)
        decoder_norm = nn.LayerNorm(d_model)
        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
                                          return_intermediate=return_intermediate_dec)

        self._reset_parameters()

        self.d_model = d_model # 256
        self.nhead = nhead # 8
        #pdb.set_trace()

    def _reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def forward(self, src, mask, query_embed, pos_embed):# [2, 256, 24, 32], [2, 24, 32], [100, 256], [2, 256, 24, 32]
        # flatten NxCxHxW to HWxNxC
        bs, c, h, w = src.shape # (2, 256, 24, 32)
        src = src.flatten(2).permute(2, 0, 1) # [24 * 32, 2, 256]
        pos_embed = pos_embed.flatten(2).permute(2, 0, 1) # [24 * 32, 2, 256]
        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)# [100, 2, 256]
        mask = mask.flatten(1) # [2, 768]

        tgt = torch.zeros_like(query_embed) # [100, 2, 256]
        #pdb.set_trace()

        #forward TransformerEncoder
        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed) # [782, 2, 256]

        #pdb.set_trace()

        #forward TransformerDecoder
        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask, # [6, 100, 2, 256]
                          pos=pos_embed, query_pos=query_embed)
        #pdb.set_trace()
        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w) 
        # hs [6, 2, 100, 256]  memory [2, 256, 24, 32] 

class TransformerEncoder(nn.Module):

    def __init__(self, encoder_layer, num_layers, norm=None):
        super().__init__()
        self.layers = _get_clones(encoder_layer, num_layers) # 复制6份
        self.num_layers = num_layers # 6
        self.norm = norm # None

    def forward(self, src,
                mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        output = src

        for layer in self.layers:
            # forward
            output = layer(output, src_mask=mask,
                           src_key_padding_mask=src_key_padding_mask, pos=pos)
            #pdb.set_trace()
        if self.norm is not None:
            output = self.norm(output)
        #pdb.set_trace()
        return output # [786, 2, 256]


class TransformerDecoder(nn.Module):

    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
        super().__init__()
        self.layers = _get_clones(decoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
        self.return_intermediate = return_intermediate
        #pdb.set_trace()
    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        output = tgt

        intermediate = []

        for layer in self.layers:  # 6
            output = layer(output, memory, tgt_mask=tgt_mask, # forward  TransformerDecoderLayer
                           memory_mask=memory_mask,
                           tgt_key_padding_mask=tgt_key_padding_mask,
                           memory_key_padding_mask=memory_key_padding_mask,
                           pos=pos, query_pos=query_pos)
            if self.return_intermediate:
                intermediate.append(self.norm(output))
        #pdb.set_trace()
        if self.norm is not None:
            output = self.norm(output)
            if self.return_intermediate:
                intermediate.pop()
                intermediate.append(output)

        if self.return_intermediate:
            return torch.stack(intermediate)
        #pdb.set_trace()
        return output.unsqueeze(0)


class TransformerEncoderLayer(nn.Module):

    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout) #  256  8 0.1

        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward) # [256, 2048]
        self.dropout = nn.Dropout(dropout) # 0.1
        self.linear2 = nn.Linear(dim_feedforward, d_model) # [2048, 256]

        self.norm1 = nn.LayerNorm(d_model)  # LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation) # F.relu
        self.normalize_before = normalize_before # False

    def with_pos_embed(self, tensor, pos: Optional[Tensor]):

        return tensor if pos is None else tensor + pos # [768, 2, 256] 直接将pos加入对于的位置。

    def forward_post(self,
                     src,
                     src_mask: Optional[Tensor] = None,
                     src_key_padding_mask: Optional[Tensor] = None,
                     pos: Optional[Tensor] = None):

        q = k = self.with_pos_embed(src, pos) # query key value 维度的大小要一样

        # Scaled Dot-Product Attention  
        # 1. Attention(q,k,v) = softmax(qk^T / sqrt(d_k))V
        # 2. cat nhead = 6
        # 3. q k v的维度都相同,但是q k的值有加入pos emb,而value没有
        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask, # src_mask = None #src2.size() : [768, 2, 256] [L_length, N_batch, Embed_d]
                              key_padding_mask=src_key_padding_mask)[0] # src_key_padding_mask [2, 768] 如果为True 则attention上的值则会被忽略
        
        # Add & Nrom 
        src = src + self.dropout1(src2) #  dropout and short-cut
        src = self.norm1(src) # nrom

        # Linear 
        # 2层linear ReLU
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        #pdb.set_trace()
        # src shortcut nrom  lin1 2 nrom2 
        return src

    def forward_pre(self, src,
                    src_mask: Optional[Tensor] = None,
                    src_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None):
        src2 = self.norm1(src)
        q = k = self.with_pos_embed(src2, pos)
        src2 = self.self_attn(q, k, value=src2, attn_mask=src_mask,
                              key_padding_mask=src_key_padding_mask)[0]
        src = src + self.dropout1(src2)
        src2 = self.norm2(src)
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))
        src = src + self.dropout2(src2)
        #pdb.set_trace()

        return src

    def forward(self, src,
                src_mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        if self.normalize_before: # False
            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
        #pdb.set_trace()

        return self.forward_post(src, src_mask, src_key_padding_mask, pos)


class TransformerDecoderLayer(nn.Module):

    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 activation="relu", normalize_before=False):
        super().__init__()
        # self_attn and multihead_atten 
        # when in encoder only have multihead_atten
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout) 
        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)

        # Implementation of Feedforward model
        # in deconder add nrom3 and dropout3
        self.linear1 = nn.Linear(d_model, dim_feedforward) # [256, 2048]
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model) # [2048, 256]

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation) # F.relu
        self.normalize_before = normalize_before # False
        #pdb.set_trace()


    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        #pdb.set_trace()

        return tensor if pos is None else tensor + pos

    def forward_post(self, tgt, memory,
                     tgt_mask: Optional[Tensor] = None,
                     memory_mask: Optional[Tensor] = None, # None
                     tgt_key_padding_mask: Optional[Tensor] = None,
                     memory_key_padding_mask: Optional[Tensor] = None,# [2, 768]
                     pos: Optional[Tensor] = None,
                     query_pos: Optional[Tensor] = None):
        q = k = self.with_pos_embed(tgt, query_pos)

        # attn_output: (L, N, E) ;L is query: (L, N, E)(L,N,E) 
        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0] # [100, 2, 256]  
        tgt = tgt + self.dropout1(tgt2)
        tgt = self.norm1(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt = self.norm2(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
        tgt = tgt + self.dropout3(tgt2)
        tgt = self.norm3(tgt) # [100, 2, 256]

        #pdb.set_trace()

        return tgt

    def forward_pre(self, tgt, memory,
                    tgt_mask: Optional[Tensor] = None,
                    memory_mask: Optional[Tensor] = None,
                    tgt_key_padding_mask: Optional[Tensor] = None,
                    memory_key_padding_mask: Optional[Tensor] = None,
                    pos: Optional[Tensor] = None,
                    query_pos: Optional[Tensor] = None):
        tgt2 = self.norm1(tgt)
        q = k = self.with_pos_embed(tgt2, query_pos)
        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
                              key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt2 = self.norm2(tgt)
        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
                                   key=self.with_pos_embed(memory, pos),
                                   value=memory, attn_mask=memory_mask,
                                   key_padding_mask=memory_key_padding_mask)[0]
        tgt = tgt + self.dropout2(tgt2)
        tgt2 = self.norm3(tgt)
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
        tgt = tgt + self.dropout3(tgt2)
        #pdb.set_trace()

        return tgt

    def forward(self, tgt, memory,
                tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None,
                query_pos: Optional[Tensor] = None):
        #pdb.set_trace()

        if self.normalize_before:
            return self.forward_pre(tgt, memory, tgt_mask, memory_mask,
                                    tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
        return self.forward_post(tgt, memory, tgt_mask, memory_mask,
                                 tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)


def _get_clones(module, N):
    x = nn.ModuleList([copy.deepcopy(module) for i in range(N)])
    return x


def build_transformer(args):

    return Transformer(
        d_model=args.hidden_dim, # 256
        dropout=args.dropout, # 0.1
        nhead=args.nheads, #  8
        dim_feedforward=args.dim_feedforward, # 2048
        num_encoder_layers=args.enc_layers, # 6
        num_decoder_layers=args.dec_layers, # 6
        normalize_before=args.pre_norm,# F
        return_intermediate_dec=True,
    )


def _get_activation_fn(activation):
    """Return an activation function given a string"""
    if activation == "relu":
        return F.relu
    if activation == "gelu":
        return F.gelu
    if activation == "glu":
        return F.glu
    raise RuntimeError(F"activation should be relu/gelu, not {activation}.")

'''
MultiheadAttention(
  (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
encoder_layer
TransformerEncoderLayer(
  (self_attn): MultiheadAttention(
    (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
  )
  (linear1): Linear(in_features=256, out_features=2048, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (linear2): Linear(in_features=2048, out_features=256, bias=True)
  (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (dropout2): Dropout(p=0.1, inplace=False)
)
decoder_layer
TransformerDecoderLayer(
  (self_attn): MultiheadAttention(
    (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
  )
  (multihead_attn): MultiheadAttention(
    (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
  )
  (linear1): Linear(in_features=256, out_features=2048, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (linear2): Linear(in_features=2048, out_features=256, bias=True)
  (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (dropout2): Dropout(p=0.1, inplace=False)
  (dropout3): Dropout(p=0.1, inplace=False)
)



TransformerEncoderLayer
ModuleList(
  (0): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (1): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (2): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (3): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (4): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (5): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
)


DecoderLayer
ModuleList(
  (0): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=False)
  )
  (1): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=False)
  )
  (2): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=False)
  )
  (3): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=False)
  )
  (4): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=False)
  )
  (5): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=False)
  )

'''

5. detr/models/matcher.py

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Modules to compute the matching cost and solve the corresponding LSAP.
"""
import torch
from scipy.optimize import linear_sum_assignment
from torch import nn

from util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou
import pdb


class HungarianMatcher(nn.Module):
    """This class computes an assignment between the targets and the predictions of the network

    For efficiency reasons, the targets don't include the no_object. Because of this, in general,
    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
    while the others are un-matched (and thus treated as non-objects).
    """

    def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
        """Creates the matcher

        Params: class bbox giou 
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class # 1
        self.cost_bbox = cost_bbox  # 5
        self.cost_giou = cost_giou # 2
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"
        #pdb.set_trace()
    @torch.no_grad()
    def forward(self, outputs, targets):
        """ Performs the matching

        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates

            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates

        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        bs, num_queries = outputs["pred_logits"].shape[:2] # 2, 100

        # We flatten to compute the cost matrices in a batch
        out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  #  [200, 92] [batch_size * num_queries, num_classes] 对200个[1, 92]softmax
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  #  [200,4] [batch_size * num_queries, 4] 

        # Also concat the target labels and boxes
        tgt_ids = torch.cat([v["labels"] for v in targets]) # tensor([18,  1,  1, 15, 27, 44, 84, 27,  6], device='cuda:0')
        tgt_bbox = torch.cat([v["boxes"] for v in targets]) # 下方注释   

        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.

        cost_class = -out_prob[:, tgt_ids] # [200, num_gt] eg. num_gt = 9

        # Compute the L1 cost between boxes l1 loss
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)  #  [200, 9] 

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) #  [200, 9] 200个pred_box 与每一个gt_box的GIOU

        # Final cost matrix
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou # [200, 9]
        C = C.view(bs, num_queries, -1).cpu() # [2, 100, 9]

        sizes = [len(v["boxes"]) for v in targets] # [box_num_1, box_num_2] eg. [8, 1] 表示第一个img有8个gtbox 第二个img有1个gtbox
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        '''
        indices = []
        for i, c in enumerate(C.split(sizes, -1)): # C[0] : [2, 100, 8]   C[1] : [2, 100, 1]
            # i : 0,  c[0] : [100, 8]
            # i : 1,  c[1] : [100, 1]
            pdb.set_trace()
            indices.append(linear_sum_assignment(c[i])) # min cost
        '''
        pdb.set_trace()
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]
        '''
        (Pdb) C.split(sizes, -1)[0].size()
        torch.Size([2, 100, 8])
        (Pdb) C.split(sizes, -1)[1].size()
        torch.Size([2, 100, 1])
        indices [(array([ 2,  4, 31, 42, 83, 84, 89, 91]), array([7, 1, 6, 5, 3, 4, 2, 0])), (array([54]), array([0]))]
        '''

def build_matcher(args):
    return HungarianMatcher(cost_class=args.set_cost_class, cost_bbox=args.set_cost_bbox, cost_giou=args.set_cost_giou)


'''
(Pdb) torch.cat([v["boxes"] for v in targets]) 
tensor([[0.3722, 0.6666, 0.2496, 0.1161],
        [0.5426, 0.4170, 0.3599, 0.7220],
        [0.2823, 0.4324, 0.4917, 0.2580],
        [0.4459, 0.6099, 0.6642, 0.2646],
        [0.6796, 0.6822, 0.1735, 0.2128],
        [0.3850, 0.6643, 0.0392, 0.1714],
        [0.5527, 0.3286, 0.0334, 0.0536],
        [0.4618, 0.4616, 0.0716, 0.0671],
        [0.1798, 0.5849, 0.3595, 0.4541]], device='cuda:0')

'''

6. detr/models/position_encoding.py

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Various positional encodings for the transformer.
"""
import math
import torch
from torch import nn

from util.misc import NestedTensor
import pdb

class PositionEmbeddingSine(nn.Module):
    """
    This is a more standard version of the position embedding, very similar to the one
    used by the Attention is all you need paper, generalized to work on images.
    """
    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):
        super().__init__()
        self.num_pos_feats = num_pos_feats # 128
        self.temperature = temperature # 10000 这个是Attention is all you need的 position PE(pos, 2i) = sin(pos/10000^(2i/dmodel))
        self.normalize = normalize # True
        if scale is not None and normalize is False:
            raise ValueError("normalize should be True if scale is passed")
        if scale is None:
            scale = 2 * math.pi  # sclae 2Pi
        self.scale = scale  # sclae 2Pi

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors # torch.Size([2, 2048, 24, 32])
        # https://github.com/facebookresearch/detr/issues/96 we set the mask to False for all the elements that are not part of the padding.

        mask = tensor_list.mask # torch.Size([2, 2048, 24, 32])
        assert mask is not None
        not_mask = ~mask # mask是bool类型 ~mask就是true和false的值相反
 
        y_embed = not_mask.cumsum(1, dtype=torch.float32) # dim=1表示第一行不动,后来的元素 + 前面的所有的元素 torch.float32将Ture or False变成 1 or 0
        x_embed = not_mask.cumsum(2, dtype=torch.float32) # [2, 24, 32]
        #pdb.set_trace()
        if self.normalize: # True
            eps = 1e-6
            # 除以最后一个元素,由于cumsum所以最后一位是所有相加总和，因此可以起到归一化的作用
            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale # [2, 24, 32] ,y_embed[:, -1:, :] --> [2,1,32] 
            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale # [2, 24, 32]

        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device) # [128]
        
        # (dim_t // 2) 为什么要除以2呢 公式是2i,并且2i 和 2_{i+1}用的都是2i
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats) # dim_t 是维度, dim_t // 2 满足对于每个i 有2i,以及2i+1两个position

        pos_x = x_embed[:, :, :, None] / dim_t # 广播 [2, 24, 32, 1] / [128] =  [2, 24,32, 128] 为什么用广播一个像素除以128呢，我认为是算每个(24 * 32)像素点在128个channel上的pos的值
        pos_y = y_embed[:, :, :, None] / dim_t 

        # 根据论文的公式
        pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3) # torch.Size([2, 24, 32, 64, 2])  fkatten -->  torch.Size([2, 24, 32, 128])

        pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3) # torch.Size([2, 24, 32, 128])
        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2) # torch.Size([2, 256, 24, 32])
        #pdb.set_trace()
        return pos


class PositionEmbeddingLearned(nn.Module):
    """
    Absolute pos embedding, learned.
    """
    def __init__(self, num_pos_feats=256):
        super().__init__()
        self.row_embed = nn.Embedding(50, num_pos_feats)
        self.col_embed = nn.Embedding(50, num_pos_feats)
        self.reset_parameters()

    def reset_parameters(self):
        nn.init.uniform_(self.row_embed.weight)
        nn.init.uniform_(self.col_embed.weight)

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors
        h, w = x.shape[-2:]
        i = torch.arange(w, device=x.device)
        j = torch.arange(h, device=x.device)
        x_emb = self.col_embed(i)
        y_emb = self.row_embed(j)
        pos = torch.cat([
            x_emb.unsqueeze(0).repeat(h, 1, 1),
            y_emb.unsqueeze(1).repeat(1, w, 1),
        ], dim=-1).permute(2, 0, 1).unsqueeze(0).repeat(x.shape[0], 1, 1, 1)
        return pos


def build_position_encoding(args):
    pdb.set_trace()
    N_steps = args.hidden_dim // 2 # 128
    if args.position_embedding in ('v2', 'sine'):  # sine
        # TODO find a better way of exposing other arguments
        position_embedding = PositionEmbeddingSine(N_steps, normalize=True)
    elif args.position_embedding in ('v3', 'learned'):
        position_embedding = PositionEmbeddingLearned(N_steps)
    else:
        raise ValueError(f"not supported {args.position_embedding}")
    #pdb.set_trace()
    return position_embedding

7. detr/main.py

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
import argparse
import datetime
import json
import random
import time
from pathlib import Path

import numpy as np
import torch
from torch.utils.data import DataLoader, DistributedSampler

import datasets
import util.misc as utils
from datasets import build_dataset, get_coco_api_from_dataset
from engine import evaluate, train_one_epoch
from models import build_model


def get_args_parser():
    parser = argparse.ArgumentParser('Set transformer detector', add_help=False)
    
    # model hyperparameters
    parser.add_argument('--lr', default=1e-4, type=float)
    parser.add_argument('--lr_backbone', default=1e-5, type=float)
    # batch_size
    parser.add_argument('--batch_size', default=2, type=int)
    
    parser.add_argument('--weight_decay', default=1e-4, type=float)
    
    parser.add_argument('--epochs', default=300, type=int)
    parser.add_argument('--lr_drop', default=200, type=int)
    parser.add_argument('--clip_max_norm', default=0.1, type=float,
                        help='gradient clipping max norm')

    # Model parameters
    parser.add_argument('--frozen_weights', type=str, default=None,
                        help="Path to the pretrained model. If set, only the mask head will be trained")
    # * Backbone
    parser.add_argument('--backbone', default='resnet50', type=str,
                        help="Name of the convolutional backbone to use")
    parser.add_argument('--dilation', action='store_true',
                        help="If true, we replace stride with dilation in the last convolutional block (DC5)")
    parser.add_argument('--position_embedding', default='sine', type=str, choices=('sine', 'learned'),
                        help="Type of positional embedding to use on top of the image features")

    # * Transformer
    parser.add_argument('--enc_layers', default=6, type=int,
                        help="Number of encoding layers in the transformer")
    parser.add_argument('--dec_layers', default=6, type=int,
                        help="Number of decoding layers in the transformer")
    parser.add_argument('--dim_feedforward', default=2048, type=int,
                        help="Intermediate size of the feedforward layers in the transformer blocks")
    parser.add_argument('--hidden_dim', default=256, type=int,
                        help="Size of the embeddings (dimension of the transformer)")
    parser.add_argument('--dropout', default=0.1, type=float,
                        help="Dropout applied in the transformer")
    parser.add_argument('--nheads', default=8, type=int,
                        help="Number of attention heads inside the transformer's attentions")
    parser.add_argument('--num_queries', default=100, type=int,
                        help="Number of query slots")
    parser.add_argument('--pre_norm', action='store_true')

    # * Segmentation
    parser.add_argument('--masks', action='store_true',
                        help="Train segmentation head if the flag is provided")

    # Loss
    parser.add_argument('--no_aux_loss', dest='aux_loss', action='store_false',
                        help="Disables auxiliary decoding losses (loss at each layer)")

    # * Matcher
    parser.add_argument('--set_cost_class', default=1, type=float,
                        help="Class coefficient in the matching cost")
    parser.add_argument('--set_cost_bbox', default=5, type=float,
                        help="L1 box coefficient in the matching cost")
    parser.add_argument('--set_cost_giou', default=2, type=float,

                        help="giou box coefficient in the matching cost")
    # * Loss coefficients
    parser.add_argument('--mask_loss_coef', default=1, type=float)
    parser.add_argument('--dice_loss_coef', default=1, type=float)
    parser.add_argument('--bbox_loss_coef', default=5, type=float)
    parser.add_argument('--giou_loss_coef', default=2, type=float)
    parser.add_argument('--eos_coef', default=0.1, type=float,
                        help="Relative classification weight of the no-object class")

    # dataset parameters
    parser.add_argument('--dataset_file', default='coco')
    parser.add_argument('--coco_path', type=str)
    parser.add_argument('--coco_panoptic_path', type=str)
    parser.add_argument('--remove_difficult', action='store_true')

    parser.add_argument('--output_dir', default='',
                        help='path where to save, empty for no saving')
    parser.add_argument('--device', default='cuda',
                        help='device to use for training / testing')
    parser.add_argument('--seed', default=42, type=int)
    parser.add_argument('--resume', default='', help='resume from checkpoint')
    parser.add_argument('--start_epoch', default=0, type=int, metavar='N',
                        help='start epoch')
    parser.add_argument('--eval', action='store_true')
    parser.add_argument('--num_workers', default=2, type=int)

    # distributed training parameters
    parser.add_argument('--world_size', default=1, type=int,
                        help='number of distributed processes')
    parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
    return parser


def main(args):
    utils.init_distributed_mode(args)
    print("git:\n  {}\n".format(utils.get_sha()))

    if args.frozen_weights is not None:
        assert args.masks, "Frozen training is meant for segmentation only"
    print(args)

    device = torch.device(args.device)

    # fix the seed for reproducibility
    seed = args.seed + utils.get_rank()
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    # build_model entry
    model, criterion, postprocessors = build_model(args) 
    model.to(device)

    model_without_ddp = model
    if args.distributed:
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
        model_without_ddp = model.module
    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print('number of params:', n_parameters)

    param_dicts = [
        {
    
    "params": [p for n, p in model_without_ddp.named_parameters() if "backbone" not in n and p.requires_grad]},
        {
    
    
            "params": [p for n, p in model_without_ddp.named_parameters() if "backbone" in n and p.requires_grad],
            "lr": args.lr_backbone,
        },
    ]

    # build optim
    optimizer = torch.optim.AdamW(param_dicts, lr=args.lr,
                                  weight_decay=args.weight_decay)
    # build lr
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, args.lr_drop)

    # prepare datasets 
    dataset_train = build_dataset(image_set='train', args=args)
    dataset_val = build_dataset(image_set='val', args=args)

    # distributed or not
    if args.distributed: #  True
        sampler_train = DistributedSampler(dataset_train)
        sampler_val = DistributedSampler(dataset_val, shuffle=False)
    else:
        sampler_train = torch.utils.data.RandomSampler(dataset_train)
        sampler_val = torch.utils.data.SequentialSampler(dataset_val)

    batch_sampler_train = torch.utils.data.BatchSampler(
        sampler_train, args.batch_size, drop_last=True)

    # build DataLoader
    data_loader_train = DataLoader(dataset_train, batch_sampler=batch_sampler_train,
                                   collate_fn=utils.collate_fn, num_workers=args.num_workers)
    data_loader_val = DataLoader(dataset_val, args.batch_size, sampler=sampler_val,
                                 drop_last=False, collate_fn=utils.collate_fn, num_workers=args.num_workers)

    if args.dataset_file == "coco_panoptic":
        # We also evaluate AP during panoptic training, on original coco DS
        coco_val = datasets.coco.build("val", args)
        base_ds = get_coco_api_from_dataset(coco_val)
    else:
        base_ds = get_coco_api_from_dataset(dataset_val)

    if args.frozen_weights is not None:
        checkpoint = torch.load(args.frozen_weights, map_location='cpu')
        model_without_ddp.detr.load_state_dict(checkpoint['model'])

    output_dir = Path(args.output_dir)
    if args.resume:
        if args.resume.startswith('https'):
            checkpoint = torch.hub.load_state_dict_from_url(
                args.resume, map_location='cpu', check_hash=True)
        else:
            checkpoint = torch.load(args.resume, map_location='cpu')
        model_without_ddp.load_state_dict(checkpoint['model'])
        if not args.eval and 'optimizer' in checkpoint and 'lr_scheduler' in checkpoint and 'epoch' in checkpoint:
            optimizer.load_state_dict(checkpoint['optimizer'])
            lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
            args.start_epoch = checkpoint['epoch'] + 1
    
    # testing
    if args.eval:
        test_stats, coco_evaluator = evaluate(model, criterion, postprocessors,
                                              data_loader_val, base_ds, device, args.output_dir)
        if args.output_dir:
            utils.save_on_master(coco_evaluator.coco_eval["bbox"].eval, output_dir / "eval.pth")
        return

    print("Start training")
    start_time = time.time()
    # training
    for epoch in range(args.start_epoch, args.epochs):
        if args.distributed:
            sampler_train.set_epoch(epoch)
        train_stats = train_one_epoch( # one epoch train
            model, criterion, data_loader_train, optimizer, device, epoch,
            args.clip_max_norm)
        lr_scheduler.step()
        if args.output_dir:
            checkpoint_paths = [output_dir / 'checkpoint.pth']
            # extra checkpoint before LR drop and every 100 epochs
            if (epoch + 1) % args.lr_drop == 0 or (epoch + 1) % 100 == 0:
                checkpoint_paths.append(output_dir / f'checkpoint{epoch:04}.pth')
            for checkpoint_path in checkpoint_paths:
                utils.save_on_master({
    
    
                    'model': model_without_ddp.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'lr_scheduler': lr_scheduler.state_dict(),
                    'epoch': epoch,
                    'args': args,
                }, checkpoint_path)

        test_stats, coco_evaluator = evaluate(
            model, criterion, postprocessors, data_loader_val, base_ds, device, args.output_dir
        )

        log_stats = {
    
    **{
    
    f'train_{k}': v for k, v in train_stats.items()},
                     **{
    
    f'test_{k}': v for k, v in test_stats.items()},
                     'epoch': epoch,
                     'n_parameters': n_parameters}

        if args.output_dir and utils.is_main_process():
            with (output_dir / "log.txt").open("a") as f:
                f.write(json.dumps(log_stats) + "\n")

            # for evaluation logs
            if coco_evaluator is not None:
                (output_dir / 'eval').mkdir(exist_ok=True)
                if "bbox" in coco_evaluator.coco_eval:
                    filenames = ['latest.pth']
                    if epoch % 50 == 0:
                        filenames.append(f'{epoch:03}.pth')
                    for name in filenames:
                        torch.save(coco_evaluator.coco_eval["bbox"].eval,
                                   output_dir / "eval" / name)

    total_time = time.time() - start_time
    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
    print('Training time {}'.format(total_time_str))


if __name__ == '__main__':
    parser = argparse.ArgumentParser('DETR training and evaluation script', parents=[get_args_parser()])# parents为ArgumentParser 对象的列表
    args = parser.parse_args()
    if args.output_dir:
        Path(args.output_dir).mkdir(parents=True, exist_ok=True)
    main(args)

'''
args:
Namespace(
    aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, 
    coco_panoptic_path=None, coco_path='/hdd2/wh/datasets/coco/', dataset_file='coco', 
    dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, 
    dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, 
    eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0,
    hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, 
    nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', 
    pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, 
    set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1
    )

'''

8. detr/engine.py


import math
import os
import sys
from typing import Iterable

import torch

import util.misc as utils
from datasets.coco_eval import CocoEvaluator
from datasets.panoptic_eval import PanopticEvaluator
import pdb

def train_one_epoch(model: torch.nn.Module, criterion: torch.nn.Module,
                    data_loader: Iterable, optimizer: torch.optim.Optimizer,
                    device: torch.device, epoch: int, max_norm: float = 0):
    model.train()
    criterion.train()
    metric_logger = utils.MetricLogger(delimiter="  ")
    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
    metric_logger.add_meter('class_error', utils.SmoothedValue(window_size=1, fmt='{value:.2f}'))
    header = 'Epoch: [{}]'.format(epoch)
    print_freq = 10

    for samples, targets in metric_logger.log_every(data_loader, print_freq, header):
        samples = samples.to(device)
        targets = [{
    
    k: v.to(device) for k, v in t.items()} for t in targets]

        # forward model
        outputs = model(samples)
        #pdb.set_trace()
        
        # forward loss
        loss_dict = criterion(outputs, targets)
        pdb.set_trace()
        weight_dict = criterion.weight_dict
        losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)

        # reduce losses over all GPUs for logging purposes
        loss_dict_reduced = utils.reduce_dict(loss_dict)
        loss_dict_reduced_unscaled = {
    
    f'{k}_unscaled': v
                                      for k, v in loss_dict_reduced.items()}
        loss_dict_reduced_scaled = {
    
    k: v * weight_dict[k] # 乘上对应的weight 1 2 5
                                    for k, v in loss_dict_reduced.items() if k in weight_dict} #  去除了weight_dict中没有的loss/error  eg. cardinality class error
        losses_reduced_scaled = sum(loss_dict_reduced_scaled.values()) # 等于loss

        loss_value = losses_reduced_scaled.item()

        if not math.isfinite(loss_value):
            print("Loss is {}, stopping training".format(loss_value))
            print(loss_dict_reduced)
            sys.exit(1)

        optimizer.zero_grad()
        losses.backward()
        if max_norm > 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
        optimizer.step()

        metric_logger.update(loss=loss_value, **loss_dict_reduced_scaled, **loss_dict_reduced_unscaled)
        metric_logger.update(class_error=loss_dict_reduced['class_error'])
        metric_logger.update(lr=optimizer.param_groups[0]["lr"])
    # gather the stats from all processes
    metric_logger.synchronize_between_processes()
    print("Averaged stats:", metric_logger)
    return {
    
    k: meter.global_avg for k, meter in metric_logger.meters.items()}


@torch.no_grad()
def evaluate(model, criterion, postprocessors, data_loader, base_ds, device, output_dir):
    pdb.set_trace()
    model.eval()
    criterion.eval()

    metric_logger = utils.MetricLogger(delimiter="  ") # 创建utils.MetricLogger
    metric_logger.add_meter('class_error', utils.SmoothedValue(window_size=1, fmt='{value:.2f}'))
    header = 'Test:'

    iou_types = tuple(k for k in ('segm', 'bbox') if k in postprocessors.keys())
    coco_evaluator = CocoEvaluator(base_ds, iou_types) # init CocoEvaluator
    # coco_evaluator.coco_eval[iou_types[0]].params.iouThrs = [0, 0.1, 0.5, 0.75]

    panoptic_evaluator = None
    if 'panoptic' in postprocessors.keys(): # False   dict_keys(['bbox'])
        panoptic_evaluator = PanopticEvaluator(
            data_loader.dataset.ann_file,
            data_loader.dataset.ann_folder,
            output_dir=os.path.join(output_dir, "panoptic_eval"),
        )

    for samples, targets in metric_logger.log_every(data_loader, 10, header):
        samples = samples.to(device)
        targets = [{
    
    k: v.to(device) for k, v in t.items()} for t in targets]
    
        outputs = model(samples)
        loss_dict = criterion(outputs, targets)
        weight_dict = criterion.weight_dict

        # reduce losses over all GPUs for logging purposes
        loss_dict_reduced = utils.reduce_dict(loss_dict)
        loss_dict_reduced_scaled = {
    
    k: v * weight_dict[k]
                                    for k, v in loss_dict_reduced.items() if k in weight_dict}
        loss_dict_reduced_unscaled = {
    
    f'{k}_unscaled': v
                                      for k, v in loss_dict_reduced.items()}
        metric_logger.update(loss=sum(loss_dict_reduced_scaled.values()),
                             **loss_dict_reduced_scaled,
                             **loss_dict_reduced_unscaled)
        metric_logger.update(class_error=loss_dict_reduced['class_error'])
        orig_target_sizes = torch.stack([t["orig_size"] for t in targets], dim=0)
        pdb.set_trace()

        results = postprocessors['bbox'](outputs, orig_target_sizes) # postprocessors
        pdb.set_trace()
        if 'segm' in postprocessors.keys():
            target_sizes = torch.stack([t["size"] for t in targets], dim=0)
            results = postprocessors['segm'](results, outputs, orig_target_sizes, target_sizes)
        res = {
    
    target['image_id'].item(): output for target, output in zip (targets, results)} # image_id 拿到对应的每张图片
        pdb.set_trace()
        if coco_evaluator is not None:
            coco_evaluator.update(res) # 进入update函数 难点

        if panoptic_evaluator is not None: # false
            res_pano = postprocessors["panoptic"](outputs, target_sizes, orig_target_sizes)
            for i, target in enumerate(targets):
                image_id = target["image_id"].item()
                file_name = f"{image_id:012d}.png"
                res_pano[i]["image_id"] = image_id
                res_pano[i]["file_name"] = file_name

            panoptic_evaluator.update(res_pano)
        pdb.set_trace()
    # gather the stats from all processes
    metric_logger.synchronize_between_processes()
    pdb.set_trace()
    print("Averaged stats:", metric_logger)
    if coco_evaluator is not None:
        coco_evaluator.synchronize_between_processes()
    if panoptic_evaluator is not None:
        panoptic_evaluator.synchronize_between_processes()

    # accumulate predictions from all images
    if coco_evaluator is not None:
        coco_evaluator.accumulate()
        coco_evaluator.summarize()
    panoptic_res = None
    if panoptic_evaluator is not None:
        panoptic_res = panoptic_evaluator.summarize()
    stats = {
    
    k: meter.global_avg for k, meter in metric_logger.meters.items()}
    pdb.set_traces()
    if coco_evaluator is not None:
        if 'bbox' in postprocessors.keys():
            stats['coco_eval_bbox'] = coco_evaluator.coco_eval['bbox'].stats.tolist()
            pdb.set_trace()
        if 'segm' in postprocessors.keys():
            stats['coco_eval_masks'] = coco_evaluator.coco_eval['segm'].stats.tolist()
    if panoptic_res is not None:
        stats['PQ_all'] = panoptic_res["All"]
        stats['PQ_th'] = panoptic_res["Things"]
        stats['PQ_st'] = panoptic_res["Stuff"]
    pdb.set_trace()
    return stats, coco_evaluator

'''
targets:
len(targets) = 2
[{'boxes': tensor([[0.3722, 0.6666, 0.2496, 0.1161],
        [0.5426, 0.4170, 0.3599, 0.7220],
        [0.2823, 0.4324, 0.4917, 0.2580],
        [0.4459, 0.6099, 0.6642, 0.2646],
        [0.6796, 0.6822, 0.1735, 0.2128],
        [0.3850, 0.6643, 0.0392, 0.1714],
        [0.5527, 0.3286, 0.0334, 0.0536],
        [0.4618, 0.4616, 0.0716, 0.0671]], device='cuda:0'), 
 'labels': tensor([18,  1,  1, 15, 27, 44, 84, 27], device='cuda:0'), 
 'image_id': tensor([151988], device='cuda:0'), 
 'area': tensor([16129.6201, 81465.6953, 46954.9570, 48422.0742, 22253.7461,  4561.6089,
          818.3591,  2781.5159], device='cuda:0'), 
 'iscrowd': tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'), 
 'orig_size': tensor([480, 640], device='cuda:0'), 
 'size': tensor([ 768, 1024], device='cuda:0')},
 
 {'boxes': tensor([[0.1798, 0.5849, 0.3595, 0.4541]], device='cuda:0'), 
  'labels': tensor([6], device='cuda:0'), 'image_id': tensor([116848], device='cuda:0'), 
  'area': tensor([89652.4375], device='cuda:0'), 
  'iscrowd': tensor([0], device='cuda:0'), 
  'orig_size': tensor([421, 640], device='cuda:0'), 
  'size': tensor([704, 780], device='cuda:0')
  }
  ]

'''

References

http://jalammar.github.io/illustrated-transformer/

https://zhuanlan.zhihu.com/p/48508221

https://zhuanlan.zhihu.com/p/150635505

[DETR]Object Detection of Transformer代码笔记

文章目录

一、transformer

1. 网络结构

2. scale dot-product attention 以及 muli-head attention

3. Attention(Q,K,V)公式

4. MultiHead(Q,K,V)公式

5. Position Embedding

二、DETR

1. Motivation

2. 网络结构

2. HungarianMatcher

① compute match cost

② compute Hungarian loss

③match and loss部分核心代码

三、核心代码部分

1. detr/models/detr.py

2. detr/d2/detr/detr.py

3.detr/models/backbone.py

4.detr/models/transformer.py

5. detr/models/matcher.py

6. detr/models/position_encoding.py

7. detr/main.py

8. detr/engine.py

References

猜你喜欢