[DETR source code analysis] 2. Backbone module

foreword

Recently, I was looking at the source code of DETR. After watching it intermittently for about a week, I sorted out the main model code. I have been thinking about what form to write DETR's source code analysis. One form to consider is to write line by file like the YOLOv5 written before, and one is to string the source code according to functional modules. After thinking about it for a long time, I decided to use the second method. First, this method may save more time. In addition, it is also convenient for me to understand it as a whole.

I think looking at the code is to see that the entire model can be disassembled into functions, and finally all the modules are connected in series, so as to achieve twice the result with half the effort.

Another point I think is very important: get an open source project code, you must immediately configure the environment to run Debug normally, and immediately find the content related to the main model by analyzing train.py, and then focus on the analysis of the model, like some Logs, calculating mAP, drawing and other codes can be completely ignored, which can save a lot of time. Therefore, when I explain the source code in the future, I will completely strip out irrelevant codes, no longer explain, and focus on the model, improvement, loss and other content.

This section mainly talks about the Backbone part of DETR, including the code of the two modules of CNN and position encoding. It mainly involves two files, models/backbone.py and models/position_encoding.py.

Github annotation version source code: HuKai97/detr-annotations

1. The overall structure of Backbone

The entire Backbone mainly includes two parts: CNN feature extraction and position encoding. The code is relatively simple, let's start parsing the source code.

The first is to call the build_backbone function in models/Backbone.py to create Backbone:

def build_backbone(args):
    # 搭建backbone
    # 位置编码  PositionEmbeddingSine()
    position_embedding = build_position_encoding(args)
    train_backbone = args.lr_backbone > 0   # 是否需要训练backbone  True
    return_interm_layers = args.masks       # 是否需要返回中间层结果 目标检测False  分割True
    # 生成backbone  resnet50
    backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation)
    # 将backbone输出与位置编码相加   0: backbone   1: PositionEmbeddingSine()
    model = Joiner(backbone, position_embedding)
    model.num_channels = backbone.num_channels   # 512
    return model

Here first call the build_position_encoding function to generate the sine and cosine position encoding position_embedding: [bs, 256, H/32, W/32], where 128 before 256 is the position encoding in the y direction, and the last 128 is the position encoding in the x direction; then call the Backbone class to generate ResNet50 Feature extraction is performed on the input data to obtain a feature map [bs, 2048, H/32, W/32]. Finally, the Joiner merges and stores the two for subsequent use.

1. CNN-Backbone

To create ResNet50, first call the Backbone class:

class Backbone(BackboneBase):
    """ResNet backbone with frozen BatchNorm."""
    def __init__(self, name: str,
                 train_backbone: bool,
                 return_interm_layers: bool,
                 dilation: bool):
        # 直接掉包 调用torchvision.models中的backbone
        backbone = getattr(torchvision.models, name)(
            replace_stride_with_dilation=[False, False, dilation],
            pretrained=is_main_process(), norm_layer=FrozenBatchNorm2d)
        # resnet50  2048
        num_channels = 512 if name in ('resnet18', 'resnet34') else 2048
        super().__init__(backbone, train_backbone, num_channels, return_interm_layers)

This class is inherited from the BackboneBase class, and CNN directly calls the model in torchvision.models, so look directly at the BackboneBase class:

class BackboneBase(nn.Module):

    def __init__(self, backbone: nn.Module, train_backbone: bool, num_channels: int, return_interm_layers: bool):
        super().__init__()
        for name, parameter in backbone.named_parameters():
            # layer0 layer1不需要训练 因为前面层提取的信息其实很有限 都是差不多的 不需要训练
            if not train_backbone or 'layer2' not in name and 'layer3' not in name and 'layer4' not in name:
                parameter.requires_grad_(False)
        # False 检测任务不需要返回中间层
        if return_interm_layers:
            return_layers = {
    
    "layer1": "0", "layer2": "1", "layer3": "2", "layer4": "3"}
        else:
            return_layers = {
    
    'layer4': "0"}
        # 检测任务直接返回layer4即可  执行torchvision.models._utils.IntermediateLayerGetter这个函数可以直接返回对应层的输出结果
        self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
        self.num_channels = num_channels

    def forward(self, tensor_list: NestedTensor):
        """
        tensor_list: pad预处理之后的图像信息
        tensor_list.tensors: [bs, 3, 608, 810]预处理后的图片数据 对于小图片而言多余部分用0填充
        tensor_list.mask: [bs, 608, 810] 用于记录矩阵中哪些地方是填充的(原图部分值为False,填充部分值为True)
        """
        # 取出预处理后的图片数据 [bs, 3, 608, 810] 输入模型中  输出layer4的输出结果 dict '0'=[bs, 2048, 19, 26]
        xs = self.body(tensor_list.tensors)
        # 保存输出数据
        out: Dict[str, NestedTensor] = {
    
    }
        for name, x in xs.items():
            m = tensor_list.mask  # 取出图片的mask [bs, 608, 810] 知道图片哪些区域是有效的 哪些位置是pad之后的无效的
            assert m is not None
            # 通过插值函数知道卷积后的特征的mask  知道卷积后的特征哪些是有效的  哪些是无效的
            # 因为之前图片输入网络是整个图片都卷积计算的 生成的新特征其中有很多区域都是无效的
            mask = F.interpolate(m[None].float(), size=x.shape[-2:]).to(torch.bool)[0]
            # out['0'] = NestedTensor: tensors[bs, 2048, 19, 26] + mask[bs, 19, 26]
            out[name] = NestedTensor(x, mask)
        # out['0'] = NestedTensor: tensors[bs, 2048, 19, 26] + mask[bs, 19, 26]
        return out

This class is still calling the model in torchvision.models, and then input the preprocessed image data [bs, 3, 608, 810] and mask data [bs, 608, 810] into the model (this image data is passed The pad filling data, and the mask data is to record which pixel positions of these pictures are pad, which is True, and the real effective data without pad is False). After forward propagation, the IntermediateLayerGetter function is called to extract the feature map of the corresponding layer, and the feature map [bs, 2048, 19, 26] of the original image downsampled by 32 times is obtained, and the mask corresponding to this feature map [bs, 19, 26].

二、Positional Encoding

Positional Encoding is positional encoding. Here is mainly to call the build_position_encoding function in models/position_encoding.py to create position encoding:

def build_position_encoding(args):
    """
    创建位置编码
    args: 一系列参数  args.hidden_dim: transformer中隐藏层的维度   args.position_embedding: 位置编码类型 正余弦sine or 可学习learned
    """
    # N_steps = 128 = 256 // 2  backbone输出[bs,256,25,34]  256维度的特征
    # 而传统的位置编码应该也是256维度的, 但是detr用的是一个x方向和y方向的位置编码concat的位置编码方式  这里和ViT有所不同
    # 二维位置编码   前128维代表x方向位置编码  后128维代表y方向位置编码
    N_steps = args.hidden_dim // 2
    if args.position_embedding in ('v2', 'sine'):
        # TODO find a better way of exposing other arguments
        # [bs,256,19,26]  dim=1时  前128个是y方向位置编码  后128个是x方向位置编码
        position_embedding = PositionEmbeddingSine(N_steps, normalize=True)
    elif args.position_embedding in ('v3', 'learned'):
        position_embedding = PositionEmbeddingLearned(N_steps)
    else:
        raise ValueError(f"not supported {
      
      args.position_embedding}")

    return position_embedding

It can be seen that the source code implements two kinds of position encoding, one is sine-cosine absolute position encoding, which does not require additional parameter learning, and the other is learnable absolute position encoding. The original paper uses sine-cosine absolute position encoding, and the code also uses this by default, so here we mainly introduce the PositionEmbeddingSine class:

class PositionEmbeddingSine(nn.Module):
    """
    Absolute pos embedding, Sine.  没用可学习参数  不可学习  定义好了就固定了
    This is a more standard version of the position embedding, very similar to the one
    used by the Attention is all you need paper, generalized to work on images.
    """
    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):
        super().__init__()
        self.num_pos_feats = num_pos_feats    # 128维度  x/y  = d_model/2
        self.temperature = temperature        # 常数 正余弦位置编码公式里面的10000
        self.normalize = normalize            # 是否对向量进行max规范化   True
        if scale is not None and normalize is False:
            raise ValueError("normalize should be True if scale is passed")
        if scale is None:
            # 这里之所以规范化到2*pi  因为位置编码函数的周期是[2pi, 20000pi]
            scale = 2 * math.pi  # 规范化参数 2*pi
        self.scale = scale

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors   # [bs, 2048, 19, 26]  预处理后的 经过backbone 32倍下采样之后的数据  对于小图片而言多余部分用0填充
        mask = tensor_list.mask   # [bs, 19, 26]  用于记录矩阵中哪些地方是填充的(原图部分值为False,填充部分值为True)
        assert mask is not None
        not_mask = ~mask   # True的位置才是真实有效的位置

        # 考虑到图像本身是2维的 所以这里使用的是2维的正余弦位置编码
        # 这样各行/列都映射到不同的值 当然有效位置是正常值 无效位置会有重复值 但是后续计算注意力权重会忽略这部分的
        # 而且最后一个数字就是有效位置的总和,方便max规范化
        # 计算此时y方向上的坐标  [bs, 19, 26]
        y_embed = not_mask.cumsum(1, dtype=torch.float32)
        # 计算此时x方向的坐标    [bs, 19, 26]
        x_embed = not_mask.cumsum(2, dtype=torch.float32)

        # 最大值规范化 除以最大值 再乘以2*pi 最终把坐标规范化到0-2pi之间
        if self.normalize:
            eps = 1e-6
            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)   # 0 1 2 .. 127
        # 2i/2i+1: 2 * (dim_t // 2)  self.temperature=10000   self.num_pos_feats = d/2
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)   # 分母

        pos_x = x_embed[:, :, :, None] / dim_t   # 正余弦括号里面的公式
        pos_y = y_embed[:, :, :, None] / dim_t   # 正余弦括号里面的公式
        # x方向位置编码: [bs,19,26,64][bs,19,26,64] -> [bs,19,26,64,2] -> [bs,19,26,128]
        pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
        # y方向位置编码: [bs,19,26,64][bs,19,26,64] -> [bs,19,26,64,2] -> [bs,19,26,128]
        pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
        # concat: [bs,19,26,128][bs,19,26,128] -> [bs,19,26,256] -> [bs,256,19,26]
        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)

        # [bs,256,19,26]  dim=1时  前128个是y方向位置编码  后128个是x方向位置编码
        return pos

Comparison formula:
insert image description here
My understanding of several key points:

  1. Here, the position code is constructed through the mask. The mask records whether each pixel position in the feature map is a pad. Only the position where it is False is a valid position, and the position code needs to be constructed;
  2. Regarding the normalization of the maximum value: because of the sine-cosine encoding method, the idea is to map the passing formula of each position to the range of 0~2Π (it can also be 4Π, 6Π, 8Π..., because it is a periodic function, but we generally default to 2Π), so x_embed and y_embed need to be normalized before being brought into the formula;
  3. Regarding the position encoding method: The reason here is to encode x and y respectively (two-dimensional position encoding), rather than one-dimensional position encoding like transformer. The main consideration is that the transformer is applied in the language model, which is naturally one-dimensional, so one-dimensional may be more suitable, and DETR is a target detection framework applied in image tasks. In image tasks, of course, two-dimensional position encoding The effect may be better;
  4. In this way, for each position (x, y), the encoding value corresponding to its column is in the first 128 dimensions of the channel dimension, and the encoding value of its row is in the last 128 dimensions of the channel dimension. In this way, each position on the feature map corresponds to the encoding value of different dimensions.

Of course, as a learning, you can also look at the second absolute position encoding method: learnable position encoding:

class PositionEmbeddingLearned(nn.Module):
    """
    Absolute pos embedding, learned.
    可以发现整个类其实就是初始化了相应shape的位置编码参数,让后通过可学习的方式学习这些位置编码参数
    """
    def __init__(self, num_pos_feats=256):
        super().__init__()
        # nn.Embedding  相当于 nn.Parameter  其实就是初始化函数
        self.row_embed = nn.Embedding(50, num_pos_feats)
        self.col_embed = nn.Embedding(50, num_pos_feats)
        self.reset_parameters()

    def reset_parameters(self):
        nn.init.uniform_(self.row_embed.weight)
        nn.init.uniform_(self.col_embed.weight)

    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors
        h, w = x.shape[-2:]   # 特征图h w
        i = torch.arange(w, device=x.device)
        j = torch.arange(h, device=x.device)
        x_emb = self.col_embed(i)   # 初始化x方向位置编码
        y_emb = self.row_embed(j)   # 初始化y方向位置编码
        # concat x y 方向位置编码
        pos = torch.cat([
            x_emb.unsqueeze(0).repeat(h, 1, 1),
            y_emb.unsqueeze(1).repeat(1, w, 1),
        ], dim=-1).permute(2, 0, 1).unsqueeze(0).repeat(x.shape[0], 1, 1, 1)
        return pos

It can be found that the entire class actually initializes the position encoding parameters of the corresponding shape, and then learns these position encoding parameters by itself in a learnable way. The code is relatively simple.

Reference

Official source code: https://github.com/facebookresearch/detr

Explanation of the source code of station b: iron-clad assembly line workers

Zhihu [Brother Buffalo]: Interpretation of DETR source code

CSDN [squirrel working hard] source code explanation: DETR source code notes (1)

CSDN [squirrel working hard] source code explanation: DETR source code notes (2)

CSDN: position encoding in Transformer (position encoding one)

Knowing that CV will not be wiped out- [source code analysis target detection cross-border star DETR (1), overview and model inference]

Knowing that CV will not be wiped out- [source code analysis target detection cross-border star DETR (2), model training process and data processing]

Knowing that CV will not be wiped out- [Source code analysis target detection cross-border star DETR (3), Backbone and position encoding]

Knowing that CV will not be wiped out- [Source code analysis target detection cross-border star DETR (4), Detection with Transformer]

Knowing that CV will not be wiped out- [source code analysis target detection crossover star DETR (5), loss function and Hungarian matching algorithm]

Knowing that CV will not be wiped out- [source code analysis target detection cross-border star DETR (6), model output and prediction generation]

Guess you like

Origin blog.csdn.net/qq_38253797/article/details/127614228