Model Analysis of DAB-DETR Code Learning Record

DAB-DETR is perfected on the basis of absorbing Deformable-DETR, Conditional-DETR, Anchor-DETR, etc. Its main contribution is to initialize the query to the form of x, y, w, h thinking coordinates.
This blog post mainly analyzes the work done by DAB-DETR from the perspective of code.
DAB-DETR mainly improves the Decoder model. The blogger mainly analyzes the model of the Decoder module.

insert image description here

insert image description here

Position coded temperature value adjustment

The first is the position_encoding.py file, which redefines a method PositionEmbeddingSineHWwhose function is to separate the width and height temperature values ​​of the high-frequency position encoding part, so that the width and height have different temperature values. This file also improves the sincos position encoding method and the learnable position encoding method.

class PositionEmbeddingSineHW(nn.Module):
    """
    This is a more standard version of the position embedding, very similar to the one
    used by the Attention is all you need paper, generalized to work on images.
    """
    def __init__(self, num_pos_feats=64, temperatureH=10000, temperatureW=10000, normalize=False, scale=None):
        super().__init__()
        self.num_pos_feats = num_pos_feats
        self.temperatureH = temperatureH
        self.temperatureW = temperatureW
        self.normalize = normalize
        if scale is not None and normalize is False:
            raise ValueError("normalize should be True if scale is passed")
        if scale is None:
            scale = 2 * math.pi
        self.scale = scale
    def forward(self, tensor_list: NestedTensor):
        x = tensor_list.tensors
        mask = tensor_list.mask
        assert mask is not None
        not_mask = ~mask
        y_embed = not_mask.cumsum(1, dtype=torch.float32)
        x_embed = not_mask.cumsum(2, dtype=torch.float32)
        # import ipdb; ipdb.set_trace()
        if self.normalize:
            eps = 1e-6
            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale
        dim_tx = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)
        dim_tx = self.temperatureW ** (2 * (dim_tx // 2) / self.num_pos_feats)
        pos_x = x_embed[:, :, :, None] / dim_tx
        dim_ty = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)
        dim_ty = self.temperatureH ** (2 * (dim_ty // 2) / self.num_pos_feats)
        pos_y = y_embed[:, :, :, None] / dim_ty
        pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
        pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
        # import ipdb; ipdb.set_trace()
        return pos

Transformer overall architecture

Let's first understand the overall architecture of Transformer:
First, let's take a look at the parameters passed in by forward:
src: the feature information extracted by the backbone, the shape is initially torch.Size([2, 256,19,24]) and then becomes torch .Size([456, 2, 256])
mask: Complete the mask information of the image, the shape is initially torch.Size([2, 19, 24]) and then flattened to torch.Size([2, 456] )

refpoint_embed: reference point coordinate encoding, ie object_query , torch.Size([300, 4]). Used in the Decoder module, which is initialized in the DAB-DETR module definition: self.refpoint_embed = nn.Embedding(num_queries, query_dim), initially torch.Size([300,4]), after refpoint_embed = refpoint_embed.unsqueeze( 1).repeat(1, bs, 1) becomes torch.Size([300, 4]).

pos_embed: position encoding information, the shape is initially torch.Size([2, 256,19,24]) and then becomes torch.Size([456, 2, 256]) The
execution code of the above process is as follows:

    # flatten NxCxHxW to HWxNxC
    bs, c, h, w = src.shape  #初始为2,256,19,24
    src = src.flatten(2).permute(2, 0, 1)#拉平:
    pos_embed = pos_embed.flatten(2).permute(2, 0, 1)
    refpoint_embed = refpoint_embed.unsqueeze(1).repeat(1, bs, 1)
    mask = mask.flatten(1)  

Then the data is sent to the Encoder module, and the output memory is: torch.Size([456, 2, 256])

 memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)

Then initialize tgt, and judge its pattern according to self.num_patterns, which defaults to 0 here. tgt is initialized to all 0, and the shape is: torch.Size([300, 2, 256]), which is similar to DETR, which is used as the initial decoder input.

 num_queries = refpoint_embed.shape[0]
if self.num_patterns == 0:
    tgt = torch.zeros(num_queries, bs, self.d_model, device=refpoint_embed.device)
else:
    tgt = self.patterns.weight[:, None, None, :].repeat(1, self.num_queries, bs, 1).flatten(0, 1) # n_q*n_pat, bs, d_model
    refpoint_embed = refpoint_embed.repeat(self.num_patterns, 1, 1) # n_q*n_pat, bs, d_model

Then send it to the Decoder module:

 hs, references = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                      pos=pos_embed, refpoints_unsigmoid=refpoint_embed)
 return hs, references

Encoder module construction

The Encoder module of DAB-DETR is not much different from DETR.

EncoderLayer

src_mask=None
src_key_padding_mask: Complement the picture with a shape of [2, 456]
src: The features extracted by ResNet are converted from two-dimensional to one-dimensional, and the shape is torch.Size([456, 2, 256])
pos: Position encoding information, There are originally two types, sincos position encoding and learnable position encoding. In addition, DAB-DETR also proposes a position encoding method that can jump width and height. The shape is torch.Size([456, 2, 256])
src2obtained through self-attention, and the shape is torch.Size([456, 2, 256]), followed by the dropout layer and the norm layer. The final output result is src: torch.Size([456, 2, 256]), which is sent to Decoder.

q = k = self.with_pos_embed(src, pos)
src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
                              key_padding_mask=src_key_padding_mask)[0]
src = src + self.dropout1(src2)
src = self.norm1(src)
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
src = src + self.dropout2(src2)
src = self.norm2(src)
return src

Like DETR, with_pos_embedit is a direct addition.

def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos

Encoder module

The Encoder consists of 6 EncoderLayers.

class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer, num_layers, norm=None, d_model=256):
        super().__init__()
        self.layers = _get_clones(encoder_layer, num_layers)
        self.num_layers = num_layers
        self.query_scale = MLP(d_model, d_model, d_model, 2)
        self.norm = norm
    def forward(self, src,
                mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None,
                pos: Optional[Tensor] = None):
        output = src
        for layer_id, layer in enumerate(self.layers):
            # rescale the content and pos sim
            pos_scales = self.query_scale(output)
            output = layer(output, src_mask=mask,
                           src_key_padding_mask=src_key_padding_mask, pos=pos*pos_scales)
        if self.norm is not None:
            output = self.norm(output)
        return output

Decoder brief overview

In the Decoder part query ancor(Anchor Boxes), it is initialized to [2, 300, 4] and will pass Anchor Sine Encoding, x, y, w, h will all be converted to 128 dimensions, 4 is 512 dimensions, and then passed a MLPconversion for 256.
The position encoding method is as follows: a total of 4, the encoded dimension is 128 dimensions.

insert image description here
insert image description here

The following is one of its main innovations. The attention mechanism of width and height modulation is added. The reason for this is to make attention more sensitive to width and height.

insert image description here
insert image description here

Decoder module code implementation

First, give tgtthe value of output, as can be seen here, the output result is output, and its shape is torch.Size([300, 2, 256])

output = tgt

Will reference_pointsbe normalized, shape is still torch.Size([300, 2, 4])

reference_points = refpoints_unsigmoid.sigmoid()

After entering the Decoder loop, first reference_pointsencode the high-frequency position, that is, take out all the values, enter the high-frequency position encoding module, and change from torch.Size([300, 2, 4]) to torch.Size([300, 2, 512]), each becomes 128, as follows:
insert image description here
Then after a self.ref_point_head(MLP)change to torch.Size([300, 2, 256])

obj_center = reference_points[..., :self.query_dim]  #torch.Size([300, 2, 4])  
query_sine_embed = gen_sineembed_for_position(obj_center) #torch.Size([300,2,512])
query_pos = self.ref_point_head(query_sine_embed) #torch.Size([300, 2, 256])

gen_sineembed_for_positionMethods as below:

def gen_sineembed_for_position(pos_tensor):
    # n_query, bs, _ = pos_tensor.size()
    # sineembed_tensor = torch.zeros(n_query, bs, 256)
    scale = 2 * math.pi
    dim_t = torch.arange(128, dtype=torch.float32, device=pos_tensor.device)
    dim_t = 10000 ** (2 * (dim_t // 2) / 128)
    x_embed = pos_tensor[:, :, 0] * scale
    y_embed = pos_tensor[:, :, 1] * scale
    pos_x = x_embed[:, :, None] / dim_t
    pos_y = y_embed[:, :, None] / dim_t
    pos_x = torch.stack((pos_x[:, :, 0::2].sin(), pos_x[:, :, 1::2].cos()), dim=3).flatten(2)
    pos_y = torch.stack((pos_y[:, :, 0::2].sin(), pos_y[:, :, 1::2].cos()), dim=3).flatten(2)
    if pos_tensor.size(-1) == 2:
        pos = torch.cat((pos_y, pos_x), dim=2)
    elif pos_tensor.size(-1) == 4:
        w_embed = pos_tensor[:, :, 2] * scale
        pos_w = w_embed[:, :, None] / dim_t
        pos_w = torch.stack((pos_w[:, :, 0::2].sin(), pos_w[:, :, 1::2].cos()), dim=3).flatten(2)
        h_embed = pos_tensor[:, :, 3] * scale
        pos_h = h_embed[:, :, None] / dim_t
        pos_h = torch.stack((pos_h[:, :, 0::2].sin(), pos_h[:, :, 1::2].cos()), dim=3).flatten(2)
        pos = torch.cat((pos_y, pos_x, pos_w, pos_h), dim=2)
    else:
        raise ValueError("Unknown pos_tensor shape(-1):{}".format(pos_tensor.size(-1)))
    return pos

Then some initialization is performed, self.query_scaleand MLPthe output can be regarded as the output result of the previous layer of Decoder.
Take out query_sine_embedthe first 256 dimensions, that is, pos_transformationmultiply x, y by (1 in the first layer).
ref_anchor_headis an MLP with self.ref_anchor_head = MLP(d_model, d_model, 2, 2)an input dimension of 256, an intermediate layer width of 256, an output dimension of 2, and a hidden layer of 2.
refHW_cond is torch.Size([300, 2, 2])
query_sine_embed is initially torch.Size([300, 2, 512]), after the following query_sine_embed = query_sine_embed[...,:self.d_model] * pos_transformationchanges to torch.Size([300, 2, 256]), this sentence The code means to take the first 256 dimensions

if self.query_scale_type != 'fix_elewise':#执行
      if layer_id == 0:#第一层时执行
          pos_transformation = 1
      else:
          pos_transformation = self.query_scale(output) #query_scale为MLP
else:
     pos_transformation = self.query_scale.weight[layer_id]
#取出  query_sine_embed的前256维,即x,y与pos_transformation相乘 
query_sine_embed = query_sine_embed[...,:self.d_model] * pos_transformation
if self.modulate_hw_attn:
      refHW_cond = self.ref_anchor_head(output).sigmoid() #将其送入MLP后进行归一化 torch.Size([300, 2, 2])
      query_sine_embed[..., self.d_model // 2:] *= (refHW_cond[..., 0] / obj_center[..., 2]).unsqueeze(-1)
      query_sine_embed[..., :self.d_model // 2] *= (refHW_cond[..., 1] / obj_center[..., 3]).unsqueeze(-1)

The above code actually executes the following process: Note that it is not that PE(Xref) and PE(Yref) are not multiplied at this time, but because it is set to 1, that is, here we can see it in the second layer of DecoderLayer pos_transformation = 1.

insert image description here

insert image description here

Then send the data to DecoderLayer, note that DecoderLayer is the first layer at this time.

output = layer(output, memory, tgt_mask=tgt_mask,
                           memory_mask=memory_mask,
                           tgt_key_padding_mask=tgt_key_padding_mask,
                           memory_key_padding_mask=memory_key_padding_mask,
                           pos=pos, query_pos=query_pos, query_sine_embed=query_sine_embed,
                           is_first=(layer_id == 0))

The first layer DecoderLayer module

Self_Attention

First, calculate the self-attention mechanism in DecoderLayer.
Let's take a look at how the data changes:
tgtthat is, the output of the previous layer of DecoderLayer is all 0 at this time, and the shape is torch.Size([300, 2, 256])
first passes through a linear layer (sa_qcontent_proj = nn.Linear (d_model, d_model)) to get q_content
the shape as torch.Size([300, 2, 256])
It should be noted that tgtwhen the qkv initialization is completed through the linear layer, although tgtall are 0, q, k, v are not

insert image description here

Then q_pos(the xywh information obtained by Anchor through high-frequency position encoding and MLP) is also passed through a linear layer sa_qpos_projdimension unchanged: the shape is torch.Size([300, 2, 256])
and k, v are also initialized in the same way . Same as DETR, v has no position information.

To sum up, in self-attention, the query_pos converted from the Anchor Box provides position information, and the content information is provided by initializing to all 0 or the output result of the previous DecoderLayer. The position information and content information are also added. Merge together, for example: q = q_content + q_pos
As for the following, it is exactly the same as DETR, just input q, k, v to participate in the operation.

 if not self.rm_self_attn_decoder:
            # Apply projections here
            # shape: num_queries x batch_size x 256
            q_content = self.sa_qcontent_proj(tgt)      # target is the input of the first decoder layer. zero by default.
            q_pos = self.sa_qpos_proj(query_pos)
            k_content = self.sa_kcontent_proj(tgt)
            k_pos = self.sa_kpos_proj(query_pos)
            v = self.sa_v_proj(tgt)
            num_queries, bs, n_model = q_content.shape
            hw, _, _ = k_content.shape
            q = q_content + q_pos
            k = k_content + k_pos
            tgt2 = self.self_attn(q, k, value=v, attn_mask=tgt_mask,
                                key_padding_mask=tgt_key_padding_mask)[0] 
                     #tgt2为Attention计算结果,torch.Size([300, 2, 256])
            # ========== End of Self-Attention =============
            tgt = tgt + self.dropout1(tgt2)
            tgt = self.norm1(tgt)

Finally, the output tgt of self-attention is obtained, and the shape is torch.Size([300, 2, 256]). The above code executes the part shown in the frame below.

insert image description here

It is then ready to be fed cross-attentioninto the calculation

Cross_Attention

The first is q k vthe initialization process. It can be seen that q comes from the output of self-attention, and after a linear layer, k and v come from the output of the Encoder. The dimension of memory is torch.Size([456, 2, 256])

q_content = self.ca_qcontent_proj(tgt)#torch.Size([300, 2, 256])
k_content = self.ca_kcontent_proj(memory)#torch.Size([456, 2, 256])
v = self.ca_v_proj(memory)#torch.Size([456, 2, 256])

k_pos = self.ca_kpos_proj(pos)#对K进行位置编码,pos来自于Encoder。torch.Size([456, 2, 256])

Since it is the first layer, the following operations need to be performed, that is, first pass query_pos[torch.Size([300, 2, 256])] through a fully connected layer, and the dimension does not change, that is, q_posthe process of generating,

if is_first or self.keep_query_pos:#self.keep_query_pos默认为False
    q_pos = self.ca_qpos_proj(query_pos)# query_pos:torch.Size([300, 2, 256])
    q = q_content + q_pos
    k = k_content + k_pos
else:
    q = q_content
    k = k_content

The next step is Cross_Attentionthe initialization process of Q, K, and V sent in: It should be noted that the separate attention operation is placed outside, and it was originally completed inside the attention.

q = q.view(num_queries, bs, self.nhead, n_model//self.nhead)# q分头:torch.Size([300, 2, 8, 32])
query_sine_embed = self.ca_qpos_sine_proj(query_sine_embed)#query_sine_embed即
query_sine_embed = query_sine_embed.view(num_queries, bs, self.nhead, n_model//self.nhead)
q = torch.cat([q, query_sine_embed], dim=3).view(num_queries, bs, n_model * 2)
#q经过拼接变为torch.Size([300, 2, 512])
k = k.view(hw, bs, self.nhead, n_model//self.nhead)#torch.Size([456, 2, 8, 32])
k_pos = k_pos.view(hw, bs, self.nhead, n_model//self.nhead)#torch.Size([456, 2, 8, 32])
k = torch.cat([k, k_pos], dim=3).view(hw, bs, n_model * 2)#torch.Size([456, 2, 512])

Then it will Q,K,Vbe sent to Cross_Attention for calculation: To sum up, q: torch.Size([300, 2, 512]), k: torch.Size([456, 2, 512]), v: torch.Size([456, 2, 256])

tgt2 = self.cross_attn(query=q, key=k, value=v, attn_mask=memory_mask, key_padding_mask=memory_key_padding_mask)[0]    

Specifically execute the following process: different dimensions of QKV

return multi_head_attention_forward(
                query, key, value, self.embed_dim, self.num_heads,
                self.in_proj_weight, self.in_proj_bias,
                self.bias_k, self.bias_v, self.add_zero_attn,
                self.dropout, self.out_proj.weight, self.out_proj.bias,
                training=self.training,
                key_padding_mask=key_padding_mask, need_weights=need_weights,
                attn_mask=attn_mask, out_dim=self.vdim)

After the calculation of cross_attention is completed, the dimension of tgt2 is torch.Size([300, 2, 256]). For dimension changes, please refer to the Attention calculation formula:

insert image description here
Then, after a series of residual connections, the batch normalization operation outputs the result, and the final result is still torch.Size([300, 2, 256]).

Anchor update strategy

This module is also an innovation point of DAB-DETR, that is, the anchor update strategy Anchor Update

That is, after the cross_attention calculation of DecoderLayer, the output value is passed to the next layer of DecoderLayer, and it is also used for anchor point update, using the MLP network to obtain the offset of x, y, w, h, and the shape is torch.Size ([300, 2, 4]). Add it to our initialized reference point coordinates reference_points(ie Anchor box, shape is torch.Size([300, 2, 4])). This is the anchor point update strategy, and the initialization anchor in the previous DETR model is always unchanged.

insert image description here

if self.bbox_embed is not None:
    if self.bbox_embed_diff_each_layer:#是否共享参数:false
        tmp = self.bbox_embed[layer_id](output)
    else:
        tmp = self.bbox_embed(output)#经过MLP获得output偏移量x,y,w,h torch.Size([300, 2, 4])
    # import ipdb; ipdb.set_trace()
    tmp[..., :self.query_dim] += inverse_sigmoid(reference_points)
    new_reference_points = tmp[..., :self.query_dim].sigmoid()
    if layer_id != self.num_layers - 1:
        ref_points.append(new_reference_points)
    reference_points = new_reference_points.detach()
if self.return_intermediate:
    intermediate.append(self.norm(output))

It can be seen from the above code that reference_points will be continuously updated, that is, the Anchor update strategy
. In order to achieve automatic differentiation, PyTorch tracks all operations involving tensors, and may need to calculate gradients for them (that is, require_grad is True). These operations are recorded as a directed graph. The detach() method constructs a new view on a tensor that is declared not to require gradients. What the
above code executes is the following process of framing:
insert image description here

The second layer DecoderLayer module

Compared with the first layer of DecoderLayer, the structure of the second layer is the same as that of the first layer, except that the initialization tgt of the Decoder-Embedding of the first layer is all 0, and the second layer becomes the output of the first layer. In addition, due to the anchor update strategy, the Anchor Boxes of the second layer also become the Anchor Boxes of the first layer plus the offset of xywh.

The first is the change of Anchor Boxes, reference_points(that is, Anchor Boxes) after passing through the previous layer of Decoderlayer, the value has been updated, and after high-frequency position encoding again, the MLP layer changes the data dimension to torch.Size([300, 2, 256])

obj_center = reference_points[..., :self.query_dim]  
query_sine_embed = gen_sineembed_for_position(obj_center)  
query_pos = self.ref_point_head(query_sine_embed) 

Immediately afterwards, the difference is highlighted here. First, query_scale_typethe change at this time cond_elewise, and because of the second layer, the output (that is, the output result of the previous layer) passes

self.query_scale = MLP(d_model, d_model, d_model, 2)

Encoding to get pos_transformationthe dimension is torch.Size([300, 2, 256])

insert image description here

Next query_sine_embed[...,:self.d_model] * pos_transformation, the query_sine_embed here is torch.Size([300, 2, 512]), which takes the previous 256 dimensions, that is, the corresponding acquisition is x, y. Multiply with pos_transformation, pos_transformationthat is Xref,Yref, what is done here is the following operation:
insert image description here

It can be seen that it is not that there is no multiplication in the first layer PE(Xref),PE(Yref), but because its value is 1.
The subsequent process is exactly the same as the first layer of DecodeLayer.

Decoder module

Execute immediately after ending the loop of DecoderLayer: intermediatesave the result of each layer, which is a List containing 6 values, and the shape of each value is torch.Size([300, 2, 256]), and then throw away the sixth layer (pop operation), and then append the final output value.

 if self.norm is not None:
        output = self.norm(output)
        if self.return_intermediate:
            intermediate.pop()
            intermediate.append(output)

insert image description here
Then judge whether bbox_embed (MLP box prediction header) is None, and torch.stack is a splicing operation.

 if self.return_intermediate:
            if self.bbox_embed is not None:
                return [
                    torch.stack(intermediate).transpose(1, 2),
                    torch.stack(ref_points).transpose(1, 2),
                ]
            else:
                return [
                    torch.stack(intermediate).transpose(1, 2), 
                    reference_points.unsqueeze(0).transpose(1, 2)
                ]

Transformer's decoder module finally returns the result:

hs, references = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                          pos=pos_embed, refpoints_unsigmoid=refpoint_embed)

Among them, references torch.Size([6, 2, 300, 4]), hs is torch.Size([6, 2, 300, 256]), the result is also the return result of Decoder, references is after each update is completed the box. hs is considered as semantic feature information.

insert image description here

DAB-DETR integral module

After Transformer, the next step is to design the classification head and regression head.
The first is to denormalize the value of the reference (Anchor Box) of the Decoder module, and then perform regression head prediction on hs (Decoder output value, equivalent to the output of DETR) to obtain tmp, whose shape is torch.Size([6, 2, 300, 4]), add this value to the processed reference. (self.query_dim is 4, that is, all are added), and finally tmp is normalized.

   if not self.bbox_embed_diff_each_layer:#是否权值共享
        reference_before_sigmoid = inverse_sigmoid(reference)#反归一化
        tmp = self.bbox_embed(hs)#torch.Size([6, 2, 300, 4])
        tmp[..., :self.query_dim] += reference_before_sigmoid
        outputs_coord = tmp.sigmoid()

insert image description here
The output_coord value is the xywh of the prediction box and
insert image description here
the final return result is:
pred_logits is the category prediction (here is 91 categories) torch.Size([2, 300, 91])
pred_boxes is the box box prediction torch.Size([2, 300, 4 ])
aux_outputs is the result of the first 5 layers of DecoderLayer. It is a list with 5 values.
insert image description here

Guess you like

Origin blog.csdn.net/pengxiang1998/article/details/130208479