【Paper Notes】Contextual Transformer Networks for Visual Recognition

paper

Thesis title: Contextual Transformer Networks for Visual Recognition

Included: CVPR2021

Paper address: [2107.12292] Contextual Transformer Networks for Visual Recognition (arxiv.org)

项目地址:GitHub - JDAI-CV/CoTNet: This is an official implementation for "Contextual Transformer Networks for Visual Recognition".

Reference blog: ResNet super variant: Jingdong AI's new open source computer vision module! (with source code) (qq.com)

CoTNet study notes - Zhihu (zhihu.com)

CoTNet | Performance surpasses BoTNet and Swin! Transformer+CNN= laying a new pattern of CV model (with code interpretation) - Zhihu (zhihu.com)

CNN is finally back! JD AI's strongest open-source ResNet variant CoTNet: a plug-and-play visual recognition module - Tencent Cloud Developer Community - Tencent Cloud (tencent.com)

Guide: This article is an exploration of the self-attention mechanism by the Mei Tao team of Jingdong AI Research Institute. Different from the existing attention mechanism, which only uses local or global methods to obtain context information, they creatively use the self-attention mechanism in Transformer Dynamic context information aggregation is integrated with convolutional static context information aggregation, and a novel Transformer-style "plug-and-play" CoT module is proposed, which can directly replace the 3×3 convolution in the existing ResNet architecture Bottleneck And achieved significant performance improvements. Whether it is ImageNet classification or COCO detection and segmentation, the proposed CoTNet architecture has achieved significant performance improvement and the number of parameters remains at the same level as FLOPs. For example, compared to 84.3% of EfficientNet-B6, the proposed SE-CoTNetD-152 achieved 84.6% and has 2.75 times faster inference speed.

foreword

Transformers with self-attention have sparked a revolution in the field of natural language processing, and recently inspired the emergence of Transformer-like architecture designs and achieved competitive results in a wide range of computer vision tasks. Nonetheless, most existing designs directly use self-attention on 2D feature maps to obtain independent queries and key-pair-based attention matrices for each spatial location, but do not fully exploit the rich context between adjacent keys. In the work shared today, the researchers designed a novel Transformer-style module, the Contextual Transformer (CoT) block , for visual recognition. This design takes full advantage of the contextual information between input keys to guide the learning of the dynamic attention matrix, thereby enhancing the visual representation ability. Technically, the CoT block first context-encodes the input keys via 3×3 convolutions, resulting in a static contextual representation of the input.

We further concatenate the encoded keys with the input query to learn a dynamic multi-head attention matrix through two consecutive 1×1 convolutions. The learned attention matrix is ​​multiplied by the input values ​​to achieve a dynamic contextual representation of the input. The fusion of static and dynamic contextual representations is the final output. The CoT block is attractive because it can easily replace each 3 × 3 convolution in the ResNet architecture, resulting in a Transformer-like backbone called Contextual Transformer Networks (CoTNet). The superiority of CoTNet as a more powerful backbone is verified through extensive experiments on a wide range of applications, such as image recognition, object detection, and instance segmentation.

background

Attention attention mechanism and self-attention self-attention mechanism

The main reasons for using the self-attention mechanism in the CV field are as follows:

  • In CNN, the convolution layer obtains the output features through the convolution kernel, but the receptive field of the convolution kernel is very small
  • It is not efficient to use stacked convolutional layers to increase the receptive field
  • Many tasks in computer vision are performance-affected by insufficient semantic information
  • The self-attention mechanism can capture global information to obtain a larger receptive field
  • Why Attention Mechanism?

Before the birth of Attention, there were already CNN and RNN and their variant models, so why introduce the attention mechanism? There are mainly two reasons, as follows:

(1) Limitation of computing power: When a lot of "information" needs to be remembered, the model will become more complex. However, computing power is still the bottleneck that limits the development of neural networks.

(2) Limitation of optimization algorithm: LSTM can only alleviate the long-distance dependence problem in RNN to a certain extent, and the information "memory" ability is not high.

  • What is the attention mechanism

Before introducing what is the attention mechanism, let everyone look at a picture. When you see the picture below, what will you see first? When an overload of information comes into view, our brains focus on the main information. This is the brain's attention mechanism.

Similarly, when we read a sentence, the brain will first remember important words, so that the attention mechanism can be applied to natural language processing tasks, so people put forward Attention by using the way the human brain handles information overload. mechanism.

Self attention is one of the attention mechanisms and an important part of the transformer. The self-attention mechanism is a variant of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features. The application of the self-attention mechanism in text is mainly to solve the long-distance dependence problem by calculating the mutual influence between words.

frame

1、Multi-head Self-attention in Vision Backbones

Here, we propose a general formulation of scalable local multi-head self-attention in the visual backbone, as shown in (a) above.

 by embedding matrices (Wq, Wk, Wv) respectively. Notably, each embedding matrix is ​​implemented as a 1×1 convolution in space.

The local relation matrix R further enriches the location information of each k × k grid:

Next, the attention matrix A is realized by normalizing the enhanced spatially aware local relation matrix Rˆ via a Softmax operation on the channel dimension of each head: A = Softmax(Rˆ). The feature vector of each spatial position of A is reshaped into a Ch local attention matrix (size: k × k), and the final output feature map is calculated as all values ​​in each k × k grid and the learned local attention Aggregation of matrices:

It is worth noting that the local attention matrix for each head is computed from V feature maps evenly partitioned along the channel dimension, and the final output Y is the concatenation of the aggregated feature maps of all heads

2、Contextual Transformer Block

Traditional self-attention triggers feature interactions at different spatial locations well, depending on the input itself. However, in traditional self-attention mechanisms, all pairwise query-key relations are independently learned from isolated query-key pairs without exploring the rich context in between. This severely limits the ability of self-attention learning to learn visual representations on 2D feature maps.

To alleviate this problem, the researchers constructed a new Transformer-style building block, the Contextual Transformer (CoT) block in (b) above, which integrates contextual information mining and self-attention learning into a unified architecture .

3、Contextual Transformer Networks

 ResNet-50 (left) and CoTNet50 (right)

ResNeXt-50 with a 32×4d template (left) and CoTNeXt-50 with a 2×48d template (right).

The above table compares the network structures of ResNet-50, CoTNet-50, ResNeXt-50, and CoTNeXt-50 in detail:

  • CoTNet-50 is constructed by replacing all 3×3 convolutions in ResNet-50 with CoT modules. Since the CoT module is similar to a typical convolution, CoTNet-50 has a similar (or even slightly smaller) number of parameters and FLOPs to ResNet-50.
  • The construction of CoTNeXt-50 is also similar. Compared with typical convolutions, when the number of groups (i.e., C in the above table) increases, the kernel depth inside group convolutions decreases significantly. Therefore, the computational cost of group convolution is thus reduced by a factor of C. In order to keep the parameter amount and FLOPs similar to ResNeXt-50, we additionally reduce the input feature map size.

Discuss some related work in this paper.

  • Blueprint Separable Convolution [18] approximates traditional convolution by 1×1 pointwise convolution plus k×k depth convolution, which effectively reduces network redundancy. Commonality: Transformer-style blocks also utilize 1×1 pointwise convolution to convert input features into values, and perform subsequent aggregation calculations in a similar manner. Furthermore, for each head, the aggregation computation in the transformer-style block adopts a channel-sharing strategy to improve network efficiency without significant accuracy drop.
  • Dynamic Region-Aware Convolution [6] introduces a filter generator module (consisting of two consecutive 1×1s) for learning region features at different spatial locations. This is similar to the attention matrix generator in the CoT block, but our attention matrix generator can more fully exploit the complex feature interactions between context keys and queries for learning self-attention.
  • Bottleneck Transformer [44] augments ConvNets with self-attention mechanism by replacing 3×3 convolutions with Transformer-like modules. Specifically, it employs a global multi-head self-attention layer, which is computationally more expensive than CoT's local self-attention. Furthermore, our CoT block goes beyond typical local self-attention by exploiting the rich context between input keys to enhance self-attention learning.

 code interpretation

class CotLayer(nn.Module):
    def __init__(self, dim, kernel_size):
        super(CotLayer, self).__init__()

        self.dim = dim
        self.kernel_size = kernel_size

        self.key_embed = nn.Sequential(
            nn.Conv2d(dim, dim, self.kernel_size, stride=1, padding=self.kernel_size//2, groups=4, bias=False),
            nn.BatchNorm2d(dim),
            nn.ReLU(inplace=True)
        )

        share_planes = 8
        factor = 2
        self.embed = nn.Sequential(
            nn.Conv2d(2*dim, dim//factor, 1, bias=False),
            nn.BatchNorm2d(dim//factor),
            nn.ReLU(inplace=True),
            nn.Conv2d(dim//factor, pow(kernel_size, 2) * dim // share_planes, kernel_size=1),
            nn.GroupNorm(num_groups=dim // share_planes, num_channels=pow(kernel_size, 2) * dim // share_planes)
        )

        self.conv1x1 = nn.Sequential(
            nn.Conv2d(dim, dim, kernel_size=1, stride=1, padding=0, dilation=1, bias=False),
            nn.BatchNorm2d(dim)
        )

        self.local_conv = LocalConvolution(dim, dim, kernel_size=self.kernel_size, stride=1, padding=(self.kernel_size - 1) // 2, dilation=1)
        self.bn = nn.BatchNorm2d(dim)
        act = get_act_layer('swish')
        self.act = act(inplace=True)

        reduction_factor = 4
        self.radix = 2
        attn_chs = max(dim * self.radix // reduction_factor, 32)
        self.se = nn.Sequential(
            nn.Conv2d(dim, attn_chs, 1),
            nn.BatchNorm2d(attn_chs),
            nn.ReLU(inplace=True),
            nn.Conv2d(attn_chs, self.radix*dim, 1)
        )

    def forward(self, x):
        # 获取static Contexts:CBL三元组 (3×3)
        k = self.key_embed(x)
        # 将查询特征 q 和键 k 拼接起来(Concat)
        qk = torch.cat([x, k], dim=1)
        # 获取qk矩阵的 B, C, H, W
        b, c, qk_hh, qk_ww = qk.size()

        # 两个1×1卷积层:CBL + CG
        # 1. 第二个卷积层没有ReLU,可以避免屏蔽过多的有效信息
        # 2. 第二个卷积层使用GroupNorm,将channel分为G组,每组分别做归一化,避免不同batch特征之间造成干扰
        w = self.embed(qk)
        # 展开为多个头部
        w = w.view(b, 1, -1, self.kernel_size*self.kernel_size, qk_hh, qk_ww)

        # 获取 value
        # Conv 1×1 + BN
        x = self.conv1x1(x)
        # 获取dynamic contexts:结合权重 w 和 值 v 作矩阵乘法,对每个空间像素进行加权
        # Conv 3×3 + BN + swish
        x = self.local_conv(x,w)
        x = self.bn(x)
        x = self.act(x)

        # 增强维度用于特征融合
        B, C, H, W = x.shape
        x = x.view(B, C, 1, H, W)
        k = k.view(B, C, 1, H, W)
        #将 dynamic contexts 和 static contexts 拼接起来(Concat)
        x = torch.cat([x, k], dim=2)

        # 融合 dynamic contexts 和 static contexts,并基于此再进行注意力加权
        x_gap = x.sum(dim=2)
        # 对特征图的空间维度求均值
        x_gap = x_gap.mean((2, 3), keepdim=True)
        # CBLC (1×1)特征编码
        x_attn = self.se(x_gap)
        # 特征维度变换
        x_attn = x_attn.view(B, C, self.radix)
        # 计算特征重要性
        x_attn = F.softmax(x_attn, dim=2)
        # 将注意力图重新加权回 x
        out = (x * x_attn.reshape((B, C, self.radix, 1, 1))).sum(dim=2)
        # 使tensor在内存中连续存放
        return out.contiguous()

Experimental results

Performance comparison of different ways of exploring context information, i.e. using only Static Context, only using Dynamic Context, linear fusion of static and dynamic contexts (Linear Fusion), and the full version of the CoT block. The backbone is CoTNet-50 and trained on ImageNet with default settings.

 The above table compares the performance comparison of different context information acquisition, from which we can see:

  • The static context method (which can be regarded as a ConvNet without self-attention) only achieves 77.1% top1 accuracy;
  • The dynamic context method can achieve 78.5% top1 accuracy;
  • Simple addition and fusion of dynamic and static contexts can achieve 78.7% top1 accuracy;
  • Integrating dynamic and static context by way of attention fusion can further improve its performance to 79.2%.

 

The above table shows the effect of using different replacement settings on the four stages (res2→res3→res4→res5) in ResNet-50:

  • Replacing the 3*3 convolution with the CoT module can improve performance while reducing the number of parameters and FLOPs slightly;
  • The replacement of CoT modules in the last two stages (res4 and res5) contributes the most to the performance improvement. Replacing the CoT modules in the first stage (res1 and res2) brings only marginal performance improvement (total 0.2% top-1 accuracy), while requiring 1.34 times the inference time.

Inference Time vs. Accuracy Curve on the ImageNet dataset

The table above summarizes the performance comparison of object detection using Faster-RCNN and Cascade-RCNN with different pre-trained backbones on the COCO dataset. Group visual backbones with the same network depth (50 layers/101 layers). From observations, pretrained CoTNet models (CoTNet50/101 and CoTNeXt-50/101) show significant performance against ConvNets backbones (ResNet-50/101 and ResNeSt-50/101) for all IoUs per network depth Threshold and target size. The results essentially demonstrate the advantage of integrating self-attention learning with contextual information mining in CoTNet, even when transferred to the downstream task of object detection.

Summarize

  1. Most existing Transformer-based architecture designs operate directly on 2D feature maps by using self-attention to obtain the attention matrix (independent queries and all keys), which is not sufficient . Take advantage of rich context between adjacent keys . Therefore, this paper proposes a new Transformer-like architecture called Contextual Transformer (CoT) module, which is able to exploit contextual information between input keys to guide self-attention learning .
  2. The CoT module first captures the static context between adjacent keys, which is further exploited to trigger self-attention for mining dynamic context. This approach elegantly unifies context mining and self-attention learning into a single architecture, thereby enhancing the power of visual representation. Then, the CoT block can easily replace the regular convolution in the existing ResNet architecture while reducing the network parameters. Finally, the object detection and instance segmentation experiments on the COCO dataset also prove that CoTNet has a strong generalization ability.

Highlights:

  1. Technically speaking, the CoT module first encodes the context information of the input keys through convolution to obtain a static context expression about the input; further concatenates the encoded keys with the input query and learns a dynamic multi-head attention matrix through two consecutive convolutions; the resulting attention The force matrix is ​​multiplied by the input values ​​to obtain a dynamic context representation of the input.
  2. CoTNet-50 directly uses CoT to replace the convolution in Bottlenck; similarly, CoTNeXt-50 uses the CoT module to replace the corresponding group convolution. In order to obtain similar calculations, the number of channels and the number of groups are adjusted: CoTNeXt-50 parameters The amount is 1.2 times that of ResNeXt-50, and the FLOPs are 1.01 times.

Guess you like

Origin blog.csdn.net/m0_61899108/article/details/125687262#comments_26387399