CVPR'22 | CMT: CNN and Transformer combined with a new paradigm (open source)

Author | Wang Yunhe Editor | Autobots

Original link: https://zhuanlan.zhihu.com/p/534567826

Click the card below to pay attention to the " Automatic Driving Heart " public account

ADAS Jumbo dry goods, you can get it

Click to enter → Heart of Autopilot【Transoformer】Technical Exchange Group

This article is only for academic sharing, if there is any infringement, please contact to delete the article

Hello everyone, I haven't posted on Zhihu for a long time. Today I will share with you the latest CVPR22 work in a group, and explore how to combine CNN and Transformer efficiently.

Summary

In recent years, Transformer has attracted more and more attention in the visual field, and a question naturally arises: Which one is better, CNN or Transformer? Of course, it is best to join forces. Researchers at Huawei's Noah Labs proposed a new visual network architecture, CMT. By simply combining traditional convolution and Transformer, the network performance obtained is better than Google's EfficientNet, ViT and MSRA's Swin Transformer. Based on the multi-level Transformer, the paper inserts traditional convolution between the layers of the network, aiming to extract local and global features of the image hierarchically through convolution + global attention. The simple and effective combination proves that in the current vision field, using traditional convolution is the fastest way to improve the performance of the model. In the ImageNet image recognition task, CMT-Small has a Top-1 accuracy rate of 83.5% under similar calculation conditions, much higher than Swin's 81.3% and EfficientNet's 82.9%.

Link to the paper: CMT: Convolutional Neural Networks Meet Vision Transformers

https://link.zhihu.com/?target=https%3A//openaccess.thecvf.com/content/CVPR2022/papers/Guo_CMT_Convolutional_Neural_Networks_Meet_Vision_Transformers_CVPR_2022_paper.pdf

PyTorch code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/cmt_pytorch

MindSpore code: https://gitee.com/mindspore/models/tree/master/research/cv/CMT

introduction

6a25bf3cbdb6bfef006115d42c67d4ce.png

The birth of Transformer promotes the progress and development of natural language processing networks. Inspired by this, Transformer has emerged in the field of computer vision in recent years. The Vision Transformer (ViT) model proposed by Google scholars is a classic pure transformer technology solution for visual tasks. It divides the input image into several image blocks (patches), each patch is represented by a vector (vector/tensor), and the transformer is used to process the image patch sequence, and the final output features can be directly used for image recognition and detection wait. For example, DETR uses transformer-based Figure 1 ResNet-50/ViT/CMT architecture comparison

Encoder-decoder for object detection, IPT utilizes a single transformer model to simultaneously handle multiple underlying vision tasks. Compared with the previous traditional CNN model (such as ResNet), transformer relies on its global attention mechanism to capture the long-distance dependencies between patches, and performs well in visual tasks such as detection and segmentation that require a large receptive field.

However, compared with NLP tasks, the input of vision tasks is more complex due to its unique 2D structure, and the local spatial information between patches is also very important. Therefore, the shortcomings of the existing visual transformer are also very obvious. In the process of patching the input image, the internal structure information of the image block will be destroyed, and the long-range attention mechanism can easily ignore the local characteristics of the image, resulting in the existing The effect of transformer is not as good as SOTA's traditional convolutional network.

The goal of this article is to combine the advantages of CNN into Transformer to solve the above problems. We propose a new architecture CMT, based on the stage-wise transformer, introduce convolution operation for fine-grained feature extraction, and also design a unique module to extract local and global features hierarchically. It not only improves performance, but also saves computational overhead. Experiments on the ImageNet benchmark and downstream tasks demonstrate the superiority of the proposed method in terms of accuracy and computational complexity. For example, CMT-Small achieves 83.5% ImageNet top-1 accuracy with only 4.0B FLOPs, which is 2.2% higher than the more computationally intensive Swin Transformer.

method

image preprocessing

Most transformer-based models use a large convolution (such as the 16x16 convolution kernel in ViT) to directly cut the input image into non-overlapping patches. This approach directly loses the 2D spatial features and many details in the patch. and edge information. Therefore, CMT adopts the traditional Conv stem structure, and uses a structure formed by stacking multiple 3x3 convolutions to achieve the purpose of downsampling and extracting detailed features. In order to extract multi-scale features (in line with current mainstream detectors), the main structure of CMT uses a stage-wise Transformer, and each stage uses 2x2 stride2 convolution for downsampling and increasing the number of channels.

CMT module combining CNN and Transformer

  • LPU (local perception unit) local perception unit:

e0a747590611e0d60c9d47dd4ee34baa.png

Rotation and translation are commonly used augmentation methods in CNN. However, absolute position encoding is usually used in ViT, and each patch corresponds to a unique position encoding, so it cannot bring translation invariance to the network. Our local perception unit uses a 3x3 depth-separated convolution, introduces the translation invariance of the convolution into the Transformer module, and uses the residual connection to stabilize the network training:

  • LMHSA (lightweight multi-head self-attention) lightweight multi-head attention:

Given an input of size Rn×d, the original multi-head attention mechanism will first generate the corresponding Query, Key, and Value (consistent with the original input size), and then generate a size of Rn through the dot product of Query and Key . ×n weight matrix:

bb9092dc32383f07bebbf2a755f8522e.png

This process often consumes a lot of computing resources (video memory) due to the large input feature size, which brings difficulties to the training and deployment of the network. We use two kxk depth separation convolutions to downsample the generation of Key and Value respectively , and obtain two relatively small features K' and V' :

fc6487ca5616b850c4e3b8142ae71f6b.png 5d5e226893da2cf2548633ef2e709d0f.png d18546bd5bbe44d1666b776818b9f078.png

Introducing a depthwise separable convolution in the Self-Attention module to downsample the feature map is an efficient method to save computation and memory.

  • IRFFN (inverted residual feed-forward network) reverse residual feed-forward network:

Compared with the FFN of the traditional transformer, the feedforward network of this part adds a depth separable convolutional layer between the two fully connected layers. The design is similar to the inverted residual block in MobileNetV2:

f5e9a5ab5ee905136bfae0b4d97b5425.png

Overall structure of CMT

15b490c29cf2fd9205915f44261bf818.png

The CMT network is mainly composed of CMT (Conv) Stem, four downsampling layers, four stages, pooling layers and fully connected classifiers, where each stage is composed of several CMT Block stacks. The specific structure is shown in Table 1, where Hi and ki are the number of heads and the downsampling rate of the lightweight multi-head attention module, respectively, and Ri is the amplification factor of the middle layer channel of the reverse residual feedforward network.

CMT series network family

Similar to the rules of the scaling model of EfficientNet, we searched for the best scaling factor for CMT with a grid search of 100 million points, α=1.2, β=1.3, and γ=1.15, and the scaling formula is as follows:

d2113ab4785cfa53bbbb545b85deff8f.png

Based on CMT-Small, we obtained the corresponding CMT-Ti, CMT-XS and CMT-B by using the scaling formula above. These models correspond to ImageNet input sizes of 160x160, 192x192, 224x224 (CMT-S) and 256x256, respectively.

experiment

ImageNet experiment

We train and validate the CMT model on the ImageNet 2012 dataset. As can be seen from Table 2, whether it is an emerging transformer model or a traditional CNN model, CMT has significant performance advantages. With only 4.0B of computation required, CMT-S achieves a top-1 accuracy of 83.5%, which is 2.2% higher than the baseline model Swin-T, which shows that the introduction of traditional convolution in the transformer model is conducive to better extraction and Preserve local structural information.

bce4440a357807776e119e815a8dc0f6.png
Comparison between ImageNet dataset and SOTA model

Transfer Learning Experiments

cee7beeb34c83f1e733f777e3017cfe8.png
Performance of CMT on downstream tasks

To demonstrate that CMT has strong generalization ability, we migrate the CMT-S, CMT-B models trained on ImageNet to other datasets. More specifically, the CMT model is evaluated on 5 image classification datasets, including CIFAR-10, CIFAR-100, Stanford Cars, Oxford 102 Flowers, and Oxford IIIT Pets. The image resolution for all model fine-tuning is 224x224. Table 3 compares the transfer learning results of CMT and EfficientNet, DeiT, TNT and other networks.

Object Detection and Instance Segmentation Experiments

Table 4 and Table 5 show the results of CMT-S on different detection tasks in different frameworks (such as RetinaNet and Mask R-CNN). Under similar computational constraints, CMT-S has a significant performance improvement compared to other networks.

d297ae1239cc36c00bea33cd18a30169.png
Performance of CMT in COCO dataset target detection task
486fda9118053a9917557c490f8e3275.png
Performance of CMT on COCO Dataset Instance Segmentation Task

Summarize

This paper proposes a general vision model combining CNN and Transformer: CMT. In the era when CNN, Transformer, and MLP are proposing a variety of visual basic frameworks, whenever a new architecture/module is proposed, researchers have to test whether these structures can bring to enhance the effect. This article succinctly and effectively proves that in the visual field, the combination of traditional convolution and Transformer has the effect of 1+1>2. Based on the current hot Transformer, we introduce the Conv Stem composed of 3x3 convolutions on the classic ViT structure, and the CMT module composed of Depth-wise convolution and self-attention mechanism, which hardly increases FLOPs. In this case, the existing accuracy of the visual network is greatly improved. Extensive experiments on ImageNet and downstream tasks demonstrate the superiority of the proposed CMT architecture.

① Exclusive video courses on the whole network

BEV perception, millimeter-wave radar vision fusion, multi-sensor calibration, multi-sensor fusion, multi-modal 3D object detection, point cloud 3D object detection, object tracking, Occupancy, cuda and TensorRT model deployment, collaborative perception, semantic segmentation, autonomous driving simulation , sensor deployment, decision planning, trajectory prediction and other learning videos (scan code learning)

1314aa71f71b9914ecb05ddc1529738d.png Video official website: www.zdjszx.com

② The first autonomous driving learning community in China

A communication community of nearly 2,000 people, involving 30+ autonomous driving technology stack learning routes, who want to learn more about autonomous driving perception (2D detection, segmentation, 2D/3D lane lines, BEV perception, 3D object detection, Occupancy, multi-sensor fusion, Multi-sensor calibration, target tracking, optical flow estimation), automatic driving positioning and mapping (SLAM, high-precision map, local online map), automatic driving planning control/trajectory prediction and other technical solutions, AI model deployment in actual combat, industry trends, Job announcement, welcome to scan the QR code below to join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, communicate with the field leaders about various problems in getting started, studying, working, and job-hopping, and share papers + codes daily +Video , looking forward to communication!

b1e2e5db82a312b8db89bad8e8df641b.png

③【Heart of Autopilot】Technical exchange group

The Heart of Autopilot is the first autopilot developer community, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-modal perception, Occupancy, Multi-sensor fusion, transformer, large model, point cloud processing, end-to-end automatic driving, SLAM, optical flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, products Manager, hardware configuration, AI job search and communication , etc. Scan the QR code to add Autobot Assistant WeChat to invite to join the group, note: school/company + direction + nickname (quick way to join the group)

58f8fb4149fa93e4eb842dce46045fc2.jpeg

④【Automatic Driving Heart】Platform matrix, welcome to contact us!

440442c9b2049075677a1326823cbb60.jpeg

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/132126612
Recommended