Nvidia's latest open source | FasterViT: Efficient neural network architecture for hardware optimization

Title: FasterViT: Fast Vision Transformers with Hierarchical Attention
Paper: https://arxiv.org/pdf/2306.06189.pdf
Code: https://github.com/NVlabs/FasterViT

guide

Today, I will bring NVIDIAyou the latest open-source efficient neural network architecture " FasterViT" of the research team, which aims to improve the image processing speed in the field of computer vision.

Like conventional neural network architectures, FasterViTit combines the advantages CNNof local feature learning with the advantages Transformerof global modeling capabilities. In addition, the highlight of the paper is the introduction of a Hierarchical Attention, HATmethod called , which decomposes the global self-attention mechanism with quadratic complexity into multi-level attention to reduce the computational cost. The method utilizes an efficient window-based self-attention mechanism , and each window has a dedicated " carrier tokens" to participate in local and global feature learning. On high-level output features, the global self-attention mechanism achieves efficient communication between windows at a low cost.

FasterViTIt achieves the optimal trade-off point between accuracy and image processing speed, and has been widely verified on multiple computer vision tasks such as image classification, object detection and semantic segmentation. The researchers also demonstrated HATthat it can be used as a plug-in module for existing networks and enhance their performance. FasterViTDemonstrate faster and more accurate performance on high-resolution images than the competition .

motivation

FasterViTThe motivation of the proposal is to solve the efficiency problem faced in the field of CV when dealing with high-resolution images . Although ViT achieves excellent performance on various tasks, the computational complexity of its self-attention mechanism is high for high-resolution images, resulting in slow processing speed. In addition, the original ViT model lacks the ability of multi-scale when learning feature representation , and for some downstream tasks, such as object detection and semantic segmentation, this isotropic structure is not applicable.

The architectural design of FasterViT focuses on achieving the highest throughput in CV tasks and is optimized for mainstream general-purpose hardware that excels in parallel computing. The computation of this architecture involves a set of streaming multiprocessors (SMs) with CUDA and Tensor cores as computing units. Computing requires frequent data transfers, and data movement bandwidth can have an impact on computing. Thus, operations that are compute-bound are math-bound, and operations that are memory-transfer-bound are memory-bound. A trade-off needs to be made between the two to maximize throughput. Let’s analyze it in detail below.

Friends who have seen more than a few networks know that in a general layered visual model, as the depth increases, the spatial dimension of the intermediate representation will shrink. Initial network layers typically have large spatial dimensions and fewer channels (e.g., 112x112x64?), making them memory-bound. This makes it more suitable for computationally intensive operations such as dense convolutions, rather than depthwise separable convolutions or sparse convolutions that impose additional overhead on transmission costs. In addition, there are some ops that cannot be represented by matrix operations, such as nonlinear activation, pooling, and batch normalization, which are also limited by memory and should be used as little as possible. In contrast, later layers are usually computationally bound and require computationally intensive operations. For example, the feature map size of hierarchical CNN is 14x14 with high-dimensional convolution kernels. This leaves room for more expressive operations, such as layer normalization, attention mechanisms, etc., with relatively little impact on throughput.

method

Framework

Above we briefly analyzed some motivations. Based on these points, this paper proposes a novel architecture that can benefit from accelerated computing hardware. The overall framework is shown in the figure:

FastViT

Is it very concise? It can be seen that the method utilizes earlier stages of convolutional layers to process higher resolution inputs. The second half of the model relies on a novel hierarchical attention layer for spatial reasoning over the entire feature map. In this design, the paper optimizes the architecture for computation and throughput. Therefore, the first half of the network and the downsampling block use dense convolutional kernels. Also, in higher resolution stages (i.e. stage 1, 2), squeeze excitation operations are avoided and layer normalization is minimized. The second half of the architecture (i.e., stages 3 and 4) is usually limited to computation, because the GPU hardware spends more time on computation, disproportionately compared to memory transfer costs. Therefore, applying a multi-head attention mechanism will not become a bottleneck.

In fact, the following network design ideas are not bad. We mainly look at the design ideas of HAT.

HAT

HAT is a novel windowed attention mechanism. This module aims to facilitate the exchange of local and global information at a low computational cost, which introduces the concept of carrier markers (CTs) and performs hierarchical self-attention operations.

As shown in the figure above, the HAT module first divides the input feature map into local windows, similar to the operation of Swin. Each local window is represented by a set of flags. The key idea is to introduce CTs for summarizing information within each local window. CTs are obtained by pooling and convolution operations, which provide summary information of their respective local windows. Each partial window has unique CTs.

In the HAT block, CTs undergo multi-head self-attention (MHSA) operations, followed by layer normalization and multi-layer perceptron (MLP) operations. This attention process allows CTs to exchange information and summarize global features. Next, the local window markers and CTs are concatenated and another set of attention operations is applied to model their interactions, enabling the communication of short-range and long-range spatial information. Subsequently, the markers are split again into their respective local windows and CTs, and these operations are iteratively applied on multiple layers at this stage. To facilitate long-range interactions, global information propagation is performed in this stage at the end. Outputs are computed by upsampling CTs and merging them with local window markers.

To incorporate position information, we use a two-layer MLP to add absolute position biases to CTs and local window markers. In addition, the log-space relative position bias proposed in SwinV2 is adopted to enhance attention with image sample locality. Overall, the HAT module enables the information exchange between local windows and global features, effectively promoting the spatial reasoning ability in the whole feature map hierarchy.

The figure above briefly shows the attention map comparison of the efficient global local self-attention mechanism. It can be seen that the proposed hierarchical attention divides self-attention into local and sub-global parts, both of which can be compressed into two intensive attention operations.

experiment

image classification

:::block-1

On the ImageNet-1K dataset, the FasterViT model achieves higher accuracy at the same throughput compared to various hybrid, convolutional, and Transformer-based networks. For example, the accuracy is improved by 2.2% compared to ConvNeXt-T. Considering the trade-off of accuracy and throughput, the FasterViT model has a significant speed advantage over Transformer-based models such as the Swin Transformer family. Furthermore, FasterViT achieves higher performance in average throughput and better ImageNet top-1 performance compared to recent hybrid models such as EfficientFormer and MaxViT. In terms of model optimization, such as TensorRT, the delay-accuracy Pareto frontier trend of the FasterViT model still exists.
:::

Target Detection

:::block-1

In the performance evaluation of object detection and instance segmentation using the Cascade Mask R-CNN network on the MS COCO dataset, the FasterViT model performed better in terms of accuracy and throughput. For example, FasterViT-4 outperforms ConvNeXt-B and Swin-B in box AP metrics by 0.2 and 1.0, and in mask AP metrics by 0.3 and 1.0, while throughput is 15% and 30% faster, respectively. %. Similar trends were also observed in other model variants. In addition, using FasterViT-4 as ImageNet-21K pre-trained backbone network, an additional object detection experiment with the state-of-the-art DINO model achieves a box AP of 58.7, validating FasterViT as a backbone network with more complex, state-of-the-art models effectiveness.

:::

semantic segmentation

:::block-1

On the ADE20K dataset, using the UPerNet network for semantic segmentation experiments, the FasterViT model also achieved good performance in the trade-off between performance and throughput. For example, FasterViT-4 outperforms Swin-B on the mIoU metric, improves single-scale and multi-scale inference by 1.0 and 0.7, and has 16.94% higher throughput. Compared with ConvNeXt-B, in terms of multi-scale inference, the throughput of FasterViT-4 is still increased by 7.01%, and the mIoU is increased by 0.4.

:::

Summarize

FasterViT is designed as a hybrid network structure that combines the advantages of CNN and ViT to achieve efficient image processing speed. At the same time, in order to process high-resolution images, a new HAT module is introduced in the paper to capture the short-range and long-range spatial dependencies and effectively model the interaction between windows. Through these improvements, the model in this paper is able to achieve an optimal balance between image processing speed and performance.

Guess you like

Origin blog.csdn.net/CVHub/article/details/131270674