CVPR 2023|EfficientViT: Make ViT more efficient to deploy and realize real-time reasoning (with source code)

Click on the blue word to follow us

Follow and star

never get lost

Institute of Computer Vision

388c1bc6233cd86ecf6ed9c705a1ea21.gif

4d1398e662bd457dbf86f15c30a4c630.gif

Public IDComputer Vision Research Institute

Learning groupScan the QR code to get the joining method on the homepage

355098aca145510cf5f63dd9ae34cf25.png

Paper address: https://arxiv.org/pdf/2305.07027.pdf

Project code: https://github.com/microsoft/Cream/tree/main/EfficientViT

Computer Vision Research Institute column

Column of Computer Vision Institute

Vision transformers have achieved great success due to their high modeling capabilities. However, their remarkable performance comes with a heavy computational cost, which makes them unsuitable for real-time applications.

cea8454ce93c960a96e6089b7e169ffc.gif

01

Overview

In today's sharing, the researchers proposed a family of high-speed Vision transformers called EfficientViT. We find that the speed of existing transformer models is often limited by memory-inefficient operations, especially tensor reshaping and per-element functions in MHSA. Therefore, we design a new building block with a sandwich layout, that is, a single memory-bonded MHSA between effective FFN layers, which improves memory efficiency while enhancing channel communication.

47da5a387baeafe9d5f0c0460d399a51.png

Furthermore, we found that attention maps have high similarity across heads, leading to computational redundancy. To address this problem, a cascaded group attention module is proposed, which feeds attention heads with different full-feature segmentations, which not only saves computational cost, but also improves attention diversity.

Comprehensive experiments show that EfficientViT outperforms existing efficient models with a good balance between speed and accuracy. For example, EfficientViT-M5 outperforms MobileNetV3 Large by 1.9% in accuracy, while outperforming throughput by 40.4% and 45.2% on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT XXS, EfficientViT-M2 achieves superior accuracy of 1.8%, while running 5.8x/3.7x faster on GPU/CPU and 7.4x faster when converting to ONNX format.

cb3b26ef9d4cbb5ee3536818564a149e.gif

02

 background

Several recent works have designed lightweight and efficient Vision transformers models. Unfortunately, most of these methods aim at reducing model parameters or Flops, which are indirect indicators of speed and do not reflect the actual inference throughput of the model. For example, on an Nvidia V100 GPU, MobileViT XS with 700M flops runs much slower than DeiT-T with 1220M flops. Although these methods achieve good performance with fewer Flops or parameters, many of them do not show significant wall-clock speedup compared to standard isomorphic or hierarchical transformers (such as DeiT and Swin), and there is no been widely adopted.

43d1b1841a52e79395d0c3607ea57dd0.png

To solve this problem, researchers explore how to use Vision transformers faster, trying to find principles for designing efficient transformer architectures. Based on the mainstream Vision transformers DeiT and Swin, three main factors affecting the model inference speed are systematically analyzed, including memory access, computational redundancy and parameter usage. In particular, the speed of discovering transformer models is often memory bound.

Based on these analyzes and findings, the researchers proposed a new family of memory-efficient transformer models, named EfficientViT . Specifically, a new block with a sandwich layout was designed to build the model. The sandwich layout block applies a single memory-bound MHSA layer between FFN layers. It reduces the time cost caused by memory-bound operations in MHSA and applies more FFN layers to allow communication between different channels, which is more memory-efficient. Then, a novel Cascaded Group Attention (CGA) module is proposed to improve computational efficiency. Its core idea is to enhance the diversity of features input to the attention head. Unlike previous self-attention which used the same features for all heads, CGA provides a different input segmentation for each head and concatenates the output features between heads. This module not only reduces computational redundancy in multi-head attention, but also improves model capacity by increasing network depth.

a897abf87898d518f7593fa35617e770.png

Demonstration of throughput and accuracy comparison of lightweight CNN and ViT models

Last but not least, parameters are redistributed by enlarging the channel width of key network components (e.g., value prediction), while shrinking less important components (e.g., hidden dimensions in FFNs). This reallocation ultimately improves the efficiency of the model parameters.

9c9c88d35adb74e9176449d65408e4ea.gif

03

 motivation

Vision transformers speed up

  • Memory Efficiency

70606593d70a02b8dfefba2bd874cad7.png

Memory access overhead is a key factor affecting model speed. Many operators in Transformer, such as frequent reshaping, element-wise addition, and normalization, are memory inefficient and require time-consuming access across different memory units, as shown above, although there are methods that simplify the standard softmax auto Attentional computations address this problem, such as sparse attention and low-rank approximation, but they often come at the cost of reduced accuracy and limited acceleration.

2b308c379c3bfaf714d828a22d41f8a0.png

In this work, the researchers turned to saving memory access costs by reducing memory inefficient layers. Recent studies have shown that memory-inefficient operations are mainly located at the MHSA layer rather than the FFN layer. However, most existing ViTs use equal numbers of these two layers, which may not achieve optimal efficiency. Therefore, we explored the optimal allocation of MHSA and FFN layers in small models with fast inference. Specifically, we reduce Swin-T and DeiT-T to several small sub-networks with 1.25x and 1.5x higher inference throughput, respectively, and compare the performance of sub-networks with different MHSA layer ratios.

  • Computation Efficiency

MHSA embeds input sequences into multiple subspaces (heads) and computes attention maps separately, which has been shown to be effective in improving performance. However, attention maps are computationally expensive, and research shows that some of them are not critical.

To save computational cost, we explore how to reduce redundant attention in small ViT models. The width-reduced Swin-T and DeiT-T models are trained with 1.25× inference speedup, and the maximum cosine similarity of each head and the remaining heads within each block is measured. From the figure below, it is observed that there is a high similarity between the attention heads, especially in the last block.

eece1c4dd259b9d3e685c88bb1130464.png

This phenomenon suggests that many heads learn similar projections of the same complete features and create computational redundancy. To explicitly encourage the heads to learn different patterns, an intuitive solution is applied to provide each head with only a part of the full feature, which is similar to the group convolution idea in DeiT-T. A variant of the reduced model is trained with the modified MHSA and computes the preserving similarity in the graph. Research shows that using different channel segmentation features in different heads, instead of using the same full feature for all heads like MHSA, can effectively reduce attention computation redundancy.

  • Parameter Efficiency

A typical ViT mainly inherits the design strategies of NLP transformers, for example, using the equivalent width of Q, K, V projections, increasing the head on the stage, and setting the dilation ratio to 4 in FFN. For lighter models, the configuration of these components needs to be carefully redesigned. Inspired by [Rethinking the value of network pruning], Taylor structure pruning is employed to automatically find important components in Swin-T and DeiT-T, and to explore the rationale for parameter assignment. Pruning methods remove unimportant channels under certain re-source constraints and keep the most critical channels to best maintain accuracy. It uses the product of gradients and weights as channel importance, which approximates loss fluctuations when removing channels.

192f27475bfe4feabd5f6c50b657f1cd.gif

04

new frame

f7119089cb747a8083d56d35af8dc77b.png

experiment visualization

f9e248802c50f759975777c3779438e6.png

bc3a59d701bb5efc51dfd36a52e8b858.png

f029568db4d5ef3f90c53b0879b63d77.gif

© THE END 

For reprinting, please contact this official account for authorization

7378a952a7271f135c81144af3989645.gif

The Computer Vision Research Institute study group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as object detection, object tracking, image segmentation, OCR, model quantization, and model deployment. The research institute shares the latest paper algorithm new framework every day, provides one-click download of papers, and shares actual combat projects. The research institute mainly focuses on "technical research" and "practice implementation". The Institute will share the practice process for different fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!

441846680198df7b649f16dfbf7aa0a4.png

Guess you like

Origin blog.csdn.net/gzq0723/article/details/131266418