EfficientFormer: Efficient and low-latency Vision Transformers

We all know that the architecture of Transformers is not very efficient compared to CNN, which results in high delays when performing inference on some edge devices. Therefore, the paper EfficientFormer introduced this time claims that it can reach the inference speed of MobileNet without reducing the accuracy.

Can Transformers run as fast as MobileNet while achieving high performance? To answer this question, the author first reviews the network architecture and operations used in vit-based models and illustrates some inefficient designs. Then a pure Transformer with consistent dimensions (without MobileNet blocks) is introduced as a design example. Finally, the design is optimized with latency as the goal, and a series of final models called EfficientFormer are obtained. Finally, EfficientFormerV2 was designed.

Latency analysis

The authors found in the paper:

1. Patch embedding with large kernels and large strides is a speed bottleneck on mobile devices.

2. Consistent feature dimensions are important for the selection of token mixers. MHSA is not necessarily a speed bottleneck.

3. convn-bn is more conducive to delay than LN (GN)-Linear. For the reduction of delay, a small loss of accuracy is acceptable.

4. Non-linear delay depends on hardware and compiler.

EfficientFormer overall architecture

The network consists of a patch embedding (PatchEmbed) and a stack of meta-Transformer blocks, denoted as MB:

X0 is the input image with batch size B and spatial size [H, W], Y is the desired output, and m is the total number of blocks (depth). MB consists of an unspecified token mixer (TokenMixer), followed by an MLP block:

Xi|i>0 is the intermediate feature of the i-th MB. A Stage (or S) is defined as a stack of several MetaBlocks. The network consists of 4 stages. In each stage, there is an embedding operation that projects the embedding dimension and downsamples the token length, denoted as embedding, as shown in the figure above.

In other words, effortformer is a model completely based on transformer and does not integrate the MobileNet structure.

Dimension-Consistent design

The network starts with four-dimensional partitioning and then proceeds to three-dimensional partitioning later. First, the input image is processed by the stem layer, which is two 3 × 3 convolutions with stride 2 as patch embeddings:

Where Cj is the channel number (width) of the jth level. The network then starts from MB4D and uses a simple Pool mixer to extract low-level features:

In the formula, ConvB,G indicates whether there are BN and GeLU following convolution. After all MB4D blocks have been processed, perform a reshape to convert feature sizes and enter 3D partitions. MB3D uses traditional ViT:

In the formula, LinearG represents linear followed by GeLU, and MHSA is:

Among them, Q, K, V represent the query, key and value respectively, and b is the parameterized attention bias as position encoding.

After defining the overall architecture, the next step for the authors is to search for an efficient architecture.

Latency-targeted architecture optimization

A super network MetaPath (MP) for searching efficient models is defined, which is a set of possible blocks:

Where I represents the unit path.

In S1 and S2 of the network, each block can choose MB4D or I, and in S3 and S4, each block can choose MB3D, MB4D or I.

There are 2 reasons for only enabling MB3D in the last two stages: 1. Since the calculation of MHSA grows quadratically relative to the token length, integrating it in the early stages will greatly increase the computational cost. 2. The early stages of the network capture low-level features, while the later stages learn long-term dependencies.

The search space includes Cj (width of each Stage), Nj (number of blocks per Stage, i.e. depth) and the last N blocks to which MB3D is applied.

The search algorithm trains a supernet using Gumbel Softmax sampling to obtain an importance score for blocks within each MP:

where α evaluates the importance of each block in MP since it represents the probability of selecting a block. ε ~ U(0,1) guarantees exploration. For S1 and S2, n∈{4D, I}, for S3 and S4, n∈{4D, 3D, I}.

Finally, a delay lookup table is constructed by collecting the on-device delays of MB4D and MB3D of different widths (multiples of 16).

In other words, the architecture of EfficientFormer is not designed manually, but searched through NAS (Neural Architecture Search). The authors calculated the latency produced by each action through a lookup table and evaluated the accuracy drop for each action. Select actions based on accuracy degradation per delay (-%/ms). This process is performed iteratively until the target latency is reached. (See the appendix of the paper for details)

Results display

Compared with widely used CNN-based models on ImageNet, EfficientFormer achieves a better trade-off between accuracy and latency.

Traditional vit still doesn't perform well in terms of latency. EfficientFormer-L3's top-1 accuracy is 1% higher than PoolFormer-S36, 3x faster on Nvidia A100 GPU, 2.2x faster on iPhone NPU, and 6.8x faster on iPhone CPU.

EfficientFormer-L1's Top1 accuracy is 4.4% higher than MobileViT-XS and runs faster on different hardware and compilers.

On the MS COCO dataset, EfficientFormers consistently outperforms CNN (ResNet) and Transformer (PoolFormer).

Using the ADE20K, EfficientFormer consistently performs much better than the CNN and transformer-based backbone under similar computational budgets.

Paper address:

EfficientFormer: Vision Transformers at MobileNet Speed

https://avoid.overfit.cn/post/eb0e56c5753942cf8ee70d78e2cd7db7

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/133296998