[Paper Sharing] VOLO: Vision Outlooker for Visual Recognition

VOLO: Vision Outlooker for Visual Recognition

Overview

Visual transformers (ViTs) have been extensively explored in the field of visual recognition. Due to the lower efficiency of encoding fine features, ViT still performs worse than state-of-the-art CNNs when trained from scratch on medium-sized datasets like ImageNet.

Through experimental analysis, the author found that:
1) Simple tokenization of input images fails to model important local structures such as edges and lines, resulting in low efficiency of training samples; 2
) ViT's redundant attention backbone design leads to fixed calculations Budget and limited training samples have limited feature richness.

To overcome these limitations, the authors propose a new simple general architecture called Vision Outlooker (VOLO), which implements a novel Outlook attention operation to dynamically perform local features on the input image in a sliding window manner. aggregation mechanism. Unlike self-attention, which focuses on modeling the global dependencies of local features at a coarse level, the foreground attention proposed by the authors aims to encode finer features, which are crucial for recognition but ignored by self-attention. Outlook Attention breaks the bottleneck of self-attention, and its computational cost scales quadratically with the input space dimension, making it more memory efficient.

In other words: the author proposes a new Vision Outlooker module that only uses pixel space adjacent information to generate attention weights.

method

Outlooker uses an outlook attention layer to encode spatial information, and then uses an MLP to exchange information between different channels.

Outlook attention:
1) The features at each spatial location are sufficiently representative to generate attention weights for locally aggregating their neighboring features; 2
) Dense local spatial aggregation can effectively encode fine-level information.

outlook attention
Let V Δ i , j ∈ RC × K 2 \mathbf {V}_{\Delta _{i,j}} \in \mathbb {R}^{C \times K^{2}}VDi,jRC×K2 represents all values ​​within the local window centered on (i,j), that is,
V Δ i , j = { V i + p − ⌊ K 2 ⌋ , j + q − ⌊ K 2 ⌋ } , 0 ≤ p , q < K . \begin{equation*} \mathbf {V}_{\Delta _{i,j}}=\lbrace \mathbf {V}_{i+p-\lfloor \frac{K}{2} \rfloor,j+q-\lfloor \frac{K}{2} \rfloor }\rbrace, \quad 0 \leq p,q < K. \tag{1} \end{equation*}VDi,j={ Vi+p2K,j+q2K},0p,q<K.(1)

The foreground weight at position (i,j) is directly used as the attention weight for value aggregation, reshaping it into A ^ i , j ∈ RK 2 × K 2 \hat{\mathbf {A}}_{i,j } \in \mathbb {R}^{K^{2}\times K^{2}}A^i,jRK2×K2 , followed by the Softmax function.

Y Δ i , j = MatMul ⁡ ( Softmax ⁡ ( A ^ i , j ) , V Δ i , j ) \begin{equation*}\mathbf{Y}_{\Delta_{i,j}} = \operatorname{MatMul}(\operatorname{Softmax}(\hat{\mathbf{A}}_{i,j}} ), \mathbf{V}_{\Delta_{i,j}}). \tag{2}\end{equation*}YDi,j=MatMul(Softmax(A^i,j),VDi,j).(2)

Outlook attention-intensively aggregates predictive value representations. Adding different weight values ​​at the same position from different local windows gives the output
Y ~ i , j = ∑ 0 ≤ m , n < KY Δ i + m − ⌊ K 2 ⌋ , j + n − ⌊ K 2 ⌋ i , j . \begin{equation*} \tilde{\mathbf {Y}}_{i,j} = \sum _{0 \leq m, n < K} \mathbf {Y}_{\Delta _{ i+m-\lfloor \frac{K}{2}\rfloor,j+n-\lfloor \frac{K}{2}\rfloor }}^{i,j}. \tag{3} \end{ equation*}Y~i,j=0m,n<KYDi+m2K, j + n 2Ki,j.(3)

Our foreground attention inherits the advantages of convolution and self-attention. It has the following advantages.

  • First, outlook attention encodes spatial information by measuring the similarity between pairs of token representations, which is more parameter efficient than convolution for feature learning, as studied in previous work [57], [66] .
  • Secondly, foreground attention adopts a sliding window mechanism to fine-grained local encoding of marker representations and preserves key location information for visual tasks to a certain extent [42], [71].
  • Third, the generation of attention weights is simple and efficient. Unlike self-attention, which relies on query key matrix multiplication, our foreground weights can be directly generated by a simple reshape operation, thus saving computation. To see this, we compare the

M-Adds ( S A ) ≈ 4 H W C 2 + 2 ( H W ) 2   C M-Adds ( L S A ) ≈ 4 H W C 2 + 2 H W K 2   C M-Adds ( O A ) ≈ H W C ( 2  C + N K 4 ) + H W K 2   C . \begin{align*} \text{M-Adds}(\mathbf{SA}) &\approx 4HWC^{2} + 2(HW)^{2}~C \tag{4}\\ \text{M-Adds}(\mathbf{LSA}) &\approx 4HWC^{2} + 2HWK^{2}~C \tag{5}\\ \text{M-Adds}(\mathbf{OA}) &\approx HWC(\text{2}~C + NK^{4}) + HWK^{2}~C. \tag{6} \end{align*} M-Adds(SA)M-Adds(LSA)M-Adds(OA)4 H W C2+2(HW)2C _ 4 H W C2+2 H W K2C _ H W C ( 2 C +NK4)+HWK _ _2C  ._(4)(5)(6)

Network Architecture

VOLO Architecture
The overall network architecture of VOLO architecture. The image is first sent to the convolutional stem for patch embedding. The main body of our VOLO consists of two phases, consisting of the Outlooker block proposed in Phase I and the Transformer block in Phase II. Outlooker is responsible for the fine-level feature encoding. More detailed architectural information can be found in Table 2.

Architectural information of different variants of VOLO

experiment

Training settings

Insert image description here

ablation experiment

Insert image description here

Method comparison

Insert image description here

Model performance evaluation

Insert image description here

reference

@article{Yuan2022Sep,
author = {Yuan, Li and Hou, Qibin and Jiang, Zihang and Feng, Jiashi and Yan, Shuicheng},
title = { {VOLO: Vision Outlooker for Visual Recognition}},
journal = {IEEE Trans. Pattern Anal. Mach. Intell.},
volume = {45},
number = {5},
pages = {6575–6586},
year = {2022},
month = sep,
urldate = {2023-08-24},
issn = {1939-3539},
publisher = {IEEE},
language = {english},
doi = {10.1109/TPAMI.2022.3206108}
}

Guess you like

Origin blog.csdn.net/orDream/article/details/132479261