Hydra Attention study notes

Hydra Attention study notes

Hydra Attention:Efficient Attention with Many Heads

Abstract

While transformers have begun to dominate many tasks in the vision field, applying them to large images remains computationally difficult. A big reason is that self-attention grows quadratically with the number of tags, which in turn grows quadratically with the size of the image. For larger images (e.g., 1080p), more than 60% of the computation in the network is spent on creating and applying the attention matrix. We take a step towards solving this problem by introducing Hydra Attention, a very efficient attention operation for visual transformers (ViTs) . Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as features, Hydra attention is computationally linear on both labels and features, with no hidden constants, which makes it efficient at The existing vitb/16 is significantly faster than standard self-attention by a factor of mark count. Furthermore, Hydra Attention maintains high accuracy on ImageNet and, in some cases, actually improves it.

Keywords: Vision Transformers, Attention, Token Efficiency

1 Introduction

Due to its versatility and high ability to learn from large amounts of data, transformers [32] have been a dominant force in natural language processing (NLP) over the past few years [17, 25, 6]. Now, the same takeover is happening in the visual domain with the introduction of Visual Transformers (ViTs) [10].

However, unlike NLP, pure instantiation of transformers as seen in NLP using BERT [17] or vision using ViT [10] is not a dominant force in computer vision tasks. Instead, more specialized vision-based attention architectures such as Swin [21] or MViT [11, 20] or hybrids of attention transformations such as LeViT [13] have emerged.

**The main reason behind this difference is efficiency: specialized visual transformers can perform better with less computation—by adding transformation layers, by using visual-specific local window attention, or by using some other method Cheaply add visual induction bias. **While pure vit can perform well at scale (90.45% top-1 on ImageNet [38]), the main mechanism of pure transformers when applying the model to the large images required for multiple downstream tasks— — (multihead self-attention) Multihead self-attention [32] — can become an extreme bottleneck.

In fact, when applying an existing ViT on 1080p images, typically used for benchmark tasks such as segmentation (e.g., CityScapes [7]), 60% of the total computation in the network (see Table 4) is simply used to create and Applying the self-attention matrix, this compares to only 4% on 224 × 224 ImageNet [9] images. In pure transformers, the computational scale of these attention matrices is related to the square of the tokens, which can already be very expensive (e.g. long sentences in NLP). But in ViT, this problem is further complicated by the fact that the labeling scales as the square of the image size, which means that doubling the image size will increase the calculation of attention by a factor of 16.

image-20221024091752336

In the field of NLP, a large number of techniques have been developed to solve this problem. Some works have introduced “linear” attention (in terms of labeling) by rearranging the order of computation using “kernel tricks” [5, 28, 16, 24] or projecting to a low label-independent rank space [34, 5, 24], some employ both methods simultaneously. However, most of these “linear” attention methods trade cross-label computation for cross-feature computation, which makes them quite expensive. In fact, recently Flash Attention [8] has shown that efficient implementations of io's multi-threaded self-attention can outperform most of these "linear" attention methods, even when the number of tokens is in the thousands.

There are also some works that attempt to engage effective attention in visual space, but none explore it alone within the traditional ViT shell. PolyNL [2] treats attention as an efficient third-order polynomial, but this has not been explored in ViT architectures. Note that Free Transformer [37] has an AFTSimple variant that is similarly efficient, but it performs poorly in pure ViT and requires additional support for convs and positional encodings. We tested both methods in the standard DeiT [31] shell (see Table 1) and found that both methods, while efficient, resulted in significant accuracy degradation. Thus, there is room in the literature for a truly valid, accurate, and general alternative to bullish self-attention.

image-20221024091929382

To this extent, we introduce Hydra Attention (see Figure 1). Our approach stems from a somewhat contradictory behavior in linear attention: **With standard multi-head self-attention, adding more heads to the model keeps the computational effort constant. **However,** after changing the order of operations in linear attention, adding more fronts actually reduces the computational cost of the layer. We take this observation to an extreme by setting the number of fronts in the model to be equal to the number of features,** thus creating an attention module, This module is computationally linear with both markers and features.

image-20221024091555130

Hydra Attention is not only a more general formulation of previous efficient attention work (see Section 3.5), but when the right kernel is used, it can be significantly more accurate (see Table 1). In fact, when mixed with standard multi-head attention**, Hydra Attention can actually improve the accuracy of the baseline DeiT-B model while being faster (see Figure 4). Since it is derived from multi-head attention,** our method retains several good properties of attention. Examples include interpretability (see Figure 3) and generalizability to different tasks.

image-20221024092449381

image-20221024092431378

However, while Hydra Attention is general and effective for large images, in this paper we only focus on ImageNet [9] classification using DeiT-B [31], which traditionally uses smaller 224 × 224 and 384 × 384 images . Although the improvement in efficiency is not so large (10-27% based on image size), other effective attention methods (such as [372, 2]) have suffered huge accuracy drops in this case (see Table 1) , while Hydra Note does not. We hope that Hydra Attention can be a stepping stone to regular pure Transformers with a large number of tokens in the future.

Our contribution is as follows: We conducted a study to verify how many heads a transformer can have (Figure 2) and found that 12 is the limit of softmax attention, but with the right kernel any number is feasible**. We then exploit this observation to introduce Hydra attention for pure transformers by increasing the number of bulls in bull self-attention (Section 3)**. We then mathematically analyze the action of Hydra Attention (Section 3.4) and introduce a method to visualize its focus (Figure 3). Finally, we found that by replacing a specific attention layer with Hydra Attention (Figure 4), we can improve the accuracy by 1% or match the accuracy of the baseline while generating a Strictly faster model.

2 Related Work

In this paper, we aim to speed up Transformer's inference time by removing the token square computation bottleneck in multi-head self-attention.

Efficient Attention

Multihead Self-Attention [32] is a well-known slow operation, and there has been a lot of work trying to solve its computational shortcomings in different fields.

In natural language processing, some works use decomposable kernel functions to approximate attention [5, 28, 16, 24]. This "core trick" allows them to reorder matrix multiplications based on features rather than labels. Some of these methods go a step further and reduce the dimensionality of matrix multiplication through projection into a low-rank space [34, 5, 24]. However, these “linear” attention methods trade computation across markers for computation across features, which can make them expensive. In fact, in the domain of this article (ImageNet classification), there are not enough labels to prove the correctness of these methods, and most of them produce a slower model. Even with thousands of tokens, Flash Attention [8] has shown that multi-threaded IO-aware implementations of self-attention can actually outperform the fastest of these methods.

**But reordering actions isn't the only way to speed up your focus. **In fact, the most common way to “linearize” visual attention is to use local window attention (such as [21, 3, 19]). This is indeed computationally linear with the number of tokens, but local window attention can be difficult to compute (especially in the case of Swin [21]), which is only possible in dense, spatially ordered modalities such as images and video) is possible.

Instead, we aim to produce a linear attention method that is efficient, computationally fast, and can span several different modalities.

Efficient Transformers

Replacing attention modules is not the only way to speed up Transformer inference time. In fact, depending on the number of tasks and tokens, other efficient converter methods may be preferable. For example, in the ImageNet [9] classification, attention only accounts for 4% of the total network computation, which means that if only attention is modified, 4% is the maximum speedup obtainable.

There are several effective visual transformers that blend convection and attention to create a more effective final product, such as LeViT [13], MobileViT [22] , Mobile-Former [4] and L VT [35]. All of these are valid strategies for images, and we consider them to be adjacent techniques. Other attention papers for vision, such as [37, 2], use convolution in addition to efficient attention, so it is difficult to distinguish whether this improvement comes from the attention method or the introduction of convolution.

In this paper, we have not made any modifications to the underlying ViT architecture, except replacing multi-head self-attention with Hydra Attention in order to clearly isolate its impact on performance.

Multihead Attention

Multi-head attention relies on increasing the number of heads used in multi-head attention. Interestingly, the number of heads used for multi-head attention has not been studied in depth since its introduction in [32]. There have been some studies on pruning attention heads [33, 23], however all studies were conducted in the direction of reducing the number of attention heads. In fact, even using viti-g (the largest ViT model explored in [38]), the authors only used 16 attention heads. Therefore, we conducted our own study in Figure 2.

3 Hydra Attention

Standard multi-head self-attention [32] grows quadratically with the number of markers in the image. More specifically, if T is the number of tokens and D is the number of feature dimensions, then creating and applying the attention matrix are both O ( T 2 D ) O(T^2D) < /span>O(T2D). This creates a problem when T is large (as is the case with large images), as this operation quickly becomes computationally infeasible.

3.1 The Kernel Trick

As discussed in Section 2, many works [5, 28, 16, 24] have tried to solve this problem by introducing “linear” attention. Given query Q, key K and R T × D R^{T ×D} RThe value V in T×D, the standard softmax self-attention is calculated as

image-20221024102658940

计算 Q K T QK^T QKT O ( T 2 D ) O(T^2D) O(T2D) and create a T × T matrix that scales with T Very poor. Like [16], we can generalize this operation by taking softmax(·) as the pairwise similarity between Q and k. That is to say, for some similarity function sim(·), we can write like this

image-20221024102812987

Put the original text here

image-20221024102844832

3.2 Multi-Head Attention

Although linear with T, the result in Eq. 4 is still undesirable: D is usually large (≥768), so create a D × D matrix and perform O ( T D 2 ) O(TD^2) O(TD2) operations are still very expensive. However, Eq. 1 to Eq. 4 assume that we create an attention matrix and therefore have a "head".

In practice, most visual converters use H heads (usually between 6 and 16), with each head creating and applying its own attention matrix. Following [32], each head operates on their respective subset of D/H features from Q, K and V. Therefore, equation 1 becomes

Put the original text

image-20221024103031065

3.3 Adding Heads

Given Eq. 8, the more fronts we add to the network, the faster multi-head linear attention becomes. This begs the question, how many fronts can we reasonably add? Most transformers use 6 to 16 fronts [32, 17, 10, 38] depending on the number of features D, but if you increase the number of fronts What happens if this amount is exceeded?

To find out, we train DeiT-B [31] on ImageNet-1k [9] and use standard multi-head self-attention with softmax (Eq. 5, MSA) or multi-head linear attention with cosine similarity (Eq. . 7, MLA) changes the number of head H, and the drawing result is shown in Figure 2. Judging from the memory usage, when H > 96, the MSA memory is exhausted, and when H < 3, the MLA memory is exhausted.

In terms of performance, the accuracy of the MSA tank is H > 12, and the accuracy of the MLA with cosine similarity remains fairly consistent up to H = 768. Surprisingly, with so many fronts, H equals D, which means that each front has only one scalar feature to process!< /span>

image-20221024103921024

3.4 The Hydra Trick

As shown in Figure 2, as long as the similarity function sim(x, y) is not softmax, any enlargement of H is feasible. To take advantage of this, we introduce the "hydra trick", which is setting H = D:

image-20221024104040687

Furthermore, although the derivation of Eq. 10 comes from multi-head attention, it actually ends up doing something very different: it first creates a global feature vector ∑ t = 1 T ϕ ( K ) t ⊙ V t \sum_{t=1}^{T} \phi(K)^{t} \odot V^{t} t=1Tϕ(K)tINt, this feature vector aggregates information between all markers in the image. Each φ(Q) then gates the importance of this global feature for each output token. Hydra attention therefore mixes information through a global bottleneck, rather than doing explicit token-to-token mixing like standard self-attention.

This results in a computational complexity of

image-20221024105734899

What leaves us is an efficient token mixing module that is linear with the number of tokens and features in the model, and unlike other linear attention methods (e.g. [5, 34, 16]) There are additional constants. Note that the space complexity of this technique is also O(TD), which is important for practical speed since many operations are io-bound (see [8]).

3.5 Relation to Other Works

There are some other O(TD) attention candidates in the literature: AttentionFree Transformer [37] (especially after-simple) and PolyNL [2]. In this section, we explore how the Hydra attention described in Eq. 10 relates to each.

AFT-Simple [37] is described as

image-20221024105915706

image-20221024105949296

4 Experiments

image-20221024110129639

image-20221024110143420

image-20221024110254856

image-20221024110307489

5 Conclusion and Future Directions

This article introduces Hydra Attention, a multi-head efficient attention module. We show that Hydra attention outperforms other O(TD) attention methods in Table 1 and can even work together with traditional multi-head self-attention to improve the accuracy of the baseline DeiT-B model in Figure 4. However, although Hydra attention performs well on ImageNet classification (Table 2, Table 3), its real acceleration potential lies in larger images (Table 4).

We have taken a first step to show that Hydra attention can work, and hope that future work can explore its use in other more token-intensive domains, such as detection, segmentation or video. Furthermore, Hydra attention is a general technique that does not make any assumptions about the relationship between labels, so it can be applied to further improve the speed of label sparse applications, such as mask pre-training [14, 12, 30] or label pruning [26,18,36]. We hope Hydra's focus can be used as a step toward more powerful, more efficient, and more versatile transformers in the future.

Guess you like

Origin blog.csdn.net/charles_zhang_/article/details/127488442