[AI theory learning] Language model Performer: a general attention framework based on Transformer architecture


Performer is a neural network architecture for efficiently processing self-attention mechanism (Self-Attention) . Self-attention mechanisms have achieved excellent results in many natural language processing and computer vision tasks, but have problems when processing long sequences because their computational complexity is proportional to the square of the sequence length. To solve these problems, Google AI introduces Performer, a Transformer architecture with linear scalability, and its attention mechanism is linearly scalable . The framework is implemented through the (FAVOR+) algorithm, which provides scalable, low-variance and unbiased estimates that can express attention mechanisms represented by stochastic feature map decomposition (in particular, conventional softmax- attention Fast Attention Via Positive Orthogonal Random Features ) . This mapping helps maintain linear space and time complexity.

Softmax orthogonal random eigendecomposition
The core idea of ​​Performer is to use low-rank approximation to replace the traditional fully connected self-attention matrix, thereby reducing computational complexity . Specifically, Performer uses the following key techniques:

  1. Fast Attention : In the traditional self-attention mechanism, attention weights between all positions need to be calculated , which results in a computational complexity of O ( n 2 ) O(n^2)O ( n2 ), wherennn is the sequence length. Performerreduces this computational complexity toO ( n ) O(n)introducing a fixed random projection matrix.O ( n ) . This technique makes the time complexity of the self-attention computation linearly related to the sequence length, rather than quadratically.
  2. Orthogonal Random Features : Performer uses orthogonal random features , which can further improve computational efficiency while maintaining model performance. These features can make the calculation of the projection matrix more efficient.
  3. Memory Efficient : Performer also proposes a memory efficient variant that can handle long sequences without being restricted by memory limits. This method is implemented by calculating the self-attention matrix in blocks .
  4. Favorable Asymptotics : Compared to standard self-attention, Performer has better asymptotic computational complexity as sequence length increases , which means it has a clear advantage when processing long sequences.

Overall, the Performer algorithm significantly improves the computational efficiency of self-attention models by introducing random features and low-rank approximation , as well as some other techniques, allowing them to be applied to longer sequences without sacrificing model performance . This gives it broad potential for applications in natural language processing and other fields.

Performer paper interpretation

Rethinking Attention with Performers
Paper abstract : We introduce Performer, a Transformer architecture that accurately estimates Transformers with regular (softmax) full-rank attention, but using only linear space and time complexity (instead of quadratic complexity) and without subject to any prior conditions, such as sparsity or low rank. To approximate the softmax attention kernel, Performers use a novel Fast Attention via positive Orthogonal Random features method ( FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representation capability is crucial to accurately compare softmax with other kernel functions on large-scale tasks for the first time, which is beyond the reach of conventional Transformers, and also allows the study of optimal attention kernel functions. Performers are linear architectures that are fully compatible with regular Transformers and have strong theoretical guarantees: unbiased or nearly unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. We tested Performers on a rich set of tasks, from pixel prediction to text modeling to protein sequence modeling, achieving competitive results that outperformed other proven efficient sparse and dense attention methods, demonstrating that Performers Effectiveness of the novel attentional learning paradigm utilized.

In human words : Performer is a Transformer architecture, and its attention mechanism can be linearly expanded. On the one hand, it allows the model to train faster, and on the other hand, it also allows the model to process longer input sequences. This is certainly great for certain image data sets (such as ImageNet64) and text data sets (such as PG-19). Performer uses an efficient (linear) general attention framework in which various attention mechanisms can be implemented using different similarity measures (i.e. various kernel methods). The framework is FAVOR+implemented by the (Fast Attention Via Positive Orthogonal Random Features, Fast Attention via Orthogonal Random Features) algorithm, which provides a scalable, low-variance, unbiased estimation attention mechanism that can be decomposed through random feature maps. Express. This method ensures linear space and time complexity on the one hand, and accuracy on the other hand. In addition, this method can be used alone for softmax operations, and can also be used in conjunction with other techniques such as reversible layers.


For applications that require long-distance attention, some researchers have proposed some fast and space-efficient methods, among which the more common method is sparse attention .
Standard sparsification techniques
Figure. Standard sparsification technique

However, sparse attention methods also have some limitations . First, they require efficient sparse matrix multiplication operations , which not all accelerators can do; second, they usually cannot provide strict theoretical guarantees for their representation capabilities ; third, they are mainly targeted at Transformer models and generated predictions. training to optimize; finally, they usually pile more attention layers to compensate for the sparse representation , which makes it difficult to use with other pre-trained models, requires retraining, and consumes a lot of energy.

Furthermore, sparse attention mechanisms are often insufficient to solve all the problems faced when applying conventional attention methods, such as pointer networks. There are also some operations that cannot be sparsified, such as the commonly used softmax operation.


Regular Attention Mechanism

In the conventional attention mechanism, the Query and Key corresponding to the rows and columns of the matrix are multiplied, and then the attention score matrix is ​​calculated through Softmax. The formula is as follows:
Attention ( Q , K , V ) = softmax ( QKT d ) V \text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d} })VAttention(Q,K,V)=softmax(d QKT) V
inQ , K , VQ, K, VQ,K,V (dimensions areL × d L \times dL×d ) are the queries, keys and values ​​matrices respectively. LLL is the length of the sentence,ddd is the (arbitrary) dimension of the queries, keys and values ​​vectors. The problem with Transformer comes from the softmax function, let's see why.

This method cannot decompose the query-key into the nonlinear softmax operation and decompose the result back to the original key and query , but it can decompose the attention matrix into the product of the random nonlinear function of the original query and key , which is the so-called random Features ( ) so that similarity informationrandom features can be encoded more efficiently . The standard attention matrix includes the similarity coefficient of each pair of entries, consisting of softmax calculations on query and key, represented by q and k.
Attention matrix factorization

Regular softmax attention can be viewed as a special case of nonlinear functions defined by exponential functions and Gaussian projections . Here we can also reason in reverse, first implementing some more generalized nonlinear functions that implicitly define other types of similarity measures or kernel functions in the query-key results . The researchers defined it as universal attention ( ) based on the early kernel method . Although closed-form solutions do not exist for most kernel functions, this mechanism can still be applied because it does not rely on closed-form solutions.generalized attention

The article proves for the first time that in the application of downstream Transformer, any attention matrix can be effectively approximated through random features . A new mechanism to achieve this uses positive random features(forward random features), which are positive nonlinear functions of the original query and key. This avoids instability during training and achieves a more accurate approximation to regular softmax attention.

FAVOR+: Fast attention via matrix correlation

Through the above decomposition, the linear (rather than quadratic) space complexity implicit attention matrix can be obtained . Similarly, a linear-time attention mechanism can be obtained by decomposition. The original way is to multiply the attention matrix with the value input to get the final result, but after decomposing the attention matrix , the matrix multiplication can be rearranged to approximate the results of the conventional attention mechanism without explicitly constructing a quadratic-sized Attention matrix . Eventually a new algorithm was generated FAVOR+.
Approximation of the regular attention mechanism AV via feature maps
Figure: Approximating regular attention mechanism AV via (random) feature maps (in D − 1 D^{−1}D1 before normalization). The dashed blocks represent the order of calculations, along with the corresponding time complexity. Left: Standard attention module calculation, where the final expected result is calculated by performing matrix multiplication with matrix A and value tensor V; right: by decoupling the matrix Q ′ used in the low-rank decomposition ofAQ andK ′ K′K and performing matrix multiplications in the order indicated in the dotted box, the researchers obtained a linear attention matrix without explicitly constructingAAA or its approximation.

The above analysis is related to bidirectional attention (that is, non-causal attention) and does not distinguish between past and future. So how to achieve only one part of the input sequence, that is, one-way causal attention? Just use the prefix sum calculation ( prefix-sum computation), and only store the total number of runs of the matrix calculation during the calculation process , instead of storing the complete lower triangular regular attention matrix .
Visual representation of the prefix-sum algorithm for unidirectional attention
Figure: Shows a visual representation of the prefix-sum algorithm for unidirectional attention. For clarity, we omit attention normalization in this visualization. This algorithm preserves the prefix sum, which is a matrix obtained by adding the outer product of the random features corresponding to keys and the values ​​vector. In each given iteration of the prefix sum algorithm, the random feature vector corresponding to the query is multiplied with the nearest prefix sum (obtained by adding the outer products corresponding to all previous tokens), resulting in the output by the attention mechanism. Matrix AV AVAV 's new row (new row). That is to say,the random feature mapping of the key vector and the value vector is outer producted to obtain the prefix sum, and this process is dynamically constructed. Finally, the random feature vector is left multiplied by the query to obtain a new row in the final matrix. AAon the leftA means that standard one-way attention requires mask attention matrix to obtain its lower triangular part.

Time complexity of Attention

To review, the dimensions are n × mn\times mn×m andm × pm\times pm×The time complexity of multiplying two matrices of p is O (nmp) O(nmp)O ( nm p ) . If we look at the attention equation, we see that we are multiplying three matrices:QQQ (dimension isL × d L\times dL×d), K T K^T KT (dimension isd × L d\times Ld×L ) andVVV (dimensionL × d L\times dL×d ). We will get different complexities depending on the order in which we multiply them. Ignore softmax and denominatord \sqrt{d}d (which is just a scalar), we can see: First by QKT QK^TQKThe product of T gives usO ( L 2 d ) O(L^2d)O(L2d )complexity, if letKTVK^TVKT Vis first multiplied and we get anO ( d 2 L ) O(d^2L)O(d2 L)complexity.
Time complexity comparison
Obviously, we should chooseO ( d 2 L ) O(d^2L)O(d2 L), becauseddd is a parameter we can choose, and we can haved < L d<Ld<L. _ However, we can't actually do the multiplications in this order becauseQKT QK^TQKT is "stuck" inside the softmax and there is no easy way to get it out. This means we have to deal withO ( L 2 d ) O(L^2d)O(L2 d), which is quadratic in sequence length (so processing longer sequences becomes increasingly computationally expensive). Therefore, softmax is the bottleneck of transformers and we would like to find a way to solve this problem.

Bypass the softmax bottleneck

At a high level, the approach proposed in this article is very simple. Can we find a way to approximate softmax that allows us to choose the order in which the matrices are calculated? Essentially, we want to find some matrix Q ′ Q'Q andK ′ K'K , satisfyingQ ′ K ′ ≈ softmax ( QKT / d ) Q'K'\approx \text{softmax}(QK^T/\sqrt{d})QKsoftmax(QKT/d ) The goal is simple, but the details of how to achieve it are a little more complicated.

First, let us recall that softmax is a function given a length of nnVectorz \mathbf{z} of nz , all elementszi \mathbf{z}_izi归一化的:σ ( z ) i = ezi ∑ j = 1 nezj \sigma (\mathbf{z} )_i=\frac{e^{z_i}}{\sum j=\mathbf{1}^ne^ {z_j} }σ ( z )i=j=1i.e _zjeziGiven this, note that we can rewrite softmax in the attention equation as : softmax ( QKT d ) = D − 1 A \text{softmax}(\frac{QK^T}{\sqrt{d}}) =D^{-1}Asoftmax(d QKT)=D1 AwhereA = exp ( QKT / d ) A=exp(QK^T/\sqrt{d})A=exp(QKT/d )D = diag ( A 1 L ) D=\text{diag}(A\mathbf{1}_L)D=diag(A1L) , diag turns an input vector into a diagonal matrix,1 L \mathbf{1}_L1LThen it is a length of LLL is a vector of all ones. That is the formula (1) in the article:
official 1
where the exponential inAAA is applied element-wise, D D D is the diagonal matrix with elements A 1 L A\mathbf{1}_L A 1Ldiagonal ( A 1 L ) \text{diagonal}(A\mathbf{1}_L)diag(A1L) is for summation,D − 1 D^{-1}D1 turns these sums into reciprocals, soD − 1 AD^{-1}AD1 Afollowed bysoftmax ( QKT / dk ) \text{softmax}(QK^T/\sqrt{d_k})softmax(QKT/dk ) are equivalent. In fact,A 1 LA\mathbf{1}_LA 1LJust a length of LLthe vector of L , which is passed through the pairAAObtained by summing the columns of A.

Note: AAThe element-wise exponential of A is the real problem here, so our goal is to factor it out somehow. We can ignore the scalar denominatord \sqrt{d}d Because it is only used for normalization, but we can equivalently normalize queries and keys. This means that our goal is to find some Q′ Q′Q andK ′ K'K,满足: Q ′ K ′ ≈ exp ( Q K T ) Q'K'\approx \text{exp}(QK^T) QKexp(QKT)

Find Softmax kernel through Gaussian kernel

In the above formula, AAA is the product of two matrices divided by a constant and then an exponential operation is performed to obtain an attention matrix. Here, the kernel method is introduced to approximate this attention matrix A. This is where the nuclear approach comes into play.

The specific method is as follows:
We know that kernels are equivalent to a certain feature map (feature map) φ \varphiFunctions of the dot product of φ . ForQQQ andKKAny vectorqi q_i in Kqiand kj k_jkj, the result of their original similarity without considering normalization is exp (qikj) exp(q_ik_j)exp(qikj) . After the kernel method is introduced, the calculation method becomes: K ( x , y ) = E [ ϕ ( x ) T , ϕ ( y ) T ] K(x, y)=\mathbb{E}[\phi(x) ^T, \phi(y)^T]K(x,y)=E [ ϕ ( x )T,ϕ ( y )T ]Usually, given a certain high-dimensional feature mapφ \varphiφ , we are interested in finding an equivalent functionKKK , which will allow us to avoidφ \varphiCalculations are performed in the high-dimensional space of φ . However, in our case we would actually go the opposite way: if we assumeAAA is an element containingA ( i , j ) = K ( qi , kj ) = exp ( qikj T ) A(i,j)=K(q_i,k_j)=exp(q_ik_j^T)A(i,j)=K(qi,kj)=exp(qikjT) (in whichqi q_iqiand kj k_jkjThey are QQQ andKKK row vector) kernel matrix, we can find a feature mapφ \varphiφ to help us decomposeAAA吗? A ( i , j ) = K ( q i , k j ) = exp ( q j k j T ) = ϕ ( q i ) T ϕ ( k j ) \mathbf{A}(i,j)=K(q_i,k_j)=\text{exp}(q_jk_j^T)=\phi(q_i)^T\phi(k_j) A(i,j)=K(qi,kj)=exp(qjkjT)=ϕ ( qi)Tϕ(kj)

Now, most kernels can be represented by feature maps of the form φ \varphiφ to approximate:
Nuclear technique approximate exp
wherehhh andf 1 , . . . , fl f_1, ..., f_lf1,...,flare different deterministic mapping functions, w 1 , . . . , wm w_1, ..., w_mw1,...,wmThen from a distribution DDSampled from D , that is, they are independently and identically distributed. Thereforeφ ( x ) \varphi(x)φ ( x ) is a function withl × ml\times ml×Vector of m elements.

  • When h ( x ) = 1 h(x)=1h(x)=1 ,l = 1 l=1l=1, D = N ( 0 , I d ) D=N(0, I_d) D=N(0,Id) , this core is what is calledPNG-kernel.
  • When h ( x ) = 1 h(x)=1h(x)=1 ,l = 2 l=2l=2, f 1 = s i n f_1=sin f1=sin, f 2 = c o s f_2=cos f2=When it comes to cos , it’sshift-invariantthe core. At this time, ifD = N ( 0 , I d ) D=N(0, \mathbf{I}_d)D=N(0,Id) , then it isa Gaussian kernel.

That is, if we draw w from a normal distribution with mean 0 and unit variance, we can obtain the Gaussian kernel by using a characteristic map:
ϕ ( x ) gauss = 1 m ( sin ( w 1 T x ) , . . . , sin ( wm T x ) , cos ( w 1 T x ) , . . . , cos ( wm T x ) ) \phi(\mathbf{x})_{gauss}=\frac{1}{\ sqrt{m}}(\text{sin}(w_1^T\mathbf{x}),...,\text{sin}(w_m^T\mathbf{x}),\text{cos}(w_1^ T\mathbf{x}),...,\text{cos}(w_m^T\mathbf{x}))ϕ ( x )gauss=m 1(sin(w1Tx),...,sin(wmTx),cos(w1Tx),...,cos(wmTx )) Note that the Gaussian kernel with unit variance is given by:
K gauss = exp ( − ∣ ∣ x − y ∣ ∣ 2 2 ) \mathbf{K}_{gauss}=\text{exp}(- \frac{||\mathbf{x}-\mathbf{y}||^2}{2} )Kgauss=exp(2∣∣xy2) Now remember that we want to find a softmax kernel:
KSM ( x , y ) = exp ( x T y ) \mathbf{K}_{SM}(\mathbf{x},\mathbf{y})= \text{exp}(\mathbf{x}^T\mathbf{y})KSM(x,y)=exp(xT y)We can see thatthe structure of the Softmax kernel is not too far away from the Gaussian kernel. It turns out that we can exploit this similarity to find the softmax kernel. In fact, note that
exploit this similarity to find the softmax kernel
this means that we can actuallyrewrite the softmax kernelas:
rewrite the softmax kernel
And, we can do this by convertinghhh function fromh ( x ) = 1 h(x)=1h(x)=1 is changed to the following form, reusing the feature map leading to the Gaussian kernel.
h ( x ) = exp ( ∣ ∣ x ∣ ∣ 2 2 ) h(x)=\text{exp}(\frac{||x||^2}{2})h(x)=exp(2∣∣x2)
This is a good approximation, but has some problems. The softmax function always outputs positive values, soA \mathbf{A}All elements of A should be positive values. However, using this kernel to approximate softmax may give some negative values. In fact, since we are drawing w from a normal distribution with mean 0, some of its values ​​will be negative, which in turn means thatA \mathbf{A}Some values ​​of A will be negative. This can cause problems and unexpected behavior.

Looking for a more stable Softmax kernel

The researchers found that the softmax kernel can also be rewritten as:
Rewrite Softmax kernel
(Proof that this is actually a softmax kernel can be found in the appendix of the paper.) Therefore, we can simply take the previous feature map form and set h ( x ) = exp ( − ∣ ∣ x ∣ ∣ 2 2 ) , l = 1 , f 1 = exp , D = N ( 0 , I d ) h(\mathbf{x})=\text{exp}(-\frac{||x ||^2}{2}),l=1,f_1=\text{exp},\mathcal{D}=\mathcal{N}(0,\mathbf{I}_d)h(x)=exp(2∣∣x2),l=1,f1=exp,D=N(0,Id)以得到 ϕ ( x ) S M = 1 m exp ( − ∣ ∣ x ∣ ∣ 2 2 ) ( exp ( w 1 T x ) , . . . , exp ( w m T x ) ) \phi(\mathbf{x})_{SM}=\frac{1}{\sqrt{m}}\text{exp}(-\frac{||\mathbf{x}||^2}{2})(\text{exp}(w_1^T\mathbf{x}),...,\text{exp}(w_m^T\mathbf{x})) ϕ ( x )SM=m 1exp(2∣∣x2)(exp(w1Tx),...,exp(wmTx ))
By doing this we can see thatall the values ​​are positivesince we are usingexp \text{exp}exp , thus solving the previous problem. The authors also proposed an alternative feature map that leads to the same kernel, you can read the original paper if interested.

Description of the above content in the paper:
Positive Random Features for Softmax

Find Q' and V' using Softmax kernel function

Let’s review. We start with the attention equation Attention ( Q , K , V ) = softmax ( QKT d ) V \text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{softmax }(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}})\mathbf{V}Attention(Q,K,V)=softmax(d QKT) V found that we can rewrite it as:
Attention function rewritingThen, we find the characteristic map of the Softmax kernel, which can be used to approximate the matrixA \mathbf{A}A:
ϕ ( x ) S M = 1 m exp ( − ∣ ∣ x ∣ ∣ 2 2 ) ( exp ( w 1 T x ) , . . . , exp ( w m T x ) ) \phi(\mathbf{x})_{SM}=\frac{1}{\sqrt{m}}\text{exp}(-\frac{||\mathbf{x}||^2}{2})(\text{exp}(w_1^T\mathbf{x}),...,\text{exp}(w_m^T\mathbf{x})) ϕ ( x )SM=m 1exp(2∣∣x2)(exp(w1Tx),...,exp(wmTx )) Therefore, we can now replace A \mathbf{A}
with a feature mapElements in A
: A ( i , j ) = K ( qi , kj ) = exp ( qikj T ) = ϕ SM ( qi T ) ϕ SM ( kj ) \mathbf{A}(i,j)=\mathbf{ K}(\mathbf{q}_i,\mathbf{k}_j)=\text{exp}(\mathbf{q}_i\mathbf{k}_j^T)=\phi_{SM}(\mathbf{q }_i^T)\phi_{SM}(\mathbf{k}_j)A(i,j)=K(qi,kj)=exp(qikjT)=ϕSM(qiT) ϕSM(kj)
Note that we start with lengthLLVector qi of L \mathbf{q}_iqikj \mathbf{k}_jkjMove to length mmVectorϕ SM ( qi ) of m \phi_{SM}(\mathbf{q}_i)ϕSM(qi)ϕ SM ( kj ) \phi_{SM}(\mathbf{k}_j)ϕSM(kj)

We can now convert A \mathbf{A}A decomposes intoQ ′ Q'Q andK ′ K'K , whereQ ′ Q'Q andK ′ K'KThe elements of ′ areϕ SM ( qi ) \phi_{SM}(\mathbf{q}_i)ϕSM(qi)ϕ SM ( kj ) \phi_{SM}(\mathbf{k}_j)ϕSM(kj)

Finally, we are free to change the order of matrix multiplications and change the time complexity from O ( L 2 d ) O(L^2d)O(L2 d)reduces toO ( L md ) O(Lmd)O ( L m d ) , thereby achieving linear rather than quadratic complexity over the sequence length.
Time complexity reduced

Summarize

Essentially, in this paper, the authors managed to find a way to approximate the softmax function using the dot product of feature maps . Because of this, the time complexity of calculating attention in transformers can be reduced from quadratic to linear for the sequence length, that is, the exponential operation is split and approximated. This will significantly speed up transformers when dealing with long sequences. At the same time, to the extent that the approximation is guaranteed, techniques such as kernel methods, positive values, and orthogonality are used.

In addition, it should be noted that:

  • Although this method was developed with transformers in mind, it can be applied to virtually any model that requires softmax .
  • The authors note that this approach is not only faster but also more memory efficient . This can be seen by looking at the dimensions of the matrix that needs to be stored.

Code availability

A PyTorch implementation of Performer: https://libraries.io/pypi/performer-pytorch, the installation method is as follows:

pip install performer-pytorch==1.1.4

Usage example:

import torch
from performer_pytorch import PerformerLM

model = PerformerLM(
    num_tokens = 20000,
    max_seq_len = 2048,             # max sequence length
    dim = 512,                      # dimension
    depth = 12,                     # layers
    heads = 8,                      # heads
    causal = False,                 # auto-regressive or not
    nb_features = 256,              # number of random features, if not set, will default to (d * log(d)), where d is the dimension of each head
    feature_redraw_interval = 1000, # how frequently to redraw the projection matrix, the more frequent, the slower the training
    generalized_attention = False,  # defaults to softmax approximation, but can be set to True for generalized attention
    kernel_fn = torch.nn.ReLU(),    # the kernel function to be used, if generalized attention is turned on, defaults to Relu
    reversible = True,              # reversible layers, from Reformer paper
    ff_chunks = 10,                 # chunk feedforward layer, from Reformer paper
    use_scalenorm = False,          # use scale norm, from 'Transformers without Tears' paper
    use_rezero = False,             # use rezero, from 'Rezero is all you need' paper
    ff_glu = True,                  # use GLU variant for feedforward
    emb_dropout = 0.1,              # embedding dropout
    ff_dropout = 0.1,               # feedforward dropout
    attn_dropout = 0.1,             # post-attn dropout
    local_attn_heads = 4,           # 4 heads are local attention, 4 others are global performers
    local_window_size = 256,        # window size of local attention
    rotary_position_emb = True,     # use rotary positional embedding, which endows linear attention with relative positional encoding with no learned parameters. should always be turned on unless if you want to go back to old absolute positional encoding
    shift_tokens = True             # shift tokens by 1 along sequence dimension before each block, for better convergence
)

x = torch.randint(0, 20000, (1, 2048))
mask = torch.ones_like(x).bool()

model(x, mask = mask) # (1, 2048, 20000)

Reference link

  1. Google AI Introduces Performer: A Generalized Attention Framework based on the Transformer architecture
  2. Performer - Pytorch
  3. From Transformers to Performers: Approximating Attention
  4. Rethinking Attention with Performers
  5. Random Features for Large-Scale Kernel Machines
  6. Some notes about Performer

Guess you like

Origin blog.csdn.net/ARPOSPF/article/details/132710212