ICLR2022《Compositional Attention: Disentangling Search and Retrieval》

Insert image description here
Paper link: https://arxiv.53yu.com/pdf/2110.09419.pdf
Code link: https://github.com/sarthmit/Compositional-Attention
Insert image description here
(picture from the paper "attention is all you need" )

  • Standard multi-head self-attention
    1) Key-Value Attention
    Given a set of query and key-value pairs, key-value attention calculates the scaled cosine similarity measure between each query and key set. This similarity score determines the contribution (weight) of each value in the output of the corresponding query.
    Given input X ∈ RN × d X\in R^{N \times d}XRN × d , query, key and value are obtained respectively after linear transformation. The formula is as follows.
    Insert image description here
    For each query, use scaled cosine similarity (called scaled dot product) to calculate the similarity score of each key to give the soft Attention weight of combined values
    Insert image description here
    ​​2) Multi-head attention mechanism
    combines multiple (such ashhh ) Independent key-value attention mechanisms are connected in parallel to provide the model with the ability to jointly pay attention to different locations, thus improving the representation ability of the model. The outputs produced by these multiple heads are concatenated together and then linearly projected back to the input dimensions using a learnable matrix:
    Insert image description here
    where eachheadi = A attention ( Q i , K i , V i ) {head}_i = Attention(Q_i, K_i, V_i)headi=Attention(Qi,Ki,Vi)

1. Motive

It can be seen from the above that with standard multi-head self-attention, each head learns a strict mapping between query-key (search) and value (retrieval). This will cause two problems: 1) Lead to learning in certain
tasks Redundant parameters
2) hinders generalization

2. Method

In order to solve the above problems, this paper proposes a Compositional Attention alternative to the standard head structure, in which search and retrieval operations can be flexibly combined: the query-key search mechanism is no longer bound to a fixed value retrieval matrix, but draws attention from multiple combinations of headers. A matrix of accessible values ​​dynamically selected from the shared pool.

  • Search and Retrieval components
    1) Search
    search is parameterized by the query matrix and key matrix, that is, W q W_q respectivelyWqand W k ​​W_kWk. These parameters define the element pair xj x_jxj x k ∈ X x_k \in X xkThe concept of compatibility measure between X
    Insert image description here
    whereQ = XW q Q = XW_qQ=XWq K = X W k K = XW_k K=XWk, the above calculation gives an element xj x_jxjwith other elements xk x_kxkCompatibility under the compatibility measure defined by the search parameters.
    2) Retrieval
    retrieves a value matrix W v W_vWvParameterized, the value matrix describes XXFeature types in input elements in X that are relevant to downstream tasks and need to be accessed:
    Insert image description here
    whereV = XW v V=XW_vV=XWv. Note that each retrieve defines x_k from input xkxkThe type of property accessed in , and can take any Search result as its input.
    3) Multi-head attention as a rigid pairing of search and retrieval.
    According to the above definition, you can see how standard multi-head attention constitutes a rigid pairing of search and retrieval, thereby learning the end-to-end function of fixed attribute pairs during optimization. . In fact, h heads consist of h different search-retrieval pairs, and the i-th retrieval is performed only in the i-th search. Therefore, bullish attention is equivalent to a special case of Equations 4 and 5
    Insert image description here
  • Disadvantages of Rigid Correlation
    Assuming such rigid mapping is not always ideal and sometimes leads to a reduction in the capacity and learning ability of redundant parameters, thereby losing the opportunity for better system generalization. We note that the search associated with each header defines a feature (defined by the query-key matrix W q W_qWqand W k ​​W_kWkDefinition), this feature is calculated based on the compatibility between targets. Furthermore, the retrieval of each head allows the model to access specific features from the search target (defined by the value matrix W v W_vWvdefinition). Following this, two types of redundancy will be demonstrated:
    (a) search redundancy, which leads to the learning of redundant query-key matrices;
    (b) retrieval redundancy, which leads to the learning of redundant value matrices.
    Insert image description here
    We jointly highlight these two redundancies using a simple example shown in Figure 1 above, where three targets with attributes (shape, color, and location) are the subject of different questions. In (a), the model needs to learn to search based on color and retrieve shape information accordingly; in (b), the model needs to learn to search based on shape and retrieve location information. In this task, standard multi-head attention (middle row) should learn two heads, representing (a) and (b) respectively. To answer the question in ©, the model must search based on color and retrieve the location. What Head 1 learned (a) is to search based on color, and what Head 2 learned (b) is to retrieve based on location, but there is no way to combine them. Therefore, another head is needed to obtain the search for head 1 and the retrieval of head 2. This results in parameter redundancy and misses the opportunity to decompose the knowledge more efficiently, since this learned knowledge already exists separately in head 1 and head 2.
    The scenario in Figure 1 looks very ideal, since multi-head attention may not limit search/retrieval on a single feature and enable more detailed soft combinations. While this may be the case for simple examples, it highlights the dangers of rigidly learned associations, which limit the recombination of pieces of learned knowledge, lead to redundant parameters, and may limit the generalization of OoD, regardless of what the model learns. In what follows we propose that by allowing S × RS \times RS×Pairings like R to alleviate this fundamental limitation,SSS represents the number of search types,RRR represents the number of search types.
  • Compositional attention
    Insert image description here
    proposes a new attention mechanism that relaxes static search-retrieval pairs to support more flexible and dynamic mapping. In order to do this, the concept of head is abandoned here and replaced by independent and reorganized search and retrieval, as defined above. The core innovation lies in the way these two components are combined: retrieval using query-key attention.
    Similar to head, first define SSS parallel search mechanism. That is, we haveSSS different query-key parameterizationsW qi W_{q_i}WqiW ki W_{k_i}Wki. The output for each search is defined as shown in Equation 4. Essentially, for each search iii , can be obtained.RR
    Insert image description here
    is definedR different retrieval mechanisms, corresponding toRRR differentW vj W_{v_j}Wvjmatrix. These matrices are used to obtain different attributes from the input. Formally summarized as
    Insert image description here
    where V j V_jVjAccess to different properties is highlighted. Then, corresponding to each search, all possible retrievals are completed. Similar to Equation 5, defined as
    Insert image description here
    This step gives us all hypothetical retrievals for each search. In this step, one retrieval needs to be instantiated for each search. This is instantiated by using the retrieval queries Q ‾ i \overline{Q}_iQiand retrieval keys K ‾ ij \overline{K}_{ij}KijThe calculation is completed by the secondary attention mechanism, and they can be obtained as follows:
    Insert image description here
    where, the parameters W ‾ qi ∈ R d × dr \overline{W}_{q_i} \in R^{d \times d_r}WqiRd×drYes iiA different matrix for each search indexed by i , with W ‾ k \overline{W}_kWkTogether, used to drive the pairing between search and retrieval. We will matrix Q i Q_iQi R d × d r R^{d \times d_r} Rd×drPropagate to RN × 1 × dr R^{N \times 1 \times d_r}RN×1×dr, and define K ‾ i ∈ RN × R × dr \overline{K}_i \in R^{N \times R \times d_r}KiRN×R×drAs follows:
    Insert image description here
    So, with these retrieval query and key, the instantiation required for each search is by where
    Insert image description here
    the transpose is on the last two axes. Therefore, for each search iii , softmax gives attention weights to all possible retrievals, and instantiates winning retrievals through this soft attention mechanism. Finally, similar to multi-head attention, the outputs of these parallel searches are combined by concatenating them and passing through a linear network
    Insert image description here
    whereW o ∈ RS dv × d W^o \in R^{Sd_v \times d}WoRSdv× d . Note that in this mechanism, the retrieval selection for each search is not fixed, but is determined byQ ‾ i \overline{Q}_iQi K ‾ i \overline{K}_i Kidynamically adjusted respectively. Figure 2 gives a visual depiction of the computational graph.
    composition Attention allows the model to have:
    (a) different search times and retrieval times, respectively SSS andRRR ;
    (b) Dynamic selection of the number of shared retrievals for each search;
    (c)S × RS \times RS×R (Search – Retrieval) pair representation ability. Therefore, the emphasis here on composition Attention can unravel the entanglement of search and retrieval and solve the redundancy of multi-head attention.

3. Some experimental results

  • Retrieval tasks
    Insert image description here
    Insert image description here
  • relational reasoning
    Insert image description here
  • Equilateral triangle detection
    Insert image description here
  • Multi-task image classification
    Insert image description here
  • Logical reasoning in ESBN tasks
    Insert image description here
  • SCAN dataset
    Insert image description here
  • language model
    Insert image description here

4 Conclusion

1) In this work, multi-head attention is revisited, decomposed into two steps: search and retrieval, and its shortcomings due to the rigid association between search and retrieval mechanisms are highlighted.
2) In order to alleviate the problem that this rigid coupling hinders the reusability of parameters and reduces the expressive ability of the model, a new mechanism that uses the value retrieval mechanism to flexibly combine search and retrieval is proposed.

Guess you like

Origin blog.csdn.net/weixin_43994864/article/details/123291539