How to understand Query, Key and Value in Transformer

-------------------------------------

Original articles are welcome to reprint, but please indicate the source.

-------------------------------------

Transformer originated from another article "Attention is all you need" written by Google Brain in 2017, and has since led another research hotspot in NLP and CV.

A very critical contribution in Transformer is self-attention. It is to use the relationship of the input sample itself to build an attention model.

Three very important elements are introduced in self-attention: Query, Key and Value.

Assumption  \bf{X} \in \mathbb{R}^{n \times d} is a feature of an input sample sequence, where n is the number of input samples (sequence length), and d is the latitude of a single sample.

Query, Key & Value are defined as follows:

Query:  \bf{Q} = \bf{X} \cdot W^Q , where \bf{W}^Q \in \mathbb{R}^{d \times d_q}, this matrix can be considered as proof of space transformation, the same below

Key:  \bf{K} = \bf{X} \cdot W^K , where\bf{W}^K \in \mathbb{R}^{d \times d_k}

Value:  \bf{V} = \bf{X} \cdot W^K , where\bf{W}^V \in \mathbb{R}^{d \times d_v}

For many people, seeing these three concepts is confusing. What is the relationship between these three concepts and self-attention, why do you want to take this name?

[ Note: It must be noted here that each row of X, Q, K, and V represents an input sample, which is different from the definition that each column of many sample matrices is a sample, which is very important for understanding the subsequent content.

Then this blog is to briefly explain the reasons for these three names.

To understand the meaning of these three concepts, we must first understand what self-attention ultimately wants?

The answer is: given the current input sample  \bf{x}_i \in \mathbb{R}^{1 \times d}(for better understanding, we disassemble the input), produce an output, and this output is the weighted sum of all samples in the sequence. Because it is assumed that this output can see all the input sample information, and then choose its own attention points according to different weights.

If you agree with this answer, then the following is a good explanation.

The concept of query, key & value actually comes from the recommendation system. The basic principle is: Given a query, calculate the correlation between the query and the key, and then find the most suitable value according to the correlation between the query and the key. For example: in movie recommendation. query is a person's preferences for movies (such as points of interest, age, gender, etc.), key is the type of movie (comedy, era, etc.), and value is the movie to be recommended. In this example, although each attribute of query, key and value is in a different space, they actually have a certain potential relationship, that is to say, through some transformation, the attributes of the three can be made in a similar space .

In the principle of self-attention, the current input sample \bf{x}_iis transformed into a query through space transformation,   \bf{q}_i = \bf{x}_i \cdot W^Q\bf{q}_i \in \mathbb{R}^{1 \times d_q}. Similar to the retrieval items in the recommendation system, we need to retrieve the required value according to the correlation between the query and the key. So \bf{K} = \bf{X} \cdot W^K why is the key?

Because according to the process of the recommendation system, we need to find the correlation between the query and the key. The easiest way is to perform a dot product to obtain the current sample and the relationship vector. In the operation of self-attention, the following operations will be performed  \bf{r_i} = \bf{q}_i \cdot K^T, so that  \bf{r}_i \in \mathbb{R}^{1 \times n} each element can be regarded as  \bf{x}_i a relationship vector between the current sample and other samples in the sequence.

After obtaining the relationship between the samples, it is a matter of course. You only need to  \bf{r}_i multiply the normalized by the V matrix to get the final weighted output of self-attention: O_i = softmax(r_i)\cdot V.

Each row in V is a sample of the sequence. \bf{O}_i \in \mathbb{R} ^{1 \times d_v}, where the output of each dimension of O is equivalent to the weighted sum of all input sequence samples corresponding to the latitude, and the weight is the relationship vector  softmax(r_i) . (This matrix multiplication can be drawn by yourself).

From this, the following conclusions can be drawn:

1. The reason why self-attention uses the three concepts of query, key, and value in the recommendation system is to use a process similar to the recommendation system. But self-attention is not to find the value for the query, but to obtain the weighted sum of the value according to the current query. This is the task of self-attention. It is necessary to find a better weighted output for the current input, which must contain all visible input sequence information, and attention is controlled by weights.

2. In self-attention, the key and value here are both a transformation of the input sequence itself. Maybe this is another meaning of self-attention: itself acts as key and value at the same time. In fact, it is very reasonable, because in the recommendation system, although the original feature spaces of the key and value attributes are different, they have a strong correlation, so they can be unified into a feature space through a certain space transformation. This is one of the reasons why self-attention is multiplied by W.

--

The above content will continue to be modified and optimized, welcome to exchange and discuss

---

References:

Attention is all you need:https://arxiv.org/pdf/1706.03762.pdf

Transformers in Vision: A Survey:  https://arxiv.org/abs/2101.01169  [Note that in this article, the latitude definitions of W^Q, W^K and W^V are wrong, don't be misled]

A Survey on Visual Transformer:https://arxiv.org/abs/2012.12556

Recommendation System and Attention Mechanism - Detailed Attention Mechanism_caizd2009's Blog-CSDN Blog_attention Recommendation System

neural networks - What exactly are keys, queries, and values in attention mechanisms? - Cross Validated

Guess you like

Origin blog.csdn.net/yangyehuisw/article/details/116207892