Introduction to Deep Learning (67) Recurrent Neural Network - Attention Mechanism

foreword

The core content comes from blog link 1 blog link 2 I hope you can support the author a lot
This article is used for records to prevent forgetting

Recurrent Neural Networks - Attention Mechanism

courseware

psychology

  • Animals need to effectively focus on noteworthy points in complex environments
  • A psychological framework: Humans choose attention points based on voluntary and involuntary cues

attention mechanism

Convolutional, fully connected, and pooling layers only consider involuntary cues (no clear goals)

The pooling operation usually extracts the maximum value in the range of the receptive field (maximum pooling).
The convolution operation usually operates on all inputs through the convolution kernel, and then extracts some more obvious features.

The attention mechanism is to explicitly consider random cues

Random clues are called queries (query) - what you want to do,
each input is a pair of value (value) and non-random clues (key) - can be understood as the environment, that is, some keys Value pairs, key and value can be the same, or they can be different.
Select certain inputs biasedly through the attention pooling layer---- select inputs biasedly according to the query, which is different from the previous pooling layer. Here Explicitly added query, and then query what you need according to query

Non-parametric attention pooling layer

insert image description here

Non-parameter: no need to learn parameter
x – key
y – value
f(x) – corresponding to the thing to be queried
(x, y) – key-value pair (candidate)
average pooling: the reason why it is the simplest solution is Because you don't need to care about what you are looking for (that is, x in f(x)), you just need to sum and average y without thinking

Nadaraya-Watson kernel regression:

Kernel: K function, which can be considered as a function to measure the distance between x and xi
. The newly given value is compared with similar data, and then the value values ​​corresponding to these data are then weighted and summed to obtain the final query), so there is no need to learn parameters

Choice of K: Gaussian Kernel
insert image description here

u: represents the distance between x and xi
exp: the function is to change the final result into a number greater than 0
softmax: get a number between 0 and 1 as a weight
Add a learnable w on the basis of the above formula:
insert image description here

Summarize

1. Psychology believes that people choose attention points through random cues and involuntary cues

2. In the attention mechanism, the input is biasedly selected through query (random clue) and key (non-random clue), which can generally be written as

insert image description here

The key of f(x) and all the keys of involuntary clues are calculated on the distance (α(x, xi), usually called attention weight), which are respectively used as the weight of all values. This is not a new concept
. Non-parametric attention mechanisms have been around since the 60s

Textbooks (attention prompts)

Since economics studies the allocation of scarce resources, people are in the era of "attention economy", that is, human attention is regarded as an exchangeable, limited, valuable and scarce commodity. Many business models have also been developed to take advantage of this: On music or video streaming services, people either consume attention on advertisements or pay to hide them; in order to grow in the world of online games, people either consume attention In game battles, thus helping to attract new players, or pay to become instantly powerful. In short, attention is not free.

Attention is scarce, but there is not much information in the environment that interferes with attention. For example, the human visual nervous system receives about 1 0 8 10^8 per second108 bits of information, which is way beyond what the brain can fully process. Fortunately, our ancestors have learned from experience (also known as data) that "not all inputs to the senses are created equal". Throughout human history, this ability to direct attention to only a small subset of information of interest has allowed the human brain to more wisely allocate resources to survive, grow and socialize, such as spotting predators, finding food and mates.

1 Attention cues in biology

How is attention applied to the visual world? It starts with a framework that is very popular today 双组件(two-component): the emergence of this framework can be traced back to the 1890s of William James, who is considered the "father of American psychology". In this framework, subjects base 非自主性提示and 自主性提示selectively direct the focus of attention.

Involuntary cues are based on the salience and visibility of objects in the environment. Imagine if we have five items in front of us: a newspaper, a research paper, a cup of coffee, a notebook, and a book, like the picture below. All paper products are printed in black and white except for the coffee cups which are red. In other words, this coffee mug is prominent and conspicuous in this visual environment, involuntarily attracting people's attention. So we put the sharpest eyesight on the coffee, as shown in the picture.
insert image description here
After drinking coffee, we get excited and want to read a book, so we turn our heads, refocus our eyes, and read a book, as depicted in the image below. Different from the selection caused by salience in the above picture, the selection book is controlled by cognition and consciousness at this time, so attention will be more cautious when assisting selection based on autonomous prompts. Pushed by the subject's subjective will, the power of choice is stronger.
insert image description here

2 Queries, keys and values

Autonomous and non-autonomous attention cues explain the way of human attention. Let's see how to use these two attention cues to design the framework of the attention mechanism with neural networks.

First, consider a relatively simple situation using only involuntary cues. To bias the selection towards sensory input, one can simply use parametric fully connected layers, or even non-parametric max pooling or average pooling layers.

Therefore, "whether autonomy hints are included" distinguishes attention mechanisms from fully-connected or pooling layers. In the context of attention mechanisms, autonomous cues are called 查询(query). Given any query, the attention mechanism 注意力汇聚(attention pooling)guides the selection to 感官输入(sensory inputs, such as intermediate feature representations) by . In the attention mechanism, these sensory inputs are called 值(value). Interpreted more colloquially, each value is 键(key)paired with a , which can be imagined as an involuntary cue for sensory input. As shown, attention pooling can be designed in such a way that a given query (autonomous cues) is matched to a key (involuntary cues), which leads to the best matching value (sensory input).
insert image description here
Given the dominance of the above-mentioned framework in the figure, the models under this framework will be the center of this chapter. However, there are many alternatives to the design of attention mechanisms. For example, it is possible to design a non-differentiable attention model that can be trained using reinforcement learning methods (Mnih et al., 2014).

3 Visualization of Attention

The average pooling layer can be viewed as a weighted average of the inputs, where each input is equally weighted. In fact, attention pooling results in a sum of weighted averages, where weights are computed across different keys for a given query.

import torch
from d2l import torch as d2l

In order to visualize the attention weights, a show_heatmapsfunction needs to be defined. The shape of its input matricesis (number of rows to display, number of columns to display, number of queries, number of keys).

#@save
def show_heatmaps(matrices, xlabel, ylabel, titles=None, figsize=(2.5, 2.5),
                  cmap='Reds'):
    """显示矩阵热图"""
    d2l.use_svg_display()
    num_rows, num_cols = matrices.shape[0], matrices.shape[1]
    fig, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize,
                                 sharex=True, sharey=True, squeeze=False)
    for i, (row_axes, row_matrices) in enumerate(zip(axes, matrices)):
        for j, (ax, matrix) in enumerate(zip(row_axes, row_matrices)):
            pcm = ax.imshow(matrix.detach().numpy(), cmap=cmap)
            if i == num_rows - 1:
                ax.set_xlabel(xlabel)
            if j == 0:
                ax.set_ylabel(ylabel)
            if titles:
                ax.set_title(titles[j])
    fig.colorbar(pcm, ax=axes, shrink=0.6);

Let's use a simple example to demonstrate. In this example, the attention weight is 1 only if the query and key are the same, and 0 otherwise.

attention_weights = torch.eye(10).reshape((1, 1, 10, 10))
show_heatmaps(attention_weights, xlabel='Keys', ylabel='Queries')

output:
insert image description here

Later chapters will often call show_heatmapsfunctions to display attention weights.

4 Summary

  • Human attention is a finite, valuable and scarce resource.

  • Subjects selectively direct attention using involuntary and voluntary cues. The former is based on salience, the latter on awareness.

  • The difference between the attention mechanism and the fully connected layer or the pooling layer is due to the addition of autonomous hints.

  • The attention mechanism differs from fully-connected layers or pooling layers due to the inclusion of autonomous cues.

  • The attention mechanism biases selection toward values ​​(sensory inputs), which contain queries (autonomous cues) and keys (involuntary cues), through attentional pooling. Keys and values ​​are pairs.

  • It is feasible to visualize the attention weights between queries and keys.

Textbook (Attention Pooling: Nadaraya-Watson Kernel Regression)

The previous section introduced the main components of the attention mechanism under the framework: Interactions between queries (voluntary cues) and keys (involuntary cues) form attention pools; attention pools selectively aggregate values ​​(sensory inputs) to Generate the final output. This section introduces more details of attention pooling to get a high-level understanding of how the attention mechanism works in practice. Specifically, the Nadaraya-Watson kernel regression model proposed in 1964 is a simple but complete example that can be used to demonstrate machine learning with attention mechanisms.

import torch
from torch import nn
from d2l import torch as d2l

1 Generate dataset

n_train = 50  # 训练样本数
x_train, _ = torch.sort(torch.rand(n_train) * 5)   # 排序后的训练样本

def f(x):
    return 2 * torch.sin(x) + x**0.8

y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))  # 训练样本的输出
x_test = torch.arange(0, 5, 0.1)  # 测试样本
y_truth = f(x_test)  # 测试样本的真实输出
n_test = len(x_test)  # 测试样本数
n_test

output

50

The function below will plot all training samples (samples are represented by circles), the real data generating function ff without the noise termf (labeled "Truth"), and the learned prediction function (labeled "Pred").

def plot_kernel_reg(y_hat):
    d2l.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth', 'Pred'],
             xlim=[0, 5], ylim=[-1, 5])
    d2l.plt.plot(x_train, y_train, 'o', alpha=0.5);

2 average pooling

Use the simplest estimator first to solve the regression problem. Compute the average of all training sample output values ​​based on mean pooling:
f ( x ) = 1 n ∑ i = 1 nyi , f(x) = \frac{1}{n}\sum_{i=1}^n y_i ,f(x)=n1i=1nyi,
as shown in the figure below, this estimator is indeed not smart enough. real functionfff ("Truth") and prediction function("Pred") are very different.

y_hat = torch.repeat_interleave(y_train.mean(), n_test)
plot_kernel_reg(y_hat)

output:
insert image description here

3 Non-parametric attention pooling

Obviously, average pooling ignores the input xi x_ixi. So Nadaraya and Watson came up with a better idea to output yi y_i according to the position of the inputyiTo weight:

f ( x ) = ∑ i = 1 n K ( x − x i ) ∑ j = 1 n K ( x − x j ) y i ( 10.2.3 ) f(x) = \sum_{i=1}^n \frac{K(x - x_i)}{\sum_{j=1}^n K(x - x_j)} y_i \qquad(10.2.3) f(x)=i=1nj=1nK(xxj)K(xxi)yi(10.2.3)

where KKK is the kernel. The estimator described by Equation (10.2.3) is calledNadaraya-Watson核回归(Nadaraya-Watson kernel regression). The details of the kernel function will not be discussed in depth here, but inspired by this, we can rewrite (10.2.3) from the perspective of the attention mechanism framework and become a more general注意力汇聚(attention pooling)formula:
f ( x ) = ∑ i = 1 n α ( x , xi ) yi ( 10.2.4 ) f(x) = \sum_{i=1}^n \alpha(x, x_i) y_i \qquad(10.2.4)f(x)=i=1na ( x ,xi)yi( 10.2.4 )
wherexxx is the answer,( xi , yi ) (x_i, y_i)(xi,yi) are key-value pairs. Comparing (10.2.4) and (10.2.2), attention pooling isyi y_iyiweighted average. will query xxx and keyxi x_ixiThe relationship between is modeled as 注意力权重(attention weight), as shown in (10.2.4), this weight will be assigned to each corresponding value yi y_iyi. For any query, the model has a valid probability distribution over all key-value pairs of attention weights: they are non-negative and sum to 1.

To understand pooling of attention better, consider one 高斯核(Gaussian kernel), defined as:

K ( u ) = 1 2 π exp ⁡ ( − u 2 2 ) ( 10.2.5 ) K(u) = \frac{1}{\sqrt{2\pi}} \exp(-\frac{u^2 {2})\qquad (10.2.5)K(u)=2 p.m 1exp(2u2)(10.2.5)

Substituting the Gaussian kernel into (10.2.4) and (10.2.3) gives:

f ( x ) = ∑ i = 1 n α ( x , x i ) y i = ∑ i = 1 n exp ⁡ ( − 1 2 ( x − x i ) 2 ) ∑ j = 1 n exp ⁡ ( − 1 2 ( x − x j ) 2 ) y i = ∑ i = 1 n s o f t m a x ( − 1 2 ( x − x i ) 2 ) y i . ( 10.2.6 ) \begin{split}\begin{aligned} f(x) &=\sum_{i=1}^n \alpha(x, x_i) y_i\\ &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}(x - x_i)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}(x - x_j)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}(x - x_i)^2\right) y_i. \end{aligned}\end{split} \qquad (10.2.6) f(x)=i=1na ( x ,xi)yi=i=1nj=1nexp(21(xxj)2)exp(21(xxi)2)yi=i=1nsoftmax(21(xxi)2)yi.(10.2.6)

In (10.2.6), if a key xi x_ixiThe closer to a given query xxx , then assign to this key the corresponding valueyi y_iyiThe greater the weight of attention, the more "obtained attention".

It is worth noting that Nadaraya-Watson kernel regression is a non-parametric model. Therefore, (10.2.6) is 非参数的注意力汇聚(nonparametric attention pooling)the model. Next, we will plot predictions based on this non-parametric attention pooling model. From the plotted results you will find that the new model prediction line is smooth and closer to the truth than the average pooled prediction.

# X_repeat的形状:(n_test,n_train),
# 每一行都包含着相同的测试输入(例如:同样的查询)
X_repeat = x_test.repeat_interleave(n_train).reshape((-1, n_train))
# x_train包含着键。attention_weights的形状:(n_test,n_train),
# 每一行都包含着要在给定的每个查询的值(y_train)之间分配的注意力权重
attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2, dim=1)
# y_hat的每个元素都是值的加权平均值,其中的权重是注意力权重
y_hat = torch.matmul(attention_weights, y_train)
plot_kernel_reg(y_hat)

Output:
insert image description here
Now look at the attention weights. Here the input of the test data is equivalent to the query, and the input of the training data is equivalent to the key. Since both inputs are sorted, it can be observed that the closer the query-key pair is, the higher the attention weight of the attention pool will be.

d2l.show_heatmaps(attention_weights.unsqueeze(0).unsqueeze(0),
                  xlabel='Sorted training inputs',
                  ylabel='Sorted testing inputs')

output
insert image description here

4 Attention pooling with parameters

Advantages of non-parametric Nadaraya-Watson kernel regression 一致性(consistency): If there is enough data, the model will converge to the optimal result. Nevertheless, we can easily integrate learnable parameters into attention pooling.

For example, slightly different from (10.2.6), in the following query xxx and keyxi x_ixiThe distance between multiplied by the learnable parameter www

f ( x ) = ∑ i = 1 n α ( x , x i ) y i = ∑ i = 1 n exp ⁡ ( − 1 2 ( ( x − x i ) w ) 2 ) ∑ j = 1 n exp ⁡ ( − 1 2 ( ( x − x j ) w ) 2 ) y i = ∑ i = 1 n s o f t m a x ( − 1 2 ( ( x − x i ) w ) 2 ) y i . ( 10.2.7 ) \begin{split}\begin{aligned}f(x) &= \sum_{i=1}^n \alpha(x, x_i) y_i \\&= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}((x - x_i)w)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}((x - x_j)w)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}((x - x_i)w)^2\right) y_i.\end{aligned}\end{split} \qquad (10.2.7) f(x)=i=1na ( x ,xi)yi=i=1nj=1nexp(21((xxj)w)2)exp(21((xxi)w)2)yi=i=1nsoftmax(21((xxi)w)2)yi.( 10.2.7 )
The remainder of this section learns the parameters of attention pooling by training this model (10.2.7).

4.1 Batch Matrix Multiplication

To more efficiently compute attention for small batches of data, we can take advantage of the batch matrix multiplication provided in the Deep Learning Development Framework.

Suppose the first mini-batch contains nnn matricesX 1 , … , X n \mathbf{X}_1,\ldots, \mathbf{X}_nX1,,Xn, the shape is a × ba\times ba×b , the second mini-batch containsnnn rangeY 1 , ... , Y n \mathbf{Y}_1, \ldots, \mathbf{Y}_nY1,,Yn, the shape is b × cb\times cb×c . Their batch matrix multiplication yieldsnnn rangeX 1 Y 1 , ... , X n Y n \mathbf{X}_1\mathbf{Y}_1, \ldots, \mathbf{X}_n\mathbf{Y}_nX1Y1,,XnYn, the shape is a × ca\times ca×c . So, assuming two tensors are of shape( n , a , b ) (n,a,b)(n,a,b ) and( n , b , c ) (n,b,c)(n,b,c ) whose batch matrix multiplication output has shape( n , a , c ) (n,a,c)(n,a,c)

X = torch.ones((2, 1, 4))
Y = torch.ones((2, 4, 6))
torch.bmm(X, Y).shape

output

torch.Size([2, 1, 6])

In the context of attention mechanisms, we can use mini-batch matrix multiplication to compute weighted averages across mini-batches of data.

weights = torch.ones((2, 10)) * 0.1
values = torch.arange(20.0).reshape((2, 10))
torch.bmm(weights.unsqueeze(1), values.unsqueeze(-1))

output

tensor([[[ 4.5000]],

        [[14.5000]]])

4.2 Define the model

Based on parametric attention pooling in (10.2.7), using mini-batch matrix multiplication, the parametric version of the Nadaraya-Watson kernel regression is defined as:

class NWKernelRegression(nn.Module):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.w = nn.Parameter(torch.rand((1,), requires_grad=True))

    def forward(self, queries, keys, values):
        # queries和attention_weights的形状为(查询个数,“键-值”对个数)
        queries = queries.repeat_interleave(keys.shape[1]).reshape((-1, keys.shape[1]))
        self.attention_weights = nn.functional.softmax(
            -((queries - keys) * self.w)**2 / 2, dim=1)
        # values的形状为(查询个数,“键-值”对个数)
        return torch.bmm(self.attention_weights.unsqueeze(1),
                         values.unsqueeze(-1)).reshape(-1)

4.3 Training

Next, transform the training dataset into keys and values ​​for training the attention model. In the attention pooling model with parameters, the input of any training sample will be calculated with the "key-value" pairs of all training samples except itself, so as to obtain its corresponding predicted output.

# X_tile的形状:(n_train,n_train),每一行都包含着相同的训练输入
X_tile = x_train.repeat((n_train, 1))
# Y_tile的形状:(n_train,n_train),每一行都包含着相同的训练输出
Y_tile = y_train.repeat((n_train, 1))
# keys的形状:('n_train','n_train'-1)
keys = X_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))
# values的形状:('n_train','n_train'-1)
values = Y_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))

When training a pooled attention model with parameters, use a squared loss function and stochastic gradient descent.

net = NWKernelRegression()
loss = nn.MSELoss(reduction='none')
trainer = torch.optim.SGD(net.parameters(), lr=0.5)
animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[1, 5])

for epoch in range(5):
    trainer.zero_grad()
    l = loss(net(x_train, keys, values), y_train)
    l.sum().backward()
    trainer.step()
    print(f'epoch {
      
      epoch + 1}, loss {
      
      float(l.sum()):.6f}')
    animator.add(epoch + 1, float(l.sum()))

Output:
insert image description here
As shown below, after training the attention pooling model with parameters, it can be found that: When trying to fit the noisy training data, the prediction results draw a line that is not as smooth as the previous non-parametric model.

# keys的形状:(n_test,n_train),每一行包含着相同的训练输入(例如,相同的键)
keys = x_train.repeat((n_test, 1))
# value的形状:(n_test,n_train)
values = y_train.repeat((n_test, 1))
y_hat = net(x_test, keys, values).unsqueeze(1).detach()
plot_kernel_reg(y_hat)

Output
insert image description here
Why is the new model less smooth? Let's take a look at the plotting of the output results: Compared with the non-parametric attention pooling model, after adding learnable parameters to the parameterized model, the curve becomes less smooth in the area where the attention weight is larger.

d2l.show_heatmaps(net.attention_weights.unsqueeze(0).unsqueeze(0),
                  xlabel='Sorted training inputs',
                  ylabel='Sorted testing inputs')

output
insert image description here

5 Summary

  • Nadaraya-Watson kernel regression is a machine learning paradigm with an attention mechanism.

  • Attention pooling for Nadaraya-Watson kernel regression is a weighted average of the outputs in the training data. From an attention perspective, the attention weight assigned to each value depends on a function that takes as input the value's corresponding key and query.

  • Attention pooling can be divided into non-parametric and parametric.

Guess you like

Origin blog.csdn.net/qq_52358603/article/details/128485768