"Hands-On Deep Learning" - 65 Attention Scores

Mushen's version of "Learning Deep Learning by Hands" study notes, recording the learning process, please buy books for detailed content.
B station video link
open source tutorial link

attention score

insert image description here

Section 64 uses a Gaussian kernel to model the relationship between queries and keys . The exponential part in the Gaussian kernel can be regarded as the attention scoring function ( attention scoring function), referred to as the scoring function , and then the output of this function is input into the softmax function for operation, and the probability distribution of the value corresponding to the key (that is, the attention weight) will be obtained. . The output of the final attention pooling is based on the weighted sum of the values ​​of these attention weights.

insert image description here

The figure above illustrates how the output of attention pooling is calculated as a weighted sum of values, where α \alphaα denotes the attention scoring function. Since attention weights are probability distributions, the weighted sum is essentially a weighted average.

insert image description here

The lengths of query, key, and value can be different, query qi q_iqiand key ki k_ikiThe attention weight is passed by the attention scoring function α \alphaα maps two vectors into scalars, and after the softmax operation, choose a differentα \alphaα leads to different attention pooling operations.

additive attention

When queries and keys are of different lengths, additive attention can be used as a scoring function.

insert image description here

Scaled dot product attention

If the query and key have the same length, use the dot product to obtain a scoring function with higher computational efficiency, and divide it by the length d to make the length less affected. (Actually assuming that both query and key are independent random variables with zero mean and unit variance, then the dot product of two vectors has mean 0 and variance d. To ensure that regardless of vector length, the variance of the dot product It is 1 regardless of the length of the vector, and then divide the dot product by d \sqrt{d}d Get the scaled dot product attention ( scaled dot-product attention).

insert image description here

Summarize

The attention score is the similarity between query and key, and the attention weight is the score softmax result.
insert image description here

hands-on learning

attention scoring function

import math
import torch
from torch import nn
from d2l import torch as d2l

# 掩蔽softmax操作
def masked_softmax(X, valid_lens):
    """通过在最后一个轴上掩蔽元素来执行softmax操作"""
    # X:3D张量,valid_lens:1D或2D张量
    if valid_lens is None:
        return nn.functional.softmax(X, dim=-1)
    else:
        shape = X.shape
        if valid_lens.dim() == 1:
            valid_lens = torch.repeat_interleave(valid_lens, shape[1])
        else:
            valid_lens = valid_lens.reshape(-1)
        # 最后一轴上被掩蔽的元素使用一个非常大的负值替换,从而其softmax输出为0
        X = d2l.sequence_mask(X.reshape(-1, shape[-1]), valid_lens,
                              value=-1e6)
        return nn.functional.softmax(X.reshape(shape), dim=-1)

masked_softmax(torch.rand(2, 2, 4), torch.tensor([2, 3])) # 两个批次
tensor([[[0.4074, 0.5926, 0.0000, 0.0000],
         [0.5424, 0.4576, 0.0000, 0.0000]],

        [[0.3432, 0.3032, 0.3536, 0.0000],
         [0.3251, 0.4291, 0.2458, 0.0000]]])
masked_softmax(torch.rand(2, 2, 4), torch.tensor([[1, 3], [2, 4]]))
tensor([[[1.0000, 0.0000, 0.0000, 0.0000],
         [0.3577, 0.2927, 0.3496, 0.0000]],

        [[0.5817, 0.4183, 0.0000, 0.0000],
         [0.2940, 0.2301, 0.2602, 0.2157]]])

additive attention

#@save
class AdditiveAttention(nn.Module):
    """加性注意力"""
    def __init__(self, key_size, query_size, num_hiddens, dropout, **kwargs):
        super(AdditiveAttention, self).__init__(**kwargs)
        self.W_k = nn.Linear(key_size, num_hiddens, bias=False)
        self.W_q = nn.Linear(query_size, num_hiddens, bias=False)
        self.w_v = nn.Linear(num_hiddens, 1, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, queries, keys, values, valid_lens): # valid_lens:多少对key-value对是需要考虑的
        queries, keys = self.W_q(queries), self.W_k(keys)
        # 在维度扩展后,
        # queries的形状:(batch_size,查询的个数,1,num_hidden)
        # key的形状:(batch_size,1,“键-值”对的个数,num_hiddens)
        # 使用广播方式进行求和
        features = queries.unsqueeze(2) + keys.unsqueeze(1) # 升到四维
        features = torch.tanh(features)
        # self.w_v仅有一个输出,因此从形状中移除最后那个维度。
        # scores的形状:(batch_size,查询的个数,“键-值”对的个数)
        scores = self.w_v(features).squeeze(-1)
        self.attention_weights = masked_softmax(scores, valid_lens) # batch_size*查询个数*10
        # values的形状:(batch_size,“键-值”对的个数,值的维度)
        return torch.bmm(self.dropout(self.attention_weights), values) # 最后得到的是query长度*value长度

queries, keys = torch.normal(0, 1, (2, 1, 20)), torch.ones((2, 10, 2)) # 20个query长度是1,10个key长度是2
# values的小批量,两个值矩阵是相同的
values = torch.arange(40, dtype=torch.float32).reshape(1, 10, 4).repeat(2, 1, 1) # 10个value长度是4
valid_lens = torch.tensor([2, 6])

attention = AdditiveAttention(key_size=2, query_size=20, num_hiddens=8,
                              dropout=0.1)
attention.eval()
attention(queries, keys, values, valid_lens) # 2*1*4
tensor([[[ 2.0000,  3.0000,  4.0000,  5.0000]],

        [[10.0000, 11.0000, 12.0000, 13.0000]]], grad_fn=<BmmBackward0>)
# attention.attention_weights 2* 1* 10 中间维度被加权
print(attention.attention_weights.shape)
d2l.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)),
                  xlabel='Keys', ylabel='Queries')

insert image description here

Scaled dot product attention

#@save
class DotProductAttention(nn.Module):
    """缩放点积注意力"""
    def __init__(self, dropout, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)

    # queries的形状:(batch_size,查询的个数,d)
    # keys的形状:(batch_size,“键-值”对的个数,d)
    # values的形状:(batch_size,“键-值”对的个数,值的维度)
    # valid_lens的形状:(batch_size,)或者(batch_size,查询的个数)
    def forward(self, queries, keys, values, valid_lens=None):
        d = queries.shape[-1]
        # 设置transpose_b=True为了交换keys的最后两个维度
        scores = torch.bmm(queries, keys.transpose(1,2)) / math.sqrt(d)
        self.attention_weights = masked_softmax(scores, valid_lens)
        return torch.bmm(self.dropout(self.attention_weights), values)

queries = torch.normal(0, 1, (2, 1, 2))
attention = DotProductAttention(dropout=0.5)
attention.eval()
attention(queries, keys, values, valid_lens)

d2l.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)),
                  xlabel='Keys', ylabel='Queries')
tensor([[[ 2.0000,  3.0000,  4.0000,  5.0000]],

        [[10.0000, 11.0000, 12.0000, 13.0000]]])

insert image description here

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/132093762