Self-attention mechanism (self-attention) is a neural network model based on attention mechanism, which is mainly used in natural language processing tasks. It is widely used in the Transformer model, which can calculate the relationship between each element in the input sequence and other elements, and use these relationships to better represent the input sequence.
In the self-attention mechanism, each element is a vector representation, for example, in language processing, the embedding vector of each word can be used as an element in the input sequence. Then, to calculate the relationship between each element and other elements, the self-attention mechanism introduces three matrices: query matrix, key matrix and value matrix. These matrices can extract features from each element in the input sequence through a linear transformation.
Realize self-attention with pytorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
class SelfAttention(nn.Module):
def __init__(self, input_size, hidden_size):
super(SelfAttention, self).__init__()
self.query = nn.Linear(input_size, hidden_size)
self.key = nn.Linear(input_size, hidden_size)
self.value = nn.Linear(input_size, hidden_size)
def forward(self, x):
# 计算Q、K、V
q = self.query(x)
k = self.key(x)
v = self.value(x)
# 计算Self-Attention矩阵
attn_weights = torch.bmm(q, k.transpose(1, 2))
attn_weights = F.softmax(attn_weights, dim=-1)
# 使用Self-Attention矩阵对V进行加权平均
attn_output = torch.bmm(attn_weights, v)
return attn_output
In the above code, we defined a SelfAttention
class that inherits from nn.Module
. In __init__()
the function, we define query
, , key
and value
three linear layers for computing query, key, and value vectors, respectively. In forward()
the function, we first compute the q
, , k
and v
vectors, then use torch.bmm()
the function to compute the Self-Attention matrix, and use F.softmax()
the function to normalize the Self-Attention matrix. Finally, we use torch.bmm()
the function to v
perform a matrix product of the Self-Attention matrix with the value vector and return the weighted averaged output.
SelfAttention
An instance can be created and tested using the following code :
input_size = 128
hidden_size = 64
batch_size = 32
seq_len = 10
sa = SelfAttention(input_size, hidden_size)
x = torch.randn(batch_size, seq_len, input_size)
output = sa(x)
print(output.size()) # 输出:torch.Size([32, 10, 64])
In the code above, we create an SelfAttention
instance and use torch.randn()
the function to generate a random input tensor of size , where is x
the batch size, is the sequence length, and is the feature dimension. Finally, we pass the instance and store the output in a tensor. We print the size to make sure the output size is as expected.(32, 10, 128)
32
10
128
x
sa
output
output
Self-attention is implemented in the network:
import torch
import torch.nn as nn
class MyModel(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(MyModel, self).__init__()
# 定义Self-Attention模块
self.self_attn = nn.MultiheadAttention(hidden_size, num_heads=8)
# 定义前向神经网络
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# 使用Self-Attention模块进行特征提取
x, _ = self.self_attn(x, x, x)
# 经过前向神经网络进行分类
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
In the code above, we define a MyModel
neural network named , which consists of a Self-Attention module and a feed-forward neural network. In __init__()
the function, we first define an nn.MultiheadAttention
instance and store it in self.self_attn
. Next, we define a feed-forward neural network that consists of an input layer fc1
, a ReLU activation function, and an output layer fc2
. In forward()
the function, we pass the input tensor x
to the Self-Attention module and use x
as query, key and value to extract features from the input. We then classify the features using a feed-forward neural network and return an output tensor x
.
MyModel
An instance can be created and tested using the following code :
input_size = 128
hidden_size = 64
num_classes = 10
batch_size = 32
seq_len = 10
model = MyModel(input_size, hidden_size, num_classes)
x = torch.randn(seq_len, batch_size, input_size)
output = model(x)
print(output.size()) # 输出:torch.Size([10, 32, 10])
In the code above, we create an MyModel
instance and use torch.randn()
the function to generate a random input tensor x
of size , (10, 32, 128)
where 10
is the sequence length, 32
is the batch size, and 128
is the feature dimension. Finally, we x
pass model
the instance and store the output in output
a tensor. We print output
the size to make sure the output size is as expected.