Tweet link: [GNN] Graph Attention Network
GAT
GAT (Graph Attention Network) is a new type of neural network structure based on graph-structured data, which uses hidden self-attention layers to solve the shortcomings of previous methods based on graph convolution or its approximation. By overlaying layers, a node is able to participate in the features of its neighborhood, which allows (implicitly) assigning different weights to different nodes of a neighborhood without requiring any kind of expensive matrix operations or relying on prior knowledge of the graph structure. In this way, GAT simultaneously addresses several key challenges of spectrum-based graph neural networks and makes the model easily applicable to both induction and transduction problems.
From graph convolutional networks (GCNs), we learned that combining local graph structures with node-level features can achieve good performance in node classification tasks.
h i ( l + 1 ) = σ ( ∑ j ∈ N ( i ) 1 c i j W ( l ) h j ( l ) ) c i j = ∣ N ( i ) ∣ ∣ N ( j ) ∣ \begin{aligned} h_i^{(l+1)}&=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\frac{1}{c_{ij}} W^{(l)}h^{(l)}_j}\right) \\ c_{ij}&=\sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|} \end{aligned} hi(l+1)cij=p⎝⎛j∈N(i)∑cij1W(l)hj(l)⎠⎞=∣N(i)∣∣N(j)∣
However, the way GCN aggregates messages is structure-dependent , which may compromise its generality.
GAT introduces an attention mechanism to replace the static normalized convolution operation. The diagram below clearly illustrates the key difference.
Attention layer
The input is a set of node features
h = h ⃗ 1 , h ⃗ 2 , … , h ⃗ N , h ⃗ i ∈ RF \mathbf{h} = {\vec{h}_1,\vec{h}_2,…,\vec{h}_N }, \vec{h}_i ∈ \mathbb{R}^{F}h=h1,h2,…,hN,hi∈RF
Generate a new set of node properties
h = h ′ ⃗ 1 , h ′ ⃗ 2 , … , h ′ ⃗ N , h ′ ⃗ i ∈ RF ′ \mathbf{h} = {\vec{h'}_1,\vec{h'}_2,… ,\vec{h'}_N}, \vec{h'}_i ∈ \mathbb{R}^{F'}h=h′1,h′2,…,h′N,h′i∈RF′
The attention layer can be divided into 4 parts:
- Simple linear transformation
In order to obtain enough expressive power to transform input features into high-level features, at least one learnable linear transformation is required. To this end, as an initial step, by a weight matrix W ∈ RF ′ × FW ∈ \mathbb{R}^{F′×F}W∈RA shared linear transformation parameterized by F ′ × F is applied to each node.
z i ( l ) = W ( l ) h i ( l ) \begin{aligned} z_i^{(l)}&=W^{(l)}h_i^{(l)} \end{aligned} zi(l)=W(l)hi(l)
- Attention Coefficients
We then compute pairwise unnormalized attention scores between two neighbors.
e i j ( l ) = LeakyReLU ( a ⃗ ( l ) T ( z i ( l ) ∣ ∣ z j ( l ) ) ) \begin{aligned} e_{ij}^{(l)}&=\text{LeakyReLU}(\vec a^{(l)^T}(z_i^{(l)}||z_j^{(l)}))\ \end{aligned} eij(l)=LeakyReLU (a(l)T(zi(l)∣∣zj(l)))
∣ ∣ || ∣ ∣ means concatenation. This form of attention is often referred to as additive attention, as opposed to the dot-product attention used in the Transformer model. This step allows each node to participate in the computation of other nodes, disregarding all structural information.
Softmax
This makes the coefficients easy to compare across different nodes, we use softmax
the function in jjnormalize them among all options of j
α i j ( l ) = exp ( e i j ( l ) ) ∑ k ∈ N ( i ) exp ( e i k ( l ) ) \begin{aligned} \alpha_{ij}^{(l)}&=\frac{\exp(e_{ij}^{(l)})}{\sum_{k\in \mathcal{N}(i)}^{}\exp(e_{ik}^{(l)})} \end{aligned} aij(l)=∑k∈N(i)exp(ei k(l))exp(eij(l))
- Aggregation
This step is similar to GCN. Embeddings from neighbors are aggregated, scaled by the attention score.
h i ( l + 1 ) = σ ( ∑ j ∈ N ( i ) α i j ( l ) z j ( l ) ) \begin{aligned} h_i^{(l+1)}&=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\alpha^{(l)}_{ij} z^{(l)}_j }\right) \end{aligned} hi(l+1)=p⎝⎛j∈N(i)∑aij(l)zj(l)⎠⎞
Multi-head Attention
Multi-head attention ( K = 3 ) for a node 1 on its neighborhood (K = 3)(K=3 ) The diagram is shown in the figure above, and different arrow styles and colors indicate independent attention calculations. Features aggregated from each head are concatenated or averaged to geth ′ ⃗ 1 \vec{h'}_{1}h′1。
Similar to multi-channel in convolutional networks, GAT uses multi-head attention to enrich model capabilities and stabilize the learning process. Specifically, KKK independent attention mechanisms perform the transformation of Equation 4, and their outputs can be combined in two ways depending on the usage:
- Concatenation
Concatenation : h i ( l + 1 ) = ∣ ∣ k = 1 K σ ( ∑ j ∈ N ( i ) α i j k W k h j ( l ) ) { \color{green} \text{Concatenation} }: h^{(l+1)}_{i}=||_{k=1}^{K}\sigma\left(\sum_{j\in \mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right) Concatenation:hi(l+1)=∣∣k=1Kp⎝⎛j∈N(i)∑aijkWkhj(l)⎠⎞
As can be seen from this setup, the final returned output $h'$ will be determined by the $KF'$ properties of each node (instead of F'F'F' ) composition.
- Averaging
Average : h i ( l + 1 ) = σ ( 1 K ∑ k = 1 K ∑ j ∈ N ( i ) α i j k W k h j ( l ) ) { \color{red}\text{Average} }: h_{i}^{(l+1)}=\sigma\left(\frac{1}{K}\sum_{k=1}^{K}\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right) Average:hi(l+1)=p⎝⎛K1k=1∑Kj∈N(i)∑aijkWkhj(l)⎠⎞
If we take multi-head attention to the final (prediction) layer of the network, the connections are no longer sensitive, but averaging is taken, and a final non-linearity is applied (usually or forsoftmax
classification problems).logistic sigmoid
So intermediate layers use concatenation and final layers use averaging .
Implementation
The following is pytoch
the implementation , tensorflow
see Github for the version .
class GATLayer(nn.Module):
"""
Simple PyTorch Implementation of the Graph Attention layer.
"""
def __init__(self, in_features, out_features, dropout, alpha, concat=True):
super(GATLayer, self).__init__()
self.dropout = dropout # drop prob = 0.6
self.in_features = in_features #
self.out_features = out_features #
self.alpha = alpha # LeakyReLU with negative input slope, alpha = 0.2
self.concat = concat # conacat = True for all layers except the output layer.
# Xavier Initialization of Weights
# Alternatively use weights_init to apply weights of choice
self.W = nn.Parameter(torch.zeros(size=(in_features, out_features)))
nn.init.xavier_uniform_(self.W.data, gain=1.414)
self.a = nn.Parameter(torch.zeros(size=(2*out_features, 1)))
nn.init.xavier_uniform_(self.a.data, gain=1.414)
# LeakyReLU
self.leakyrelu = nn.LeakyReLU(self.alpha)
def forward(self, input, adj):
# Linear Transformation
h = torch.mm(input, self.W)
N = h.size()[0]
# Attention Mechanism
a_input = torch.cat([h.repeat(1, N).view(N * N, -1), h.repeat(N, 1)], dim=1).view(N, -1, 2 * self.out_features)
e = self.leakyrelu(torch.matmul(a_input, self.a).squeeze(2))
# Masked Attention
zero_vec = -9e15*torch.ones_like(e)
attention = torch.where(adj > 0, e, zero_vec)
attention = F.softmax(attention, dim=1)
attention = F.dropout(attention, self.dropout, training=self.training)
h_prime = torch.matmul(attention, h)
if self.concat:
return F.elu(h_prime)
else:
return h_prime