[GNN] Graph Attention Network

Tweet link: [GNN] Graph Attention Network

GAT

GAT (Graph Attention Network) is a new type of neural network structure based on graph-structured data, which uses hidden self-attention layers to solve the shortcomings of previous methods based on graph convolution or its approximation. By overlaying layers, a node is able to participate in the features of its neighborhood, which allows (implicitly) assigning different weights to different nodes of a neighborhood without requiring any kind of expensive matrix operations or relying on prior knowledge of the graph structure. In this way, GAT simultaneously addresses several key challenges of spectrum-based graph neural networks and makes the model easily applicable to both induction and transduction problems.

From graph convolutional networks (GCNs), we learned that combining local graph structures with node-level features can achieve good performance in node classification tasks.

h i ( l + 1 ) = σ ( ∑ j ∈ N ( i ) 1 c i j W ( l ) h j ( l ) ) c i j = ∣ N ( i ) ∣ ∣ N ( j ) ∣ \begin{aligned} h_i^{(l+1)}&=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\frac{1}{c_{ij}} W^{(l)}h^{(l)}_j}\right) \\ c_{ij}&=\sqrt{|\mathcal{N}(i)|}\sqrt{|\mathcal{N}(j)|} \end{aligned} hi(l+1)cij=pjN(i)cij1W(l)hj(l)=N(i) N(j)

However, the way GCN aggregates messages is structure-dependent , which may compromise its generality.

GAT introduces an attention mechanism to replace the static normalized convolution operation. The diagram below clearly illustrates the key difference.

Attention layer

The input is a set of node features

h = h ⃗ 1 , h ⃗ 2 , … , h ⃗ N , h ⃗ i ∈ RF \mathbf{h} = {\vec{h}_1,\vec{h}_2,…,\vec{h}_N }, \vec{h}_i ∈ \mathbb{R}^{F}h=h 1,h 2,,h N,h iRF

Generate a new set of node properties

h = h ′ ⃗ 1 , h ′ ⃗ 2 , … , h ′ ⃗ N , h ′ ⃗ i ∈ RF ′ \mathbf{h} = {\vec{h'}_1,\vec{h'}_2,… ,\vec{h'}_N}, \vec{h'}_i ∈ \mathbb{R}^{F'}h=h 1,h 2,,h N,h iRF

The attention layer can be divided into 4 parts:

  • Simple linear transformation

In order to obtain enough expressive power to transform input features into high-level features, at least one learnable linear transformation is required. To this end, as an initial step, by a weight matrix W ∈ RF ′ × FW ∈ \mathbb{R}^{F′×F}WRA shared linear transformation parameterized by F × F is applied to each node.

z i ( l ) = W ( l ) h i ( l ) \begin{aligned} z_i^{(l)}&=W^{(l)}h_i^{(l)} \end{aligned} zi(l)=W(l)hi(l)

  • Attention Coefficients

We then compute pairwise unnormalized attention scores between two neighbors.

e i j ( l ) = LeakyReLU ( a ⃗ ( l ) T ( z i ( l ) ∣ ∣ z j ( l ) ) )   \begin{aligned} e_{ij}^{(l)}&=\text{LeakyReLU}(\vec a^{(l)^T}(z_i^{(l)}||z_j^{(l)}))\ \end{aligned} eij(l)=LeakyReLU (a (l)T(zi(l)zj(l))) 

∣ ∣ || means concatenation. This form of attention is often referred to as additive attention, as opposed to the dot-product attention used in the Transformer model. This step allows each node to participate in the computation of other nodes, disregarding all structural information.

  • Softmax

This makes the coefficients easy to compare across different nodes, we use softmaxthe function in jjnormalize them among all options of j

α i j ( l ) = exp ⁡ ( e i j ( l ) ) ∑ k ∈ N ( i ) exp ⁡ ( e i k ( l ) ) \begin{aligned} \alpha_{ij}^{(l)}&=\frac{\exp(e_{ij}^{(l)})}{\sum_{k\in \mathcal{N}(i)}^{}\exp(e_{ik}^{(l)})} \end{aligned} aij(l)=kN(i)exp(ei k(l))exp(eij(l))

  • Aggregation

This step is similar to GCN. Embeddings from neighbors are aggregated, scaled by the attention score.

h i ( l + 1 ) = σ ( ∑ j ∈ N ( i ) α i j ( l ) z j ( l ) ) \begin{aligned} h_i^{(l+1)}&=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\alpha^{(l)}_{ij} z^{(l)}_j }\right) \end{aligned} hi(l+1)=pjN(i)aij(l)zj(l)

Multi-head Attention

Illustration of multi-head attention for a node 1 on its neighborhood

Multi-head attention ( K = 3 ) for a node 1 on its neighborhood (K = 3)(K=3 ) The diagram is shown in the figure above, and different arrow styles and colors indicate independent attention calculations. Features aggregated from each head are concatenated or averaged to geth ′ ⃗ 1 \vec{h'}_{1}h 1

Similar to multi-channel in convolutional networks, GAT uses multi-head attention to enrich model capabilities and stabilize the learning process. Specifically, KKK independent attention mechanisms perform the transformation of Equation 4, and their outputs can be combined in two ways depending on the usage:

  • Concatenation

Concatenation : h i ( l + 1 ) = ∣ ∣ k = 1 K σ ( ∑ j ∈ N ( i ) α i j k W k h j ( l ) ) { \color{green} \text{Concatenation} }: h^{(l+1)}_{i}=||_{k=1}^{K}\sigma\left(\sum_{j\in \mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right) Concatenation:hi(l+1)=k=1KpjN(i)aijkWkhj(l)

As can be seen from this setup, the final returned output $h'$ will be determined by the $KF'$ properties of each node (instead of F'F'F' ) composition.

  • Averaging

Average : h i ( l + 1 ) = σ ( 1 K ∑ k = 1 K ∑ j ∈ N ( i ) α i j k W k h j ( l ) ) { \color{red}\text{Average} }: h_{i}^{(l+1)}=\sigma\left(\frac{1}{K}\sum_{k=1}^{K}\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\right) Average:hi(l+1)=pK1k=1KjN(i)aijkWkhj(l)

If we take multi-head attention to the final (prediction) layer of the network, the connections are no longer sensitive, but averaging is taken, and a final non-linearity is applied (usually or forsoftmax classification problems).logistic sigmoid

So intermediate layers use concatenation and final layers use averaging .

Implementation

The following is pytochthe implementation , tensorflowsee Github for the version .

class GATLayer(nn.Module):
    """
    Simple PyTorch Implementation of the Graph Attention layer.
    """

    def __init__(self, in_features, out_features, dropout, alpha, concat=True):
        super(GATLayer, self).__init__()
        self.dropout       = dropout        # drop prob = 0.6
        self.in_features   = in_features    # 
        self.out_features  = out_features   # 
        self.alpha         = alpha          # LeakyReLU with negative input slope, alpha = 0.2
        self.concat        = concat         # conacat = True for all layers except the output layer.

        # Xavier Initialization of Weights
        # Alternatively use weights_init to apply weights of choice 
        self.W = nn.Parameter(torch.zeros(size=(in_features, out_features)))
        nn.init.xavier_uniform_(self.W.data, gain=1.414)
        self.a = nn.Parameter(torch.zeros(size=(2*out_features, 1)))
        nn.init.xavier_uniform_(self.a.data, gain=1.414)
        
        # LeakyReLU
        self.leakyrelu = nn.LeakyReLU(self.alpha)

    def forward(self, input, adj):
        # Linear Transformation
        h = torch.mm(input, self.W)
        N = h.size()[0]

        # Attention Mechanism
        a_input = torch.cat([h.repeat(1, N).view(N * N, -1), h.repeat(N, 1)], dim=1).view(N, -1, 2 * self.out_features)
        e       = self.leakyrelu(torch.matmul(a_input, self.a).squeeze(2))

        # Masked Attention
        zero_vec  = -9e15*torch.ones_like(e)
        attention = torch.where(adj > 0, e, zero_vec)
        
        attention = F.softmax(attention, dim=1)
        attention = F.dropout(attention, self.dropout, training=self.training)
        h_prime   = torch.matmul(attention, h)

        if self.concat:
            return F.elu(h_prime)
        else:
            return h_prime

Guess you like

Origin blog.csdn.net/qq_38904659/article/details/113464225