GAT algorithm principle introduction and source code analysis

GAT algorithm principle introduction and source code analysis

Zero. Foreword (Irrelevant to the main text, please ignore)

A brief summary of the articles I have analyzed before:

Widely advertising

You can search for "Jenny's Algorithm Road" or "world4458" in WeChat and follow my WeChat public account; in addition, you can read the Zhihu column PoorMemory-Machine Learning, and future articles will also be published in the Zhihu column. Read on CSDN The experience will be better, the address is: https://blog.csdn.net/eric_1993/category_9900024.html

1. Article information

2. Core ideas

GAT (Graph Attention Networks) uses the Attention mechanism to learn the weight of neighbor nodes, and obtains the expression of the node itself by weighting and summing the neighbor nodes.

3. Interpretation of core viewpoints

The implementation mechanism of GAT is shown in the figure below:

Note that in the figure on the right, GAT adopts Multi-Head Attention, and there are 3 color curves in the figure, representing 3 different Heads. Under different Heads, the node h ⃗ 1 \vec{h}_ {1}h 1You can learn different embeddings, and then concat/avg these embeddings to generate h ⃗ 1 ′ \vec{h}_{1}^{\prime}h 1.

Let's look directly at the analysis code.

4. Source code analysis

The source code of GAT is located at: https://github.com/PetarV-/GAT

The GAT network itself is formed by stacking multiple Graph Attention Layers. First, the implementation of the Graph Attention Layer is introduced.

4.1 Graph Attention Layer

Definition of Graph Attention Layer:

Let NNThe characteristics of N input nodes are:h = { h ⃗ 1 , h ⃗ 2 , … , h ⃗ N } , h ⃗ i ∈ RF \mathbf{h}=\left\{\vec{h}_{1} , \vec{h}_{2}, \ldots, \vec{h}_{N}\right\}, \vec{h}_{i} \in \mathbb{R}^{F}h={ h 1,h 2,,h N},h iRF , using the Attention mechanism to generate new node featuresh ′ = { h ⃗ 1 ′ , h ⃗ 2 ′ , … , h ⃗ N ′ } , h ⃗ i ′ ∈ RF ′ \mathbf{h}^{\prime}= \left\{\vec{h}_{1}^{\prime}, \vec{h}_{2}^{\prime}, \ldots, \vec{h}_{N}^{\prime }\right\}, \vec{h}_{i}^{\prime} \in \mathbb{R}^{F^{\prime}}h={ h 1,h 2,,h N},h iRF as output.

Attention coefficients are generated as follows:

α i j = exp ⁡ ( LeakyReLU ⁡ ( a → T [ W h ⃗ i ∥ W h ⃗ j ] ) ) ∑ k ∈ N i exp ⁡ ( LeakyReLU ⁡ ( a → T [ W h ⃗ i ∥ W h ⃗ k ] ) ) \alpha_{i j}=\frac{\exp \left(\operatorname{LeakyReLU}\left(\overrightarrow{\mathbf{a}}^{T}\left[\mathbf{W} \vec{h}_{i} \| \mathbf{W} \vec{h}_{j}\right]\right)\right)}{\sum_{k \in \mathcal{N}_{i}} \exp \left(\operatorname{LeakyReLU}\left(\overrightarrow{\mathbf{a}}^{T}\left[\mathbf{W} \vec{h}_{i} \| \mathbf{W} \vec{h}_{k}\right]\right)\right)} aij=kNiexp( L e a k y R e L U(a T[Wh iWh k]))exp( L e a k y R e L U(a T[Wh iWh j]))

For a → ∈ R 2 F ′ \overrightarrow{\mathbf{a}} \in \mathbb{R}^{2 F^{\prime}}a R2 F , while∥ \| means concatenation operation.

Its code implementation is located at: https://github.com/PetarV-/GAT/blob/master/utils/layers.py . Note that in the code implementation, the author's writing method is very concise and subtle, not directly written according to the above formula , but with a slight degree of transformation.

由于 a → ∈ R 2 F ′ \overrightarrow{\mathbf{a}} \in \mathbb{R}^{2 F^{\prime}} a R2 F, 因此令 a → = [ a → 1 , a → 2 ] \overrightarrow{\mathbf{a}} = [ \overrightarrow{\mathbf{a}}_{1}, \overrightarrow{\mathbf{a}}_{2}] a =[a 1,a 2] , anda → 1 ∈ RF ′ , a → 2 ∈ RF ′ \overrightarrow{\mathbf{a}}_{1}\in\mathbb{R}^{F^{\prime}}, \overrightarrow{\ mathbf{a}}_{2}\in\mathbb{R}^{F^{\prime}}a 1RF,a 2RF, 那么 a → T [ W h ⃗ i ∥ W h ⃗ j ] \overrightarrow{\mathbf{a}}^{T}\left[\mathbf{W} \vec{h}_{i} \| \mathbf{W} \vec{h}_{j}\right] a T[Wh iWh j] 其实等效于 a → 1 T W h ⃗ i + a → 2 T W h ⃗ j \overrightarrow{\mathbf{a}}_{1}^T\mathbf{W}\vec{h}_{i} + \overrightarrow{\mathbf{a}}_{2}^T\mathbf{W}\vec{h}_{j} a 1TWh i+a 2TWh j, in the following code implementation, the equivalent writing method is adopted.

def attn_head(seq, out_sz, bias_mat, activation, in_drop=0.0, coef_drop=0.0, residual=False):
	"""
	参数介绍:
	+ seq: 输入节点特征, 大小为 [B, N, E], 其中 N 表示节点个数, E 表示输入特征的大小
	+ out_sz: 输出节点的特征大小, 我这里假设为 H
	+ bias_mat: 做 Attention 时一般需要 mask, 比如只对邻居节点做 Attention 而不包括 Graph 中其他节点.
			它的大小为 [B, N, N], bias_mat 的生成方式将在下面介绍
	+ 其余参数略.

	attn_head 输入大小为 [B, N, E], 输出大小为 [B, N, out_sz]
	"""
    with tf.name_scope('my_attn'):
        if in_drop != 0.0:
            seq = tf.nn.dropout(seq, 1.0 - in_drop)

        ## conv1d 的参数含义依次为: inputs, filters, kernel_size
        ## seq: 大小为 [B, N, E], 经过 conv1d 的处理后, 将得到
        ## 大小为 [B, N, H] 的输出 seq_fts (H 表示 out_sz)
        ## 这一步就是公式中对输入特征做线性变化 (W x h)
        ## seq_fts 就是节点经映射后的输出特征
        seq_fts = tf.layers.conv1d(seq, out_sz, 1, use_bias=False)

        ## f_1 和 f_2 就是我在上面介绍过的, 将向量 a 拆成 a1 和 a2, 然后分别和输入特征进行内积
        ## 再利用 f_1 + tf.transpose(f_2, [0, 2, 1])
        ## 得到每个节点相对其他节点的权重, logits 的大小为 [B, N, N]
        f_1 = tf.layers.conv1d(seq_fts, 1, 1)  ## [B, N, 1]
        f_2 = tf.layers.conv1d(seq_fts, 1, 1)  ## [B, N, 1]
        logits = f_1 + tf.transpose(f_2, [0, 2, 1]) ## [B, N, N]


        ## 将 logits 经过 softmax 前, 还需要加上 bias_mat, 大小为 [B, N, N], 可以认为它就是个 mask,
        ## 对于每个节点, 它邻居节点在 bias_mat 中的值为 0, 而非邻居节点在 bias_mat 中的值为一个很大的负数, 
        ## 代码中设置为 -1e9, 这样在求 softmax 时, 非邻居节点对应的权重值就会近似于 0
        coefs = tf.nn.softmax(tf.nn.leaky_relu(logits) + bias_mat)

        if coef_drop != 0.0:
            coefs = tf.nn.dropout(coefs, 1.0 - coef_drop)
        if in_drop != 0.0:
            seq_fts = tf.nn.dropout(seq_fts, 1.0 - in_drop)

        ## coefs 大小为 [B, N, N], 表示每个节点相对于它邻居节点的 Attention 系数, 
        ## seq_fts 大小为 [B, N, H], 表示每个节点经变换后的特征
        ## 最后得到 ret 大小为 [B, N, H]
        vals = tf.matmul(coefs, seq_fts)
        ret = tf.contrib.layers.bias_add(vals)

        # residual connection
        if residual:
            if seq.shape[-1] != ret.shape[-1]:
                ret = ret + conv1d(seq, ret.shape[-1], 1)
            else:
                ret = ret + seq

        return activation(ret)  # activation

Here are two more points: conv1dthe realization of bias_matand the generation of . First look at conv1dthe realization of :

Let’s look at the generation bias_matof , the code is located at: https://github.com/PetarV-/GAT/blob/master/utils/process.py , the implementation is as follows:

def adj_to_bias(adj, sizes, nhood=1):
	"""
	输入参数介绍:
	+ adj: 大小为 [B, N, N] 的邻接矩阵
	+ sizes: 节点个数, [N]
	+ nhood: 设置多跳, 如果 nhood=1, 则只考虑节点的直接邻居; nhood=2 则把二跳邻居也考虑进去
	"""
    nb_graphs = adj.shape[0]
    mt = np.empty(adj.shape)
    for g in range(nb_graphs):
        mt[g] = np.eye(adj.shape[1])
        ## 考虑多跳邻居的关系, 比如 nhood=2, 则把二跳邻居的关系考虑进去, 之后在 Graph Attention Layer 中,
        ## 计算权重系数时, 也会让二跳邻居参与计算.
        for _ in range(nhood):
            mt[g] = np.matmul(mt[g], (adj[g] + np.eye(adj.shape[1])))

        ## 如果两个节点有链接, 那么设置相应位置的值为 1
        for i in range(sizes[g]):
            for j in range(sizes[g]):
                if mt[g][i][j] > 0.0:
                    mt[g][i][j] = 1.0

    ## 最后返回上面 attn_head 函数中用到的 bias_mat 矩阵, 
    ## 对于没有链接的节点位置, 设置一个较大的负数 -1e9; 而有链接的位置, 元素为 0
    ## 这样相当于 mask, 用于 Attention 系数的计算
    return -1e9 * (1.0 - mt)

4.2 GAT network

The code is defined at: https://github.com/PetarV-/GAT/blob/master/models/gat.py , the implementation is as follows:

class GAT(BaseGAttN):
    def inference(inputs, nb_classes, nb_nodes, training, attn_drop, ffd_drop,
            bias_mat, hid_units, n_heads, activation=tf.nn.elu, residual=False):
    	"""
    	inputs: 大小为 [B, N, E]
		n_heads: 输入为 [8, 1], n_heads[0]=8 表示使用 8 个 Head 处理输出特征, n_heads[-1]=1, 使用 1 个 Head 处理输出特征
    	"""
        attns = []
        for _ in range(n_heads[0]):
        	## attn_head 的输出大小为 [B, N, H]
            attns.append(layers.attn_head(inputs, bias_mat=bias_mat,
                out_sz=hid_units[0], activation=activation,
                in_drop=ffd_drop, coef_drop=attn_drop, residual=False))
        h_1 = tf.concat(attns, axis=-1)

        ## 重复以上过程
        for i in range(1, len(hid_units)):
            h_old = h_1
            attns = []
            for _ in range(n_heads[i]):
                attns.append(layers.attn_head(h_1, bias_mat=bias_mat,
                    out_sz=hid_units[i], activation=activation,
                    in_drop=ffd_drop, coef_drop=attn_drop, residual=residual))
            h_1 = tf.concat(attns, axis=-1)

        ## 得到输出
        out = []
        for i in range(n_heads[-1]):
            out.append(layers.attn_head(h_1, bias_mat=bias_mat,
                out_sz=nb_classes, activation=lambda x: x,
                in_drop=ffd_drop, coef_drop=attn_drop, residual=False))
        logits = tf.add_n(out) / n_heads[-1]
    
        return logits

The main content is stacking Graph Attention Layer, so I won't introduce it in detail.

V. Summary

There is no summary, only entanglement in the heart.

Guess you like

Origin blog.csdn.net/Eric_1993/article/details/120899535