Natural Language Processing NLP - Graph Neural Network and Graph Attention Model (GNN, GCN, GAT)

Table of contents

Series Article Directory

1. Graph neural network

1. Graph and graph embedding

2. GNN motivation

2.1 Defects of CNN and unstructured data

2.2 Drawbacks of graph embedding

3. Detailed explanation of GNN

3.1 Introduction to GNN

3.2 GNN model

3.3 GNN framework

3.4 GNN limitations and optimization

2. Graph convolutional neural network

1. Convolution

2. Detailed explanation of GCN

2.1 GCN motivation

2.2 Introduction to GCNs

2.3 GCN Thought and Model

2.4 GCN core formula analysis

2.5 GCN Advantages and Limitations

3. Graph attention network

1. Attention mechanism

2. Introduction to GAT

3. Core formula analysis

3.1 Input and output of graph attention layer

3.2 Feature extraction and attention mechanism

3.3 Output features 

3.4 multi-head attention 

4. Paper results and analysis

4.1 Datasets and Comparison Methods

4.2 Experimental model setup

4.3 Experimental results and analysis

4.4 Supplement to the conclusion of the paper

4. GAT recurrence

1. Tensorflow GAT (code analysis)

1.1 Code structure and parameter setting

1.2 Data loading and feature preprocessing

1.3 GAT model

1.4 GAT network

1.5 Training and loss function

2. Pytorch GAT (code analysis)

3. Code practice and result analysis

5. GAT Supplement

1. Promotion of GAT

2. GAT optimization

2.1 Introduction to the paper

2.2 Motivation

2.3 Methodology 

2.4 Optimization effect

6. Experimental conclusion and reference

1. Experimental conclusion

2. References


Series Article Directory

This series of blogs focuses on the concept, principle and code practice of natural language processing NLP (if you have any questions, please discuss and point out in the comment area, or contact me directly by private message).

The first chapter  of natural language processing NLP - GSDMM for short text clustering_@李忆如的博客-CSDN博客

Chapter 2 Natural Language Processing NLP - Graph Neural Network and Graph Attention Model (GNN, GCN, GAT)


synopsis

   This blog introduces the motivation and model of the graph neural network (GNN), the detailed model explanation and formula derivation of the graph convolutional network (GCN), and focuses on the derivation of the objective function of the graph attention network (GAT), model analysis, and the use of different The framework reproduces the GAT paper experiments, compares the analysis conclusions with the paper results, and supplements certain GAT promotion and optimization at the end (with Python code and data sets included).


1. Graph neural network

    Since GAT is essentially a graph neural network, the graph neural network is introduced first in this part.

    Graph Neural Network (Graph Neural Network, GNN) refers to the use of neural networks to learn graph-structured data, extract and explore features and patterns in graph-structured data, and meet the needs of graph learning tasks such as clustering, classification, prediction, segmentation, and generation. Algorithms in general, this chapter introduces the related concepts, motivations, models, and applications of graph neural networks.

1. Graph and graph embedding

    A graph is a data structure . Common graph structures include nodes and edges. In machine learning (deep learning), some graphs have attributes, weights, etc., and a variety of real data can be represented by graphs (matrix composition). Examples are graphs 1, GNN is a branch of deep learning on the graph structure .

Figure 1 Sample graphical representation of data

    Graph embedding is a process of mapping graph data (usually a high-dimensional dense matrix) into a low-density vector , which can well solve the problem that graph data is difficult to efficiently input into machine learning algorithms. A sample is shown in Figure 2. If more information is represented, then downstream tasks will achieve better performance. There is a consensus during the embedding: nodes in the vector space that remain connected are close to each other .

Figure 2 Graph embedding example

    Graph is an easy-to-understand representation, and the advantages of graph embedding are summarized in Table 1 :

Table 1 Advantages of graph embedding

1. Direct machine learning on graph has certain limitations

2. Graph embedding can compress data

3. Vector calculation is simpler and faster than directly operating on the graph.

2. GNN motivation

    This section introduces why GNN is needed and the shortcomings of related methods.

2.1 Defects of CNN and unstructured data

    The core features of CNN are: local connection, weight sharing and multi-layer superposition. These features are also very applicable to graph problems, because the graph structure is the most typical local connection structure. Secondly, shared weights can reduce the amount of calculation. In addition, multi-layer Structure is the key to dealing with hierarchical patterns.

    However, CNN can only process Euclidean data, such as two-dimensional pictures and one-dimensional text data, and these data are only special cases of the graph structure, as shown in Figure 3. CNNs are hard to use (do not work well) for general graph structures .

Figure 3 Samples of different spatial maps

    In detail, not all things in the real world can be represented as a sequence or a grid , such as social networks, knowledge graphs, complex file systems, etc., that is to say, many things are unstructured, as shown in Fig. 4 shows:

Figure 4 Unstructured data sample

    Analysis : As shown in Figure 4, compared with simple text and images, this network type of unstructured data is very complex, and the difficulties in processing it are summarized in Table 2:

Table 2 Difficulties in unstructured data processing

1. The size of the graph is arbitrary, the topology is complex, and there is no spatial locality like an image

2. There is no fixed node order in the graph, or there is no reference node

3. Graphs are often dynamic and contain multimodal features

    So how do we model this type of data? Can deep learning be extended to model this type of data? These problems have prompted the emergence and development of graph neural networks.

2.2 Pitfalls of graph embedding

    Graph embeddings can be roughly divided into three categories: matrix factorization, random walks, and deep learning methods. Common models include DeepWalk, Node2Vec, etc. However, these methods have two serious shortcomings. First, the weights in node encoding are not shared , resulting in a linear increase in the number of weights as the number of nodes increases. In addition, the direct embedding method lacks generalization capabilities . , which means that it cannot handle dynamic graphs and generalize to new graphs. The comparison between GNN and graph embedding is shown in Figure 5:

Figure 5 GNN and graph embedding comparison

3. Detailed explanation of GNN

3.1 Introduction to GNN

    Graph neural network is a special graph representation method. It uses neural network to encode graph nodes, and embedding the graph structure into a computer-processable vector matrix. Compared with traditional NN, it is optimized in terms of nodes, edges, and reasoning. ,Summarized as follows:

(1 ) node

① Both CNN and RNN require the features of nodes to be arranged in a certain order.

② But for the graph structure, there is no natural order . Therefore, GNN uses * to propagate separately on each node to learn, thus ignoring the order of nodes, which is equivalent to the output of GNN will vary with different inputs.

(2 ) Edges (the edges of the graph structure represent the dependencies between nodes)

①Traditional neural network does not explicitly express this dependency relationship, but indirectly expresses the relationship between nodes through different node characteristics, and these dependency information are only used as node characteristics.

②GNN can propagate through the graph structure , instead of using it as part of the node features, and update the hidden state of the node through the weighted sum of neighbor nodes.

(3 ) Reasoning

① Reasoning is a very important research topic of advanced artificial intelligence. The reasoning process in the human brain is almost all based on the graphics extracted from daily experience. Standard neural networks have shown the ability to generate synthetic images and documents by learning data distributions, while they still cannot learn inference graphs from large experimental data. However, GNN explores generating graphs from unstructured data such as scene pictures and story documents , which can be a powerful neural model for further advanced AI.

3.2 GNN model

    For the graph neural network model, there are different compositions based on different methods, which are briefly described here. Compared with the most basic network structure of the neural network, the fully connected layer (MLP), the feature matrix is ​​multiplied by the weight matrix, and the graph neural network has an additional adjacency matrix . The calculation form is very simple, three matrices are multiplied and a nonlinear transformation is added, as shown in Figure 6:

Figure 6 GNN calculation form

    Therefore, a relatively common application mode of the graph neural network is shown in Figure 7 below. Input a graph, go through various operations such as multi-layer graph convolution and activation functions , and finally obtain the representation of each node, so as to facilitate node classification and link prediction. , generation of graphs and subgraphs, etc.

Figure 7 Common graph neural network application patterns

3.3 GNN framework

Figure 8 GNN flow chart

    For the learning of the parameters of f and g in GNN, the target information is often used for supervised learning , and the loss function can be defined as formula 6:

Formula 6 GNN loss function

    Among them, p represents the number of supervised nodes, t i and oi represent the real value and predicted value of the node respectively. The learning of the loss function is based on the gradient descent strategy , and the steps are summarized in Table 3:

Table 3  Loss function learning process

1. State iterative update: State h vt is updated for T rounds according to formula 1 until it is close to the fixed-point solution of formula 3

2. Calculate the gradient of weight W: the gradient of weight W is calculated from loss

3. Update weight: update the weight W according to the gradient in 2

3.4 GNN limitations and optimization

    Combined with GNN principles, models, and framework analysis, the limitations of classic GNNs are summarized as shown in Table 4:

Table 4 Classical GNN limitations

1. Using an iterative method for fixed points to update the hidden state of nodes is not efficient .

2. The original GNN uses the same parameters in iterations, and it is difficult for the model to learn deeper feature expressions . Also, the update of node hidden layers is a sequential process.

3. There may be some information features on some edges that cannot be effectively considered . In addition, how to learn the hidden states of edges is also an important issue.

4. If we need to learn vector representations of nodes instead of graph representations, it is not suitable to use fixed point because the distribution of representations in fixed point will be very smooth in value and less informative to distinguish each node.

    According to the description in Table 4, the classic GNN still has limitations, so there have been many algorithms based on graph neural networks, which mainly optimize graph neural networks from three aspects: graph type, propagation type and training method. A summary of GNN variants is shown in Figure 9:

Figure 9 Summary of three types of optimization algorithms for GNN

2. Graph convolutional neural network

    Graph Convolutional Neural Network (GCN) is the "pioneering work" of graph neural network. For the first time, the convolution operation in image processing is simply used in graph structure data processing. Since GAT has made several major optimizations on GCN, the implementation process is similar, so in this part, the graph convolutional neural network is introduced first.

    Tips: GCN derivation involves a lot of mathematical theory, this part only introduces the core part.

1. Convolution

    In functional analysis, convolution is a mathematical operation that generates a third function through two functions f and g. Its essence is a special integral transformation that characterizes the overlapping function of f and g after flipping and translation Integral of the product of values ​​over the overlap length. For image (data) processing, convolution is essentially filtering the signal , the idea comes from the image, and then introduced into the picture. However, graphs are much more complex when images have a fixed structure. Taking CNN as an example, the convolution idea from image to graph is shown in Figure 10:

Figure 10 Image-to-Graph Convolution Thought Example

2. Detailed explanation of GCN

2.1 GCN motivation

    GCN motivation is similar to GNN motivation. For the neural network algorithms of the CNN and RNN series, good results can be achieved for tasks such as image recognition and natural language processing for Euclidean space data. But in real life, there are actually many, many irregular data structures , typically graph structures, or topological structures, such as social networks, chemical molecular structures, knowledge graphs, etc., as shown in Figure 4.

    The structure of the graph is generally very irregular, and can be considered as a kind of data with infinite dimensions, so it has no translation invariance. The surrounding structure of each node may be unique, and the data of this structure makes the traditional CNN and RNN instantly invalid. In order to process this type of data, many methods have emerged, and GCN is one of the classic methods.

2.2 Introduction to GCNs

    Similar to CNN, GCN is a convolutional neural network that can work directly on graphs (objects) and exploit the structural information of graphs. GCN has cleverly designed a method to extract features from graph data, so that we can use these features to perform graph data: node classification, graph classification, edge prediction , and the embedded representation of graphs can also be obtained by the way. Only a small part of the nodes have labels (semi-supervised learning), and a sample of node classification on the graph is shown in Figure 11:

Figure 11 Example of graph node classification

2.3 GCN Thought and Model

    The classic GCN is a semi-supervised graph convolutional neural network . The basic idea is that for each node, we obtain its feature information from all its neighbor nodes (including itself). Suppose we use the average() function. We will do the same for all nodes. Finally, we feed these calculated averages into the neural network.

    As shown in Figure 12, we have a simple instance of a citation network. Each node represents a research paper, while the edges represent citations. We have a preprocessing step here. Here we don't use original papers as features, but convert papers into vectors (by using NLP embeddings such as tf-idf). NLP embeddings such as TF-IDF).

Figure 12 Simple example of GCN used in citation network

    Main idea of ​​GCN: We take green nodes as an example. First, we take the average of all its neighbors, including itself. Then, pass the average through the neural network. Note that in GCN, we only use one fully connected layer. In this example, we get a 2D vector as output (2 nodes of the fully connected layer) as shown in Figure 13:

Figure 13 GCN design example

    In practice, we can use more complex aggregation functions than the average function. We can also stack more layers together to obtain a deeper GCN. The output of each layer is considered as the input of the next layer.

    Therefore, according to the introduction and ideas of GCN, the model of GCN is shown in Figure 14:

Figure 14 GCN model

    Analysis : As shown in Figure 14, although GCN has performed complex mathematical derivations and calculations in the hidden layer, the input and output are simple and regular . For GCN, a graph with C input channels is used as input, and the output of F output channels is obtained through the hidden layers in the middle.

2.4 GCN core formula analysis

(1 ) Layer-to-layer propagation

    Suppose we have a batch of graph data, which has N nodes, and each node has its own characteristics. We set the characteristics of these nodes to form an N×d-dimensional matrix X, and then the relationship between each node will also form an N ×N-dimensional matrix A, also known as adjacency matrix. X and A are the inputs to our model.

(2 ) classification

After solving the above core formula, the final result of GCN is a feature vector of each node of Z= H^(l)     obtained after l-layer feature enhancement . That is to say, after several layers of GCN, the feature of each node changes from X to Z^N x C , where C is the number of categories to be classified.

(3 ) Training and parameter update 

'''-------------------------------------------------------------------------------------'''
		'''
			tf.train.AdamOptimizer利用梯度的一阶矩估计和二阶矩估计动态调整每个参数的学习率。
			Adam的优点主要在于经过偏置校正后,每一次迭代学习率都有个确定范围,使得参数比较平稳.
		'''
		# self.lr为事先设置好的梯度下降中的学习率
        self.optimizer = tf.train.AdamOptimizer(self.lr)
		
		'''
			由tf源代码可以知道optimizer.minimize()实际上包含了两个步骤,
			即optimizer.compute_gradients和optimizer.apply_gradients,前者用于计算梯度,
			后者用于使用计算得到的梯度来更新对应的变量。
		'''
		
		'''
			computer_gradients(loss, val_list):
			    ●loss: 需要被优化的Tensor;这里的loss为self.loss+self.l2
			最终返回的是元组列表,即[(gradient, variable),...]。
			例:x = 50, w = 10, y = x*w;结果是[(50,10),(10,50)]
			列表中第一个元组中第一个元素是y对w求导的结果,第二个元素是w。
			列表中第二个元组中第一个元素是y对x求导的结果,第二个元素是x。
		'''
		# self.loss是通过tf.losses.softmax_cross_entropy计算得到的损失函数的Tensor
		# l2是一个正则化项
        gradients = self.optimizer.compute_gradients(self.loss+self.l2)

		'''
			self.optimizer.apply_gradients的作用是将compute_gradients()返回的值作为输入参数对变量进行更新。
			
			使用tf.clip_by_value来修正梯度:
			输入一个张量grad,把grad中的每一个元素的值都压缩在-5和5之间。小于-5的让它等于-5,大于5的元素的值等于5。
		'''
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        self.train_op = self.optimizer.apply_gradients(capped_gradients)
		'''
			那为什么minimize()会分开两个步骤呢?原因是因为在某些情况下我们需要对梯度做一定的修正,
			例如为了防止梯度消失(gradient vanishing)或者梯度爆炸(gradient explosion),
			我们需要事先干预一下以免程序出现Nan的尴尬情况;
			有的时候也许我们需要给计算得到的梯度乘以一个权重或者其他乱七八糟的原因,所以才分开了两个步骤。
		'''
		
		'''多次执行self.train_op后,即可训练成功'''
		
		'''---------------------------------------------------------------------------------------'''

2.5 GCN Advantages and Limitations

    Combined with the GCN principle, model, and framework analysis, the limitations of the classic GNN are summarized as shown in Table 5:

Table 5 Advantages and limitations of GCN

    For the shortcomings of GCN, there are many optimization algorithms in the future, and GAT is one of them.

3. Graph attention network

1. Attention mechanism

    GAT is a graph attention network, and the key mechanism is the Attention mechanism . Therefore, we will introduce and analyze the Attention mechanism first.

    The Chinese name of the Attention mechanism is "Attention Mechanism". Its main function is to let the neural network put "attention" on a part of the input , that is, to distinguish the influence of different parts of the input on the output. Here, we understand the Attention mechanism from the perspective of enhancing the semantic representation of words/words.

    We know that the meaning expressed by a word/phrase in a text is usually related to its context. For example: just looking at the word "繁", we may feel very strange (we don't even remember the pronunciation), but after seeing its context "繁鹿之翁", we will be familiar with it immediately. Therefore, the contextual information of a word/phrase helps to enhance its semantic representation . At the same time, different words/phrases in the context often play different roles in enhancing semantic representation. For example, in the above example, the character "Hong" has the greatest effect on understanding the character "Hu", while the effect of the character "Zhi" is relatively small. In order to differentiate the use of contextual word information to enhance the semantic representation of the target word , the Attention mechanism can be used.

    The Attention mechanism mainly involves three concepts: Query, Key and Value . In the above application scenario of enhancing the semantic representation of words, the target word and its context words have their own original values. The Attention mechanism uses the target word as Query and each word in its context as Key, and uses the similarity between Query and each Key As a weight , the Value of each word in the context is integrated into the original Value of the target word.

    As shown in Figure 16, the Attention mechanism takes the target word and the semantic vector representation of each context word as input, and first obtains the Query vector representation of the target word, the Key vector representation of each context word, and the original Value representation, and then calculate the similarity between the Query vector and each Key vector as a weight, weighted and fused the Value vector of the target word and the Value vector of each context word, as the output of Attention, that is: the enhanced semantic vector representation of the target word.

Figure 16 Attention mechanism architecture sample

    In addition to the basic Attention mechanism, there are two mechanisms , Self-Attention and Multi-head Attention . The introduction is as follows:

    Self-Attention : For the input text, we need to enhance the semantic vector representation of each word in it. Therefore, we use each word as a query, and weight the semantic information of all words in the text to obtain the enhanced semantic vector of each word. . In this case, the vector representations of Query, Key, and Value all come from the same input text , so the Attention mechanism is also called Self-Attention.

    Multi-head Self-Attention : In order to enhance the diversity of Attention, the author of the article further uses different Self-Attention modules to obtain the enhanced semantic vectors of each word in different semantic spaces in the text, and multiple enhanced semantic vectors of each word Vectors are linearly combined to obtain a final enhanced semantic vector with the same length as the original word vector.

    Two improved Attention architectures are shown in Figure 17:

 Figure 17 Architecture example of Self-Attention and Multi-head Attention

2. Introduction to GAT

    According to the two major limitations of GCN summarized in Table 5, GAT introduces the attention mechanism in Figure 17, which solves these two shortcomings well.

    Graph Attention Network (GAT) proposes to use the attention mechanism to weight and sum the features of adjacent nodes . The weights of neighboring node features depend entirely on node features, independent of the graph structure. The core difference between GAT and GCN is how to collect and accumulate the feature representations of neighbor nodes with a distance of 1 . The graph attention model GAT replaces the fixed normalization operation in GCN with an attention mechanism. Specifically, GAT draws on the idea of ​​Transformer and introduces a masked self-attention mechanism. When calculating the representation of each node in the graph, it will assign different weights to it according to the characteristics of neighboring nodes. Essentially, GAT just replaces the normalization function of the original GCN with a neighbor node feature aggregation function using attention weights .

    Therefore, the advantages of GAT are summarized in Table 6:

Table 6 Advantages of classic GAT

1. Training GCN does not need to understand the entire graph structure, but only needs to know the neighbor nodes of each node

2. The calculation speed is fast, and parallel calculation can be performed on different nodes

3. It can be used for both Transductive Learning and Inductive Learning, and can process unseen graph structures

3. Core formula analysis

    GAT is essentially a graph neural network that introduces an attention mechanism, so the core is the graph attention layer . In this section, the relevant core formulas are analyzed.

3.1  Input and output of graph attention layer

3.2 Feature extraction and attention mechanism

Figure 18 GAT feature extraction and attention mechanism process

3.3 Output Features 

3.4 multi-head attention 

Figure 19 Multi-head attention example in GAT

    Analysis : As shown in Figure 19, an illustration by a node's multi−head attention (with K=3 heads) on its neighborhood. Different arrow styles and colors indicate independent attention computations , aggregated features from each head are concatenated or averaged to obtain .

    So far, the analysis of the core formula of the attention layer of the GAT graph has been completed. For the GAT classification task , the process is very similar to the GCN classification process. They are all completed using the softmax function + cross-entropy loss function + gradient descent method . The specific process has been described above. detail.

4. Paper results and analysis

    The authors evaluate the GAT model against various strong baselines and previous methods on four established graph-based benchmark tasks ( transductive learning and inductive learning ) , where Both meet or meet state-of-the-art performance. This section summarizes the experimental setup, results, and a brief qualitative analysis of the feature representations extracted by the GAT model.

4.1 Datasets and Comparison Methods

    For transductive learning, the paper utilizes three standard citation network benchmark datasets Cora , Citeseer , and Pubmed , and closely follows the transductive experimental setup of Yang et al. The comparison method chooses strong baselines and state-of-the-art methods specified in Kipf&Welling.

    For inductive learning, the paper compares four different supervised GraphSAGE induction methods proposed by Hamilton et al. using the protein-protein interaction ( PPI ) dataset.

    Furthermore, for both tasks, we present the per-node performance of a shared Multi-Layer Perceptron (MLP) classifier (which does not contain a graph structure at all).

    The data sets and information used in the paper are summarized in Figure 20:

Figure 20 Datasets used by GAT and their information summary

4.2 Experimental model setup

    For the transductive learning task, the experimental settings are shown in Table 7:

Table 7 Transductive learning experiment settings

  • Two-tier GAT model
  • Optimize the hyperparameters of the network structure on the Cora dataset and apply it to the Citeseer dataset
  • The first layer 8head, F'=8, ELU as a nonlinear function
  • The second layer is the classification layer, an attention head feature number C, followed by a softmax function, in order to cope with the small training set, regularization (L2)
  • Both layers use a dropout of 0.6, which is equivalent to randomly selecting a part of the neighboring nodes to participate in the convolution when calculating the convolution of each node position.

    For the inductive learning task, the experimental settings are shown in Table 8:

Table 8 inductive learning experiment settings

  • Three-layer GAT model
  • The first two layers K=4, F1=256, ELU as a nonlinear function
  • The last layer is used to classify K=6, F=121, and the activation function is sigmoid
  • In this task, the training set is large enough not to use regularization and dropout

    Both models were initialized with Glorot and the training nodes were trained using the Adam SGD optimizer to minimize cross-entropy with an initial learning rate of 0.01 for Pubmed and 0.005 for all other datasets . In both cases, an early-stopping strategy is used on the cross-entropy loss and accuracy (transductive) or micro F1-score (inductive) scores of the validation nodes for a total of 100 epochs .

4.3 Experimental results and analysis

    For the transductive task, the paper reports the average classification accuracy (with standard deviation) on the test nodes after 100 runs, and uses metrics already reported in Kipf & Welling and Monti et al. for state-of-the-art techniques. The classification accuracy results of Cora, Citeser and Pubmed are summarized in Figure 21:

    For inductive tasks, the paper reports micro-averaged F1-scores on two unseen test graph nodes , averaged over 10 runs, and uses metrics already reported in Hamilton et al. for other techniques. The summary of the micro-average F1-score results of the PPI data set is shown in Figure 22:

Figure 21 Summary of classification accuracy results of Cora, Citeser and Pubmed

Figure 22 Summary of micro-average F1-score results of PPI dataset

    Analysis : As shown in Figure 21 and Figure 22, combined with the discussion and analysis of related work, the results of the paper successfully prove that the latest performance achieved or matched in all four datasets meets expectations .

4.4 Supplement to the conclusion of the paper

    For GAT, some visualization can be done to qualitatively study the effectiveness of the learned feature representation . To this end, the paper provides a visualization of t-SNE through the transformation feature representation extracted from the first layer of the GAT model pre-trained on the Cora dataset, as shown in Figure 23. The representation exhibits clear clustering in the projected 2D space. Note that these clusters correspond to the seven labels of the dataset, validating the discriminative ability of the model among Cora 's seven subject classes . In addition, the relative strength of the normalized attention coefficient (averaged across all eight attention heads) is also visualized.

Figure 23 t-SNE diagram of the computational feature representation of the first hidden layer of the pre-trained GAT model on the Cora dataset

4. GAT recurrence

    After introducing the principle, model implementation, advantages and disadvantages of graph neural network (GNN), graph convolutional neural network (GCN) and graph attention network (GAT), this part reproduces the GAT model and uses it in Practice with Cora and Citeser datasets to complete practical tasks such as citation classification.

    There are many ways to reproduce GAT, which can be based on Tensorflow, Pytorch, keras, etc. The official code address is summarized in Table 9. This part uses the framework of Tensorflow and Pytorch to reproduce and analyze GAT.

Table 9 GAT official implementation summary

frame

address

Tensorflow

GitHub - PetarV-/GAT: Graph Attention Networks 

Pytorch

GitHub - gordicaleksa/pytorch-GAT 

Hard

GitHub - danielegrattarola/hard-gat 

1. Tensorflow GAT (code analysis)

    This part attempts to use Tensorflow GAT to reproduce and analyze the paper's experiments, including the analysis of the core code and the comparative analysis of the reproduced results.

1.1 Code structure and parameter setting

    Import the official Tensorflow GAT into Pycharm, and check the code structure and parameter settings defined in GAT/execute_cora.py , as shown in Figure 24:

Figure 24 Tensorflow GAT code structure and parameter settings

1.2 Data loading and feature preprocessing

    Tensorflow GAT defines the data loading and preprocessing part in GAT/utils/process.py . Among them, the Cora dataset is used for citation classification by default, and the preprocessing part is consistent with GCN . The final loaded data adj is an adjacency matrix , indicating the index relationship between 2708 articles . feature indicates whether 1433 words exist in 2708 articles .

1.3 GAT model

    The GAT network itself is formed by stacking multiple Graph Attention Layer layers, so the core of the model is the implementation of Graph Attention Layer , which is defined in layers.py .

    Among them, the definition of Graph Attention Layer is explained in detail in the core formula analysis of three (3), and the core step is to generate the corresponding Attention coefficient. In the code implementation, the core function is attn_head () , the definition and key analysis are shown in Figure 25 ( Carbon | Create and share beautiful images of your source code rendering):

Figure 25 attn_head () definition and key analysis

    As shown in Figure 25, the core code, key code and process in the Graph Attention Layer layer have been briefly introduced, and then the key codes in Figure 25 will be explained and supplemented.

Figure 26 layer convolution

    As shown in Figure 26, the author first uses the 1D convolution simulation projection transformation of the original node feature seq with a convolution kernel size of 1 to obtain seq_fts, and the dimension after projection transformation is out_sz. Note that the projection matrix W is shared by all nodes here , so multiple convolution kernels in 1D convolution are also shared.

    For the implementation of concld (convolution) in tensorflow, the supplement is shown in Figure 27:

Figure 27 Convolution implementation process visualization

Figure 28 Projective transformation convolution

    As shown in Figure 28, seq_fts obtained after projection transformation continues to use 1D convolution processing with a convolution kernel size of 1 to obtain the projection f_1 of the node itself and the projection f_2 of its neighbors. Note that the parameters of the two projections are separated here, that is, there are two sets of projection parameters , corresponding to the parameters in the above two conv1d.

Figure 29 logits matrix

    As shown in Figure 29, f_2 is transposed and superimposed on f_1, and the logits are obtained through Tensorflow’s broadcast mechanism , which is an attention matrix. The related formula is shown in Equation 17:

Formula 17 Attention matrix definition

Figure 30 Attention coefficient solution

    As shown in Figure 30, the coefs in the code are the attention coefficients. By solving, the logits are softmax normalized .

However, a bias_mat is added     to the attention coefficient calculation code , because the logits store the attention value between any two nodes, but normalization only needs to be performed on the attention of all neighbors of each node ( k∈ Ni ). Therefore, the introduction of bias_mat is to constrain the normalization object of softmax to the neighbors of each node , such as the red part of formula 18.

Equation 18 Constrained Attention Coefficient

    Next, do a simple analysis of the implementation of bias_mat. Its implementation is defined in GAT/utils/process.py . The definition and core analysis are shown in Figure 31 ( Carbon | Create and share beautiful images of your source code rendering):

Figure 31 bias_mat implementation and core analysis

    Analysis : As shown in Figure 31, GAT uses a large negative number for the implementation of bias_mat, adj_to_bias the original neighbor matrix, and then adds the bias_mat and attention matrix, and then masks the non-node neighbors.

Figure 32 Node update

    As shown in Figure 32, according to the final output prediction of GAT , the attention matrix coefs after the mask is multiplied by the transformed feature matrix seq_fts to obtain the updated node representation vals.

1.4 GAT network

    After the GAT model is successfully defined and constructed, the construction of the GAT network will begin. The code is defined in GAT/models/gat.py , as shown in Figure 33, which is essentially a stack of Graph Attention Layers .

Figure 33 Tensorflow GAT network definition

1.5 Training and loss function

Tensorflow GAT defines training and loss functions     in GAT/models/base_gattn.py . The core of training is to minimize the loss function and L2 loss . The function definition is shown in Figure 34:

Figure 34 Tensorflow GAT training and loss function definition

    So far, the analysis of the key code and algorithm flow of Tensorflow GAT has been completed. 

2. Pytorch GAT (code analysis)

    The basic processes of Pytorch GAT and Tensorflow GAT are in line with the GAT process. There are certain differences in the implementation and selection of functions, which are related to the framework. Therefore, this part only analyzes the core code of Pytorch (attention layer). For the attention layer of Pytorch GAT The code implementation is shown in Figure 35:

Figure 35 Pytorch GAT attention layer code implementation

3. Code practice and result analysis

    This part uses Tensorflow GAT and Pytorch GAT to classify the citations of the Cora and Citeseer datasets and analyze and compare the results. The citation classification process of GAT is shown in Figure 36, and the training process is shown in Figure 37 :

Figure 36 GAT citation classification process

Figure 37 GAT part of the training process

    After the training is complete, use Tensorflow GAT and Pytorch GAT to classify the citations of Cora and Citeseer respectively. Taking Pytorch GAT to classify Cora as an example, the results are shown in Figure 38:

 Figure 38 Example of GAT classification results

    The reproduction results of Tensorflow GAT and Pytorch GAT's citation classification for Cora and Citeseer and the conclusion data of the paper are shown in Table 10, and the comparison is shown in Figure 39:

Table 10 GAT classification recurrence and paper results data summary

method

Cora

Citeseer

paper

83.0 ± 0.7%

72.5 ± 0.7%

Tensorflow

82.45

71.96

Pytorch

84.39

72.97

Figure 39 Comparison of GAT classification recurrence and paper results

    Analysis : According to Table 10 and Figure 39, it can be seen that the results of the reproduced GAT and the paper GAT under the Cora and Citeseer datasets are basically the same (F1-classification indicators, etc.), and some differences are related to the environment and parameter settings.

    Among them, the visualization of the training and test indicators in the training process can feel the process of establishing the GAT model, and at the same time observe the transformation of the classification accuracy rate and loss function with epoch. Take the visualization of the training process of tensorflow GAT as an example, as shown in Figure 40:

Figure 40 Visualization of the training process of tensorflow GAT (the right picture is the Cora dataset)

    Analysis : As shown in Figure 40, the training process of GAT is more intuitive after visualization. It can be seen that as the epoch increases, the classification accuracy rate continues to increase until it becomes stable.

5. GAT Supplement

1. Promotion of GAT

    GAT is only applied to a single-layer graph structure network. Can we generalize it to a multi-layer network structure ?

    这里我们假设一个有N层网络的结构,每层网络都定义了相同的节点,但是节点之间的关系有所差异。这样,我们就完成了一个多层网络的构建,他们共享相同的节点,但又分别具有不同的邻边,如果我们分别处理每一层视图,然后将他们得出的节点表示单纯相加的话,就可能会失去不同视图之间的协作关系,降低分类(预测)的精度。

    基于以上观点,这里提出了一种新的方法:首先在每一层单视图中应用GAT进行学习,并计算出每层视图的节点表示。之后在不同视图之间引入attention机制来让网络自行学习不同视图的权重。之后根据学习的权重,将各个视图加权相加得到全局节点表示并进行后续的诸如节点表示,链接预测等任务。

    同时,因为不同视图共享同样的节点,即使每一层视图都表示了不同的节点关系,最终得到的每一层的节点嵌入表示应具有一定的相关性。基于以上理论,我们在每层GAT的网络参数间引入正则化项来约束参数,使其向互相相近的方向学习。大致的网络流程图如图41所示:

图41 多层网络GAT流程图

2.GAT优化

    参考论文:HOW ATTENTIVE ARE GRAPH ATTENTION NETWORKS?(ICLR 2022)

    针对传统GAT的原理、模型、实现进行分析,总结其局限如表11所示:

表11 传统GAT的局限

1.GAT在聚合多阶邻居时的不足

2.GAT最好加入self loop

(仅仅使用邻居消息聚合来训练出节点embedding的方式往往会引入大量噪音)

3.GAT的训练对参数初始化比较敏感

4.注意力计算时必不可少的LeakyReLU

    因此,GAT还存在优化的空间,本部分主要引用ICLR 2022的一篇论文思想对于经典GAT提出几个优化思路。

2.1 论文简介

    论文认为GAT是static attention,仅实现了对节点重要度的静态ranking,而未实现对不同query给出不同key的设想;故提出GATv2,通过调整LeakyReLU和linear unit计算顺序,实现dynamic attention,即对不同query能给出不同key

2.2 动机

    GAT已成为图神经网络发展历程中的标志性架构,但论文观察发现,GAT的attention对于相同的keys实现的其实是ranking。

    假设有Dictionary Lookup,问题与使用GAT所得的attention scores如图42所示:

图42 Dictionary Lookup问题与GAT所得attention scores

    分析:如图42,可以看到,对于不同的query,key的scores排序实际是一样的(静态的)。这限制了GAT的表达能力。

    而论文认为,attention的初衷应该是:给定不同的query,能找到不同的key(即不同query,ranking结果应该不同,动态的)。

2.3 方法论 

2.4 优化效果

    在Dictionary Lookup方面,用GATv2去解决图42的问题,优化前后对比如图43所示:

图43 优化前后GAT所得attention scores

    分析:由图43所示,对于上文中二部图问题,使用改进后的GAT能有效实现dynamic attention

    在Robustness to Noise方面,GATv2与经典GAT对比如图44所示:

图44 优化前后GAT抵抗噪声效果对比

    分析:由图44所示,dynamic attention(GATv2)能更好抵抗噪声。 

六、实验结论与参考

1.实验结论

(1)现实生活中有很多不规则的数据结构,典型的就是图结构,如社交网络、分子结构、知识图谱等,对非结构化的数据,传统NN方法表现不佳,需要使用图神经网络

(2)GAT本质上是一种加入了注意力机制的图卷积神经网络,在此之前的一些经典图神经网络缺陷如表12所示:

表12 经典图神经网络缺陷

GNN缺陷

1.对不动点使用迭代的方法来更新节点的隐藏状态,效率不高。

2.原始GNN 在迭代中使用相同参数,模型难以学习到更加深的特征表达。而且,节点隐藏层的更新是顺序流程。

3.一些边上可能会存在某些信息特征不能被有效地考虑进去。此外,如何学习边的隐藏状态也是一个重要问题。

4.如果我们需要学习节点的向量表示而不是图的表示,则不适合使用固定点,因为固定点中的表示分布将在值上非常平滑并且用于区分每个节点的信息量较少。

GCN缺陷

GCN对于同阶的邻域上分配给不同的邻居的权重是完全相同的,这一点限制了模型对于空间信息的相关性的捕捉能力

GCN结合临近节点特征的方式和图的结构依依相关,这局限了训练所得模型在其他图结构上的泛化能力

(3)GAT可以在Tensorflow、Pytorch、Keras等多种框架下完成多种实际任务,如引文分类等,同时GAT可以推广到多层网络结构

(4)GAT是一种高效的算法,但仍存在不适合聚合多阶邻居、需要加入self loop、对参数初始化比较敏感、注意力计算时必不可少LeakyReLU等问题,仍存在优化空间。

2.参考资料

1.【图表示学习】pytorch实现图注意力网络GAT_BQW_的博客-CSDN博客_gat pytorch实现

2.图神经网络(三)—GAT-pytorch版本代码详解_Arvin Ou的博客-CSDN博客_gat pytorch

3.【图结构】之图神经网络GCN详解_張張張張的博客-CSDN博客_gcn结构

4.【图结构】之图注意力网络GAT详解_張張張張的博客-CSDN博客_gat公式

5.GAT 算法原理介绍与源码分析_珍妮的选择的博客-CSDN博客_gat算法

6.Graph Attention Network (GAT) 的Tensorflow版代码解析_酒酿小圆子~的博客-CSDN博客

7. Graph neural network GAT tensorflow code analysis_Thirteen words blog-CSDN blog_gat code 

Guess you like

Origin blog.csdn.net/weixin_51426083/article/details/128340275