Detailed introduction to GCN graph convolution network

This article is reprinted from "Detailed Introduction to GCN Graph Convolution Network"

Reprint link: Detailed introduction to GCN graph convolution network | Leifeng.com

In this article, we will take a closer look at a well-known graph neural network called GCN. First, let’s get an intuitive understanding of how it works, and then dive into the math behind it.

Why use Graph?

Many problems are essentially graphs. In our world, we see a lot of data as graphs, such as molecules, social networks, and paper citation networks.

Detailed introduction to GCN graph convolution network

 Figure example. (Picture from [1])

Tasks on the Graph

  • Node Classification: Predict the type of a specific node.

  • Link prediction: predict whether two nodes are connected

  • Community detection: identifying densely connected node communities.

  • Network similarity: How similar are two (sub)networks? 

Machine learning life cycle

In a graph, we have node features (data that represents the nodes) and the structure of the graph (represents how the nodes are connected).

For nodes, we can easily get the data of each node. But when it comes to the structure of a graph, extracting useful information from it is not an easy task. For example, if 2 nodes are very close to each other, should we treat them differently from other pairs of nodes? How to deal with high and low degree nodes? In fact, for every specific work, just feature engineering, that is, converting the graph structure into our features, will consume a lot of time and energy.

Detailed introduction to GCN graph convolution network

Feature engineering on graphs. (Picture from [1])

It would be better if we could somehow get both the node characteristics and structural information of the graph as input and let the machine judge which information is useful.

This is why we need graph representation learning.

Detailed introduction to GCN graph convolution network

We want the graph to learn "feature engineering" on its own. (Picture from [1]) 

Graph Convolutional Neural Networks (GCNs)

Paper : Semi-supervised classification based on graph neural network (2017) [3]

GCN is a convolutional neural network that works directly on graphs and exploits the structural information of the graph.

It solves the problem of classifying nodes (such as documents) in a graph (such as a citation network) where only a small fraction of the nodes have labels (semi-supervised learning).

Detailed introduction to GCN graph convolution network

Example of semi-supervised learning on Graphs. Some nodes have no labels (unknown nodes).  

The main idea

As the name "convolution" refers to, the idea came from images and was later introduced into graphs. However, graphs are much more complex when images have a fixed structure.

Detailed introduction to GCN graph convolution network

Convolution ideas from images to graphics. (Picture from [1])

The basic idea of ​​GCN: for each node, we obtain its characteristic information from all its neighbor nodes, including its own characteristics. Suppose we use average() function. We will do the same for all nodes. Finally, we feed these calculated averages into the neural network.

In the image below we have a simple example of a citation network. Each node represents a research paper, and the edges represent citations. We have a preprocessing step here. Here we do not use the original papers as features, but convert the papers into vectors (by using NLP embeddings, such as tf-idf). NLP embedding, such as TF-IDF).

Let's consider the green node. First, we get the eigenvalues ​​of all its neighbors, including its own node, and then take the average. Finally, a result vector is returned through the neural network and used as the final result.

Detailed introduction to GCN graph convolution network

The main idea of ​​GCN. Let’s take the green node as an example. First, we take the average of all its neighbor nodes, including its own node. Then, the average is passed through the neural network. Please note that in GCN, we only use one fully connected layer. In this example, we get a 2-dimensional vector as output (2 nodes of the fully connected layer).

In practice, we can use more complex aggregate functions than the average function. We can also stack more layers together to obtain deeper GCNs. The output of each layer is considered the input of the next layer.

Detailed introduction to GCN graph convolution network

Example of 2-layer GCN: The output of the first layer is the input of the second layer. Also, note that the neural network in GCN is just a fully connected layer (picture from [2]).

Let’s take a hard look at the math to see how this works.

Intuitive feelings and mathematical principles behind

First, we need some annotations

We consider graph G, as shown below.

Detailed introduction to GCN graph convolution network

 From the graph G, we have an adjacency matrix A and a degree matrix D. At the same time we also have the feature matrix X.

Detailed introduction to GCN graph convolution network

So how can we get the eigenvalues ​​of each node from its neighbor nodes? The solution lies in multiplying A and X.

Looking at the first row of the adjacency matrix, we see that there is a connection between node A and node E. The first row of the obtained matrix is ​​the eigenvector of the E node connected to A (as shown below). In the same way, the second row of the obtained matrix is ​​the sum of the eigenvectors of D and E. Through this method, we can get the sum of the vectors of all neighbor nodes.

Detailed introduction to GCN graph convolution network

Calculate the first row of the "sum vector matrix" AX.

  • There is still some room for improvement here.

  1. We ignore the characteristics of the nodes themselves. For example, the first row of the calculated matrix should also contain the characteristics of node A.

  2. Instead of using the sum() function, we need to take an average, or even better, a weighted average of the neighbor node feature vectors. So why don't we use the sum() function? The reason is that when using the sum() function, nodes with large degrees are likely to generate large v vectors, while nodes with low degrees tend to get small aggregate vectors, which may cause gradient explosion or gradient disappearance in the future (for example, , when using sigmoid). Furthermore, neural networks appear to be sensitive to the size of the input data. Therefore, we need to normalize these vectors to get rid of possible problems.

In problem (1), we can solve it by adding an identity matrix I to A to get a new adjacency matrix Ã.

Detailed introduction to GCN graph convolution network

Taking lambda=1 (making the characteristics of the node itself as important as its neighbors), we have Ã=A+I. Note that we can treat lambda as a trainable parameter, but now we only need to assign lambda to 1. Even in the paper, the lambda is simply assigned a value of 1.

Detailed introduction to GCN graph convolution network

By adding a self-loop to each node, we get a new adjacency matrix

For question (2): For matrix scaling, we usually multiply the matrix by the diagonal matrix. In the current case, we want to take the average of the aggregated features, or mathematically speaking, scale the aggregated vector matrix ÃX according to the node degree. Intuition tells us that the diagonal matrix used for scaling here is something related to the degree matrix D (why Dmber, not D? Because we are considering the degree matrix Dmber of the new adjacency matrix Ã, not A anymore).

The question now becomes how do we scale/normalize the sum vector? in other words:

How do we pass neighbor information to a specific node? We start with our old friend average. In this case, the inverse matrix of Dmber (i.e., Dmber^{-1}) comes into play. Basically, each element in the inverse matrix of D̃ is the reciprocal of the corresponding entry in the diagonal matrix D.

Detailed introduction to GCN graph convolution network

For example, the degree of node A is 2, so we multiply the aggregate vector of node A by 1/2, and the degree of node E is 5, we should multiply the aggregate vector of E by 1/5, and so on.

Therefore, by inverting D̃ and multiplying X, we can take the average of the eigenvectors of all neighbor nodes (including its own node).

Detailed introduction to GCN graph convolution network

So far, so good. But what about weighted average() you might ask? Intuitively, it would be better if we treat nodes with high and low degrees differently.

Detailed introduction to GCN graph convolution network

Detailed introduction to GCN graph convolution network

 But we only scale by row but ignore the corresponding column (dashed box). 

Detailed introduction to GCN graph convolution network

Detailed introduction to GCN graph convolution network

Adds a new scaler to the column.

The new scaling method gives us a "weighted" average. What we do here is add more weight to low-degree nodes to reduce the influence of high-degree nodes. The idea behind this weighted average is that we assume that low-degree nodes will have a greater impact on neighbor nodes, while high-degree nodes will have a lower impact because their influence is spread over too many neighbor nodes.

Detailed introduction to GCN graph convolution network

When aggregating adjacent node features at node B, we assign the largest weight (degree 3) to node B itself and the smallest weight (degree 5) to node E.

Detailed introduction to GCN graph convolution network

Detailed introduction to GCN graph convolution network

Because we normalized twice, change "-1" to "-1/2"

Detailed introduction to GCN graph convolution network

Detailed introduction to GCN graph convolution network

For example, we have a multi-classification problem with 10 classes and F is set to 10. After having 10-dimensional vectors in layer 2, we predict these vectors through a softmax function.

The calculation method of the Loss function is very simple, which is to calculate the cross-entropy error of all labeled examples, where Y_{l} is the set of labeled nodes.    

Detailed introduction to GCN graph convolution network

number of layers

meaning of layers

The number of layers refers to the furthest distance that node features can be transmitted. For example, in a layer 1 GCN, each node can only obtain information from its neighbors. The process of collecting information for each node is carried out independently and at the same time for all nodes.

When adding another layer on top of the first layer, we repeat the process of collecting information, but this time, the neighbor nodes already have information about their own neighbors (from the previous step). This makes the number of layers the maximum hop each node can take. So, depending on how far we think a node should get information from the network, we can set an appropriate number for #layers. But again, in a diagram, usually we don't want to go too far. Setting it to 6-7 hops, we can get almost the entire graph, but this makes aggregation less meaningful.

Detailed introduction to GCN graph convolution network

Example: The process of collecting two-layer information of target node i

How many layers should GCN be stacked on?

In the paper, the author also conducted some experiments on shallow and deep GCN respectively. In the image below, we can see that using a 2- or 3-layer model gives the best results. In addition, for deep GCN (more than 7 layers), poor performance (dashed blue line) is often obtained. One solution is to resort to residual connections between hidden layers (purple lines).

Detailed introduction to GCN graph convolution network

Performance with different number of layers#. Picture from paper[3]

Take notes

  • GCNs are used for semi-supervised learning on graphs.

  • GCNs are trained using both node features and structure

  • The main idea of ​​GCN is to take the weighted average of the characteristics of all neighbor nodes (including its own node). Nodes with lower degrees receive greater weight. After that, we train the obtained feature vectors through the neural network.

  • We can stack more layers to make GCN deeper. Consider residual connections for deep GCNs. Usually, we will choose 2-layer or 3-layer GCN.

  • Math Notes: When you see a diagonal matrix, think of matrix scaling.

  • Here is a GCN demo using the StellarGraph library [5]. The repository also provides many other GNN algorithms.

Note from the author of the paper

The framework is currently limited to undirected graphs (weighted or unweighted). However, directed edges and edge features can be handled by representing the original directed graph as an undirected two-terminal graph and adding nodes that represent the edges in the original graph.

What's next?

For GCN, we seem to be able to exploit both node features and the structure of the graph. However, what if the edges in the graph are of different types? Should we treat each relationship differently? How to aggregate neighbor nodes in this case? What are the recent advanced methods?

In the next article on the graph topic, we will look at some more sophisticated methods.

Detailed introduction to GCN graph convolution network

  How to deal with different relationships between sides (brothers, friends,...)?

references

[1] Excellent slides on Graph Representation Learning by Jure Leskovec (Stanford): https://drive.google.com/file/d/1By3udbOt10moIcSEgUQ0TR9twQX9Aq0G/view?usp=sharing

[2] Video Graph Convolutional Networks (GCNs) made simple: https://www.youtube.com/watch?v=2KRAOZIULzw

[3] Paper Semi-supervised Classification with Graph Convolutional Networks (2017): https://arxiv.org/pdf/1609.02907.pdf

[4] GCN source code: https://github.com/tkipf/gcn

[5] Demo with StellarGraph library: https://stellargraph.readthedocs.io/en/stable/demos/node-classification/gcn-node-classification.html

Guess you like

Origin blog.csdn.net/SmartLab307/article/details/125181176