An introduction to GNN: A Gentle Introduction to Graph Neural Networks

Original link

A Gentle Introduction to Graph Neural Networks (distill.pub)icon-default.png?t=N6B9https://distill.pub/2021/gnn-intro/

Introduction: This article is the reading notes of "A Gentle Introduction to Graph Neural Networks". Because it is the first time I contact GNN, I don't understand many esoteric concepts, so I haven't read it completely. Maybe it will be added later.

What is Graph? Graph is actually three elements, vertex is a node, edge, and attribute is a feature of the graph .

What is the Graph problem used for? It can be divided into three categories, one is based on the graph level, such as finding a graph with a ring in the topology; the other is to cluster the vertex according to the relationship of the edge; the third is to find the feature of the edge according to the characteristics of the vertex.

How to transfer and aggregate information between graphs? The new information transmission in GNN is Information pooling, that is, information is transmitted between nodes, and then the information is aggregated. The aggregation of information can use average, maximum or summation .

Rules for designing GNNs? When designing GNN , the performance of GNN is not that the deeper the number of layers, the better. More layers will enhance the worst and average solution performance, but cannot enhance the best solution performance. To achieve better performance, one is to design a better propagation mechanism, and the other is to increase the attributes of the graph.

Sampling when training GNN? Because the neighbors and connected edges of each node of GNN are different, unlike traditional neural networks that can take a fixed size for training, there are different methods for sampling for training, such as random sampling and random walk.

The extension mode of GNN?  GNNs are tuned to more complex graph structures, for example, we can consider multi-edge graphs or multi-graphs, such as a pair of nodes sharing multiple edges of different properties.

Questions that have not been understood, what are the information transmission mechanisms between Graphs? and how is it delivered?

How is Graph trained? What techniques are used for sampling? need a deeper understanding

1. What kind of data can be converted into Graph

A graph is actually a graph composed of vertex edges and their attributes.

 2. What kind of problems can be transformed into graph structure problems?

There are three categories:

In a graph-level task, we predict a single property for a whole graph. For a node-level task, we predict some property for each node in a graph. For an edge-level task, we want to predict the property or presence of edges in a graph.

Graph-level task

For example, find a graph with two rings from a stack of topological graphs

 Node-level task

 Similar to clustering nodes

 Edge-level task

find relationships between nodes

Representing a graph with an adjacency matrix consumes a lot of memory, especially when the adjacency matrix is ​​a sparse matrix, how to represent the graph more memory-efficient?

 GNN Predictions by Pooling Information

After constructing GNN, how to complete tasks or make predictions? ?

Pooling is done in two steps:

1. For each item to be pooled, collect each of their embeddings, and concatenate them into a matrix.

2. The collected embedded data is then aggregated, usually by a sum operation

Predict node based on edge information

 Predict edge based on node information

 Predict global information based on node and edge information

 Passing messages between parts of the graph

We can make more complex predictions by using pooling techniques in GNN layers, so that our learned embeddings are aware of graph connectivity. We can do this through information passing, where adjacent nodes or edges exchange information and influence each other's updated embeddings.

Message passing is divided into three steps:

1. For each node in the graph, collect the embeddings (or information) of all neighboring nodes, the g-function described above.

2. Summarize all the information through an aggregate function such as sum.

3. All the pooled information will be passed through an update function (usually a learned neural network).

Essentially, both message passing and convolution are operations that summarize and process information about an element's neighborhood to update an element's value. In a graph, an element is a node, while in an image, an element is a pixel. However, the number of neighboring nodes in a graph can be variable, unlike images where each pixel has a fixed number of neighboring elements.

By stacking message-passing GNN layers together, a node can eventually integrate information from the entire graph: after going through three layers, a node can obtain information about nodes three steps away from it.

The general modeling template uses sequential GNN layers followed by a linear model with sigmoid activation for classification. The design space of GNNs has many levers that can be used to customize the model:

1. The number of GNN layers, also known as depth.

2. Update the dimensionality of each attribute. The update function is a 1-layer MLP with relu activation function and layer normalization.

3. Aggregation functions are used in aggregation: maximum, average, or sum.

4. Updated graph properties or messaging styles: node, edge and global representation. We control these with boolean toggles (on or off). The baseline model will be a graph-independent GNN (all message passing off), which at the end aggregates all data into a single global attribute. Turning on all messaging functions results in a GraphNets architecture.

Data is aggregated into a single global attribute. Switching all messaging functions results in a GraphNets architecture.

Since these are high-dimensional vectors, we reduce them to 2D by principal component analysis (PCA). A perfect model should be able to cleanly separate the labeled data, but since we are reducing dimensionality and also have imperfect models, this boundary can be harder to see.

Some empirical GNN design lessons

1. The deeper the network layer does not perform better.

2. The performance is mainly related to the following: the type of message passing, the dimension of mapping, the number of layers and the type of aggregation operation.

The average performance and the worst performance increase as the number of GNN layers increases, but the best performance does not increase with the increase in the number of layers. This effect may be due to the fact that a GNN with more layers propagates information farther away, and its node representations may be "diluted" by multiple successive iterations.

The more graphical properties that are communicated, the better the average model performance. Our task centers on a global representation, so explicitly learning this property also tends to improve performance. Our node representations also seem to be more useful than edge representations, which makes sense since more information is loaded in these attributes.

For better performance, there are many directions to choose from. We wish to highlight two general directions, one related to more complex graph algorithms and the other related to graphs themselves.

So far, GNNs operate based on neighborhood pooling. Some graph concepts are more difficult to express this way, such as linear graph paths (connected chains of nodes). Designing new mechanisms to extract, execute, and propagate graph information in GNNs is a current area of ​​research.

A frontier of GNN research is not building new models and architectures, but "how to build graphs", more precisely, adding additional structures or relationships to graphs that can be exploited . As we have seen, the more properties a graph has, the better models we can build. In this special case, we can consider enriching the features of molecular graphs by adding spatial relationships between nodes, adding non-bonded edges, or specifying learnable relationships between subgraphs.

Graph related to GNN

Although we only describe graphs with vectorized information for each attribute, the graph structure is more flexible and can accommodate other types of information. Fortunately, the message passing framework is flexible enough that in general, tuning a GNN to a more complex graph structure simply requires defining how information is passed and updated through new graph properties.

For example, we can consider multi-edge graphs, or multigraphs, where a pair of nodes can share more than one type of edge. For example, in a social network, we can specify edge types based on relationship types (acquaintances, friends, family). GNNs can be tuned by setting different types of information transfer steps for each edge type. We can also consider nested graphs, such that a node represents a graph, also known as a supernode graph. Nested diagrams are useful for representing hierarchical information. For example, we can consider molecular networks where a node represents a molecule and an edge is shared between two molecules if we have a method (reaction) to transform one molecule into another. In this case, we can learn on nested graphs by having one GNN learn representations at the molecular level and another at the reaction network level, and use them alternately during training.

Another kind of graph is a hypergraph, where an edge can connect multiple nodes, not just two. For a given graph, we can construct a hypergraph by identifying node communities and assigning a hyperedge connecting all nodes in the community.

Sampling Graphs and Batching in GNNs 

Usually for neural network training, we take a fixed size mini-batch, but this is not the case for GNN networks, because the edge connected to each node in the GNN network is not fixed, and there is no constant size. The idea of ​​batching graphs is to build subgraphs. Building subgraphs involves graph sampling.

How to sample graphs is an unresolved research problem.

If we wish to preserve neighborhood-level structure, one approach is to randomly sample a certain number of nodes, our node set. Then add the adjacent nodes with distance k adjacent to the node set, including their edges. Each neighborhood can be viewed as a separate graph, and GNNs can be trained on batches of these subgraphs. Since the neighborhoods of all neighboring nodes are incomplete, the loss can be masked and only the set of nodes is considered.

A more efficient strategy might be to first randomly sample a node, expand its neighborhood to distance k, and then pick other nodes in the expanded set. These operations can be terminated once a certain number of nodes, edges or subgraphs have been constructed. If the situation allows, we can build constant-size neighborhoods by picking an initial set of nodes and then subsampling a certain number of nodes (eg, randomly, or via random walks or the Metropolis algorithm).

Comparing aggregation operations

Aggregating the information of neighboring nodes and edges is a key step of GNN . The number of neighbors of each node is not fixed, and a differentiated method is needed to collect this information. Therefore, the collection operation should be smooth and invariant to the ordering of nodes and the number of nodes.

Selection and design of optimal aggregation operations is an unresolved research topic. A desirable property of aggregate operations is that similar inputs give similar aggregated outputs, and vice versa. Some very simple candidates for envelope-invariant operations include sum, average, and maximum. Summary statistics such as variance are also available. All of these operations take a variable number of inputs and provide the same output regardless of the ordering of the inputs.

There is no one-size-fits-all best choice for an operation. The average operation can be useful when the number of neighbors of a node varies greatly, or when the features of the local neighborhood need to be normalized . The max operation can be useful when you want to highlight a single salient feature in a local neighborhood . Summation strikes a balance between these two , providing a snapshot of the local feature distribution, but also highlights outliers since it is not normalized. In practice, summation is the usual method.

Guess you like

Origin blog.csdn.net/qq_38480311/article/details/131953224