[Graph convolutional network] 03-Introduction to spatial convolution (1)

Note: This article is 3.1-3.2 airspace convolution video notes, for personal study use only


1. Spectral domain graph convolution

1.1 Review

The last blog [Graph Convolutional Neural Network] 02-Introduction to Spectral Domain Graph Convolution talked about three classic spectral domain graph convolutions:

  • SCNNReplace the spectral domain convolution kernel with a learnable diagonal matrix.
    insert image description here

  • ChebNetUsing Chebyshev polynomials instead of convolution kernels in spectral domain
    insert image description here

  • GCNIt can be regarded as a further simplification of ChebNet. Only 1st-order Chebyshev polynomials are considered, and each convolution kernel has only one parameter. The
    insert image description here
    common characteristics of the three of them are:Both are based on the convolution theorem and the Fourier transform of graphs.


1.2 Defects of spectral domain graph convolution

  1. Spectral domain graph convolution does not work with directed graphs
    • The application of the graph Fourier transform is limited to undirected graphs.
    • The first step in the spectral domain graph convolution is to convert the spatial domain signal to the spectral domain. When the graph Fourier transform cannot be used, the spectral domain graph convolution cannot continue.
    • In a large number of practical scenarios, W ij ≠ W ji .
  2. Spectral-domain graph convolutions assume a fixed graph structure
    insert image description here
    • During model training, the graph structure cannot be changed (weights between nodes cannot be changed, nodes cannot be added or deleted).
    • In some scenarios, the graph structure may change (such as social network data, traffic data).
  3. Model Complexity Issues
    • SCNN needs to perform spectral decomposition of the Laplacian matrix, which is time-consuming and the complexity is .
    • ChebNet and GCN do not require spectral decomposition. But its learnable parameters are oversimplified. While reducing the complexity of the model, it also limits the performance of the model.

Can we bypass the graph theory and redefine the convolution on the graph? This paper introduces four spatial graph convolution models, each of which can be regarded as four different answers to the above
questions

  1. GNN
  2. GraphSAGE
  3. GAT
  4. PGC
    insert image description here

2. Four spatial convolution models


2.1 GNN

2.1.1 Question: What is convolution?

论文:GNN Hechtlinger Y, Chakravarti P, Qin J, et al. A Generalization of Convolutional Neural Networks to Graph-Structured Data.[J]. arXiv: Machine Learning, 2017

Answer 1: == Convolution means that after sorting a fixed number of neighborhood nodes, it is multiplied by the same number of convolution kernel parameters and summed ==. Traditional convolution has a fixed neighborhood size (for example, a 3X3 convolution kernel is an eight-neighborhood), and has a fixed order (usually from the upper left corner to the lower right corner).
insert image description here

2.1.2 Core idea

The convolution operation can be divided into two steps

  1. Build neighborhoods.
    • Find a fixed number of neighbor nodes.
    • Sort the found neighbor nodes.
  2. The inner product of the points in the neighborhood and the parameters of the convolution kernel

For graph-structured data, there are some difficulties in building neighborhoods:

  1. There is no fixed eight-neighborhood structure. The neighborhood size of each node is variable.
  2. There is no order for nodes in the same neighborhood
    insert image description here

2.1.3 Solutions

  1. Use == random walk (random walk) == method to select a fixed number of neighbor nodes according to the expected size of the probability of being selected.
  2. Neighborhoods are then ordered according to the expected probability of the node being selected.

symbol mark

  1. is the random walk transition matrix on the graph, where P ij represents the transition probability from node i to node j.

  2. The similarity matrix is ​​S. The similarity matrix in this paper can be understood as the adjacency matrix W.

  3. D is the degree matrix, insert image description here.

2.1.4 Specific steps

  • GNN assumes the existence of a graph transition matrix. If the graph structure is known, then the sum is known. The random walk probability transition matrix is ​​defined as follows:P = D -1 S
  • Use the normalized adjacency matrix as the transition matrix!
  • The multi-step transition expectation is defined as:insert image description hereinsert image description here
  • Select the neighborhood according to the desired size, and Π i (k) represents the serial number of the node. This node is the node with the cth largest expected number of visits from node i (within k steps). Then there are:insert image description here
  • Perform 1D convolution (inner product)
    insert image description here

2.1.5 CNN convolution operation example

insert image description here
The meaning of the element P n ij of the matrix P n is: the probability of taking n steps from node i to node j. In fact, the similarity matrix S (adjacency matrix D) is normalized so that each row adds up to 1, which can represent probability. Calculate the expected Q, assuming a maximum of 3 steps, that is, k=3. According to Q, select p nodes as the neighborhood. Suppose p=3. Perform 1D convolution. For node No. 5, the convolution operation is as follows: where x c ​​5 is the signal on node No. 5 of layer c. Note that the order of x 5 , x 2 , and x 1 cannot be changed, they are arranged according to the expected size.
insert image description here
insert image description here

insert image description here
insert image description here


2.1.6 Experimental results

Paper: GNN Hechtlinger Y, Chakravarti P, Qin J, et al. A Generalization of Convolutional Neural Networks to Graph-Structured Data.[J]. arXiv: Machine Learning, 2017 Molecular Activity Detection and Experiments on the MNIST Dataset
insert image description here
.
This method is superior to the traditional fully connected layer (Fully connectedNN) and random forest (Random Forest).


Rethinking of GNN:Essentially, GNN's approach is to force a graph structure data to be transformed into a similar regular data. Thus it can be processed by 1D convolution.
insert image description here


2.2 GraphSAGE

2.2.1 Question: What is convolution?

论文:Inductive representation learning on large graphs, in Proc. of NIPS, 2017

Answer 2:Convolution = sampling + aggregation of information!

2.2.2 Core idea

  • Integrate the convolution into two steps of sampling and aggregation.
  • The abbreviation for SAGE is: Sample and AggreGatE
  • Aggregate functions must be input order independent. That is, the author believes that the nodes in the neighborhood do not need to be sorted.insert image description here

2.2.3 Implementation process

  1. By sampling, neighbor nodes are obtained.
  2. Use the aggregation function to aggregate the information of the neighbor nodes to obtain the embedding of the target node;
  3. Use the information aggregated on the node to predict the label of the node/graph; insert image description here
    convolution = sampling + information aggregation!
  • sampling
    • GraphSAGE uses uniform sampling method (Uniform Sampling) to sample fixed neighborhood nodes. That is, for a node, uniform sampling is performed on its first-order connected nodes to construct a neighborhood with a fixed number of nodes. The nodes sampled in different batch iterations for the same node may be different.
  • polymerization
    • Mean aggregator
      insert image description here

    • LSTM (long short-term memory recurrent neural network, is a special recurrent neural network) aggregator. Use LSTM to encode the features of neighbors. Here, the order between neighbors is ignored, that is, randomly shuffled, and input into LSTM.

  • Pooling aggregator.insert image description here
  • forward calculation processinsert image description here

Differences from GNNs

  • In GNNs (as well as traditional CNNs) it is necessary to determine the order of neighborhood nodes. However, the authors of GraphSAGE argue that graph convolution neighborhood nodesNo need to sort
  • Each node in the neighborhood in GNN has different convolution kernel parameters. Neighborhoods in GraphSAGEAll nodes share the same convolution kernel parameters
  • In the neighborhood selection method, GNN passesrandom walkThe probability size to build the neighborhood, GraphSAGE throughuniform samplingBuild neighborhoods.

2.2.4 GraphSAGE convolution operation example

insert image description here
For node No. 5, there are three first-order adjacent nodes No. 1, No. 2, and No. 4. These three nodes are candidate nodes, and samples are taken from these three nodes to construct the neighborhood of node 5.

  • Assume that the mean aggregate function is taken. Assume that we are now at the kth layer.
  • Assume 2 nodes need to be sampled (sampling with replacement). The result of a certain sampling of the neighborhood of node 5 is: node 1 and node 4.
  • In the first step, the result of aggregating neighborhood nodes is:insert image description here
  • In the second step, the result of combining the central node information is: insert image description here
    h k 5 is the output of the kth layer.

In another iterative sampling, the result of a certain sampling of the neighborhood of node 5 is: node 1 and node 1 (this is possible due to the sampling with replacement).

  • In the first step, the result of aggregating neighborhood nodes is:insert image description here
  • The second step is the same as
    insert image description here
    h k 5 which is the output of the kth layer.

2.2.5 Experimental results

论文:Inductive representation learning on large graphs, in Proc. of NIPS, 2017
insert image description here

  • In effect, GraphSAGE is superior to traditional methods such as DeepWalk.
  • In terms of calculation time, the LSTM training speed in GraphSAGE is the slowest, but compared with DeepWalk, the prediction time of GraphSAGE is reduced by 100-500 times.
  • In addition, for GraphSAGE (shown in Figure B) as the number of neighbor samples increases, F1 increases, and the calculation time also becomes larger.
  • "unsup F1" and "sup F1" refer to the results under unsupervised learning and supervised learning, respectively.

2.3 GAT

2.3.1 Question: What is convolution?

Paper: GRAPH ATTENTION NETWORKS ICLR 2018

Answer 3:Convolution can be defined as a discriminative aggregation of neighborhood nodes using attention.

What is attention?

The attention mechanism is a technique that allows the model to focus on important information and fully learn and absorb it. It mimics the way humans look at objects. The core logic is "from focusing on everything to focusing on key points"
insert image description here

2.3.2 Core idea

  • GAT is GRAPH ATTENTION NETWORKS, and its core idea is to introduce attention into the graph convolution model.
  • The author believes that all nodes in the neighborhood share the same convolution kernel parameters, which limits the ability of the model. Because the degree of association between each node in the neighborhood and the central node is different, it is necessary to treat different nodes in the neighborhood differently when convoluting and aggregating the information of the neighborhood nodes.
  • The attention mechanism is used to model the degree of association between neighborhood nodes and central nodes.

2.3.3 Specific steps

  1. Use the attention mechanism to calculate the degree of association between nodes
    • Calculate relevance
      insert image description here

    • softmax normalization. In order to make the attention coefficients between different nodes easy to compare, the authors use a softmax function to normalize the attention coefficients of each node.
      insert image description here

  2. The attention coefficient is used to conduct discriminative information aggregation on neighboring nodes to complete the graph convolution operation.insert image description here

2.3.4 GAT convolution operation example

insert image description here

  1. For node No. 5, there are three first-order adjacent nodes No. 1, No. 2, and No. 4. These three nodes are the neighborhood of GAT.
  2. Use the attention mechanism to calculate the degree of association between nodes. For example, the degree of association between node 5 and node 1 is as follows:insert image description here
  3. softmax normalization
    insert image description here
  4. The attention coefficient is used to conduct discriminative information aggregation on neighboring nodes to complete the graph convolution operation. insert image description here
    Note that the W of the 4 neighborhood nodes is common.

2.3.5 Experimental results

Paper: GRAPH ATTENTION NETWORKS ICLR 2018.
insert image description here

2.3.6 Comparison with other graph convolutions

  1. In the construction of neighborhood nodes, unlike GNN (random walk) and GraphSAGE (sampling), GAT directly selects first-order adjacent nodes as neighborhood nodes (similar to GCN).
  2. In terms of node sorting, all nodes in the neighborhood in GAT do not need to be sorted and share convolution kernel parameters (similar to GraphSAGE).
  3. Since GAT introduces the Attention mechanism, the relationship between adjacent nodes can be constructed, which is a differentiated aggregation of adjacent nodes. If α ij and W are combined as a coefficient, in fact GAT implicitly assigns different convolution kernel parameters to each node in the neighborhood.

2.3.7 Rethinking of GAT

  1. GAT can be considered as a kind of learning of local graph structure. Existing graph convolution methods often pay more attention to node features and ignore graph structure.
    insert image description here
  2. The Attention mechanism can be considered as a learnable function with learnable parameters. Using the Attention mechanism, GAT constructs a learnable function to obtain the relationship between adjacent nodes, that is, the local graph structure.

2.4 PGC

2.4.1 Question: What is convolution?

论文:Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action. AAAI. 2018.

Answer 4:Convolution can be considered as a summation after multiplying a specific sample function (sample function) and a specific weight function (weight function).

Starting from the classic convolution, the convolution kernel of K x K can be written as the following formula:
insert image description here
K is the size of the convolution kernel (commonly 3). P ( ) is a sampling function. That is, the nodes are sequentially taken out in the neighborhood to participate in the convolution calculation. w() is the weight function, which assigns convolution kernel parameters to the extracted nodes. This whole formula is actually the inner product of node features and convolution kernel parameters.

2.4.2 Core idea

In the process of extending convolution from regular data to graph-structured data, select the appropriate sampling function and weight function
insert image description here

sampling function

  • The sampling function is to take out nodes sequentially in the neighborhood. The focus is on how to construct the neighborhood of nodes, that is to say where the sampling function samples.
  • On graph-structured data, PGC can define a sampling function on the nodes of the D-order neighbors. That is, insert image description here
    where d(v j , v i ) represents the shortest distance from node i to node j.
  • Take D=1 in the experiment, and sample one by one in the 1st order neighborhood. But it can also be set to other neighborhoods.

weight function

  • First, the points in the neighborhood are divided into K different classes.
    insert image description here
    Due to the classification operation, this method is called Partition Graph Convolution (PGC).
  • Each class shares a convolution kernel parameter. The convolution kernel parameters are different between different classes.
    insert image description here

Classification Strategies for Weight Functions

  • Uni-labeling (Figure b). All nodes in the neighborhood are treated equally. There is only one category. So all divided into one color.
  • Distance partitioning (Figure c). Different categories are determined according to the order. The zero-order is himself, one class, and the first-order is another class, all in blue.
  • Spatial configuration partitioning (Figure d), classified according to the distance from the center of the human skeleton. There are three categories. The distance from the center node is smaller than the center node is a class (blue), and the distance from the center is greater than a class (yellow).
    insert image description here

The convolution formula defined by PGC on the graph is finally as follows: The meaning of insert image description here
Z i (v j ) is the number of nodes in the class of node j (in the neighborhood of node i). The normalization coefficient is to balance the contribution of each type of nodes in the neighborhood.

2.4.3 Example of PGC convolution operation

insert image description here

  • For node No. 5, there are three first-order adjacent nodes No. 1, No. 2, and No. 4. These three nodes are the neighborhood of the PGC. The sampling function
    takes samples from nodes No. 1, No. 2, No. 4, and No. 5 in turn. The order does not matter, because the parameters corresponding to each node are
    assigned through the weight function.

  • Assume that when the weight function is classified, node 5 is classified into the first category, nodes 1 and 2 are classified into the second category, and node 4 is classified into the third category.insert image description here

  • Assuming that the weight function is classified, the four nodes are classified into one class.insert image description here

  • Assume that when the weight function is classified, the four nodes are divided into four different categories, and each category has one node.insert image description here

2.4.4 Experimental results

insert image description here

2.4.5 Relationship with other graph convolutions

  • Compared with GraphSAGE, which uses mean sampling to determine neighborhoods, PGC defines neighborhood construction as a sampling function, which is more generalizable.
  • GNN needs to determine the order of the neighborhood, and the points of the GraphSAGE neighborhood do not need to be sorted. PGC takes a more general approach - defining a weight function. GNN/GraphSAGE's approach to neighborhood nodes can be seen as the two extremes of PGC - different/equal treatment.

3. Summary

3.1 The essence of spatial convolution

Different spatial graph convolution methods essentially correspond to different understandings of convolution.
insert image description here

3.2 Features of Spatial Graph Convolution

  • Bypassing the graph theory, there is no need to transform the signal between the spatial and spectral domains.
  • It is more intuitive to define the convolution operation directly on the airspace.
  • Without the shackles of graph theory, the definition is more flexible and the methods are more diverse.
  • Compared with spectral domain graph convolution, it lacks mathematical theoretical support.

3.3 Convolution comparison of four kinds of airspace maps

convolution method Definition of convolution Neighborhood node selection method Whether the neighbor nodes need to be sorted In the same neighborhood, whether the convolution kernel parameters are shared
GNN After the fixed number of neighbor nodes are sorted, they are multiplied and summed with the same number of convolution kernel parameters random walk need sorting do not share
GraphSAGE Sampling + Aggregation of Information uniform sampling No need to sort shared
GAT Discriminative Aggregation of Neighborhood Nodes Using Attention Mechanism Directly use first-order neighbor nodes No need to sort shared. However, after being corrected by the attention mechanism, each node is actually assigned to a different convolution kernel parameter
PGC A specific sampling function is multiplied by a specific weight function and then summed determined by a specific sampling function determined by a specific weight function determined by a specific weight function

3.4 Requirements for graph structure

  • GNN, GraphSAGE, and GAT do not require a fixed graph structure, that is, the graph structure of the training set and the test set can be different.
  • This issue is not discussed in the PGC text. Their experiments also operate on graph data with a fixed structure.
  • However, as a framework with strong generalization, the above three convolutions can be regarded as special cases of PGC. It can be considered that PGC does not require that the graph structure must be fixed.
  • In essence, because the spatial domain graph convolution does not use the graph Fourier transform, it does not need to consider the problem of the change of the basis function (the eigenvector U of the Laplacian matrix) after the Laplacian matrix L changes.

Guess you like

Origin blog.csdn.net/happy488127311/article/details/128762586