For the paper Semi-Supervised Classification with Graph Convolutional Networks, Xiaobai’s learning and understanding

Reference notes: Paper notes: Semi-Supervised Classification with Graph Convolutional Networks_hongbin_xu's blog-CSDN blog

Paper Notes: SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS_semi supervised classification_Yinbingl’s Blog-CSDN Blog

Laplacian matrix and regularization_Laplacian regularization_solicucu's blog-CSDN blog

Understanding and detailed derivation of Graph Convolutional Network (GCN) Graph Convolutional Network (spectral domain GCN)_The derivation process of graph convolutional network is given in detail_The blog of Tudou who is not doing his job-CSDN blog

If you still don’t understand Fourier Transform after reading this article, then come over and strangle me to death - Zhihu

Preface

Convolutional neural network: Inspired by the human brain. When identifying an object, it first identifies the edge, then the shape, and finally determines the type of object. The convolutional neural network takes advantage of the characteristics of brain recognition to establish a multi-layer neural network. The lower layer identifies the characteristics of objects, and several low-level features form higher-level features. Finally, classification is performed through a combination of multi-layer features.

A typical convolutional neural network (CNN) consists of convolutional layers, pooling layers and fully connected layers. The convolutional layer is used to extract features, the pooling layer is used to reduce dimensionality and overfitting, and the fully connected layer is used to output the final result. The objects of research are usually regular spatial structures, such as ordered sentences, or the classification of cats and dogs. These features can be represented by matrices. For images with translation invariance, moving a small window to any position does not affect the internal structure, and CNN can be used to extract features. RNN is usually used for sequence information such as NLP. However, there is still a lot of irregular spatial structure data in life, such as molecular structures, which can be considered as infinite-dimensional data and do not have translation invariance. These irregular spatial structures are difficult to represent features with fixed convolution kernels. Each node is unique, which will make CNN and RNN invalid. GCN has designed a set of methods to extract features from graph data, which is essentially a feature extractor. The paper ( Semi-Supervised Classification with Graph Convolutional Networks ) uses spectral graph theory and uses the eigenvalues ​​and eigenvectors of the Laplacian matrix to study the properties of graphs.

Semi-supervised learning means that only part of the data in the sample set contains labels, and the classification of unlabeled data is inferred from the data with existing labels. Given a data set, it can be mapped into a graph. Each data in the data set corresponds to a node. Since a graph can correspond to a matrix, this allows us to analyze semi-supervised learning algorithms based on the matrix. But there are two problems with this. First, assuming there are n samples, the complexity is n^2, which makes it difficult to process such large-scale data. Second, the composition process can only use sample sets, and adding new samples requires reconstructing and re-labeling the original image.

The goal of this paper is to solve the problem of semi-supervised learning. This paper uses the neural network f(X,A) to encode the graph structure and train the labeled nodes on the supervised target.

 1 Introduction

We consider the problem of classifying nodes in a graph, where labels apply to only a small subset of nodes. The problem is framed as semi-supervised learning, where label information is smoothed over a graph via some form of graph-based explicit regularization. For example, formula (1) uses the Laplacian regularization term in the loss function:

  • L_{0}Represents the supervised loss with labels in the graph
  • f(.) represents a differentiable function similar to a neural network
  • λ is a weighting factor
  • X is X_{i}a matrix of node feature vectors
  • A represents the adjacency matrix of the graph
  • D_{ii}=\sum_{j}^{}A_{ij}Represents the degree matrix of A, which is a diagonal matrix
  • Δ=DA represents the non-normalized graph Laplacian of an undirected graph.

This has two benefits, we introduce a simple and well-behaved forward propagation formulation for graph neural networks and show how it can be motivated from a first-order approximation of spectral graph convolution. Second, we demonstrate how this form of graph-based neural network model can be used for semi-supervised classification of nodes in graphs. The disadvantage is that it relies on the assumption that connected nodes may share the same label. This assumption may limit the modeling capabilities because graph edges do not necessarily need to encode node similarities but may instead contain additional information.

2 Fast approximate convolution on graphs

This section considers multi-layer graph convolutional networks (GCN):

  • A is the adjacency matrix of the graph
  • \tilde{A}=A+I_{A}, is A plus a self-loop
  •  \tilde{D_{ii}}Yes \tilde{A}, the degree matrix  \sum j \tilde{A}ijis ​​a diagonal matrix
  • W^{(l)}is the weight matrix trainable for a specific layer
  • σ(.) is the activation function, such as ReLU(.)=max(0,.)
  • H^{l}R^{N*D}is the activation matrix of layer l, which can be considered as the output of layer l-1, and H^{0}=X

It will be shown below that this form of forward propagation can be excited by a first-order approximation of a local spectral filter on a graph.

2.1 Spectral convolution

We consider spectral convolution on a graph. For an input signal x∈ R^{N}, take the parameter θ∈ in the Fourier domain R^{N}and set a filterg\theta =diag(\theta )

  • U is the eigenvector matrix of the normalized Laplacian of the graph
  • Laplacian matrix:L=I_{N}-D^{-1/2}AD^{-1/2}=U\Lambda U^{T}
  • The Λ eigenvalue is a diagonal matrix, U^{T}xwhich is the Fourier transform on the graph
  • It can be g\thetaunderstood as the eigenvalue function of L, that isg\theta (\Lambda )

The cost of calculating formula (3) is very expensive, and the complexity is O( N^{2}). In addition, calculating the eigendecomposition of L is very expensive. To avoid this problem, the truncated expansion of g\theta (\Lambda )the Chebyshev polynomials can be used to provide a good approximation to order k.T_{k}(x)

  •  \tilde{\Lambda }=\frac{2}{\lambda _{max}}\Lambda -I_{N}, \lambda _{max}represents the largest eigenvalue of the Laplacian matrix L
  • {\theta}'\epsilon R^{K} is a Chebyshev coefficient
  • Chebyshev polynomial definition:
  1. T_{0}(x)=1
  2. T_{1}(x)=x
  3. T_{k}(x)=2xT_{k-1}(X)-T_{k-2}(x)

Going back to the definition of convolution of signal x and filter g\theta (\Lambda ), we can get:

  •  \tilde{L }=\frac{2}{\lambda _{max}}L -I_{N}
  • (U\Lambda U^{T})^{k}=U\Lambda ^{k}U^{T}
  • k is the k-order polynomial of the Laplacian function, which only depends on nodes that are at most k steps away from the center point.

2.2 Hierarchical linear model

A neural network based on graph convolution can be constructed by superimposing formula (5). When K=1, it is a linear function on Laplacian L. The paper mentions that through this form of GCN, the overfitting problem of the local neighborhood structure of graphs with very wide node degree distribution can be alleviated. Features can be extracted by stacking multiple such layers.

In this linear formula of GCN (K=1), it is further assumed \lambda _{max}\approx 2that it can be predicted that the neural network parameters can adapt to this change during the training process. Under these approximations, equation (5) simplifies to:

proving process:

Laplacian matrix L=D (degree matrix)-A (adjacency matrix)

Proof of Laplacian matrix regularization:L^{sym}= D^{-1/2}LD^{-1/2}= D^{-1/2}(D-A)D^{-1/2}= D^{-1/2}DD^{-1/2}-D^{-1/2}AD^{-1/2}= I_{N}-D^{-1/2}AD^{-1/2}

Insert image description here

  •  There are two filter parameters sum \Theta ^{0}, \Theta ^{1}which can be shared across the graph.
  • k is the number of convolutional layers
  • Further limiting the number of parameters can help reduce overfitting

  •  \Theta =\Theta ^{0}=-\Theta ^{1}

At this time, I_{N}+D^{-1/2}AD^{-1/2}the eigenvalue range is [0,2]. When the neural network uses this formula, it may cause numerical instability and gradient explosion/disappearance. To solve this problem, the following normalization technique is introduced.

  • I_{N}+D^{-1/2}AD^{-1/2}-> \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}(that is, adding the self-loop in the picture)
  • \tilde{A}=A+I_{N}
  • \tilde{D}_{ii}=\sum_{j}^{}\tilde{A}_{ij}

After adding the activation function σ(.), we can get H^{l+1}=f(H^{l},A)=\sigma(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}H^{l}W^{l}), which is formula 2.

This definition can be generalized to signals X\epsilon R^{N*C}, which have C input channels (i.e. each node has a C-dimensional feature vector) and F and filters or feature maps, as shown in the figure below

  • \Theta \epsilon R^{C*F}is the filter parameter matrix
  • Z\epsilon R^{N*F}is the signal matrix output after convolution

3 Semi-supervised node classification

We have introduced the model f(X,A) that effectively propagates information on the graph. Now we return to the problem of semi-supervised node classification. Adjusting the model f(.) makes it more useful in situations where the adjacency matrix contains some information that is not present in X, such as citation links between documents in a citation network or relationships in a knowledge graph. As shown in the figure, the overall model is used for multi-layer GCN for semi-supervised learning:

3.1 Example

The following example considers semi-supervised node classification using a two-layer GCN on a graph with a symmetric adjacency matrix A. We first calculate \tilde{A}=D^{-1/2}\tilde{A}D^{-1/2}the forward propagation model using:

The left picture is a schematic diagram of a multi-layer network GCN used for semi-supervised learning. There are C inputs in the input layer, several hidden layers and F output features. The picture on the right is the hidden layer visualization result of the two-layer GCN trained on the Cora dataset. The color indicates the document classification.

Here, we define the parameters belonging to the input layer to the hidden layer, and the parameters belonging to the hidden layer to the output layer. The softmax activation function is defined as e^{x_{i}}/\sum_{i=1}^{N}e^{x_{i}},applied row by row. The cross-entropy error is used to evaluate all labeled data.

where Yl is the set of labeled nodes. The neural network uses the gradient descent method to train W0 and W1. Use dropout to introduce randomness into the training process.

4 Summary

Use Chebyshev's formula to make a k-order approximate fitting filter, relying on the surrounding nodes with the longest step length k, and further consider the influence of surrounding nodes on them, that is, take K=1, , and further optimize, take, in order to prevent \lambda_{max} \approx 2the \Theta =\Theta ^{0}=-\Theta ^{1}gradient Explosions were improved and finally defined.

Guess you like

Origin blog.csdn.net/DW_css/article/details/132521666