1. Message delivery
Since the graph has " transformation invariance " (that is, changes in the spatial structure of the graph will not affect the properties of the graph), it cannot be directly input into a convolutional neural network. Generally, message passing (Message pass) is used for processing.
The message passing mechanism is implemented by constructing a calculation graph in local neighborhoods , that is, the attributes of a node are determined by its neighbor nodes . The work of gathering the information of these neighbor nodes is done by the neural network without human intervention. Its form is as follows:
Each node can build its own calculation graph, which can represent its structure , function and role . In the calculation process, each calculation graph is a separate sample.
It should be noted that the number of layers of the graph neural network is not the number of layers of the neural network, but the number of layers of the calculation graph . The number of layers of the graph neural network = the number of layers of the calculation graph = the order of neighbors of the target node in the graph . Nodes in each layer share a set of computational weights .
The number of layers in a graph neural network can be regarded as the receptive field in a convolutional neural network . If it is too large, it may lead to over-smoothing (all nodes output the same graph)
2. Graph convolutional neural network
1. Computing unit
The graph convolutional neural network is based on the message passing method. The general calculation method is to average the attribute characteristics of the neighbor nodes element by element ( regardless of the order , it can also be the maximum value/summation), and then input this vector into the neuron. .
2. Mathematical representation
The embedding of the k+1 layer is the neighborhood calculation of the nodes in the kth layer ( the sum of the nodes in the neighborhood is divided by the number of connections of the nodes ), and the formula can be written as:
where is the activation function and is the weight
Among them, the 0th-order attribute feature of a node is itself:
The embedding vector output by the neural network is , K is the number of layers of the network
3. Matrix representation
① Record the embedding of all nodes in layer k as , , which is a row in the matrix in the figure below
② Multiply this matrix by an adjacency matrix to the left : the neighbor nodes of the node can be selected (corresponding to the summation process in the above formula)
③Find a matrix , which is a diagonal matrix composed of node connections , expressed as:
Its inverse matrix is the reciprocal of the number of connections:
After the above steps, the formula can be expressed as
However, if calculated in this way, because the node only considers its own connection number and ignores the other party's connection number ( regardless of the quality of the connection , the information from all channels is forcibly averaged), the formula can be improved --> , so that The result is a symmetric matrix that takes into account both the number of connections to itself and the number of connections to the other side.
The magnitude of the improved vector will be reduced, and its eigenvalue range is (-1,1). For this phenomenon, you can continue to improve the formula D^{-1}AD^{-1} --> , so that the maximum eigenvalue is equal to 1 after processing.
Finally, this matrix is recorded as: , in this matrix, if there is a connection between two nodes , it is in the matrix , which can represent its connection weight (where and is the number of connections between nodes and nodes )
The matrix can also be used to calculate the Laplacian matrix
Then the formula can be listed as: , which can represent a layer of GCN; where the learnable parameters are weights
4. Improvement of calculation graph
In the above method (using the adjacent nodes to describe the situation that this node cannot reflect the node itself), the improved method is: add a connection pointing to each node for each node .
After this improvement, the adjacency matrix becomes ( the original matrix plus the unit matrix , and the diagonal lines are all 1)
The final neural network expression can be written as:
(A formula contains the original weight matrix and identity matrix)
It can also be split into writing:
(The front is the transformation of the original weight matrix, followed by the transformation of the identity matrix)
For further improvement, two sets of weights can be used (one set of aggregation node information and one set of self-loop node information), written as:
And at that time , the latter formula becomes an identity mapping , which is the residual connection.
! ! ! The final matrix simplified form is: ; where
3. GCN training
1. Supervised Learning
Loss function: , where f is the classification/regression prediction head , and y is the node label information
Cross entropy loss function:
The input of GCN is a graph structure, and the output is also a graph structure, but the nodes in the output graph are embedded with semantic information ; the output structure has the characteristics of low dimension, continuous, and dense.
The embedding vector can be input into the classification head for classification, and when it is mapped on the two-dimensional space, it can be found that the nodes of different categories are separated more and more in iterations .
---->
2. Unsupervised/self-checked learning
Similar to Deepwalk/Node2vec, using the connection structure of the graph itself, the purpose of iteration is to make the embedded vectors of the two connected nodes in the graph as close as possible .
Loss function: , which means that the sum of two nodes is similar.
Generally, the " encoder-decoder " architecture is adopted, the encoder embeds the graph into the vector, and the decoder calculates the similarity between the two vectors.
4. Advantages of GCN
Compared with the traditional machine learning based on random walk
① All calculation graphs of GCN share weights , and the amount of parameters is smaller
②GCN is inductive learning and has strong generalization ability (it can be generalized to new nodes and even new graphs - transfer learning)
③Using the attribute characteristics, structural function roles and labeling information of nodes
④ Fitting learning ability is strong, and the quality of the obtained embedding vector is high
5. Comparing CNN and Transformer
1. Compared with CNN
CNN can be regarded as a 2-layer GCN. The convolution summarizes the information of 9 neighbor nodes and target nodes. Its mathematical formula can also be written as: , CNN can be regarded as a GCN with fixed neighborhood and fixed order .
But there are following differences between the two
① CNN does not have transformation invariance , and disrupting the order of pixels will affect the output of the network.
②GCN's convolution kernel weights are predefined and do not need to be learned. The weight of CNN needs to be learned
2. Compared with Transformer
Transformer itself is a self-attention mechanism, and the purpose of its training is to allow elements in a sequence to influence each other.
Transformer can be regarded as a GCN on a fully connected graph.