[Graph Neural Network] Graph Convolutional Neural Network--GCN

1. Message delivery

        Since the graph has " transformation invariance " (that is, changes in the spatial structure of the graph will not affect the properties of the graph), it cannot be directly input into a convolutional neural network. Generally, message passing (Message pass) is used for processing.

        The message passing mechanism is implemented by constructing a calculation graph in local neighborhoods , that is, the attributes of a node are determined by its neighbor nodes . The work of gathering the information of these neighbor nodes is done by the neural network without human intervention. Its form is as follows:

        Each node can build its own calculation graph, which can represent its structure , function and role . In the calculation process, each calculation graph is a separate sample.

        It should be noted that the number of layers of the graph neural network is not the number of layers of the neural network, but the number of layers of the calculation graph . The number of layers of the graph neural network = the number of layers of the calculation graph = the order of neighbors of the target node in the graph . Nodes in each layer share a set of computational weights .

                        

         The number of layers in a graph neural network  can be regarded as the receptive fieldk  in a convolutional neural network . If   it is too large, it may lead to over-smoothing (all nodes output the same graph)k

2. Graph convolutional neural network

        1. Computing unit

                The graph convolutional neural network is based on the message passing method. The general calculation method is to average the attribute characteristics of the neighbor nodes element by element ( regardless of the order , it can also be the maximum value/summation), and then input this vector into the neuron. .

        2. Mathematical representation

                The embedding of the k+1 layer  is the neighborhood calculation of the nodes in v the kth layer ( the sum of the nodes in the neighborhood is divided by the number of connections of the nodes ), and the formula can be written as:vuuv

                        h^{(k+1)}_v=\sigma(\omega _k\sum \frac{h^k _u}{N(v)})       where  \sigmais the activation function and \omega_kis the weight

                Among them, vthe 0th-order attribute feature of a node is itself:h_v^{(0)}=x_v

                The embedding vector output by the neural network is z_v = h_v^K, K is the number of layers of the network

        3. Matrix representation

                ① Record the embedding of all nodes in layer k as H^{(k)}, H^{(k)}=[h_1^{(k)}...h_{|v|}^{(k)}]^T, which is a row in the matrix in the figure below

                ② Multiply this matrix by an adjacency matrix to the left A_v :  the neighbor nodes of \sum_{u \in N_v}h_u^{(k)}=A_vH^{(k)} the node can be selected (corresponding to the summation process in the above formula)v

                ③Find a matrix D_v=Deg(v)=|N(v)|, which is a diagonal matrix composed of node connections , expressed as:

                         Its inverse matrix is ​​the reciprocal of the number of connections:D_v^{-1}=\frac{1}{|N(v)|}

                After the above steps, the formula \sum \frac{h^k _u}{N(v)}can be expressed asD^{-1}AH^{(k)}

                However, if calculated in this way, because D^{-1}the node vonly considers its own connection number and ignores the other party's connection number ( regardless of the quality of the connection , the information from all channels is forcibly averaged), the formula can be improved D^{-1}A -->  D^{-1}AD^{-1}, so that The result is a symmetric matrix that takes into account both the number of connections to itself and the number of connections to the other side.

                The magnitude of the improved vector will be reduced, and its eigenvalue range is (-1,1). For this phenomenon, you can continue to improve the formula D^{-1}AD^{-1} -->  D^{-\frac{1}{2}}AD^{-\frac{1}{2}}, so that the maximum eigenvalue is equal to 1 after processing.

                Finally, this matrix is ​​recorded as: \tilde{A}=D^{-\frac{1}{2}}AD^{-\frac{1}{2}}, in this matrix, if there is a connection between two nodes i, jit is in the matrix \tilde{A}=\frac{1}{\sqrt{d_i}\sqrt{d_j}}, which can represent its connection weight (where d_iand djis the number of connections between nodes iand nodes j)

                         The matrix \tilde{A}can also be used to calculate the Laplacian matrix\xi = 1 - \tilde{A}

                Then the formula can be listed as: h^{(k+1)}_v=\sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}), which can represent a layer of GCN; where the learnable parameters are weightsw^{(l)}

        4. Improvement of calculation graph

                In the above method (using the adjacent nodes to describe the situation that this node cannot reflect the node itself), the improved method is: add a connection pointing to each node for each node .

                 After this improvement, the adjacency matrix \tilde{A}becomes \tilde{A}=A+I( the original matrix plus the unit matrix , and the diagonal lines are all 1)

                The final neural network expression H^{(k+1)}_v=\sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)})can be written as:

                        H^K_i=\sigma(\sum \frac{\tilde{A_{ij}}}{\sqrt{\tilde{D_{ii}}}\tilde{D_{jj}} }H_j^{k-1}W^k)(A formula contains the original weight matrix and identity matrix)

                It can also be split into writing:

                        H^k_i=\sigma(\sum\frac{A_{ij}}{\sqrt{\tilde{D_{ii}}}\tilde{D_{jj}} }H^{k-1}_jW^k+\frac{1}{\tilde{D_i}}H^{k-1}_iW^k)(The front is the transformation of the original weight matrix, followed by the transformation of the identity matrix)

For further improvement, two sets of weights                 can be used (one set of aggregation node information and one set of self-loop node information), written as:

                        H^{(k+1)}_v=\sigma(W_k\sum\frac{h_u^{(k)}}{|N(v)|}+B_kh_v^{(k)})

                        And B_k=1at that time , the latter formula becomes an identity mapping , which is the residual connection.

        ! ! ! The final matrix simplified form is: H^{(k+1)}=\sigma(\tilde{A}H^{(k)}W_k^T+H^{(k)}B_k^T); where\tilde{A}=D^{-\frac{1}{2}}AD^{-\frac{1}{2}}

3. GCN training

        1. Supervised Learning

                Loss function: min\, l(y,f(z_v)), where f is the classification/regression prediction head , and y is the node label information

                Cross entropy loss function:l=\sum y_vlog(\sigma(z_v^T\theta ))+(1-y_v))log(1-\sigma(z_v^T\theta))

                The input of GCN is a graph structure, and the output is also a graph structure, but the nodes in the output graph are embedded with semantic information ; the output structure has the characteristics of low dimension, continuous, and dense.

The embedding vector can be input into the classification head for classification, and when it is mapped on the two-dimensional space, it can be found that the nodes of different categories are separated more and more                  in iterations .

---->​​​​​​​

        2. Unsupervised/self-checked learning

                Similar to Deepwalk/Node2vec, using the connection structure of the graph itself, the purpose of iteration is to make the embedded vectors of the two connected nodes in the graph as close as possible .

                Loss function: l=\sum CE(y_{u,v},DEC(z_u,z_v)), y_{u,v}=1which means that the sum of two nodes uis vsimilar.

                Generally, the " encoder-decoder " architecture is adopted, the encoder embeds the graph into the vector, and the decoder calculates the similarity between the two vectors.

4. Advantages of GCN

        Compared with the traditional machine learning based on random walk

                ① All calculation graphs of GCN share weights , and the amount of parameters is smaller

                ②GCN is inductive learning and has strong generalization ability (it can be generalized to new nodes and even new graphs - transfer learning)

                ③Using the attribute characteristics, structural function roles and labeling information of nodes

                ④ Fitting learning ability is strong, and the quality of the obtained embedding vector is high

5. Comparing CNN and Transformer

        1. Compared with CNN

                CNN can be regarded as a 2-layer GCN. The convolution summarizes the information of 9 neighbor nodes and target nodes. Its mathematical formula can also be written as: , h^{(l+1)}=\sigma(\sum W_l^uh_u^P(l)+B_lh_v^{(l)})CNN can be regarded as a GCN with fixed neighborhood and fixed order .

                 But there are following differences between the two

                        ① CNN does not have transformation invariance , and disrupting the order of pixels will affect the output of the network.

                        ②GCN's convolution kernel weights \tilde{A}are predefined and do not need to be learned. The weight of CNN needs to be learned

        2. Compared with Transformer

                Transformer itself is a self-attention mechanism, and the purpose of its training is to allow elements in a sequence to influence each other.

                 Transformer can be regarded as a GCN on a fully connected graph.

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/129669699