From convolutional neural network (CNN) to graph convolutional neural network (GCN) in detail

Table of contents

1. The relationship between CNN and GCN

2. Preliminary knowledge of "picture"

3. Graph Convolutional Network (GCN)

4. Network optimization for hyperspectral image classification

5. Graph Convolutional Neural Networks in the Frequency Domain


        Recently, I saw an article with a very high number of citations, which uses graph convolutional networks to process hyperspectral image classification tasks. It was released in July last year and has 300+ citations so far. This is for the hyperspectral classification field. Is a very fast and very high data. After contacting the graph convolutional neural network, I found that most of the information is trying to explain the middle formula part in a mathematical reasoning way, which creates certain obstacles for the overall understanding, grasp, and use of this network. After reading the relevant information, try Understand this network from a more macro perspective, and then slowly add some mathematical content, so as to strive to understand and learn this network smoothly.

1. The relationship between CNN and GCN

       For data sets with traditional data structures, they all belong to the Euclidean space, such as pictures, videos, audios, etc., can be transformed into matrices with irregular shapes. Such data structures can be processed by convolution, because The convolution kernel is also a very regular matrix.

Figure 1.1 Convolution process of CNN

           

But for data structures in non-Euclidean space, this kind of convolution cannot be realized, such as social relationship networks, traffic network diagrams, chemical molecular structures, etc.

Figure 1.2 Network of interpersonal relationships

Figure 1.3 Chemical Molecular Structure

        Each data node has N edges connecting different nodes, and these N are not the same. Obviously, these data structures cannot perform traditional convolution processing, because we can't even find a convolution kernel whose size can change irregularly at any time. So what should we do if we want to learn the features of this kind of data through convolution? At this time, GCN is proposed, which is the protagonist of this article, the graph convolutional neural network. This kind of network needs 1. It can be processed by computer (to make the data structure regular, such as matrix), 2. In addition to retaining the information of each node, it must also contain the information of adjacent nodes.

2. Preliminary knowledge of "picture"

       Graph neural network (GNN) processes "graphs", which are data structure types such as the above-mentioned interpersonal network and chemical molecular structure. Summarize him as the following graph. We call this graph with no arrow pointing to it an undirected graph, which means two-way communication between two nodes. In GNN, almost only this kind of undirected graph is used. Another unidirectional structure with arrows we call a directed graph.

2.1 Undirected graph

        There are 4 nodes in the graph represented by 2.1. In the "graph", we call this kind of node a vertex, and the connection between the vertices is called an "edge". The degree indicates the number of edges connected by a vertex, such as In 2.1, the graph contains four vertices V1, V2, V3, V4, the degree of V3 is 2, the degree of V1 is 3, the degree of V2 is 1... . Vertices that are directly connected to each other are called first-order neighbor vertices, such as V2, V3, and V4 are first-order neighbor vertices of V1, and those with at least one vertex in the middle are called second-order neighbor vertices, such as V4 is V2’s Second-order neighbor vertices, and so on have higher-order neighbor vertices.

        With some basic definitions of graphs

, we began to consider that no matter how strange the structure of the graph is, it will eventually be sent to the computer for processing. Then we hope to find a regular structure to represent this graph, such as a matrix to represent it, so that it can be sent to the computer for processing. Then this matrix needs to be composed of at least two parts. First, it needs to be able to represent the relationship between each vertex, that is, whether there is a connection. Further, it is best to consider the strength of this connection. Second, it needs to contain the information of each node itself, that is, to characterize who the nodes themselves are.

        According to the above two guidelines, we realize step by step, first satisfy the first condition, that is, to represent the relationship between each vertex, which introduces the concept of "adjacency matrix", let's take a look at what the adjacency matrix does. Taking 2.1 as an example, we want to express the relationship between each vertex, that is, we need to express the relationship between V1 to (V1, V2, V3, V4) and V2 to (V1, V2, V3, V4) and so on until V4 and everyone's relationship. It is not difficult to see that we need two dimensions

Table 2.1 Adjacency matrix
V1 v2 V3 V4
V1 0 1 1 1
v2 1 0 0 0
V3 1 0 0 1
V4 1 0 1 0

        The above table shows the relationship from Vi to Vj. We regard the directly connected vertex relationship as 1, and the not directly connected vertex relationship as 0, so that we get the adjacency matrix A, which can be very good The expression of the relationship between each vertex. Furthermore, if we can mark the strength and weakness of the edge near each edge, then we can further improve this matrix instead of just using (0, 1) to represent it. Figure 2.2 and the table below

Figure 2.2 Graph with strong and weak relationships

V1 v2 V3 V4
V1 0 2 5 6
v2 2 0 0 0
V3 5 0 0 3
V4 6 0 3 0

       So far, the first problem is solved. Through the adjacency matrix A, we can well characterize the relationship between the vertices in the graph. Then we need to solve the second problem, which is to characterize the characteristics of each vertex.

       For example, V1 is a person, some of his personal information is stored in this vertex, and this information is represented as a number (embedding) by a certain encoding method: suppose that for V1, the name is 239, the gender is 2, and the height is 239. is 175, and the weight is 120. For V2, the name is 542, the gender is 1, the height is 168, and the weight is 100. By analogy, all Vis are encoded, and i vectors are obtained, such as V1's The vector is H1 (239, 2, 175, 120), and the vector of V2 is H2 (542, 1, 168, 100). We call this vector the signal of the corresponding vertex. Afterwards, the obtained i column vectors are concatenated into a matrix H (H1, H2, H3, H4) ⊺, where H is a block matrix, represented by one column.

H1 239 2 175 120
H2 542 1 168 100
H3 937 2 188 150
H4 365 1 163 90

       We take the first row A1 in the A matrix, multiply A1 by H, that is:                

A1\cdot H =(0\times H1 + 2\times H2 + 5\times H3 + 6\times H4)

       It can be seen that this result is the weighted sum of the signals of the neighbor nodes of V1, where the weight is the value of the relationship strength, which is provided by A, but this weight has not been normalized, that is, if a node’s neighbor vertex The more, the stronger the relationship value, the greater the result. In order to avoid this situation, we performed a normalization operation, that is, let the weight be divided by the sum of the relationship values ​​​​of all edges of the node, and the relationship values ​​​​of these edges Become a "weight" with a sum of 1 in the true sense. Then the mathematical process we need is to divide each row of A by the sum of the row. This is where we introduce a new matrix D, which is a diagonal matrix, and the elements on the diagonal of each row are the sum of the elements of this row of A, that is, the degree of the vertex.

D matrix
13 0 0 0
0 2 0 0
0 0 8 0
0 0 0 9

        Execute D^{-1}Athe operation, that is, normalize each relationship value of A and turn it into a weight, and the result is

0 2/13 5/13 6/13
2/2 0 0 0
5/8 0 0 3/8
6/9 0 3/9 0

        At this time, execute (D^{-1}A)_{i}H, that is D^{-1}A, multiply the i-th line and H, (set i to 1, which is the first line) to get

D_{1}^{-1}A1\cdot H =(1/13)\times (0\times H1 + 2\times H2 + 5\times H3 + 6\times H4)

        All the rows perform the same operation, D^{-1}AH, then the dimension of each row is the same, and the calculation result of a certain row will not be particularly exaggerated. At this time, let's consider what the result is. It is equivalent to adding (aggregating) the signals of all vertices around a certain vertex according to the relationship (weight), then this result can represent the influence of the surrounding nodes on itself, and because it is Through matrix operations, the data structure becomes very regular, and computers can be used for operations.

3. Graph Convolutional Network (GCN)

        At this point, we need to think about two questions:

        First , this influence needs an object to bear:

        For example, I have five friends, some have good relationships, some have average relationships, and some have poor relationships, and their learning abilities are also different. I want to know that after I make these five friends, my grades will become what it looks like. In this example, I am the object, and I need to bear the influence of five friends, and finally express the consequences of my making these five friends, which is what I want to calculate in the end. In the calculation just now, the participation of "I" is obviously missing, such as:            

A1\cdot H = (1/13)\times (0\times H1 + 2\times H2 + 5\times H3 + 6\times H4)

        It can be seen that the signals H1, H2, and H3 of V2, V3, and V4 are all weighted and summed, and there is no information about the signal H1 of V1 (that is, there is no "I"). Then suppose that I am a schoolmaster, and they do not It has no effect on me, or I have made a huge difference in my studies, and making friends with them has greatly improved me. In this case, if we want to represent the "consequences" of making friends, then the information of "I" is essential, so we need to make friends with them. The adjacency matrix A is improved a little bit, and the information of "I" is added to create a new matrix                     

\widetilde{A}= A + I

        At the same time, D also changes accordingly, becoming                                         

\widetilde{D}= D+I

        Among them, I is the unit matrix. In this way, the new calculation formula is as follows. \widetilde{D}^{^{-1}}\widetilde{A}HIn this way, the information of "I" has to be considered when weighting. As for the proportion of my information, I can theoretically set it myself, that is,                   

\widetilde{A}= A + aI

        Among them, a is the intensity parameter, and of course D has to change accordingly. But in practice, we don't need to consider this issue, because the dimension of "consequence" and the dimension of "score" do not need to be consistent, that is, the consequences do not have to be measured by scores, and can be regarded as a new Dimensional evaluation index, under this index, the weight of "my" only accounts for 1/10 of the degree value, and only the information of "me" needs to be taken into account, so that this result has physical meaning, that is, they have a physical meaning for me The "consequences" of the influence, in other words, what I became after being influenced by them. At this time, we suddenly realized, isn’t this result a “new me” to a certain extent? After such calculations, all vertices become their new self, so we get another graph. In the new graph, the relationship between everyone remains unchanged (regardless of the edge or the value of the relationship), and we can recalculate a new graph. It's the turn. But at this time, the new I already contains the information of the original vertex directly connected to me. As in the example, the new V1 already contains the information of the old (V1, V2, V3, V4), and the new V4 also contains the information of (V1, V3, V4), but note that in the first round of calculation, because V4 and V2 are not directly connected, so V4 does not contain the information of V2, then the second impact, due to V4 and V1 There is a side, the new V1 contains the information of the original V2, then the new V4 can obtain part of the information of the original V2 through the new V1, then we found that after only one round of calculation (or called one transmission), Only the information of the first-order neighbor vertices can interact with each other. After the second propagation, the information between the second-order neighbor vertices also interacts. By analogy, after N times of propagation, the information of the N-order neighbor vertices also occurs. interact. This enables the computation of matrices representing the consequences of interactions between vertices.

       Second , the relationship value is generally obtained by the Euclidean distance between the two vertex signals, which is only related to these two vertices, and has nothing to do with the signals of its second-order neighbors, but obviously, the neighbors of the neighbors have an influence on me. There should be some, how can we take into account the influence of second-order neighbors?

        For example, my name is V2, and I am an autistic patient. There are four people in the class. Our own signal is the value of social ability. I only know V1. I know V1 not because I can chat with him, but because V1 is a Sheniu, everyone knows V1. In this example, if considering the new V2 formed by V2 after one propagation, it is calculated \widetilde{D}^{_{2}^{-1}}\widetilde{A}_{2}H, and the result is \widetilde{D}^{_{2}^{-1}}\widetilde{A}_{2}H=(1/3)×(2×H1 + 1×H2 + 0×H3 + 0×H4) =2 /3 H1 + 1/3H2 , it can be seen that most of the new V2 signal comes from the original V1 signal at this time. Of course, this is not possible. How can I turn a social fear into a social cow after one transmission? Obviously, there is still a problem with the propagation. I hope to take both the second-order neighbors and the first-order neighbors into account. When there is a big difference between the second-order neighbors and the first-order neighbors, I hope to attenuate this effect and make my own signal as much as possible. Separate as much as possible from other people's signals, then it is easy to get two ideas, one is to keep my own weight as much as possible in the communication, and reduce the weight of others, that is, to deal with it, and the second is to let "my" \widetilde{A}signal According to the propagation, there is a certain linear change, that is, the greater the difference between the number of second-order neighbors and first-order neighbors, the smaller my signal is, so that I can be distinguished from others, and I can also have my own independent characteristics, that is, the signal is particularly small Well. In GCN, it follows the second idea, and the new propagation process in mathematics is expressed as:                  

\widetilde{D}^{-\frac{1}{2}}\widetilde{A}\widetilde{D}^{-\frac{1}{2}}H

        Since both D and D \widetilde{D}^{-\frac{1}{2}}are diagonal matrices, we can regard it as a unit matrix that has undergone a series of elementary transformations, or just do the elementary transformation of multiplication. Multiply A from the left by one and multiply by one from the right, and the left multiplication is equivalent \widetilde{D}^{-\frac{1}{2}}to \widetilde{D}^{-\frac{1}{2}}A Performing row elementary transformation, right multiplication is equivalent to doing column elementary transformation on A. Then \widetilde{D}^{-\frac{1}{2}}\widetilde{A}\widetilde{D}^{-\frac{1}{2}}it can be regarded as \widetilde{A_{i,j}}the following operations:  

\widetilde{A}_{ij}=\frac{A_{ij}}{\sqrt{D_{ii}},\sqrt{D_{jj}}}

        In this way, when the degree of the first-order neighbor vertex j (the number of edges, that is, the number of its first-order neighbors) is very large, the signal will become very small after one propagation, for example, for V2:               

\widetilde{D}_{22}^{-\frac{1}{2}}\widetilde{A}_{2}\widetilde{D}_{11}^{-\frac{1}{2}}H = \frac{1}{\sqrt{3}\sqrt{14}}(2H1 + H2)

        Because \frac{1}{3} > \frac{1}{\sqrt{3}\sqrt{14}}, it can be seen that the signal of the new V2 has been attenuated, and the attenuation factor is \frac{1}{\sqrt{D_{ii}},\sqrt{D_{jj}}}, obviously, only when i=j, the weight sum is equal to 1, and other times are <1, and the greater the difference between i and j, the attenuation factor is attenuated The more ruthless. Above, the second problem has also been solved.

        So far we have solved the matrix representation of the graph. This matrix not only contains its own signal, but also contains the relationship between each vertex, and each layer of propagation has new signals generated. These new signals are equivalent to Y = in CNN Each layer in XW+ b is X. We can compare it to CNN. It can be said that the value obtained by each previous propagation is the feature of each layer. If we want to use the graph to train the convolutional neural network, we need to add data that can capture information. The training parameter matrix W and b, and then through an activation function, perfectly correspond to the convolutional neural network CNN, so we have the final forward propagation formula:                                                     

H^{l+1} = \sigma (\widetilde{D}^{-\frac{1}{2}}\widetilde{A}\widetilde{D}^{-\frac{1}{2}}H^{l}W^{l} + b^{l})

        The above is a detailed explanation of the graph convolutional neural network propagation formula. Among them \widetilde{D}^{-\frac{1}{2}}\widetilde{A}\widetilde{D}^{-\frac{1}{2}}, it is fixed, and all layers do not change. Next, it is exactly the same as CNN, constructing a fully connected layer, loss function, backpropagation, and updating parameters.

4. Network optimization for hyperspectral image classification

        The propagation of the entire network has been introduced, so what functions does it realize? We divide into two aspects to consider:

        1. No/semi-supervised: that is, when we input samples, we will input unlabeled samples (vertices and their signals) into the network together, and each time the network propagates, the signals of these unlabeled vertices can be received The information of nearby signals forms new features and becomes the "new self" of the next layer. This is actually a transmission method of "close to vermilion is red, and close to ink is black". After several times of transmission, there will be The effect of clustering, that is, the signals of certain vertices are more and more similar, but retain part of their original information, so as to achieve the effect of clustering; 2.
       Supervision: This is the hyperspectral image classification mentioned at the beginning of this article. The idea used in the article is that the traditional GCN adopts semi-supervised and unsupervised modes. When inputting, the unlabeled test set is input into the network, and the test set is slowly "infected" with the information of the labeled vertices. , gradually similarize to form clusters, and we expect to use only the training set during training, and can predict new sample labels without retraining the network during testing. So we imitated CNN's ideas and made three changes:

        First, add the softmax classification function.

        Second, reduce the amount of calculation. The amount of calculation is related to the sample size we input each time. For example, hyperspectral data, the UP data set has more than 4w samples, and the number of bands per sample is 103 (signals). In this example , if you use GCN directly, you need to input all samples, then D and A are 40000×40000 matrices, H is 40000×103 matrix, W is 103×103 matrix, and the computational complexity is (40000 * 103 * 103), and the matrix updated each time in CNN is almost stacked with multiple 3×3 and 5×5. With such a huge sample size, it is very memory-intensive compared to CNN. According to this, the above-mentioned literature provides an idea to solve the problem. According to the method of batching in cnn, the training set of hyperspectral images is randomly divided into different small areas, and 10 pixels are randomly selected in each small area. Point, and generate a picture, select points multiple times to ensure that all valid pixels in this small area are selected, then make all the pictures in this small area a batch. In this way, each H is 10×103, and the calculation complexity is (10*103*103), which will greatly reduce the calculation amount. 

       Third, support untrained samples to predict directly through the trained network. Through the second article, we have achieved the division of the training set by batch and training the network. At this time, we only need to divide the data of the test set into regions according to the way of the training set and generate graphs. The generated graphs This is achieved by feeding into the trained network: untrained samples can also be predicted by the trained network.

      So what are the advantages of GCN for CNN? We regard the signal of each vertex as the input of a sample, then GCN can realize such a function: when the spatial information is distributed in a non-European space, that is, when there is no regular adjacent relationship, the relationship between each input sample Relationships and their strength can be taken into account in the network. Usually, the correlation information between the samples input by CNN is not input into the network. In hyperspectral, although 3D-CNN and 2D-CNN have added spatial information, but the spatial information extraction of the original hyperspectral data is only in one patch, this pacth sometimes contains a variety of ground objects, and the patch The label of is only determined by the pixel label of the center point, so there will be the following situation. When several pixels near the center point are different from the label, their patches are very similar, but the labels are different. The difficulty of training will increase. In the improved GCN, since the input is no longer a patch, but a graph containing more distant information, each signal point in the graph is also randomly selected in a certain area, avoiding the appearance of graphs with different labels (pacth) Too similar problems, so in the generated prediction result graph, the boundary will be clearer.

5. Graph Convolutional Neural Networks in the Frequency Domain

     It was originally the end of the matter here, but some big guys proposed to use the idea of ​​frequency domain to explain the graph convolutional neural network, because the product of the frequency domain corresponds to the convolution of the air domain, and the product is obviously more convenient to calculate than the convolution, so The signal to be convoluted and the convolution kernel are subjected to Fourier transform or Laplace transformation, transformed to the frequency domain, multiplied, and then transformed back to the space domain. The result obtained at this time is the convolution result of the space domain, that is

g\star f = F^{-1}[F[g]\cdot F[f]]

       Where F[·] is Fourier transform or Laplace transform.

       The idea is good, but (g, f) in the formula are all functions. There are Fourier series, so it is natural to do transformation, but what is the Fourier transformation in the form of our graph structure?

        Next, we need to explore the essence of the Fourier transform. First, let's take a look at the traditional Fourier series


        Here we find that F(w) is nothing more than finding the coefficient Cn of each base, and this method is to multiply f(x) and the conjugate of the base and then integrate, and correspond f(x) and the base from continuous to discrete Here, isn't it just multiplying and adding the corresponding elements, isn't this just the dot product of two vectors? Through this dot product, the value of F(w) under a certain w can be determined, and F(w) can be obtained by performing the dot product under different w several times. Let's make a summary, the essence of finding the traditional Fourier transform is to find the coefficients Cn of the inner product with the orthogonal basis   

        Now let’s make an analogy. In the figure, it is the signals of each layer that need to participate in the convolution calculation, then "f(x)" is there, and then it is only necessary to find a set of orthogonal bases. The traditional Fourier transform is performed on a function. After the function is discretized, it becomes a one-dimensional array, so the base is a one-dimensional function about w, and we need to perform Fourier transform on the vertices. Each vertex It is also necessary to consider the mutual influence between them, then the set of orthogonal bases we are looking for needs to contain at least the information of the adjacency matrix. So what are the requirements for the orthogonal basis of the traditional Fourier transform? Some mathematical conclusions are interspersed here. The reason why we can use e^{-jwt}it as a basis function is because it satisfies the characteristic equation of the Laplacian operator, and w is related to the characteristic value. Let's look at the definition of the characteristic equation:   

AV=\lambda V

         where A is a transformation, V is the eigenvector, and λ is the eigenvalue. e^{-jwt}satisfy: 

\Delta e^{-iwt} = \frac{\partial ^{2}}{\partial t^{2}}e^{-iwt} = - w^{2}e^{-iwt}

         Among them \Deltais the Laplacian operator, that is, the second-order derivative operator. It can be seen that in this formula, \Deltathe transformation rule A e^{-jwt} is the eigenvector V (although e^{-jwt}there is only one element instead of an array), -w^{2} which is the eigenvalue λ, so it e^{-jwt}satisfies the Laplace characteristic equation, so we choose e^{-jwt}it as the basis.

        Then we calculate the Fourier transform of each signal, can we also find a "Laplace matrix" containing the information of the adjacency matrix A, in that case, it is enough to find its eigenvectors, but this set of eigenvectors needs to be correct Intersecting, then it is best that the Laplacian matrix we find is a real symmetric matrix, and its eigenvectors are orthogonal, and this set of bases is a set of orthogonal bases after normalization.

         Here comes the question, in the graph structure, is there such a matrix? The answer is yes, this matrix is ​​called the Laplacian matrix L, the algorithm is L = D - A, D is the degree matrix, and A is the adjacency matrix. How concise, how does L realize the function of the Laplacian operator (seeking the second-order derivative function in the graph structure), I found that a big guy from Tsinghua University wrote this content very well, explaining it in a simple and simple way The meaning and intuitive feeling of the Laplacian matrix are explained, and a very appropriate example is given to introduce it, which is very suitable for beginners of GCN to read. Portal .

         With this Laplacian matrix, we can obtain its eigenvector u_{i}, and then u_{i}we can form this set of orthogonal basis U, then perform Fourier transform on a certain signal f to get the inner product form of U, That is F[f] = U^{T}f, F[ ] is Fourier transform, and the inverse transform will U^{T}fbecome f, then F[f] = U^{T}fthe left and right sides can be multiplied by U at the same time, because U is an orthogonal matrix, so UU^{T} = Ethe inverse transform formula is: f = UF[f].

          Now we operate the signal f and the convolution kernel h, then h\star f = U((U^{T}h)\bigodot (U^{T}f)), where \bigodotit represents the multiplication of the corresponding elements, it can be (U^{T}h)\bigodot (U^{T}f)rewritten as \begin{pmatrix} \theta _{1} \cdots 0 \\ \vdots \ddots \vdots \\ 0 \cdots \theta _{n} \end{pmatrix} \begin{pmatrix} \widetilde{f}_{1}\\ \vdots \\ \widetilde{f}_{n} \end{pmatrix}, where, \thetais the Fourier transformed element of h, n is the dimension of the signal, and \widetilde{f}is the Fu of f The transformed elements will be rewritten diag( \theta) =  g_{\theta }, so we have

        At this time, we found that the number of columns of U can be infinitely many, because as long as \Delta e^{-iwt} = \frac{\partial ^{2}}{\partial t^{2}}e^{-iwt} = - w^{2}e^{-iwt}w is doubled in this formula, the above relationship is satisfied, then there are infinitely many eigenvectors, and the u_{i}cost of calculating so many is too high. We put One or two low-frequency components are used to replace or approximate the entire U. Based on this, some approximations are made. In the case of only the fundamental frequency, we obtain a propagation formula that is completely equal to the time domain through approximation:

H^{l+1} = \sigma (\widetilde{D}^{-\frac{1}{2}}\widetilde{A}\widetilde{D}^{-\frac{1}{2}}H^{l}W^{l} + b^{l})

          In the approximation process, the definition of the convolution kernel h has changed, and the mathematical process in the middle has been omitted. If you are interested, you can use the link of the portal to see a more detailed explanation summarized by a certain Tsinghua University boss.

        

Guess you like

Origin blog.csdn.net/weixin_42037511/article/details/126271782