[Graph neural network] Visual graph neural network ViG (Vision GNN)--Thesis reading

International practice:

Paper address icon-default.png?t=N2N8https://arxiv.org/pdf/2206.00272.pdf git address icon-default.png?t=N2N8https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch         Compared with the previous combination of GNN and CNN Image processing algorithm, ViG innovatively uses GNN directly for feature extraction. It is no longer necessary to use the features extracted by CNN to construct the graph structure, which is similar to ViT. About ViT can see:

[Self-attention neural network] Transfomer architecture icon-default.png?t=N2N8https://blog.csdn.net/weixin_37878740/article/details/129343613?spm=1001.2014.3001.5501         In the third section of ViT, it is a coincidence that both of them have very good ideas resemblance.

I. Overview

        ViG consists of two modules

                ①Grapher module : use graph convolution to realize the aggregation and update of graph information

                ②FFN module : use two fully connected layers to realize the transformation of node features

        For image tasks, CNN can only sort pixels/patches by spatial position, while in Transformer, the grid structure is converted into a sequence structure , which is obviously not flexible enough; while in GNN, nodes can be connected at will , is not constrained by the local spatial structure.

        Representing an image with a graph structure has the following advantages:

                ① A graph is a generalized data structure. Both the grid structure and the sequence structure can be regarded as a special graph, so the graph has better generalization ability.

                ②The object in the image is not necessarily a regular rectangle, and the graph is used for modeling to have better expressive ability

                ③ Objects can be regarded as a combination of parts (the graph structure can better express this connection)

        However, there are bound to be some problems in building a graph structure with a graph, the most notable of which is the huge amount of data. If each pixel is regarded as a node, a large number of nodes and connections will be brought to the graph structure. In the paper, the image will be divided into several patches , and these patches will be used for subsequent graph structure construction.

2. Network structure

        1. The graph structure of the image

                ① H\times W\times 3Divide an image into N patches;

                ②Convert each patch into a eigenvector x_i\in R^D, and combine to obtain a matrix of eigenvectors X=[x_1,x_2,...x_N];

These eigenvectors can be viewed as a set of unordered nodes , denoted as V=\{v_1,v_2,...v_N\};

                ③ For each node v_i, find its K nearest neighbors, the set of these neighbors is denoted as , and add the edge from to to the N(v_i)neighbor nodes in the entire set ;v_jv_jv_ie_{ji}

                ④Finally, the graph structure can be obtained from the node set Vand edge set\varepsilonG=(V,\varepsilon )

        2. Graph convolution

                It is used to aggregate the features from neighboring nodes to realize the exchange of information between nodes, and the object is the above graph G=(V,\varepsilon ).

                ①Information aggregation

                        {G}'=F(G,\omega )=Update(Aggregate(G,W_{agg}),W_{update}), where W_{agg}and W_{update}are learnable weights.

                Refine this operation to the node level, which can be expressed as:

                        {x_i}'=h(x_i,g(x_i,N(x_i),W_{agg}),W_{update}), where is the set of neighbor nodes of N(x_i)the node x_i,

                        The function g( ) is the maximum convolution :g(.)={x_i}''=[x_i,max(\{x_j-x_i|j\in N(x_i)\})]

                        The function h( ) is expressed as:        h(.)={x_i}'={x_i}''W_{update}

                And because the bias is omitted in this process, the whole formula can also be written as:{X}'=GraphConv(X)

                ②Multiple update mechanism

                        Divide the aggregated features {x_i}''into h heads ( head^1,head^2...head^h) and update these heads with different weights; all heads are updated and the resulting values ​​are concatenated together. Multi-head updating allows the model to update information in multiple representation subspaces, which is beneficial to feature diversity.

                        {x_i}'=[head^1W^1_{update},head^2W^2_{update},...,head^hW^h_{update}]

        3. ViG module

                GCN with multiple graph convolutional layers will experience a smooth transition , which will lead to a decrease in visual performance (caused by the decrease in diversity). In order to alleviate this problem, more feature transformations and nonlinear activations are introduced in ViG .

                For this GCN with nonlinear activation, this paper calls it the Grapher module . For the input X\in R^{N \times D}, the Grapher module can be expressed as: Y=\sigma(GraphConv(XW_{in}))W_{out}+X, where the activation function \sigmagenerally adopts ReLu or GReLu , and the bias is generally omitted.

        4. FFN network (feedforward network)

                The FFN network is a multilayer perceptron consisting of two fully connected layers. can be recorded as:

                        Z=\sigma(YW_1)W_2+Y, where are W_1,W_2the weights of the two fully connected layers, Z\in R^{N\times D}and the bias term is usually omitted. In the ViG network, each fully connected layer and graph convolutional layer is followed by a batch normalization .

3. Network parameter configuration

        ViG has two architectures, isotropic architecture (similar to ViT) and pyramid architecture (similar to ResNet).

        1. Isotropic structure

                The entire network has the same size and shape. In this paper, three networks with different model sizes are constructed, namely ViG-Ti , ViG-S and ViG-B . The number of nodes N=196, the number of neighbor nodes k ranges from 9 to 18 (used to expand the receptive field), the number of heads h is set to 4, and the performance and size are as follows:

         2. Pyramid structure

                The pyramid structure can obtain more multi-scale features as the number of layers is superimposed. In this paper, four kinds of ViG with pyramid structure are designed. See the table below for details.

                 In the table, D represents the feature size, E represents the ratio of hidden dimensions in FFN, K represents the receptive field of GCN, and HxW represents the image size

                Position encoding : In order to add position information to each node, the encoding vector is added to the node features by the following formula: x_i\leftarrow x_i+e_i, the relative distance between node i and node j e_i^The_jwill be added to the feature distance of the construction graph (refer to ViT) .

4. Visualization

        It can be clearly seen in the figure that in shallow layers, neighbor nodes tend to be selected based on low-level and local features such as color and texture. At deep layers, the neighbors of the central node are more semantic and belong to the same category.

 

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/130124772