International practice:
Paper address https://arxiv.org/pdf/2206.00272.pdf git address https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch Compared with the previous combination of GNN and CNN Image processing algorithm, ViG innovatively uses GNN directly for feature extraction. It is no longer necessary to use the features extracted by CNN to construct the graph structure, which is similar to ViT. About ViT can see:
[Self-attention neural network] Transfomer architecture https://blog.csdn.net/weixin_37878740/article/details/129343613?spm=1001.2014.3001.5501 In the third section of ViT, it is a coincidence that both of them have very good ideas resemblance.
I. Overview
ViG consists of two modules
①Grapher module : use graph convolution to realize the aggregation and update of graph information
②FFN module : use two fully connected layers to realize the transformation of node features
For image tasks, CNN can only sort pixels/patches by spatial position, while in Transformer, the grid structure is converted into a sequence structure , which is obviously not flexible enough; while in GNN, nodes can be connected at will , is not constrained by the local spatial structure.
Representing an image with a graph structure has the following advantages:
① A graph is a generalized data structure. Both the grid structure and the sequence structure can be regarded as a special graph, so the graph has better generalization ability.
②The object in the image is not necessarily a regular rectangle, and the graph is used for modeling to have better expressive ability
③ Objects can be regarded as a combination of parts (the graph structure can better express this connection)
However, there are bound to be some problems in building a graph structure with a graph, the most notable of which is the huge amount of data. If each pixel is regarded as a node, a large number of nodes and connections will be brought to the graph structure. In the paper, the image will be divided into several patches , and these patches will be used for subsequent graph structure construction.
2. Network structure
1. The graph structure of the image
① Divide an image into N patches;
②Convert each patch into a eigenvector , and combine to obtain a matrix of eigenvectors ;
These eigenvectors can be viewed as a set of unordered nodes , denoted as ;
③ For each node , find its K nearest neighbors, the set of these neighbors is denoted as , and add the edge from to to the neighbor nodes in the entire set ;
④Finally, the graph structure can be obtained from the node set and edge set
2. Graph convolution
It is used to aggregate the features from neighboring nodes to realize the exchange of information between nodes, and the object is the above graph .
①Information aggregation
, where and are learnable weights.
Refine this operation to the node level, which can be expressed as:
, where is the set of neighbor nodes of the node ,
The function g( ) is the maximum convolution :
The function h( ) is expressed as:
And because the bias is omitted in this process, the whole formula can also be written as:
②Multiple update mechanism
Divide the aggregated features into h heads ( ) and update these heads with different weights; all heads are updated and the resulting values are concatenated together. Multi-head updating allows the model to update information in multiple representation subspaces, which is beneficial to feature diversity.
3. ViG module
GCN with multiple graph convolutional layers will experience a smooth transition , which will lead to a decrease in visual performance (caused by the decrease in diversity). In order to alleviate this problem, more feature transformations and nonlinear activations are introduced in ViG .
For this GCN with nonlinear activation, this paper calls it the Grapher module . For the input , the Grapher module can be expressed as: , where the activation function generally adopts ReLu or GReLu , and the bias is generally omitted.
4. FFN network (feedforward network)
The FFN network is a multilayer perceptron consisting of two fully connected layers. can be recorded as:
, where are the weights of the two fully connected layers, and the bias term is usually omitted. In the ViG network, each fully connected layer and graph convolutional layer is followed by a batch normalization .
3. Network parameter configuration
ViG has two architectures, isotropic architecture (similar to ViT) and pyramid architecture (similar to ResNet).
1. Isotropic structure
The entire network has the same size and shape. In this paper, three networks with different model sizes are constructed, namely ViG-Ti , ViG-S and ViG-B . The number of nodes N=196, the number of neighbor nodes k ranges from 9 to 18 (used to expand the receptive field), the number of heads h is set to 4, and the performance and size are as follows:
2. Pyramid structure
The pyramid structure can obtain more multi-scale features as the number of layers is superimposed. In this paper, four kinds of ViG with pyramid structure are designed. See the table below for details.
In the table, D represents the feature size, E represents the ratio of hidden dimensions in FFN, K represents the receptive field of GCN, and HxW represents the image size
Position encoding : In order to add position information to each node, the encoding vector is added to the node features by the following formula: , the relative distance between node i and node j will be added to the feature distance of the construction graph (refer to ViT) .
4. Visualization
It can be clearly seen in the figure that in shallow layers, neighbor nodes tend to be selected based on low-level and local features such as color and texture. At deep layers, the neighbors of the central node are more semantic and belong to the same category.