IJCAI 2022 || The latest review paper on graph structure learning: A Survey on Graph Structure Learning: Progress and Opportunities

IJCAI 2022 || The latest review paper on graph structure learning: A Survey on Graph Structure Learning: Progress and Opportunities

Abstract

Graph data are widely used to describe real-world entities and their connections. Graph neural networks are highly sensitive to a given graph structure, thus noisy and incomplete graphs can be unsatisfactoryly represented and prevent the model from fully understanding the underlying mechanisms. Graph structure learning GSL aims to jointly learn optimal graph structures and corresponding graph representations. In this work, we extensively review recent progress in GSL

1、Introdcution

The success of graph neural network is attributed to its simultaneous exploration of rich information in graph structure and properties, but given graph data will inevitably contain noise and incompleteness, which will hinder the application of GNN in real-world problems. From the perspective of representation learning , GNN learns node representation by aggregating neighbor information . This iterative method has a cascading effect , that is, when a small noise is passed to a neighbor node, the representation quality of many other nodes will also decrease. It is mentioned in some works that a slight attack on the graph structure can cause GNN to make wrong predictions. Therefore, for GNNs, a high-quality graph structure is necessary. The structure of the article is shown in the figure
insert image description here

2、Preliminaries

G=(A,X) represents a graph, where A is the adjacency matrix and X is the node feature matrix. The goal of graph structure learning is to simultaneously learn an optimal adjacency matrix A* and a corresponding graph representation Z for a certain downstream task given a (possibly incomplete) graph.

graph generation目的是生成多个结构多样的图
graph learning目的是根据给定节点属性重建同质图的拉普拉斯矩阵

insert image description here

2.1 GSL pipline

The classic GSL model consists of two parts: GNN encoder and structure learner
1, GNN encoder input as a graph, and then calculate node embedding for downstream tasks
2, structure learner is used to model the connection relationship of edges in the graph

The existing GSL model follows a three-stage pipeline
1, graph construction
2, graph structure modeling
3, and message propagation

2.1.1 Graph construction

If the data set does not have a given graph structure, or the graph structure is incomplete, we will construct an initial graph structure. There are two main construction methods: 1. KNN composition
2.
e-threshold composition

2.1.2 Graph structure modeling

At the heart of GSL is a structure learner. Optimize the original graph by modeling the connection relationship of edges. In this paper, the existing structure learning methods are divided into three categories
1. Metric-based approaches: a metric function is adopted, and the embedding of the input node pair is used to calculate the weight of the edge between the node pair

2. Neural approaches: Infer the weight of the edge under the condition of given node representation through the neural network

3. Direct approaches: Treat the adjacency matrix as a learnable parameter, and directly optimize learning when training GNN

Unlike direct methods, metric-based and neural approaches learn edge connectivity through a parameterized network. The representation of the input nodes produces an optimal relationship matrix A*. The graph structure generated by the structure learner can further use some subsequent additional operations such as discrete sampling to further obtain the final graph structure.

2.1.3 Message propagation

After obtaining the optimal graph-structured leading matrix A*, a graph encoder can be used to aggregate node features to compute the node representation Z* on this structure.

3、Graph Regularization

We want the resulting graph to satisfy some properties, so before introducing these methods, let's review the three graph regularization (normalization) methods:

3.1 Sparsity

Sparsity: Considering that graphs in the real world are generally sparse, the adjacency matrix we will require is relatively sparse. Intuitively, we can use L0 norm:, but L0 norm is a non-convex problem (also NP-hard ), usually we will seek its approximate solution L1 norm, or use continuous relaxation to solve it.

3.2 Smoothness

Smoothness: Usually, we assume that the signal changes of adjacent nodes are smooth. In order to achieve this goal, the following formula is usually minimized:
insert image description here
joint optimization is usually performed.

3.3 Community Preservation

Community Preservation: In real-life graphs, nodes in different topological clusters often have different labels. Thus, edges that span multiple communities can be considered noise. We know from graph theory that the rank of an adjacency matrix is ​​related to the number of connected components in the graph, and low-rank graphs have densely connected components. To remove potentially noisy edges and best preserve the community structure, we introduce low-rank regularization.

4、Graph Structure Modeling

4.1 Model Taxonomy

4.1.1 Metric-based Approaches

This type of method will use some kernel functions to calculate the similarity of node pairs and use them as edge weights. Commonly used kernel functions are
1. Gaussian kernel function
2. Inner product
3. Cosine similarity
4. Diffusion kernel function
5. Combining multiple kernel functions.

4.1.2 Neural Approaches

Unlike metric-based methods, these methods use neural networks to predict edge weights. In addition to using simple neural networks, there are many works that utilize attention to model edge weights. Recently, there have been some works using transformers, which are different from the previous GNN architecture. The previous methods only considered first-order adjacency, and then achieved multi-hop through stacking, but transformers regard all nodes as neighbors and encode structural information into The graph transformer uses attention to assign weights to edges. Due to the particularity of graph data, location and structure encoding are critical.

4.1.3 Direct Approaches

Direct methods learn the adjacency matrix of the target graph as a random variable and do not rely on node representations. A large number of direct methods use graph regularization to optimize the adjacency matrix. This explicitly specifies the properties of the optimal graph. Gradient-based optimization methods cannot be used because jointly optimizing the adjacency matrix and model parameters often introduces non-differentiable operations. Some works integrate the initial graph structure and regularization into a hybrid objective function, others operate by integrating low-order priors or alternately optimizing the adjacency matrix and learning parameters. Besides common regularizers, GNNExplainer introduces a scalable generative loss based on mutual information for the final task of identifying the most frequent subgraph structures. There are also works that model the adjacency matrix from a probabilistic perspective, that is, assuming that the structure of the graph is sampled from a certain distribution.

4.2 Postprocessing Graph Structures

4.2.1 Discrete Sampling

The GSL model uses a sampling step that assumes that the refined graph is generated from a certain discrete distribution through an additional sampling process. Instead of directly treating the adjacency matrix as the connection weights of edges, an additional sampling step is used to restore the discrete properties of the graph, giving the structure learner more flexibility to control properties of the final graph such as sparsity.
Note that sampling from discrete distributions is not differentiable. In addition to the specific optimization methods previously mentioned in direct-form methods, we discuss traditional gradient descent by using complex parameterization to allow gradients to be propagated across sampling operations. A common approach is Gumbel-Softmax, which generates different graphs by sampling from the Gumbel distribution.

4.2.2 Residual Connections

The initial graph structure, if present, usually carries some prior information on the topology. Then it is natural to assume that the optimal graph structure is a simple transformation from the original graph. Where A is the original graph structure, and A~ is the learned graph structure.

5、Applications

5.1 Natural language processing

In the field of natural language processing, GSL technology is widely used to obtain fine-grained language representation, which builds a graph structure by taking words as nodes and connecting them according to semantic and syntactic patterns. For information retrieval, Yu et al. [2021] learn hierarchical query-document matching patterns by discarding unimportant words in document graphs. In relation extraction, Tian et al. [2021] construct a graph based on the language dependency tree and refine the structure by learning different weights for different dependencies. For sentiment analysis, Li et al. [2021a] created semantic graphs by computing self-attention between word representations, and created syntactic graphs based on dependency trees. Then, they optimize these two graph structures through a differential regularizer so that they capture different information. When answering questions, Yasunaga et al. [2021] use a language model-based encoder to learn a score for each node in order to highlight the most relevant paths to the question in the knowledge graph. For fake news detection, Xu et al. [2022] proposed a semantic structure refinement scheme to distinguish beneficial segments from redundant segments, thereby obtaining fine-grained evidence for the authenticity of news.

5.2 Computer vision and medical imaging

In computer vision, wDAE-GNN [Gidaris and Komodakis, 2019] uses the cosine similarity of class features to create graphs to capture interdependencies between different classes for few-shot learning; DGCNN [Wang et al, 2019b] The topological structure of point cloud data is restored by using GSL, thus enriching the representation ability of point cloud data classification and segmentation. Another prominent example of applying GSL to imaging data is scene graph generation, where the goal is to learn relationships between objects. Qi et al. [2018] propose to use convolutions as structure learners or convolutional LSTMs for spatio-temporal datasets. The latter approach proposes energy-based objectives to incorporate the structure of scene graphs [Suhail et al., 2021] or more general constraints during inference [Liu et al., 2022a]. For medical image analysis, GPNL [Cosmo et al., 2020] leverages metric-based graph structure modeling to learn an overall graph representing pairwise patient similarity for disease analysis. FBNetGen [Kan et al., 2022] proposes to learn a functional brain network structure optimized for downstream prediction tasks.

5.3 Scientific discovery

In scientific discovery domains, e.g. biology, chemistry, graph structures are often artificial to represent structured interactions within a system. For example, the graph structure of proteins is constructed by thresholding pairwise distance graphs of all amino acids [Guo et al., 2021; Jumper et al., 2021]. In such cases, long-range contacts are usually ignored when constructing graphs [Jain et al., 2021].

To optimize the properties of a molecular graph, we need to learn to weight the molecular skeleton describing the basic structure of a compound with optimal properties [Fu et al., 2022]. Similarly, nonbonded interactions are rarely considered in modeling small molecules [Leach, 2001; Luo et al., 2021b; Satorras et al., 2021], which may have important implications for understanding the key mechanisms of the system.

Graph structure learning allows learning more comprehensive representations of data with minimal information loss and interpretability of scientific findings.

6、Challenges and Future Directions

1. Not only homogeneous graphs: Existing methods mainly focus on the structure generation of homogeneous graphs, but the structure generation of heterogeneous graphs is still under exploration.
2. More than homogeneous graphs: Most of the existing methods are based on the associative assumption, and there are a large number of heterogeneous graphs in real life, such as chemical molecular graphs. There is still a lot of room for development in designing different learning strategies to learn heterogeneous graph structures.
3. How to learn the graph structure in the absence of rich attributes: In the recommendation system, this kind of problem is often encountered. The representation of the node may only have the id without semantic information. How do we learn a better graph under such a data set? structure.
4. Lack of scalability: Most of the current work involves a large number of pairwise similarity calculations, which will limit the use on large graphs.
5. Task-independent structural learning: Most of the existing work requires task-related supervision signals for training. But in real life, collecting high-quality labels is often time-consuming, and limited supervision will deteriorate the quality of the learned graph structure. Recently, some self-supervised work is solving this kind of problem. In the case of no labels, a better graph structure can also be learned.

7、Words

1. Sparsity
①sparsity (sparsity) refers to the phenomenon that most of the elements in the data or matrix are 0 or very close to 0. A data or matrix is ​​sparse if the nonzero elements make up only a small fraction of the total elements in the data or matrix. Conversely, if the non-zero elements make up the majority, then the data or matrix is ​​dense.

②Sparseness is very important in many fields. For example, in machine learning, many algorithms use the sparsity of data to reduce the amount of calculation and storage space. For example, in natural language processing, text data is often sparse because a text contains only few words or phrases. Therefore, in order to reduce computational complexity, sparse matrices can be used to represent text data.

③ In the field of computer science and mathematics, sparsity is also a very important research direction. For example, sparsity is widely used in matrix factorization and graph processing. By utilizing sparsity, the complexity of calculation and storage can be greatly reduced, and the efficiency and scalability of the algorithm can be improved.

2. Smoothness
①Smoothness (smoothness) refers to the relatively small and slow change rate of data or functions in space or time. In computer science and mathematics, smoothness is a very important concept, which is widely used in signal processing, image processing, optimization, machine learning and other fields.

②In machine learning, smoothness usually refers to the property that the parameters or function values ​​of the learning model change relatively little. Smoothness is widely used in machine learning for regularization methods such as L1 and L2 regularization. In L1 regularization, smoothness is expressed as the L1 norm of the parameter vector, because the L1 norm will turn some elements in the parameter vector to 0, thus making the model more sparse. In L2 regularization, smoothness is expressed as the L2 norm of the parameter vector, because the L2 norm will compress the elements in the parameter vector to a small range, thus making the model smoother.

③In image processing, smoothness usually refers to the characteristics of relatively small changes in pixel values ​​in the image. Smoothness is widely used in the design of smoothing filters in image processing, such as Gaussian filters and median filters. In a smoothing filter, the smoothness of pixel values ​​is used to suppress noise and detail information in an image, thereby making the image clearer and easier to analyze.

④In optimization, smoothness usually refers to the property that the rate of change of the objective function is relatively small. Smoothness is widely used in optimization in the design of optimization algorithms such as gradient descent, because the gradient descent algorithm requires the objective function to be differentiable and smooth. If the objective function is not smooth, then the gradient descent algorithm may get stuck in a local optimum or fail to converge

3. Community preservation
①Community preservation refers to the protection and maintenance of structures and relationships between different communities or groups in social networks or complex networks. In a social network or a complex network, each node has its own community or group affiliation, and the connection between different communities or groups is usually relatively weak, while the connection between nodes in the same community or group is usually relatively close.

②Community protection is an important research direction in network analysis and graph theory. The goal of community preservation is to find different communities or groups in the network and to protect the structures and connections between different communities or groups. Community protection has a wide range of applications in real life, such as social network analysis, recommendation systems, advertising, public opinion analysis, etc.

③ In social network analysis, community protection can be used to identify different communities or groups, so that the interpersonal relationship and community structure in social networks can be better understood. In recommender systems, community protection can be used to identify users' hobbies and preferences, so that relevant products or services can be better recommended. In advertising, community protection can be used to deliver advertisements for different communities or groups, so as to better attract the target audience. In public opinion analysis, community protection can be used to analyze the views and sentiments of different communities or groups, so that public reactions and attitudes can be better understood.

4. Metric-based approaches
Metric-based approaches refer to a class of machine learning algorithms in which the learning process relies on metric learning methods. Metric learning is a method to construct a new feature space by learning a distance metric function, thereby improving the effect of data classification or clustering.
In metric learning, a metric function is used to measure the distance or similarity between two data points, usually using common distance metric functions such as Euclidean distance and cosine similarity. The goal of metric learning is to learn an excellent metric function such that the distance between similar data points is small and the distance between dissimilar data points is large. By learning distance metric functions, the performance of machine learning tasks such as image classification, face recognition, object tracking, etc. can be effectively improved.

①Measurement-based methods usually include the following steps:

1. Choose or learn a distance measurement function, such as Euclidean distance or cosine similarity.
2. Calculate the distance or similarity between data points through the metric function, and map the data to a new feature space.
3. In the new feature space, use a classifier or clustering algorithm to classify or cluster the data.

②Measurement-based methods usually have the following advantages:

1. It can reduce the feature dimension and improve the generalization ability of the model.
2. It can handle nonlinear relationships and high-dimensional data, and improve the robustness of the model.
3. It can adapt to different tasks and data types, and has a wide range of application scenarios.

Common metric-based methods include K-nearest neighbor algorithm (KNN), support vector machine (SVM), manifold learning, etc.

5. Neural approaches
Neural approaches refer to a class of machine learning algorithms in which the learning process relies on a neural network model. Neural network is a computational model that simulates the structure of the neuron network in the human brain. It can automatically learn the relationship between data through a large amount of training data and back propagation algorithm, so as to achieve machine learning tasks such as classification, regression, and clustering.
A neural network usually includes an input layer, a hidden layer, and an output layer. The input layer receives the original data, the hidden layer maps the data to a new feature space through a series of nonlinear transformations, and the output layer performs classification and regression operations on the mapped data.

① The neural network method usually has the following advantages:
1. It can handle a large amount of complex nonlinear data and has good generalization performance.
2. The high-level features of the data can be automatically extracted through hierarchical feature learning, thereby avoiding the complexity of manual feature extraction.
3. It can be trained through the backpropagation algorithm, so as to realize end-to-end automatic learning.
4. It can adapt to different tasks and data types, and has a wide range of application scenarios.

Common neural network methods include convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory network (LSTM), generative confrontation network (GAN), etc. These methods have achieved good results in areas such as image recognition, speech recognition, natural language processing, and recommendation systems.

6. Direct approaches
Direct approaches refer to a class of graph structure learning methods that learn node representations by directly modeling and optimizing the graph structure. Different from traditional feature engineering-based methods, the direct method does not need to manually extract the features of nodes, but automatically extracts the features of nodes by learning the relationship between nodes, so as to achieve more accurate and efficient learning.

①The direct method usually includes the following steps:

1. Establish a graph structure: Establish a graph structure by constructing nodes and edges, which can be represented by an adjacency matrix or an adjacency list.
2. Learning node representation: Through neural network or other learning algorithms, learn the low-dimensional vector representation of nodes, which is usually called embedding.
3. Application: Use node embedding for various graph structure learning tasks, such as node classification, link prediction, community detection, etc.

②The direct method usually has the following advantages:

1. There is no need to manually extract the features of nodes, which reduces the complexity of feature engineering.
2. It can handle large-scale complex graph structures and has good generalization performance.
3. It can adaptively learn the representation of nodes, which has stronger adaptability and flexibility.

Common direct methods include graph-based convolutional neural network (GCN), graph attention network (GAT), graph autoencoder (GAE), etc. These methods have achieved promising results in areas such as node classification, link prediction, community detection, and graph representation learning.

7. Discrete sampling
In probability statistics, discrete sampling refers to the process of randomly selecting a sample from a discrete probability distribution. A discrete distribution is a discrete probability distribution, that is, a random variable can only take discrete values. The process of discrete sampling is to generate a random number according to the distribution function, and then compare the random number with the probability corresponding to each discrete value to determine the sampling result.
In machine learning, discrete sampling is often used as a sampling process in generative models, such as Hidden Markov Models (HMMs), Latent Dirichlet Allocation (LDA), etc. In these models, discrete sampling is used to generate latent states or themes, enabling modeling and analysis of the data.
Discrete sampling usually involves some commonly used probability distributions, such as Bernoulli distribution, multinomial distribution, Poisson distribution, etc. Common discrete sampling algorithms include inverse transform sampling, rejection sampling, importance sampling, etc.

8. Residual connections
Residual connections (residual connection) is a technique widely used in deep neural networks. Its function is to establish a direct connection between different layers of the network, so that information can be transmitted in the network more quickly, and at the same time, it can effectively alleviate problems such as gradient disappearance and gradient explosion.

The core idea of ​​Residual connections is to keep the change of the network consistent with the input by directly connecting the input of the network to the output of the network, so that the network can learn the residual more easily. In the deep network, due to the increase in the number of network layers, the signal will undergo multiple nonlinear transformations, which will lead to the weakening or disappearance of the signal. The residual connection can enable the signal to be transmitted directly in the network, thus avoiding the loss of the signal.

Specifically, Residual connections are realized by adding shortcut connections (shortcut connections) in the network. In each residual block, the input signal is first passed to a sub-network for transformation, then added to the input signal, and finally output to the next layer of network. Therefore, the output of the residual block can be expressed as:
y=x+F(x)
where x \mathbf{x}x is the input signal,F \mathcal{F}F is the transformation operation in the residual block,y \mathbf{y}y is the output signal. This direct connection method can effectively speed up the transmission of information, making it easier for the network to learn the residual. In practice, Residual connections have been widely used in the design of various deep neural networks, such as ResNet, DenseNet, etc.

Guess you like

Origin blog.csdn.net/weixin_44262492/article/details/130117876