Curriculum Learning and Graph Neural Networks (or Graph Structure Learning)
Student : Wenxuan Zeng
School : University of Electronic Science and Technology of China
Date : 2022.4.25 - 2022.4.29
Article directory
1 Curriculum Learning
This part is to learn the basics of curriculum learning. First, start with the most classic pioneering work "Curriculum Learning" to learn its core ideas.
Link: Curriculum Learning
The core idea: Let the model imitate the human learning strategy, learn simple samples first, and then gradually learn difficult samples.
Advantages: Accelerate the training speed of the model (learn knowledge from simple examples first, and it will be easier to migrate to difficult examples), under the premise of achieving the same performance, curriculum learning allows the model to converge earlier; the model can have better generalization , so that the model can be trained to a better local optimum state.
Method: According to the difficulty of training samples, assign different weights to samples of different difficulties. At the beginning, simple samples are given the highest weight, and they have a higher probability. Then, the weight of samples that are more difficult to train is adjusted. The weights are unified and trained directly on the target training set. It should be noted that the definition of sample difficulty in Curriculum is open, and different evaluation criteria for sample difficulty can be set for different practical problems.
Experiment: The author used 4 toy experiments to prove the effect of curriculum learning: ① SVM classification; ② Perceptron; ③ Neural network shape recognition; ④ Language model.
-
SVM
Two types of data are generated based on two two-dimensional Gaussian distributions, and the decision surface of the Bayesian classifier is calculated. The samples that are strictly on both sides of the decision surface are regarded as simple samples, and the others are noise samples, that is, complex samples. The classification error rate achieved by the SVM classifier trained with only simple samples is 16.3%, while the error rate of the SVM classifier trained with all samples is 17.3%, indicating that simple samples are useful for model training.
-
Perceptron
A rule of sample difficulty is defined: some elements of the input x vector are interference quantities, the less interference, that is, the more the number of interference digits is 0, the simpler the sample.
-
Neural Network Shape Recognition
Data of two different levels of difficulty are defined: a. Simple samples (equilateral triangles, squares, circles); b. Complex samples (triangles, matrices, ellipses). The effect of training simple samples first and then recognizing complex samples is better than directly training complex samples.
-
language model
The language task used here is to fill in the blanks (predict the last word). Curriculum Learning's strategy is: first select the 5000 most common words (considered as simple samples) from the thesaurus, and first use only the training data containing these words. samples for model training. Then expand to 10,000, 15,000, and 20,000 words, so that the learning process is easy first and then difficult.
The effectiveness of Curriculum Learning: ① The model does not need to spend a lot of time learning difficult samples in the early stage of training, but only needs to learn simple examples; ② Guide the training of the model towards a better local optimum, and sequentially achieve better Generalization effect.
2 Curriculum Learning for Graph Neural Networks
Through my initial learning, curriculum learning is widely used in the GNN field, such as for simple and universal graph classification or regression tasks, for graph representation learning and comparison learning tasks, for GNN's multi-domain adaptive problems, using In the pre-training stage of GNN and so on.
2.1 CurGraph: Curriculum Learning for Graph Classification
Features: Course learning, graph classification
Motivation: When training graph classification tasks, the classification difficulty of different graphs is very different, so consider using curriculum learning to learn. Evaluating graph difficulty is challenging due to the irregularity of graph data.
Method: This paper proposes CurGraph to analyze the difficulty of graphs in the high-order semantic feature space, use the infomax method to obtain graph-level embeddings, and use neural density estimators to model the distribution of embeddings. However, according to the intra-class and inter-class distribution of the graph embedding, the difficulty score of the graph is calculated. After getting the difficulty score, you can learn from easy to difficult. For a smoother transition, a smooth-step method is proposed, using a time-varying smoothing function to filter the difficult graph.
-
Infomax Curriculum Learning
A SOTA unsupervised GNN model (InfoGraph) is used to obtain graph-level embedding. InfoGraph maximizes the mutual information between graph-level embeddings and node-level embeddings. This unsupervised approach can learn better graph representations.
Use distances to define the neighbors of a graph:
Finally calculate the difficulty of a graph:
The p in the above represents density estimation, and BNAF is used in this paper.
-
Smooth-Step Curriculum Learning
Divide the difficulty of the graph into S different levels, and add an auxiliary time-varying threshold to define the difficulty value at time t. The graph with a difficulty value lower than the threshold is used to train GNN at time t. This method can make the graph difficult during training. The transition is smoother.
experiment:
-
Graph Node Classification
Experiments were conducted on different datasets to prove the superiority of CurGraph.
-
Comparison with Heuristic Course Design
Two heuristics for course learning: ① #Nodes; ② #Edges. Observe that hard graphs provided by CurGraph tend to contain more nodes and edges.
The following is a comparison of the effects of different difficulty evaluation methods:
-
Ablation experiment
The effect of the smooth-step model will be better than that of traditional course learning:
-
hyperparameter sensitivity
On the D&D and IMDB-M datasets, S=4 works best; on the NCI1 dataset, S=6 works best.
2.2 CuCo: Graph Representation with Curriculum Contrastive Learning
Features: course learning, graph comparison learning, graph representation learning
Motivation: Due to the limitation of expensive labeled data, graph-level representation learning based on contrastive learning has attracted extensive attention, but these methods mainly focus on the graph enhancement of positive samples, while the impact on negative samples is less studied.
Contributions: ① first attempt to study the impact of negative samples on learning graph-level representations, which has been largely ignored by previous work, but is quite practical and important for good self-supervised graph-level representation learning; ② proposed a method based on The graph representation learning model of curriculum contrastive learning, which effectively combines curriculum learning and contrastive learning, can automatically select and train negative samples in the way humans learn.
Methods: We investigate the impact of negative samples on learning graph-level representations, and propose a novel framework for curriculum-contrastive learning of self-supervised graph-level representations. Propose four graph enhancement methods to obtain positive and negative samples, and then use GNN to learn graph representation. Next, a scoring function is proposed to rank negative samples from easy to difficult, and a starter function automatically selects negative samples during each training pass.
-
graph enhancement
This paper introduces 4 ways of graph enhancement: ① Randomly discard some nodes and their connections; ② Perturb the connectivity by randomly adding or reducing a certain proportion of edges; ③ Feature mask, which prompts the model to use its context information to restore the masked nodes Features; ④ Use random walk to sample a subgraph of a subgraph from G
-
graph encoder
To get the graph after graph enhancement, it is necessary to learn the representation. This paper chooses GNN as the graph encoder:
-
memory versus learning
Sampled noise vs. estimated loss:
Minimizing this loss means forcing positive pairs to score higher than negative pairs in memory. To simplify the calculation, this paper uses dot-product as the similarity measure function.
-
Course Settings for Negative Sample Sampling
The main idea is to sort the negative samples according to their difficulty during the training process. The course learning mainly includes three elements:
① Scoring function (scoring function)
Two measurement methods, cosine similarity and dot-product similarity, are used.
② Starter function (pacing function)
Use starter functions to arrange how negative samples are introduced into training.
③ order
To narrow down the specific effects of using a scoring function based on ascending difficulty levels, ordering of courses (sorting from lowest score to highest score), inverse course (sorting from highest score to lowest score), or random ordering is specified.
-
Early-stop mechanism
In the later stage of training, as the difficulty of negative samples increases, the proportion of false negative samples with the same label will increase, and false negative samples will affect the generalization performance of the model. To alleviate this problem, an early-stop mechanism is designed. A patience hyperparameter p is defined. When the loss is no longer decreasing, the value of p starts to decrease. Once p becomes 0, the training stops.
experiment:
-
performance comparison
-
starter selection
-
The number of samples for negative samples
-
training sequence
-
transfer learning
2.3 Curriculum Graph Co-Teaching for Multi-Target Domain Adaptation (CVPR '21)
-
Features: single source domain-multiple target source domains, course learning, pseudo-label generation
-
Motivation: Solve the problem of multi-domain transfer.
-
Contributions: ① Proposed MTDA's Curriculum Graph Collaborative Teaching (CGCT), which utilizes a collaborative teaching strategy and curriculum learning approach with dual classifier heads to learn more robust representations across multiple target domains; ② In order to better Using domain labels, a domain-aware curriculum learning (DCL) strategy is proposed to smooth the feature alignment process
-
Methods: Two perspectives alleviate the problem of multi-domain transfer, feature aggregation and curriculum learning. A curriculum map collaborative learning is proposed, using dual classifier heads, one of which is GCN that aggregates features from similar samples across domains. Prevents the classifier from overfitting to its own noisy pseudo-labels. To prevent classifiers from overfitting to their own noisy pseudo-labels, develop a co-teaching strategy with dual classifier heads, supplemented by curriculum learning, to obtain more reliable pseudo-labels. Additionally, when domain labels are available, domain-aware curriculum learning (DCL) is proposed, which is a sequential adaptation strategy that first adapts to easier target domains and then to harder target domains.
-
(a) Curriculum Graph Co-Teaching
STEP 1: Domain Adaptation
用 f e d g e f_{edge} fedgeTo generate an adjacency matrix, the supervision information is given by MLP, and MLP labels the edges between nodes (the labels of two nodes are consistent, then the similarity between them is 1, otherwise it is 0):
Then generate the loss of the adjacency matrix:
The loss of GCN and MLP on the source domain:
The final optimization goal is:
STEP 2: Pseudo-label annotation
Use GCN to mark the unlabeled data, and if it is less than a certain threshold, it will not participate in the training. Why choose the output of GCN for labeling? The author said that considering the aggregation of GCN features, it is more robust than MLP. Then the data becomes:
-
(b) Domain-aware Curriculum Learning
The authors consider the case where the target domain is labeled. The shift degree of data distribution in different target domains and source domains is different, so the difficulty of self-adaptation is different. The Easy-to-Hard Domain Selection (EHDS) strategy is adopted here, first adapting to the easy domain, and then adapting to the hard domain. The reason is that it is obviously easier for the model to adapt to a domain with a smaller shift than a domain with a larger shift.
How to measure which domain is easier? The author measures this indicator with information entropy:
-
In addition to the research mentioned above, there is also the field of recommendation systems where curriculum learning is applied to heterogeneous graph pre-training (even if the model uses Transformer).
链接:Curriculum Pre-Training Heterogeneous Subgraph Transformer for Top-N Recommendation
3 Graph Structure Learning
The following is a list of graph structure learning methods that I think are worth reading in the past two years. I have marked the main methods used in paper (including generating semantic graphs, self-supervision, topological similarity, attention mechanism, model parameterization/probability , denoising, VAE generative models, etc.). In my previous KDD 2022 paper, I used three strategies: self-supervision, generative model, and attention mechanism to construct high-quality graphs for GCN training.
4 Curriculum Learning for Graph Structure Learning
According to my recent research, I found that there are very few applications of curriculum learning to graph structure learning tasks. I would like to explain a classic paper that I like very much, "Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks". This paper introduces GNN into the multivariate time series forecasting task for the first time, and realizes end-to-end through the attention mechanism. graph structure learning, while using the curriculum learning strategy to reduce the difficulty of training.
The framework of this paper is as follows:
Attention mechanism for learning graph structure:
In multi-step forecasting, this paper proposes a strategy of curriculum learning, starting with simple (short-term) forecasts, and then gradually expanding to complex (long-term) time-step forecasts. In this process, the training is continuously and dynamically composed, and the experiment proves that this mechanism is simple and effective.
But in other words, in fact, it is a bit far-fetched to understand this process as "curriculum learning applied to graph structure learning", because graph structure learning is a part of stable learning during the training iteration process.
So there is still a lot of room for us to think about applying curriculum learning to graph structure learning. For this aspect, I also have some preliminary thoughts. For example, I said before that the model is parameterized when composing. In fact, we can design an index to evaluate the difficulty of node composition from this perspective. In the process of training iterations, gradually compose the graph. Instead of doing it in one step, iteratively update a graph structure that may not be good in the early stage (if it is not sparse, the initial calculation will be huge).
"arning applied to graph structure learning" is inevitably a bit far-fetched, because graph structure learning is a part of stable learning during training iterations.
So there is still a lot of room for us to think about applying curriculum learning to graph structure learning. For this aspect, I also have some preliminary thoughts. For example, I said before that the model is parameterized when composing. In fact, we can design an index to evaluate the difficulty of node composition from this perspective. In the process of training iterations, gradually compose the graph. Instead of doing it in one step, iteratively update a graph structure that may not be good in the early stage (if it is not sparse, the initial calculation will be huge).