Curriculum Learning and Graph Neural Networks (or Graph Structure Learning)

Curriculum Learning and Graph Neural Networks (or Graph Structure Learning)

Student : Wenxuan Zeng

School : University of Electronic Science and Technology of China

Date : 2022.4.25 - 2022.4.29


1 Curriculum Learning

This part is to learn the basics of curriculum learning. First, start with the most classic pioneering work "Curriculum Learning" to learn its core ideas.

Link: Curriculum Learning

insert image description here

The core idea: Let the model imitate the human learning strategy, learn simple samples first, and then gradually learn difficult samples.

Advantages: Accelerate the training speed of the model (learn knowledge from simple examples first, and it will be easier to migrate to difficult examples), under the premise of achieving the same performance, curriculum learning allows the model to converge earlier; the model can have better generalization , so that the model can be trained to a better local optimum state.

Method: According to the difficulty of training samples, assign different weights to samples of different difficulties. At the beginning, simple samples are given the highest weight, and they have a higher probability. Then, the weight of samples that are more difficult to train is adjusted. The weights are unified and trained directly on the target training set. It should be noted that the definition of sample difficulty in Curriculum is open, and different evaluation criteria for sample difficulty can be set for different practical problems.

Experiment: The author used 4 toy experiments to prove the effect of curriculum learning: ① SVM classification; ② Perceptron; ③ Neural network shape recognition; ④ Language model.

  • SVM

    Two types of data are generated based on two two-dimensional Gaussian distributions, and the decision surface of the Bayesian classifier is calculated. The samples that are strictly on both sides of the decision surface are regarded as simple samples, and the others are noise samples, that is, complex samples. The classification error rate achieved by the SVM classifier trained with only simple samples is 16.3%, while the error rate of the SVM classifier trained with all samples is 17.3%, indicating that simple samples are useful for model training.

  • Perceptron

    A rule of sample difficulty is defined: some elements of the input x vector are interference quantities, the less interference, that is, the more the number of interference digits is 0, the simpler the sample.

  • Neural Network Shape Recognition

    Data of two different levels of difficulty are defined: a. Simple samples (equilateral triangles, squares, circles); b. Complex samples (triangles, matrices, ellipses). The effect of training simple samples first and then recognizing complex samples is better than directly training complex samples.

  • language model

    The language task used here is to fill in the blanks (predict the last word). Curriculum Learning's strategy is: first select the 5000 most common words (considered as simple samples) from the thesaurus, and first use only the training data containing these words. samples for model training. Then expand to 10,000, 15,000, and 20,000 words, so that the learning process is easy first and then difficult.

The effectiveness of Curriculum Learning: ① The model does not need to spend a lot of time learning difficult samples in the early stage of training, but only needs to learn simple examples; ② Guide the training of the model towards a better local optimum, and sequentially achieve better Generalization effect.

2 Curriculum Learning for Graph Neural Networks

Through my initial learning, curriculum learning is widely used in the GNN field, such as for simple and universal graph classification or regression tasks, for graph representation learning and comparison learning tasks, for GNN's multi-domain adaptive problems, using In the pre-training stage of GNN and so on.

2.1 CurGraph: Curriculum Learning for Graph Classification

Features: Course learning, graph classification

Motivation: When training graph classification tasks, the classification difficulty of different graphs is very different, so consider using curriculum learning to learn. Evaluating graph difficulty is challenging due to the irregularity of graph data.

Method: This paper proposes CurGraph to analyze the difficulty of graphs in the high-order semantic feature space, use the infomax method to obtain graph-level embeddings, and use neural density estimators to model the distribution of embeddings. However, according to the intra-class and inter-class distribution of the graph embedding, the difficulty score of the graph is calculated. After getting the difficulty score, you can learn from easy to difficult. For a smoother transition, a smooth-step method is proposed, using a time-varying smoothing function to filter the difficult graph.

  • Infomax Curriculum Learning

    A SOTA unsupervised GNN model (InfoGraph) is used to obtain graph-level embedding. InfoGraph maximizes the mutual information between graph-level embeddings and node-level embeddings. This unsupervised approach can learn better graph representations.

    Use distances to define the neighbors of a graph:

    insert image description here

    Finally calculate the difficulty of a graph:

    insert image description here

    The p in the above represents density estimation, and BNAF is used in this paper.

  • Smooth-Step Curriculum Learning

    Divide the difficulty of the graph into S different levels, and add an auxiliary time-varying threshold to define the difficulty value at time t. The graph with a difficulty value lower than the threshold is used to train GNN at time t. This method can make the graph difficult during training. The transition is smoother.

    insert image description here

experiment:

  • Graph Node Classification

    Experiments were conducted on different datasets to prove the superiority of CurGraph.

    insert image description here

  • Comparison with Heuristic Course Design

    Two heuristics for course learning: ① #Nodes; ② #Edges. Observe that hard graphs provided by CurGraph tend to contain more nodes and edges.

    insert image description here

    The following is a comparison of the effects of different difficulty evaluation methods:

    insert image description here

  • Ablation experiment

    The effect of the smooth-step model will be better than that of traditional course learning:

    insert image description here

  • hyperparameter sensitivity

    On the D&D and IMDB-M datasets, S=4 works best; on the NCI1 dataset, S=6 works best.

    insert image description here

2.2 CuCo: Graph Representation with Curriculum Contrastive Learning

Features: course learning, graph comparison learning, graph representation learning

Motivation: Due to the limitation of expensive labeled data, graph-level representation learning based on contrastive learning has attracted extensive attention, but these methods mainly focus on the graph enhancement of positive samples, while the impact on negative samples is less studied.

Contributions: ① first attempt to study the impact of negative samples on learning graph-level representations, which has been largely ignored by previous work, but is quite practical and important for good self-supervised graph-level representation learning; ② proposed a method based on The graph representation learning model of curriculum contrastive learning, which effectively combines curriculum learning and contrastive learning, can automatically select and train negative samples in the way humans learn.

Methods: We investigate the impact of negative samples on learning graph-level representations, and propose a novel framework for curriculum-contrastive learning of self-supervised graph-level representations. Propose four graph enhancement methods to obtain positive and negative samples, and then use GNN to learn graph representation. Next, a scoring function is proposed to rank negative samples from easy to difficult, and a starter function automatically selects negative samples during each training pass.

  • graph enhancement

    This paper introduces 4 ways of graph enhancement: ① Randomly discard some nodes and their connections; ② Perturb the connectivity by randomly adding or reducing a certain proportion of edges; ③ Feature mask, which prompts the model to use its context information to restore the masked nodes Features; ④ Use random walk to sample a subgraph of a subgraph from G

  • graph encoder

    To get the graph after graph enhancement, it is necessary to learn the representation. This paper chooses GNN as the graph encoder:

    insert image description here

  • memory versus learning

    Sampled noise vs. estimated loss:

    insert image description here

    Minimizing this loss means forcing positive pairs to score higher than negative pairs in memory. To simplify the calculation, this paper uses dot-product as the similarity measure function.

  • Course Settings for Negative Sample Sampling

    The main idea is to sort the negative samples according to their difficulty during the training process. The course learning mainly includes three elements:

    ① Scoring function (scoring function)

    Two measurement methods, cosine similarity and dot-product similarity, are used.

    insert image description here

    ② Starter function (pacing function)

    Use starter functions to arrange how negative samples are introduced into training.

    insert image description here

    ③ order

    To narrow down the specific effects of using a scoring function based on ascending difficulty levels, ordering of courses (sorting from lowest score to highest score), inverse course (sorting from highest score to lowest score), or random ordering is specified.

  • Early-stop mechanism

    In the later stage of training, as the difficulty of negative samples increases, the proportion of false negative samples with the same label will increase, and false negative samples will affect the generalization performance of the model. To alleviate this problem, an early-stop mechanism is designed. A patience hyperparameter p is defined. When the loss is no longer decreasing, the value of p starts to decrease. Once p becomes 0, the training stops.

experiment:

  • performance comparison

    insert image description here

  • starter selection

    insert image description here

  • The number of samples for negative samples

    insert image description here

  • training sequence

    insert image description here

  • transfer learning

    insert image description here

2.3 Curriculum Graph Co-Teaching for Multi-Target Domain Adaptation (CVPR '21)

  • Features: single source domain-multiple target source domains, course learning, pseudo-label generation

  • Motivation: Solve the problem of multi-domain transfer.

  • Contributions: ① Proposed MTDA's Curriculum Graph Collaborative Teaching (CGCT), which utilizes a collaborative teaching strategy and curriculum learning approach with dual classifier heads to learn more robust representations across multiple target domains; ② In order to better Using domain labels, a domain-aware curriculum learning (DCL) strategy is proposed to smooth the feature alignment process

  • Methods: Two perspectives alleviate the problem of multi-domain transfer, feature aggregation and curriculum learning. A curriculum map collaborative learning is proposed, using dual classifier heads, one of which is GCN that aggregates features from similar samples across domains. Prevents the classifier from overfitting to its own noisy pseudo-labels. To prevent classifiers from overfitting to their own noisy pseudo-labels, develop a co-teaching strategy with dual classifier heads, supplemented by curriculum learning, to obtain more reliable pseudo-labels. Additionally, when domain labels are available, domain-aware curriculum learning (DCL) is proposed, which is a sequential adaptation strategy that first adapts to easier target domains and then to harder target domains.

    insert image description here

    • (a) Curriculum Graph Co-Teaching

      STEP 1: Domain Adaptation

      f e d g e f_{edge} fedgeTo generate an adjacency matrix, the supervision information is given by MLP, and MLP labels the edges between nodes (the labels of two nodes are consistent, then the similarity between them is 1, otherwise it is 0):

      insert image description here

      Then generate the loss of the adjacency matrix:

      insert image description here

      The loss of GCN and MLP on the source domain:

      insert image description here

      The final optimization goal is:

      insert image description here

      STEP 2: Pseudo-label annotation

      Use GCN to mark the unlabeled data, and if it is less than a certain threshold, it will not participate in the training. Why choose the output of GCN for labeling? The author said that considering the aggregation of GCN features, it is more robust than MLP. Then the data becomes:

      insert image description here

    • (b) Domain-aware Curriculum Learning

      The authors consider the case where the target domain is labeled. The shift degree of data distribution in different target domains and source domains is different, so the difficulty of self-adaptation is different. The Easy-to-Hard Domain Selection (EHDS) strategy is adopted here, first adapting to the easy domain, and then adapting to the hard domain. The reason is that it is obviously easier for the model to adapt to a domain with a smaller shift than a domain with a larger shift.

      How to measure which domain is easier? The author measures this indicator with information entropy:

      insert image description here

In addition to the research mentioned above, there is also the field of recommendation systems where curriculum learning is applied to heterogeneous graph pre-training (even if the model uses Transformer).

链接:Curriculum Pre-Training Heterogeneous Subgraph Transformer for Top-N Recommendation

3 Graph Structure Learning

The following is a list of graph structure learning methods that I think are worth reading in the past two years. I have marked the main methods used in paper (including generating semantic graphs, self-supervision, topological similarity, attention mechanism, model parameterization/probability , denoising, VAE generative models, etc.). In my previous KDD 2022 paper, I used three strategies: self-supervision, generative model, and attention mechanism to construct high-quality graphs for GCN training.

insert image description here

4 Curriculum Learning for Graph Structure Learning

According to my recent research, I found that there are very few applications of curriculum learning to graph structure learning tasks. I would like to explain a classic paper that I like very much, "Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks". This paper introduces GNN into the multivariate time series forecasting task for the first time, and realizes end-to-end through the attention mechanism. graph structure learning, while using the curriculum learning strategy to reduce the difficulty of training.

The framework of this paper is as follows:

insert image description here

Attention mechanism for learning graph structure:

insert image description here

In multi-step forecasting, this paper proposes a strategy of curriculum learning, starting with simple (short-term) forecasts, and then gradually expanding to complex (long-term) time-step forecasts. In this process, the training is continuously and dynamically composed, and the experiment proves that this mechanism is simple and effective.

insert image description here

But in other words, in fact, it is a bit far-fetched to understand this process as "curriculum learning applied to graph structure learning", because graph structure learning is a part of stable learning during the training iteration process.

So there is still a lot of room for us to think about applying curriculum learning to graph structure learning. For this aspect, I also have some preliminary thoughts. For example, I said before that the model is parameterized when composing. In fact, we can design an index to evaluate the difficulty of node composition from this perspective. In the process of training iterations, gradually compose the graph. Instead of doing it in one step, iteratively update a graph structure that may not be good in the early stage (if it is not sparse, the initial calculation will be huge).

"arning applied to graph structure learning" is inevitably a bit far-fetched, because graph structure learning is a part of stable learning during training iterations.

So there is still a lot of room for us to think about applying curriculum learning to graph structure learning. For this aspect, I also have some preliminary thoughts. For example, I said before that the model is parameterized when composing. In fact, we can design an index to evaluate the difficulty of node composition from this perspective. In the process of training iterations, gradually compose the graph. Instead of doing it in one step, iteratively update a graph structure that may not be good in the early stage (if it is not sparse, the initial calculation will be huge).

Guess you like

Origin blog.csdn.net/qq_16763983/article/details/125156711