[Study Notes] Intensive Reading of DeepWalk Graph Neural Network Paper

Reference materials: DeepWalk [Intensive reading of graph neural network papers]

word2vec

相关论文:
Efficient Estimation of Word Representations in Vector Space
Distributed Representations of Words and Phrases and their Compositionality

Random walk Ramdom Walk brief introduction

A sequence can be sampled through a random walk.

A sequence is like a sentence, and a node is like a word.

The assumption of random walk is similar to word2vec, assuming that adjacent words should be similar. Therefore, a skip-gram problem can be constructed, the central node is input, and the surrounding neighboring nodes are predicted. This will fully apply word2vec.

Deepwalk official ppt introduction

https://dl.acm.org/doi/10.1145/2623330.2623732

Core idea: random walk = sentence

Advantages of Deepwalk

  • Scalable - Online learning algorithm, no need to start training from scratch when new data comes.
  • You can treat random walks as sentences and directly apply the language model in NLP
  • It performs well on sparsely labeled graph classification tasks.

language model

In the field of NLP, there is a phenomenon called "word frequency": some words appear particularly frequently, while others appear less frequently.
In graphs, especially scale-free graph networks, there is a similar phenomenon: "Vertex frequency": some websites are visited very frequently, and some are not.

process

Insert image description here

  1. input graph
  2. Sample a random walk sequence
  3. Training word2vec with random walk sequence
  4. Use hierarchical softmax
  5. Get the vector representation of each node
2. Random walk
  • Each node produces γ \gammaγ random walk sequences.
  • The maximum length of each random walk is ttt
  • Select the next node of a node with equal probability
  • 如:v 46 → v 45 → v 71 → v 24 → v 5 → v 1 → v 17 v_{46} \rightarrow v_{45} \rightarrow v_{71} \rightarrow v_{24} \rightarrow v_5 \rightarrow v_1 \rightarrow v_{17}v46v45v71v24v5v1v17
3. Use random walk sequence to construct skip-gram task and train word2vec

Insert image description here

4. Use hierarchical softmax

Insert image description here

Learning parameters: node representation, classifier weights
using stochastic gradient descent while optimizing

Evaluate

Attribute prediction (node ​​classification problem)
Insert image description here
This is a sparse labeling problem. Compare Deepwalk with spectral clustering, edge cluster, modularity, wvRN.

  • BlogCatalog
    Insert image description here
  • Flicker
    Insert image description here

deepwalk performs very well, especially when there are very few labels.

Can be parallelized

Parallelism does not affect representation quality.

Outlook

  • Streaming: No information about the entire graph is required. Dynamic updates.
  • "Non-random" walk: can carry a certain tendency.
  • Pictures and language complement each other, and breakthroughs in the two fields can learn from each other.
    Insert image description here

Intensive reading of papers

Dataset: Karate Club

problem definition

G = ( V , E ) G=(V,E)G=(V,E)
G L = ( V , E , X , Y ) G_L=(V,E,X,Y) GL=(V,E,X,Y)
X ∈ R ∣ V ∣ × S X \in \mathbb{R}^{|V| \times S} XRV × S : Each node has S-dimensional features
Y ∈ R ∣ V ∣ × ∣ Y ∣ Y \in \mathbb{R}^{|V| \times|\mathcal{Y}|}YRV × Y : Each node hasY \mathcal{Y}Y tags

Task: relational classification (does not satisfy the assumption of independent and identical distribution)
Goal: Learn XE ∈ R r ∣ V ∣ × d X_E \in \mathbb{R}_r^{|V| \times d}XERrV×d:d is the dimension after word embedding

Embedding reflecting connection information + characteristics reflecting the node itself => Machine learning classification (fraud detection)

Features you want to learn

  • Adaptability: Online Learning Algorithms
  • Reflect community clustering information: similar nodes in the original image are still close after embedding
  • Low dimensionality: prevent overfitting
  • Continuous: Convenient to fit smooth decision boundaries

3.1 Random walk

Starting point: vi v_ivi
Random walk: W vi \mathcal{W}_{v_i}Wvi W v i 1 , W v i 2 , … , W v i k \mathcal{W}_{v_i}^1, \mathcal{W}_{v_i}^2, \ldots, \mathcal{W}_{v_i}^{k} Wvi1,Wvi2,,Wvik: The upper right corner represents step k.
Random walk has been used for content recommendation, community detection, and as a method of similarity measurement.
Random walks are also the cornerstone of some output sensitive algorithms (the entire graph must be traversed at least once).
Advantages of random walks:
1. Parallel sampling to generate random walk sequences
2. Online learning: When there are new nodes and new relationships in the network, there is no need to recalculate the entire graph information, only the new nodes and new relationships are sampled. , iterative online incremental training is enough.

3.2 Power law distribution (Power laws)

  • Random network
    If it is a random network, then the degree of nodes is generally small, and there is no distribution where the degree of some nodes is much larger than that of other nodes, which
    roughly presents a normal curve.
    Insert image description here

  • Scale-free network
    Insert image description here

  • Zipf's law:
    The word frequency of a word is inversely proportional to the constant power of the word frequency ranking. That is, only a very small number of words (nodes) are frequently used.

3.3 Language model

The goal of a language model is to estimate the likelihood of occurrence of a specific sequence of words. More formally:
given a word sequence W 1 n = ( w 0 , w 1 , ⋯ , wn ) W_1^n=\left(w_0, w_1, \cdots, w_n\right)W1n=(w0,w1,,wn) , hoping to maximize the probability:
Pr ⁡ ( wn ∣ w 0 , w 1 , ⋯ , wn − 1 ) \operatorname{Pr}\left(w_n \mid w_0, w_1, \cdots, w_{n-1}\right )Pr(wnw0,w1,,wn1)

That is, the previous nn is knownn words, predictn+1 n+1n+Probability of 1 word.

Thesis uses i − 1 i-1 beforei1 node, predict thennthn nodes.
IntroduceΦ: v ∈ V ↦ R ∣ V ∣ × d \Phi: v \in V \mapsto \mathbb{R}^{|V| \times d}F : vVRV × d mapping, mapping nodes to vectors through table lookup.
So the problem is transformed into (predict the i-th node using the embedding of the first i-1 nodes):
Pr ⁡ ( vi ∣ ( Φ ( v 1 ) , Φ ( v 2 ) , ⋯ , Φ ( vi − 1 ) ) ) \ operatorname{Pr}\left(v_i \mid\left(\Phi\left(v_1\right), \Phi\left(v_2\right), \cdots, \Phi\left(v_{i-1}\right) \right)\right)Pr(vi( F(v1),Phi(v2),,Phi(vi1) ) )
But converting it into conditional probability will get smaller and smaller as the multiplication increases, causing the probability to become very small when you travel very far.

Reference word2vec

word2vec is a self-supervised model, and the order of surrounding words is irrelevant.

skip-gram的损失函数: minimize ⁡ Φ    − log ⁡ Pr ⁡ ( { v i − w , ⋯   , v i + w } \ v i ∣ Φ ( v i ) ) \underset{\Phi}{\operatorname{minimize}}\ \ -\log \operatorname{Pr}\left(\left\{v_{i-w}, \cdots, v_{i+w}\right\} \backslash v_i| \Phi\left(v_i\right)\right) Phiminimize  logPr({ viw,,vi+w}\vi∣Φ(vi))

The order of the graph generated by a random walk is meaningless.
The model is smaller, inputs one node at a time, and predicts surrounding nodes.

4.Method

4.2 Algorithm: deepwalk

The algorithm is divided into two parts:
1. Random walk generator
2. Update step

  • deepwalk pseudocode

Insert image description here

  • skipgram pseudo code
    Insert image description here

You can set the probability that the random walk sequence will be transmitted back to the starting node after walking a certain way, but the results of the preliminary experiments have no obvious impact.

4.2.1 skipgram

Insert image description here

4.2.2 Hierarchical softmax

It turns out that doing softmax directly is too expensive to calculate the partition function.
Insert image description here

Insert image description here
Therefore, deepwalk needs to train two sets of weights:
1. Word embedding matrix
2. N-1 logistic regression weights (N leaf nodes have N-1 logistic regression)

  • Overall flow chart
    Insert image description here

4.2.3 Optimization

The parameters are the two sets of weights mentioned above, the size is O ( d ∣ V ∣ ) O(d|V|)O(dV)

4.3 Multi-threaded asynchronous parallelism

4.4 Algorithm variants

4.4.1 streaming

Ensure that the learning rate remains unchanged and small.

4.4.2 Non-random walks

User interactions tend to be biased. This not only takes into account the existence of the connection, but also the weight of the connection. Sentences can actually be viewed as biased sampling sequences.

5.Experimental design

data set:
Insert image description here

5.2 Comparison algorithm

Insert image description here

6. Experiment

6.1 Multi-category node classification

T R T_R TR: Proportion of labeled nodes
Evaluation index: M acro − F 1 Macro-F1MacroF 1 (average F1 for each category) andMicro − F 1 Micro-F1MicroF 1 (the overall TP, FN, FP, TN calculates the overall F1)
Here, one-vs-rest logistic regression is used to implement the classifier.

You can find reference materials

The effect is shown below.
The results shown in the figure:
1. TR T_RTRWill affect the optimal dimension d, TR T_RTRThe larger the value, the larger the optimal d is.
2. γ \gammaThe larger γ , the better the effect, but there is a margin
3.TR T_RTRThe bigger the value, the better the effect.
4. Pictures of different sizes, different γ \gammaThe relative influence of γ is consistent
Insert image description here

7.Related work

1. Embedding is obtained through self-supervised (unsupervised) learning.
2. Only consider the connection information in the graph. Subsequently, embedding and annotation can be used to train a supervised classification model.
3. Online learning.

Guess you like

Origin blog.csdn.net/zhangyifeng_1995/article/details/132717699