Reference materials: DeepWalk [Intensive reading of graph neural network papers]
word2vec
相关论文:
Efficient Estimation of Word Representations in Vector Space
Distributed Representations of Words and Phrases and their Compositionality
Random walk Ramdom Walk brief introduction
A sequence can be sampled through a random walk.
A sequence is like a sentence, and a node is like a word.
The assumption of random walk is similar to word2vec, assuming that adjacent words should be similar. Therefore, a skip-gram problem can be constructed, the central node is input, and the surrounding neighboring nodes are predicted. This will fully apply word2vec.
Deepwalk official ppt introduction
https://dl.acm.org/doi/10.1145/2623330.2623732
Core idea: random walk = sentence
Advantages of Deepwalk
- Scalable - Online learning algorithm, no need to start training from scratch when new data comes.
- You can treat random walks as sentences and directly apply the language model in NLP
- It performs well on sparsely labeled graph classification tasks.
language model
In the field of NLP, there is a phenomenon called "word frequency": some words appear particularly frequently, while others appear less frequently.
In graphs, especially scale-free graph networks, there is a similar phenomenon: "Vertex frequency": some websites are visited very frequently, and some are not.
process
- input graph
- Sample a random walk sequence
- Training word2vec with random walk sequence
- Use hierarchical softmax
- Get the vector representation of each node
2. Random walk
- Each node produces γ \gammaγ random walk sequences.
- The maximum length of each random walk is ttt。
- Select the next node of a node with equal probability
- 如:v 46 → v 45 → v 71 → v 24 → v 5 → v 1 → v 17 v_{46} \rightarrow v_{45} \rightarrow v_{71} \rightarrow v_{24} \rightarrow v_5 \rightarrow v_1 \rightarrow v_{17}v46→v45→v71→v24→v5→v1→v17
3. Use random walk sequence to construct skip-gram task and train word2vec
4. Use hierarchical softmax
Learning parameters: node representation, classifier weights
using stochastic gradient descent while optimizing
Evaluate
Attribute prediction (node classification problem)
This is a sparse labeling problem. Compare Deepwalk with spectral clustering, edge cluster, modularity, wvRN.
- BlogCatalog
- Flicker
deepwalk performs very well, especially when there are very few labels.
Can be parallelized
Parallelism does not affect representation quality.
Outlook
- Streaming: No information about the entire graph is required. Dynamic updates.
- "Non-random" walk: can carry a certain tendency.
- Pictures and language complement each other, and breakthroughs in the two fields can learn from each other.
Intensive reading of papers
Dataset: Karate Club
problem definition
G = ( V , E ) G=(V,E)G=(V,E)
G L = ( V , E , X , Y ) G_L=(V,E,X,Y) GL=(V,E,X,Y)
X ∈ R ∣ V ∣ × S X \in \mathbb{R}^{|V| \times S} X∈R∣ V ∣ × S : Each node has S-dimensional features
Y ∈ R ∣ V ∣ × ∣ Y ∣ Y \in \mathbb{R}^{|V| \times|\mathcal{Y}|}Y∈R∣ V ∣ × ∣ Y ∣ : Each node hasY \mathcal{Y}Y tags
Task: relational classification (does not satisfy the assumption of independent and identical distribution)
Goal: Learn XE ∈ R r ∣ V ∣ × d X_E \in \mathbb{R}_r^{|V| \times d}XE∈Rr∣V∣×d:d is the dimension after word embedding
Embedding reflecting connection information + characteristics reflecting the node itself => Machine learning classification (fraud detection)
Features you want to learn
- Adaptability: Online Learning Algorithms
- Reflect community clustering information: similar nodes in the original image are still close after embedding
- Low dimensionality: prevent overfitting
- Continuous: Convenient to fit smooth decision boundaries
3.1 Random walk
Starting point: vi v_ivi
Random walk: W vi \mathcal{W}_{v_i}Wvi: W v i 1 , W v i 2 , … , W v i k \mathcal{W}_{v_i}^1, \mathcal{W}_{v_i}^2, \ldots, \mathcal{W}_{v_i}^{k} Wvi1,Wvi2,…,Wvik: The upper right corner represents step k.
Random walk has been used for content recommendation, community detection, and as a method of similarity measurement.
Random walks are also the cornerstone of some output sensitive algorithms (the entire graph must be traversed at least once).
Advantages of random walks:
1. Parallel sampling to generate random walk sequences
2. Online learning: When there are new nodes and new relationships in the network, there is no need to recalculate the entire graph information, only the new nodes and new relationships are sampled. , iterative online incremental training is enough.
3.2 Power law distribution (Power laws)
-
Random network
If it is a random network, then the degree of nodes is generally small, and there is no distribution where the degree of some nodes is much larger than that of other nodes, which
roughly presents a normal curve.
-
Scale-free network
-
Zipf's law:
The word frequency of a word is inversely proportional to the constant power of the word frequency ranking. That is, only a very small number of words (nodes) are frequently used.
3.3 Language model
The goal of a language model is to estimate the likelihood of occurrence of a specific sequence of words. More formally:
given a word sequence W 1 n = ( w 0 , w 1 , ⋯ , wn ) W_1^n=\left(w_0, w_1, \cdots, w_n\right)W1n=(w0,w1,⋯,wn) , hoping to maximize the probability:
Pr ( wn ∣ w 0 , w 1 , ⋯ , wn − 1 ) \operatorname{Pr}\left(w_n \mid w_0, w_1, \cdots, w_{n-1}\right )Pr(wn∣w0,w1,⋯,wn−1)
That is, the previous nn is knownn words, predictn+1 n+1n+Probability of 1 word.
Thesis uses i − 1 i-1 beforei−1 node, predict thennthn nodes.
IntroduceΦ: v ∈ V ↦ R ∣ V ∣ × d \Phi: v \in V \mapsto \mathbb{R}^{|V| \times d}F : v∈V↦R∣ V ∣ × d mapping, mapping nodes to vectors through table lookup.
So the problem is transformed into (predict the i-th node using the embedding of the first i-1 nodes):
Pr ( vi ∣ ( Φ ( v 1 ) , Φ ( v 2 ) , ⋯ , Φ ( vi − 1 ) ) ) \ operatorname{Pr}\left(v_i \mid\left(\Phi\left(v_1\right), \Phi\left(v_2\right), \cdots, \Phi\left(v_{i-1}\right) \right)\right)Pr(vi∣( F(v1),Phi(v2),⋯,Phi(vi−1) ) )
But converting it into conditional probability will get smaller and smaller as the multiplication increases, causing the probability to become very small when you travel very far.
Reference word2vec
word2vec is a self-supervised model, and the order of surrounding words is irrelevant.
skip-gram的损失函数: minimize Φ − log Pr ( { v i − w , ⋯ , v i + w } \ v i ∣ Φ ( v i ) ) \underset{\Phi}{\operatorname{minimize}}\ \ -\log \operatorname{Pr}\left(\left\{v_{i-w}, \cdots, v_{i+w}\right\} \backslash v_i| \Phi\left(v_i\right)\right) Phiminimize −logPr({ vi−w,⋯,vi+w}\vi∣Φ(vi))
The order of the graph generated by a random walk is meaningless.
The model is smaller, inputs one node at a time, and predicts surrounding nodes.
4.Method
4.2 Algorithm: deepwalk
The algorithm is divided into two parts:
1. Random walk generator
2. Update step
- deepwalk pseudocode
- skipgram pseudo code
You can set the probability that the random walk sequence will be transmitted back to the starting node after walking a certain way, but the results of the preliminary experiments have no obvious impact.
4.2.1 skipgram
4.2.2 Hierarchical softmax
It turns out that doing softmax directly is too expensive to calculate the partition function.
Therefore, deepwalk needs to train two sets of weights:
1. Word embedding matrix
2. N-1 logistic regression weights (N leaf nodes have N-1 logistic regression)
- Overall flow chart
4.2.3 Optimization
The parameters are the two sets of weights mentioned above, the size is O ( d ∣ V ∣ ) O(d|V|)O(d∣V∣)
4.3 Multi-threaded asynchronous parallelism
4.4 Algorithm variants
4.4.1 streaming
Ensure that the learning rate remains unchanged and small.
4.4.2 Non-random walks
User interactions tend to be biased. This not only takes into account the existence of the connection, but also the weight of the connection. Sentences can actually be viewed as biased sampling sequences.
5.Experimental design
data set:
5.2 Comparison algorithm
6. Experiment
6.1 Multi-category node classification
T R T_R TR: Proportion of labeled nodes
Evaluation index: M acro − F 1 Macro-F1Macro−F 1 (average F1 for each category) andMicro − F 1 Micro-F1Micro−F 1 (the overall TP, FN, FP, TN calculates the overall F1)
Here, one-vs-rest logistic regression is used to implement the classifier.
You can find reference materials
The effect is shown below.
The results shown in the figure:
1. TR T_RTRWill affect the optimal dimension d, TR T_RTRThe larger the value, the larger the optimal d is.
2. γ \gammaThe larger γ , the better the effect, but there is a margin
3.TR T_RTRThe bigger the value, the better the effect.
4. Pictures of different sizes, different γ \gammaThe relative influence of γ is consistent
7.Related work
1. Embedding is obtained through self-supervised (unsupervised) learning.
2. Only consider the connection information in the graph. Subsequently, embedding and annotation can be used to train a supervised classification model.
3. Online learning.