Article directory
1. Artificial feature engineering and connection features
- Nodes, connections, subgraphs, and entire graphs have their own attribute characteristics, and attribute characteristics are generally multimodal.
- In addition to attribute features, there are also connection features. The focus of this lecture is: extracting connection features using artificial feature extraction methods.
2. Feature extraction of connection features at the node level
- The degree of a node: only looks at the number of connections, not the quality of connections.
- Node centrality:
- Eigenvector centrality: The principle is that if the nodes around a node are important, then it is also important.
- Betweenness centrality: The principle is that if a node is in a traffic chokepoint, then it is important.
- Adjacency centrality: The principle is that if a node is close to everywhere, then it is important.
- The clustering coefficient of a node: measures the degree of clumping around a node, which is actually to check the number of triangles with the node as the endpoint.
- Graphlets: The triangular structure in the clustering coefficient can be regarded as a subgraph, and it is also possible to replace the triangular structure with other subgraphs. This is Graphlets. Extracting the number of different subgraphs around a node can form a vector Graphlet Degree Vector (GDV). This vector can be used to describe the node's neighborhood topology information.
- There are other measurement methods such as: PageRank, Katz centrality, etc. NetworkX contains a variety of data mining algorithms available for use.
3. Feature extraction of connection features at the connection level
- That is, extract the characteristics of the connection and turn the connection into a d-dimensional vector.
- Based on the distance between two nodes:
- The length of the shortest path between two nodes: only look at the length, ignoring the number and quality.
- Based on the local connection information of two nodes:
- The number of common adjacent nodes between two nodes (intersection)
- Ratio of intersection and union sets of adjacent nodes between two nodes
- Adamic-Adar index:
- S a = ∑ u ∈ N ( V 1 ) ∩ N ( V 2 ) 1 l o g ( k u ) S_{a}=\textstyle \sum_{u\in N(V_{1})\cap N(V_{2})}\frac{1}{log(k_{u})} Sa=∑u∈N(V1)∩N(V2)log(ku)1。
- It can be understood that if two people are connected through several public figures, then there is a high probability that they will not be very close. If it goes through an ordinary person, then the relationship is probably pretty good.
There is a problem. If two nodes do not have a common neighborhood node, then the above three indicators are all 0, which is meaningless. This requires looking at the information of the entire graph.
- Based on the connection information of two nodes in the whole graph - Katz index:
- Record the number of paths of length k between two nodes.
- It can be solved by raising the power of the adjacency matrix.
- Suppose the adjacency matrix of the graph is A, then the number of paths of length k between nodes u and v is A k A^{k}AThe value of the u-th row and v-th column of the k matrix.
- 公式文件S u , v = ∑ l = 1 ∞ β l A u , vl = ( I − β A ) − 1 − I S_{u,v} = \sum_{l=1}^{\infty } \beta ^{l}A^{l}_{u,v}=(I-\beta A)^{-1}-ISu,v=∑l=1∞bl Au,vl=(I−βA)−1−I , in whichβ \betaβ is the scaling factor, and the result is the katz coefficient matrix.
4. Feature extraction of connected features at the full image level
- The obtained features should reflect the structural characteristics of the entire graph.
- Bag-of-node-degrees: only looks at the degree of the node, not the connection structure. In fact, count the number of nodes corresponding to different degrees.
- Graphlet Kernel:
- Counting the number of Graphlets results in Bag-of-Graphlet, which is considered a generalization of Bag-of-*.
- Different from the node level, Graphlet can have isolated nodes from the perspective of the whole graph.
- Counting the number of various Graphlets can also form a d-dimensional vector.
- After normalizing the Bag-of-Graphlets of the two graphs, and then performing a quantitative product, the Graphlet Kernels of the two graphs are obtained.
- However, the computational complexity of Graphlet Kernel is too high and the application space is very small, which leads to Weisfeiler-Lehman Kernel.
- Weisfeiler-Lehman Kernel:
- Its characteristic is to continuously enrich the node vocabulary through iteration.
- It uses a color fine-tuning method.
- Through multiple iterations, the node colors are fine-tuned, the node vocabulary is enriched, and finally the number of occurrences of nodes of different colors is counted to obtain vectors to implement feature extraction.
- Perform a quantitative product operation on the vectors of the two graphs, and the result is the Weisfeiler-Lehman Kernel.
- Generally, the more iterations, the better the effect.
- Note 1: When calculating the Weisfeiler-Lehman Kernel of two graphs, the iterative calculations must be performed simultaneously, that is, the node color vocabulary must be contributed by both graphs at the same time.
- Note 2: The implementation of weisfeiler_lehman_graph_hash in NetwokX is different from what is mentioned above, but gklearn.kernels.Weisfeilerlehmankernel is the same.