003 Traditional graph machine learning, graph feature engineering


1. Artificial feature engineering and connection features

  • Nodes, connections, subgraphs, and entire graphs have their own attribute characteristics, and attribute characteristics are generally multimodal.
  • In addition to attribute features, there are also connection features. The focus of this lecture is: extracting connection features using artificial feature extraction methods.

2. Feature extraction of connection features at the node level

  • The degree of a node: only looks at the number of connections, not the quality of connections.
  • Node centrality:
  1. Eigenvector centrality: The principle is that if the nodes around a node are important, then it is also important.
  2. Betweenness centrality: The principle is that if a node is in a traffic chokepoint, then it is important.
  3. Adjacency centrality: The principle is that if a node is close to everywhere, then it is important.
  • The clustering coefficient of a node: measures the degree of clumping around a node, which is actually to check the number of triangles with the node as the endpoint.
  • Graphlets: The triangular structure in the clustering coefficient can be regarded as a subgraph, and it is also possible to replace the triangular structure with other subgraphs. This is Graphlets. Extracting the number of different subgraphs around a node can form a vector Graphlet Degree Vector (GDV). This vector can be used to describe the node's neighborhood topology information.
  • There are other measurement methods such as: PageRank, Katz centrality, etc. NetworkX contains a variety of data mining algorithms available for use.

3. Feature extraction of connection features at the connection level

  • That is, extract the characteristics of the connection and turn the connection into a d-dimensional vector.
  • Based on the distance between two nodes:
  • The length of the shortest path between two nodes: only look at the length, ignoring the number and quality.
  • Based on the local connection information of two nodes:
  1. The number of common adjacent nodes between two nodes (intersection)
  2. Ratio of intersection and union sets of adjacent nodes between two nodes
  3. Adamic-Adar index:
  • S a = ∑ u ∈ N ( V 1 ) ∩ N ( V 2 ) 1 l o g ( k u ) S_{a}=\textstyle \sum_{u\in N(V_{1})\cap N(V_{2})}\frac{1}{log(k_{u})} Sa=uN(V1)N(V2)log(ku)1
  • It can be understood that if two people are connected through several public figures, then there is a high probability that they will not be very close. If it goes through an ordinary person, then the relationship is probably pretty good.

There is a problem. If two nodes do not have a common neighborhood node, then the above three indicators are all 0, which is meaningless. This requires looking at the information of the entire graph.

  • Based on the connection information of two nodes in the whole graph - Katz index:
  • Record the number of paths of length k between two nodes.
  • It can be solved by raising the power of the adjacency matrix.
  • Suppose the adjacency matrix of the graph is A, then the number of paths of length k between nodes u and v is A k A^{k}AThe value of the u-th row and v-th column of the k matrix.
  • 公式文件S u , v = ∑ l = 1 ∞ β l A u , vl = ( I − β A ) − 1 − I S_{u,v} = \sum_{l=1}^{\infty } \beta ^{l}A^{l}_{u,v}=(I-\beta A)^{-1}-ISu,v=l=1bl Au,vl=IβA1I , in whichβ \betaβ is the scaling factor, and the result is the katz coefficient matrix.

4. Feature extraction of connected features at the full image level

  • The obtained features should reflect the structural characteristics of the entire graph.
  • Bag-of-node-degrees: only looks at the degree of the node, not the connection structure. In fact, count the number of nodes corresponding to different degrees.
  • Graphlet Kernel:
  • Counting the number of Graphlets results in Bag-of-Graphlet, which is considered a generalization of Bag-of-*.
  • Different from the node level, Graphlet can have isolated nodes from the perspective of the whole graph.
  • Counting the number of various Graphlets can also form a d-dimensional vector.
  • After normalizing the Bag-of-Graphlets of the two graphs, and then performing a quantitative product, the Graphlet Kernels of the two graphs are obtained.
  • However, the computational complexity of Graphlet Kernel is too high and the application space is very small, which leads to Weisfeiler-Lehman Kernel.
  • Weisfeiler-Lehman Kernel:
  • Its characteristic is to continuously enrich the node vocabulary through iteration.
  • It uses a color fine-tuning method.
  • Through multiple iterations, the node colors are fine-tuned, the node vocabulary is enriched, and finally the number of occurrences of nodes of different colors is counted to obtain vectors to implement feature extraction.
  • Perform a quantitative product operation on the vectors of the two graphs, and the result is the Weisfeiler-Lehman Kernel.
  • Generally, the more iterations, the better the effect.
  • Note 1: When calculating the Weisfeiler-Lehman Kernel of two graphs, the iterative calculations must be performed simultaneously, that is, the node color vocabulary must be contributed by both graphs at the same time.
  • Note 2: The implementation of weisfeiler_lehman_graph_hash in NetwokX is different from what is mentioned above, but gklearn.kernels.Weisfeilerlehmankernel is the same.

Guess you like

Origin blog.csdn.net/qq_44928822/article/details/132662798