Swiss Army Knife for small graphic mining research: learning karate club charts Python library


 

作者 | Benedict Rozemberczki

Translator | Him Zebian | Carol

Produced | AI technology base camp (ID: rgznai100)

       

Karate Club (Karate Club) is NetworkX Python packages unsupervised machine learning extensions. Details can refer to the documentation here:

https://github.com/benedekrozemberczki/karateclub。

Karate club will use advanced methods to map structured data unsupervised learning. In short, it is the Swiss Army knife for small graphic mining research.

First, it provides technology embedded network node level and level. Second, it includes various overlapping community detection method and do not overlap. Implemented method of covering a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), Artificial Intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) will be meetings, seminars and the name of the journal .

A simple example

 

Karate club community to make use of modern detection technology has become very easy (please refer to the accompanying tutorial here: https://karateclub.readthedocs.io/en/latest/notes/introduction.html ). The following code segment detection algorithm uses the community overlapping synthetic FIG.    

       

Design Principles

  

When you create a karate club, we use the API for machine learning system design point of view - in order to become the ultimate user-friendly machine learning tools. This design principle for the API contains some simple ideas. In this section, we will discuss these ideas and their obvious advantages through appropriate illustrative examples in detail.

1) Super package model parameters and Test

To create unsupervised by using the appropriate constructor Python object model instance karate club. The constructor parameter having a super default setting that allows proper use existing models. Simply put, this means that end users do not understand the mechanisms of internal models in great detail, a method implemented in our framework can be used.

We set these parameters to provide reasonable default super learning and runtime performance. If necessary, use the appropriate parameters of the constructor when the model created to modify these super model parameters. Ultra parameters as common attributes, so as to allow the model to check settings.

              

We demonstrate by the hyper-parameters of the package above code fragment. NetworkX generated Erdos-Renyi map First of all, we want to exceed the parameters set for the use of standard create an embedded.

When building the model, we will not change these default parameters over, and you can set the standard print super size parameters. Secondly, we can set a different number of dimensions, so we created a new model, and still publicly accessible dimension hyperparameter.

 Consistency 2) and a non-diffusible type 

Each unsupervised machine learning models karate club are implemented as a single class that inherits from class Estimator. Because we assume that the end-user algorithm details related to a specific technology is not particularly interested in, so only a few public methods algorithm implemented in our framework. 

All models through the use of Fit () method of fitting, which accepts input (pattern feature node) and calls the appropriate method to learn the private or embedded cluster. Node and the inset () method returns a Public get_embedding, while the cluster membership is retrieved by calling get_memberships ().

              

In the code segment above, we created a random graph, and using the default model with ultra DeepWalk parameters, using public fit () method of fitting the model, and by calling the public get_embedding () method returns embedded.

We can modify this example, the embedded walklet created by introducing changes to the model constructor and with minimal effort, these changes will result in the following code fragment.    

Looking at these two snippets, API-driven design advantage is obvious, because we only need to make some modifications. First, you must change the import embedded in the model. Secondly, we need to change the model structure, and have set the default super parameters.

Third, the same method of DeepWalk public behavior and Walklets class. By embedding Fit () learning by get_embedding () returns. When the feature extraction is poor for unsupervised upstream model performance, which can be quickly and minimally change the code.

3) the normalized data set extractor 

We designed a karate club in order to extract using standardized data sets when fitting model. In practice, this means that the same purpose algorithm uses the same data type model training. Details are as follows:

  • FIG single NetworkX Fitting as input and based on embedding technique neighborhood node configuration.

  • The attribute node NetworkX FIG embedding process as input and an array element is represented as NumPy SciPy or sparse matrix. In these matrices, the row corresponding to the nodes, the column corresponding to the feature.

  • FIG chart stage embedding method and the fingerprint list as input NetworkX FIG.

  • Community detecting method uses as input NetworkX FIG.

4) High performance mechanical model 

The underlying mechanism of FIG mining algorithm is widely used Python library implementation, the library does not depend on the operating system, and does not require the presence of other external libraries (e.g. TensorFlow or PyTorch) a. Internal graphics Karate Club represents the use of NetworkX.

Dense linear algebra operations are done using NumPy, and sparse is used for other operations SciPy. Implicit using matrix decomposition techniques GenSim package, and graphic data dependent on the method of use PyGSP.

5) generating a normalized output interfaces and

Standardized output karate club generated an unsupervised learning algorithm ensures that for the same purpose has always been a consistent sequence of data points to return the same type of output.

This design principle has very important significance. When a certain type of algorithm is replaced by the same type of algorithm, without changing the code used upstream from the downstream unsupervised model output. Specifically, our framework generated output using the following data structure:

  • When calling get_embedding () method, the node embedding algorithm (reserved field, the properties and structure) always returns NumPy float array. The number of rows in the array is the number of vertices, and a row index corresponding to the vertex indices always. Further, the number of columns is the embedding dimension.

  • When calling get_embedding () method, the entire graph embedding method (spectral fingerprint, the implicit matrix decomposition techniques) returns Numpy float array. Row index corresponding to a single position in FIG enter the list. Similarly, column represents the embedding dimension.

  • When you call get_memberships () method, community detection process will return a dictionary. Node index is key, and the key is the value corresponding to the vertices of community members. Figure cluster technology will create some embedded nodes, to find the vertex cluster. When call get_embedding () method, which returns NumPy floating-point array. Structure is similar to the node of the array of embedded array algorithm returns.

We generate and output interfaces standardized presentation code snippet below. We create a cluster random graph, and returns the dictionary contains a cluster membership. Use an external library community, we can calculate these modular cluster.

This indicates that the normalized output will become more readily generate the external interface FIGS mining and machine learning libraries.       

 6) Limitations 

Currently, there are karate club design some limitations, our input assumptions. We assume NetworkX FIG undirected, and by the strongly connected components of a single composition. All algorithms assume inode is continuous, and the starting node index 0. Furthermore, we assume that the figures are not part of a multi-node is uniform, and the edge is unweighted (each side has a unit weight). 

Embedding algorithm for the entire pattern, all of the graphics have to modify the concentration required for the input listed previously. Based Weisfeiler-Lehman embedding characteristic feature allows nodes having a single string, the function keys can be accessed. Without this key, the default algorithms as the central node of the feature.

If there are any questions or other feedback, please comment to tell us.

original:

https://hackernoon.com/karate-club-a-python-library-for-graph-representation-learning-05383yh9

【end】

Force plans

"Force plan [the second quarter] - learning ability Challenge" started! From now until March 21, must flow to support the original author! Exclusive [more] medal waiting for you to challenge

Recommended Reading

    Your point of each "look", I seriously as the AI

Released 1350 original articles · won praise 10000 + · views 6.19 million +

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/104787345