Graph Data Science with Python/NetworkX

Graph Data Science Using Python/NetworkX

Albanese is a developer and data scientist who worked at Facebook, where he worked on predictions for machine learning models. He is a Python expert and a university lecturer. His doctoral research is related to machine learning on graphs.

We are inundated with data. Ever-expanding databases and spreadsheets are filled with hidden business insights. How do we analyze the data and draw conclusions when there is so much data? Charts (networks, not bar charts) provide an elegant approach.

We often use tables to represent information generically. But charts use a specialized data structure. A node represents an element, not a table row. An edge connects two nodes to represent their relationship.

This graph data structure allows us to observe data from unique perspectives, which is why graph data science is used in fields ranging from molecular biology to social science.

Left image source: TITZ, Björn, et al. "The Binary Protein Interactome of Treponema Pallidum ..." PLoS One, 3, no.5 (2008).

Right image source: ALBANESE, Federico, et al. "Predicting Shifting Individuals Using Text Mining and Graph Machine Learning on Twitter.". (August 24, 2020): arXiv:2008.10749 [cs.SI]

So, how can developers leverage graph data science? Let's take a look at the most commonly used data science programming languages . Python.

Getting Started with "Graph Theory" Graphics in Python

Python developers have several graph databases available such as NetworkX, igraph, SNAP, and graph-tool. Advantages and disadvantages aside, they both have very similar interfaces for working with Python's graph data structure.

We will use the popular NetworkX  library . It's simple to install and use, and supports the community detection algorithm we'll be using.

Creating a new graph with NetworkX is simple:

import networkx as nx
G = nx.Graph()
复制代码

However G , since there are no nodes and edges, it is not yet a graph.

How to add nodes to a graph

We can add a node to the network by chaining Graph() the return value of with .add_node() (or .add_nodes_from() , for, multiple nodes in a list). We can also add arbitrary features or attributes to nodes by passing a dictionary as an argument, as we  show in node 4 and :node 5

G.add_node("node 1")
G.add_nodes_from(["node 2", "node 3"])
G.add_nodes_from([("node 4", {"abc": 123}), ("node 5", {"abc": 0})])
print(G.nodes)
print(G.nodes["node 4"]["abc"]) # accessed like a dictionary
复制代码

This will output:

['node 1', 'node 2', 'node 3', 'node 4', 'node 5']
123
复制代码

But without the edges between the nodes, they would be orphaned, and the dataset would be no better than a simple table.

How to add edges to a graph

Similar to the technique for nodes, we can use .add_edge() , taking the names of the two nodes as arguments (or .add_edges_from() , for multiple edges in a list), optionally including a dictionary of attributes.

G.add_edge("node 1", "node 2")
G.add_edge("node 1", "node 6")
G.add_edges_from([("node 1", "node 3"), 
                  ("node 3", "node 4")])
G.add_edges_from([("node 1", "node 5", {"weight" : 3}), 
                  ("node 2", "node 4", {"weight" : 5})])
复制代码

The NetworkX library supports graphs like this one, where each edge can have a weight. For example, in a social network graph where nodes are users and edges are interactions, weights can indicate how many interactions occurred between a given pair of users—a highly correlated metric.

NetworkX in use G.edges , lists all edges, but it does not include their attributes. If we want the properties of the edges, we can use G[node_name] to get everything connected to a node, or G[node_name][connected_node_name] to get the properties of a specific edge with .

print(G.nodes)
print(G.edges)
print(G["node 1"])
print(G["node 1"]["node 5"])
复制代码

This will output:

['node 1', 'node 2', 'node 3', 'node 4', 'node 5', 'node 6']
[('node 1', 'node 2'), ('node 1', 'node 6'), ('node 1', 'node 3'), ('node 1', 'node 5'), ('node 2', 'node 4'), ('node 3', 'node 4')]
{'node 2': {}, 'node 6': {}, 'node 3': {}, 'node 5': {'weight': 3}}
{'weight': 3}
复制代码

However, reading our first graph this way is impractical. Thankfully, there is a better representation.

How to generate images from graphs (and weighted graphs)

Graphical visualization is critical. It allows us to quickly and clearly see the relationships between nodes and the structure of the network.

A quick call nx.draw(G) will do the trick.

Let's nx.draw() make the heavier sides thicker accordingly with a call to .

weights = [1 if G[u][v] == {} else G[u][v]['weight'] for u,v in G.edges()]
nx.draw(G, width=weights)
复制代码

We provide a default thickness for unweighted edges, as shown in the results.

Our methods and graph algorithms are about to become more sophisticated, so the next step is to use a better known dataset.

Graph data science using data from the movie Star Wars: Episode IV

To make it easier to interpret and understand our results, we will use this dataset . Nodes represent important characters, and edges (not weighted here) mark co-occurrences in a scene.

Note: This dataset is from Gabasova, E. (2016). Star Wars Social Network. DOI: doi.org/10.5281/zen…

First, we'll nx.draw(G_starWars, with_labels = True) visualize the data with .

Characters that usually appear together, such as R2-D2 and C-3PO, appear closely connected. In contrast, we can see that Darth Vader doesn't share scenes with Owen.

Visual layout of Python NetworkX

Why is each node located where it was in the previous graph?

This is the result of the default spring_layout algorithm. It simulates the force of a spring, attracting connected nodes and repelling disconnected nodes. This helps to highlight well-connected nodes, which end up in a central location.

NetworkX also has other layouts that use different criteria to position nodes, circular_layout e.g.

pos = nx.circular_layout(G_starWars)
nx.draw(G_starWars, pos=pos, with_labels = True)
复制代码

result.

This layout is neutral because the position of a node does not depend on its importance—all nodes are represented equally. (Circular layouts can also help visualize independent connected components -- subgraphs with a path between any two nodes , but here the entire graph is one large connected component.)

Both layouts we saw had a certain degree of visual clutter, as edges were free to cross other edges. But Kamada-Kawai, another similar spring_layout force-directed algorithm, positions nodes to minimize the energy of the system.

This reduces edge crossings, but at a cost. It is slower than other layouts, so it is not strongly recommended for graphs with many nodes.

This one has a dedicated drawing function:

nx.draw_kamada_kawai(G_starWars, with_labels = True)
复制代码

This produces the shape instead.

Without any special intervention, the algorithm places the main characters (such as Luke, Leia, and C-3PO) at the center, and the less prominent ones (such as Cami and General Dodona) at the border.

Visualizing graphs with specific layouts can lead us to some interesting qualitative results. Still, quantitative results are an important part of any data science analysis, so we need to define some metrics.

Node analysis. Degrees and PageRank

Now that we can clearly see our network, we may be interested in the characteristics of the nodes. There are various metrics that characterize nodes and, in our case, characters.

A basic measure of a node is its *degree: *how many edges it has. The degree of a *Star Wars* character's nodes is a measure of how many other characters they share a scene with.

degree() Functions can calculate the degree of a character or the entire network:

print(G_starWars.degree["LUKE"])
print(G_starWars.degree)
复制代码

The output of these two commands:

15
[('R2-D2', 9), ('CHEWBACCA', 6), ('C-3PO', 10), ('LUKE', 15), ('DARTH VADER', 4), ('CAMIE', 2), ('BIGGS', 8), ('LEIA', 12), ('BERU', 5), ('OWEN', 4), ('OBI-WAN', 7), ('MOTTI', 3), ('TARKIN', 3), ('HAN', 6), ('DODONNA', 3), ('GOLD LEADER', 5), ('WEDGE', 5), ('RED LEADER', 7), ('RED TEN', 2)]
复制代码

Sort the nodes according to degree from high to low, just one line of code can be done:

print(sorted(G_starWars.degree, key=lambda x: x[1], reverse=True))
复制代码

output:

[('LUKE', 15), ('LEIA', 12), ('C-3PO', 10), ('R2-D2', 9), ('BIGGS', 8), ('OBI-WAN', 7), ('RED LEADER', 7), ('CHEWBACCA', 6), ('HAN', 6), ('BERU', 5), ('GOLD LEADER', 5), ('WEDGE', 5), ('DARTH VADER', 4), ('OWEN', 4), ('MOTTI', 3), ('TARKIN', 3), ('DODONNA', 3), ('CAMIE', 2), ('RED TEN', 2)]
复制代码

Since it is only a total number, the degree does not take into account the details of individual edges. Is a given edge connected to an otherwise isolated node or to a node connected to the entire network? Google's PageRank algorithm aggregates this information to measure a node's "importance" in the network.

The PageRank metric can be interpreted as an agent moving randomly from one node to another. Well-connected nodes have more paths through them, so agents will tend to visit them more often.

Such nodes will have a higher PageRank, which we can calculate using the NetworkX library:

pageranks = nx.pagerank(G_starWars) # A dictionary
print(pageranks["LUKE"])
print(sorted(pageranks, key=lambda x: x[1], reverse=True))
复制代码

This prints out Luke's rank and our characters sorted by rank:

0.12100659993223405
['OWEN', 'LUKE', 'MOTTI', 'DODONNA', 'GOLD LEADER', 'BIGGS', 'CHEWBACCA', 'LEIA', 'BERU', 'WEDGE', 'RED LEADER', 'RED TEN', 'OBI-WAN', 'DARTH VADER', 'CAMIE', 'TARKIN', 'HAN', 'R2-D2', 'C-3PO']
复制代码

Owen is the character with the highest PageRank, surpassing Luke who has the highest rank. analysis. While Owen isn't the character who shares the most scenes with other characters, he is a character who shares scenes with many important characters, such as Luke himself, R2-D2, and C-3PO.

In greater contrast, C-3PO, the character with the third highest degree, is the character with the lowest PageRank. Although C-3PO has many connections, many of them are with unimportant characters.

Speaking from experience. Using multiple metrics provides a deeper understanding of different characteristics of a graph's nodes.

Community Detection Algorithm

When analyzing a network, it may be necessary to isolate communities : groups of nodes that are highly connected to each other but least connected to nodes outside the community.

There are various algorithms for this. Most of these are found in unsupervised machine learning algorithms because they assign a label to nodes without requiring them to have been labeled before.

One of the most famous algorithms is Label Propagation . In this algorithm, each node starts out with a unique label, in a population. The label of a node is updated iteratively based on the majority of labels of neighboring nodes.

Labels are diffused in the network until all nodes share a label with a majority of their neighbors. Groups of nodes that are closely connected to each other end up with the same label.

Using the NetworkX library, only three lines of Python are required to run this algorithm:

from networkx.algorithms.community.label_propagation import label_propagation_communities

communities = label_propagation_communities(G_starWars)
print([community for community in communities])
复制代码

output:

[{'R2-D2', 'CAMIE', 'RED TEN', 'RED LEADER', 'OBI-WAN', 'DODONNA', 'LEIA', 'WEDGE', 'HAN', 'OWEN', 'CHEWBACCA', 'GOLD LEADER', 'LUKE', 'BIGGS', 'C-3PO', 'BERU'}, {'DARTH VADER', 'TARKIN', 'MOTTI'}]
复制代码

In this collection list, each collection represents a community. Readers familiar with the film will note that the algorithm successfully separates the "good guys" from the "bad guys," meaningfully distinguishing the characters without using any real (community) labels or metadata.

Intelligent Insights Using Graph Data Science in Python

We've seen that getting started with graphical data science tools is much simpler than it sounds. Once we represent the data as a graph using the NetworkX library in Python, just a few lines of code can be eye-catching. We can visualize our dataset, measure and compare the characteristics of nodes, and classify nodes reasonably by community detection algorithms.

Possession of skills in extracting conclusions and insights from the web using Python enables developers to integrate with tools and methodologies commonly found in data science serving pipelines. These methods are easily applicable to a wide range of environments, from search engines to flight scheduling to electrical engineering.

Recommended Reading for Graph Data Science

Community Detection Algorithms
Zhao Yang, René Algesheimer, and Claudio Tessone. " Comparative Analysis of Community Detection Algorithms on Artificial Networks ". Scientific Reports, 6, no. 30750 (2016).

Graph Deep Learning
Thomas Kipf. " Graph Convolutional Networks ." September 30, 2016.

Applications of Graph Data Science
Albanese, Federico, Leandro Lombardi, Esteban Feuerstein, and Pablo Balenzuela. "Individuals Using Text Mining and Graph Machine Learning to Predict Shifts on Twitter." (August 24, 2020): arXiv:2008.10749 [cs.SI].

Cohen, Elior. "PyData Tel Aviv Meetup. Node2vec." YouTube. November 22, 2018. Video, 21:09. www.youtube.com/watch?v=828…

Learn the basics

Can Python be used for data visualization?

Yes it can. Python has multiple libraries for data visualization, such as the NetworkX library.

How to graph data in Python?

Python graph data visualization libraries like NetworkX, igraph, SNAP, and graph-tool already have this capability built in. The NetworkX library is very useful for the visualization of nodes and edges of a network.

Is Graph a data type in Python?

The Python NetworkX library provides different data chart types. Depending on the properties of the graph, the possible types are Graph, DiGraph, MultiGraph, and MultiDiGraph.

Is graph theory used in data science?

Yes, the NetworkX library enables Python data scientists to easily utilize different graph theory based algorithms such as PageRank and Label Propagation.

What is the use of NetworkX in Python?

NetworkX is a library for representing graphs in Python. Developers can use it to create, manipulate, and visualize graphs, as well as for non-visual graph data science analysis.

When should I use NetworkX?

The easy-to-use NetworkX library should be used for graph analysis; for example, when community detection algorithms or other special features are required. But its functionality is otherwise comparable to other libraries such as igraph, SNAP, and graph-tool.

Is NetworkX fast?

For many applications, NetworkX is fast enough, but for large-scale graph datasets other Python libraries may be faster, depending on the algorithm. The benefits of using NetworkX are its ease of use and extensive developer community.

What is a Community Detection Algorithm?

Community detection algorithms aim to cluster network nodes based on their connectivity. Label propagation is a widely used method and has an implementation in the Python NetworkX library.

 

Guess you like

Origin blog.csdn.net/weixin_73136678/article/details/128805390