Acquaintance NetworkX - preliminary exploration of Panama Papers

This article focuses on is to try to use NetworkX, build complex networks (CNA), the preliminary inquiry in Panama Papers implicit information. Process substantially as follows: Construction of a network model with Panama Papers NetworkX; 2 measure with the common network information to assess the primary network and sub-network; network information visualization 3...

1. The origin of the story


  J recently saw a book called Complex Network in the Analysis Python , originally just want to learn to draw network diagrams, results, doubled up, as if discovered the New World. Article describes a general method for Complex Network Analysis, there are presentation tool, but also teach methods, narrative style of the article was full of fun. J did not come into contact with the complex network, I do not like to read English books, but this should be the most suitable resources like J beginners can find it?

  When the seventh chapter of the book to see, when referred to the Panama Papers, J shines, but also the roughly understand why the article has not been translated. But do not jump to discuss affairs of state, as law-abiding and not over the wall of the J someone, after spending a night time look after the data, J accidentally found a good article Exploring Papers at The Panama Network . The article was written in 2016.6, J someone determined in accordance with article ideas, try to be a Preliminary data for Panama Papers of. (Article should not be such a water meter checked it)

2, the results of the project show


  Frankly, Panama Papers do take complex network analysis first opponent, I think I must be crazy. . . With J weak financial accounting knowledge, understanding Panama Papers basic data structure has been a huge challenge. ICIJ disclosed data structure includes entities, addresses, officers and intermediaries4 a set of data nodes, and four data sets described above, a relationship between the edges of data sets. Interpretation of each of the data sets as follows:

  • ”Entity (offshore)”: company, trust or fund created in a low-tax, offshore jurisdiction by an agent;
  • ”Officer”: person or company who plays a role in an offshore entity;
  • ”Intermediary”: a go-between for someone seeking an offshore corporation and an offshore service provider - usually a law-firm or a middleman that asks an offshore service provider to create an offshore firm for a client;
  • ”Address”: contact postal address as it appears in the original databases obtained by ICIJ

2.1 Creating a network graph model

import pandas as pd
import matplotlib.colors as colors
import matplotlib.cm as cmx
import matplotlib.patches as mpatches
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from networkx.drawing.nx_agraph import graphviz_layout
View Code

  Read and build a model of the network of FIG.

def normalise(s, strip_punctuation=False):
    if pd.isnull(s):
        return ""
    s = s.strip().lower()
    PUNCTUATION = """.,"'()[]{}:;/!£$%^&*-="""
    if strip_punctuation:
        for c in PUNCTUATION:
            s = s.replace(c, "")
    return s
adds = pd.read_csv("Addresses.csv", low_memory=False)
ents = pd.read_csv("Entities.csv", low_memory=False)
ents["name"] = ents["name"].apply(normalise)
inter = pd.read_csv("Intermediaries.csv", low_memory=False)
inter["name"] = inter.name.apply(normalise)
offi = pd.read_csv("Officers.csv", low_memory=False)
offi["name"] = offi.name.apply(normalise)
edges = pd.read_csv("all_edges.csv", low_memory=False)
View Code

  We will use to read data sets pandas, NetworkX used to construct a directed graph, and then removed from the network according to a similar document edges Node redundancy, we need to obtain the model of FIG.

G = nx.DiGraph()
for n,row in adds.iterrows():
    G.add_node(row.node_id, node_type="address", details=row.to_dict())
    
for n,row in ents.iterrows():
    G.add_node(row.node_id, node_type="entities", details=row.to_dict())
    
for n,row in inter.iterrows():
    G.add_node(row.node_id, node_type="intermediates", details=row.to_dict())
    
for n,row in offi.iterrows():
    G.add_node(row.node_id, node_type="officers", details=row.to_dict())
for n,row in edges.iterrows():
    G.add_edge(row.node_1, row.node_2, rel_type=row.rel_type, details={})

  When someone J to see all five data sets to add data to the model G in FIG., J Computer rose 92% from 20% memory usage, J J understand the computer may want to change. 4G memory really just decoration, had to buy Shenzhou computer and it is because of poor exercise. After adding nodes and edges to the model, we have redundant and will have similar nodes removed.

SAME_NAME_REL_TYPES = [
    'similar name and address as',
    'same name and registration date as',
    'same address as',
]
def merge_similar_names(g):

    edges = list(g.edges(data=True))
    removed = set()

    while edges:
        current_edge = edges.pop()

        if current_edge[2]["rel_type"] not in SAME_NAME_REL_TYPES:
            continue

        if current_edge[0] in removed or current_edge[1] in removed:
            continue

        new_edges = merge_edge(g, current_edge)

        edges += new_edges
        removed.add(current_edge[0])
def merge_edge(g, target_edge):

    n_remove, n_replace = target_edge[0:2]

    edges_to_replace = g.edges(nbunch=n_remove, data=True)

    new_edges = []

    for e in edges_to_replace:
        if (e[0],e[1]) == (n_remove, n_replace):
            continue
        if e[0] == n_remove:
            new_edges.append( (n_replace, e[1], e[2]) )
        else:
            new_edges.append( (e[0], n_replace, e[2]) )

    g.remove_node(n_remove)

    for e in new_edges:
        g.add_edge (e [0], e [ 1], ** e [2 ])

    return new_edges
View Code

  Since the entire network is too large, we will always network model G split into multiple sub-networks into subgraphs in.

subgraphs = [g for g in nx.connected_component_subgraphs(G.to_undirected())]
subgraphs = sorted(subgraphs, key=lambda x: x.number_of_nodes(), reverse=True)
print([s.number_of_nodes() for s in subgraphs[:10]])

  All the sub-networks in descending order according to the number of nodes, we found that 90% of the nodes of a large network, other network nodes must simpler. (I must be inflated)

  We draw a random sub-networks to a simple analysis of information Panama Papers.

def get_node_label(n):
    if n["node_type"] == "address":
        if pd.isnull(n["details"]["address"]):
            return ""
        return n["details"]["address"].replace(";", "\n")
    return n["details"]["name"]
def build_patches(n2i, sm):
    patches = []

    for k,v in n2i.items():
        patches.append(mpatches.Patch(color=sm.to_rgba(v), label=k))

    return patches
View Code
def plot_graph(g, label_nodes=True, label_edges=False, figsize=(15,15)):
    """

    : Param g;
    :return:
    """
    node_to_int = {k: node_types.index(k) for k in node_types}
    node_colours = [node_to_int[n[1]["node_type"]] for n in g.nodes(data=True)]
    node_labels = {k:get_node_label(v) for k,v in g.nodes(data=True)}

    cmap = plt.cm.rainbow
    cNorm  = colors.Normalize(vmin=0, vmax=len(node_to_int)+1)
    scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=cmap)

    plt.figure(figsize=figsize)
    plt.legend(handles=build_patches(node_to_int, scalarMap))

    pos = nx.spring_layout(g, iterations=100)

    # nodes
    nx.draw_networkx_nodes(g, pos, node_color=node_colours,
                           cmap=cmap, vmin=0, vmax=len(node_to_int)+1)

    # edges
    nx.draw_networkx_edges(g, pos, edgelist=g.edges(), arrows=True)

    # labels
    if label_nodes:
        nx.draw_networkx_labels(g, pos, labels=node_labels,
                            font_size=12, font_family='sans-serif')
    if label_edges:
        edge_labels = {(e[0], e[1]): e[2]["rel_type"] for e in g.edges(data=True)}
        nx.draw_networkx_edge_labels(g, pos, edge_labels
View Code
plot_graph(subgraphs[176])

   Agents in this network diagram as "controller limited" (analysis academic research, not against any person or organization), we can see entities business entities in the network is more complex, and officers behind for clearer . It appears more of a boulevard from Hong Kong "wan pak kuen" and Germany's "gerhart". However, most of the network officers mostly "the bearer", the author By Iain guess is: His guess is that here "the bearer" refers to an infamous bearer board. He appeared 10 times on the "wan pak kuen", although sometimes a different title, sometimes a different case, we do not know when the data processing methods and local knowledge ICIJ, it is difficult to conclude.

2.2 Analysis of the primary network of FIG.

  Next, we will analyze the main map. We will have to face a problem, our primary network too. A map to visualize our No. 176 sub-network has been some reluctance, while the 90W-node network, gas J almost smashed the computer. Therefore, we also need to introduce some of the ways to assess the measure of our network.

g = subgraphs[0]
nodes = g.nodes()
g_degree = g.degree()
types = [g.node[n]["node_type"] for n in nodes]
degrees = [g_degree[n] for n in nodes]
names = [get_node_label(g.node[n]) for n in nodes]
node_degree = pd.DataFrame(data={"node_type":types, "degree":degrees, "name": names}, index=nodes)

  From the above sub-graph, we can guess the agent may be in a position C network, we will measure the following four types of information to verify our conclusions. We will use the concept of "degree" (degree), the "degree" is the number of edges connected to a node. ( "Degree" is the number of edges connected to a node).

node_degree.groupby("node_type").agg(["count", "mean", "median"])

  We can see four types of data information, the median are relatively small, only 50% of the values ​​are 1-3. Wherein the mean great agents, agents which show distribution of highly heterogeneous set of data and long tail, severe tailing, having a small amount of side large number of connections. Let's look at the top 15 of the Top node.

node_degree.sort_values("degree", ascending=False)[:15]

  And imagination is slightly different, there are at most edges node is connected to an address, but followed by broker address: "portcullis trustnet" (You can some of this "Guardian" article, learn about them information). This seems to be more than 37,000 addresses specific business entities (entities) registered address.

  Given the agency appears to be the middleman to help create an entity, it is easy to think of each broker can be linked to many corporate entities, but it is not clear how to link them together. Let's look at the shortest path "portcullis trustnet (bvi) limited" and "unitrust Corporate Services Ltd." between.

def plot_path(g, path):
    plot_graph(g.subgraph(path), label_edges=True)

path = nx.shortest_path(g, source=54662, target=298333)
plot_path(G, path)

  It seems that these two agencies are jointly shared by the directors' first director of the company, "the company linked.

plot_graph(G.subgraph(nx.ego_graph(g, 24663, radius=1).nodes()), label_edges=True)

3 with a conventional measure to analyze network


3.1 degree of distribution

  In the above we have explained the concept of excessive, and now we can analyze the distribution of the entire main network.

max_bin = max(degrees)
n_bins = 20
log_bins = [10 ** ((i/n_bins) * np.log10(max_bin)) for i in range(0,n_bins)]
fig, ax = plt.subplots()
node_degree.degree.value_counts().hist(bins=log_bins,log=True)
ax.set_xscale('log')
ax.set_xlim(0,max(log_bins))
plt.xlabel("Number of Nodes")
plt.ylabel("Number of Degrees")
plt.title("Distribution of Degree");

  Discussion of degrees and above roughly the same.

Importance of node 3.2

  Common method is to measure the importance of page rank node (Page Rank). Google Page Rank is a measure used to determine a page's importance. In essence, if we randomly traverse the entire map, then jump to a random page from time to time, and time spent on each node is proportional to its page rank Page Rank.

pr = nx.pagerank_scipy(g)
node_degree["page_rank"] = node_degree.index.map(lambda x: pr[x])
#node_degree.sort_values("page_rank", ascending=False)[0:15]
node_degree[node_degree.node_type == "entities"].sort_values("page_rank", ascending=False)[0:15]

t = nx.ego_graph(g, 10165699, radius=1)
plot_graph(t, label_edges=True)

  For Page Rank relatively large nodes, in our view ultimately only have a large number of shareholders and entities use a high-level intermediaries.

3.3 clustering coefficient

  We can Another measure of "shape" of FIG performed by clustering coefficient (clustering coefficient). You can be viewed as a measure of the local structure of FIG: how much of the neighboring nodes have, and these nodes are also nodes are neighbors (what fraction of a nodes neighbours are also neighbours of each other).

cl = nx.clustering(g)
fig, ax = plt.subplots()
node_degree["clustering_coefficient"] = node_degree.index.map(lambda x: cl[x])
node_degree.clustering_coefficient.hist()
ax.set_xlim(0,1)
plt.xlabel("Number of Nodes")
plt.ylabel("Number of clustering_coefficient")
plt.title("Distribution of clustering_coefficient");

  It turns out that the entire network and not too many significant local concrete structure. Most node clustering coefficient is zero. While a few nodes have non-zero values, but tend to have a relatively small degree. This means that Panama Papers network is not a small-world network.

To see some clustering coefficient of zero cases of it, we can see the following sample sub-graph:

t = nx.ego_graph(g, 122762, radius=1)
plot_graph(G.subgraph(t), label_edges=True)

  We will find a non-zero node clustering coefficient seem to be some of the scale is not large, but there is a high correlation, for example, share the same node address.

The median of 3.4 subgraphs (the largest median node degree)

  In order to explore some of the more interesting structure, we can see sub-network map with the greatest degree of intermediate nodes.

avg_deg = pd.Series(data=[np.median([d[1] for d in sg.degree()]) for sg in subgraphs],
                    index=range(0,len(subgraphs)))
tt = subgraphs[372].nodes()
plot_graph(G.subgraph(tt))
#avg_deg.sort_values(ascending=False)[0:10]

  We found that for such a network, it is more like a mid-size, large number, and have the same owners of companies and enterprises.

4. Thinking about this item

  • Well, I have brought two days of time what to do? Also my youth, I still break. In doing research, the choice of subject is very important, Panama Papers need a lot of financial accounting knowledge, as primary research networks depends on not very appropriate.
  • NetworkX the advantage of detailed documentation, but the bottom has not written in JAVA or C, is equipped with the natural and matplotlib, but pay attention to save memory use

 

 

 

 

 

 

 

 

 

J someone said to myself, then

This month vaguely have four friends find me, J someone because of some personal reasons did not go to Dali, Mimi quietly or to apologize. National Day at my parents there a good rest three or four days, really is what to eat every day, what better on the fat five pounds. (⊙v⊙) ah, in addition to almost grab another blind date was perfect.

Some time ago tired and impetuous, the last generation are small businessmen J someone's home, the ear plug head burning of hard work and careful planning can learn a lot still to learn how much. (⊙v⊙) new environment wherever he goes a lot better than the previous one, because I heard that call uncle complained, save a lot of rent, it seems relatives still have to walk around. Mr. Chen also hopes to continue its efforts, chances J in his lifetime but also when the second-generation rich.

Mom said I should buy something replenishment. . Joke, J someone natural beauty, need that kind of thing? Or lose weight bear.

Last month buy Pokémon Adventures manga, one did not read. Although it may be called the treasure can dream, though also knew from other sources Xiao Huang defeated the dragon flying Pokemon line of kings. But did not think of 16 years ago, opened the Beijing tenements, think of three people running to catch 20 of hide and seek, few people think of 10 with leaves piled on the grill, remember to clean up spilled ink for my girls, I think that is full various possible summer.

Kite line broken, it brought it to tie her hair.
View Code

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/307825064j/p/11750085.html
Recommended