Data Analysis-Academic Frontier Trend Analysis Five

Introduction

This blog will model the author relationship of the paper, count the most frequently occurring author relationships, construct an author relationship graph, and mine author relationships.

Data processing steps

Process the author list and complete the statistics. Specific steps are as follows:

  • Construct a graph between the first author of the paper and other authors (not the first author of the paper);
  • Use graph algorithm to calculate the relationship between the author and other authors in the graph;

Social network analysis

Graph is an important concept in complex network research. Graph is a mathematical model that uses points and lines to describe the connection between each pair of things in a set of discrete things in a certain way. Graph can be seen everywhere in the real world, such as transportation maps, travel maps, flowcharts, etc. Graphs can be used to describe many things in real life. For example, points can be used to represent intersections, and connections between points represent paths, so that a transportation network can be easily depicted.

Graph type

  • An undirected graph ignores the direction of the edge between two nodes.
  • Refers to a directed graph, considering the directionality of the edges.
  • Multiple undirected graphs, that is, the number of edges between two nodes is more than one, and a vertex is allowed to be related to itself through the same edge.

Graph statistical indicators

  • Degree: refers to the number of edges associated with the node, also known as the degree of association. For directed graphs, the in-degree of a node refers to the number of edges entering the node; the out-degree of a node refers to the number of edges starting from the node;
  • Dijkstra path: The shortest path from a source point to other points, the Dijkstra algorithm can be used to find the shortest path;
  • Connected graph: In an undirected graph G, if there is a path connected from vertex i to vertex j, then i and j are said to be connected. If G
    is a directed graph, then all edges in the path connecting i and j must be in the same direction. If any two points in the graph are connected, then the graph is called a connected graph. If this graph is a directed graph, it is called a strongly connected graph.

For other graph algorithms, you can find them in the networkx and igraph libraries.

Specific code and explanation

First guide the package

import pandas as pd #数据处理,数据分析
import json #读取数据,我们的数据为json格式的

Read the data again

data  = [] #初始化
#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open("arxiv-metadata-oai-snapshot.json", 'r') as f: 
    for idx, line in enumerate(f): 
        d = json.loads(line)
        d = {
    
    'authors_parsed': d['authors_parsed']}
        data.append(d)
        
data = pd.DataFrame(data) #将list变为dataframe格式,方便使用pandas进行分析

Create an undirected graph with author links:

import networkx as nx 
# 创建无向图
G = nx.Graph()

# 只用五篇论文进行构建
for row in data.iloc[:5].itertuples():
    authors = row[1]
    authors = [' '.join(x[:-1]) for x in authors]
    
    # 第一个作者 与 其他作者链接
    for author in authors[1:]:
        G.add_edge(authors[0],author) # 添加节点2,3并链接23节点

Draw the author relationship diagram:

nx.draw(G, with_labels=True)

Insert picture description here

Get the distance between authors:

try:
    print(nx.dijkstra_path(G, 'Balázs C.', 'Ziambaras Eleni'))
except:
    print('No path')

If we build a graph of 500 papers, we can get a more complete author relationship, and select the largest connected subgraph for drawing, and the line graph is the node degree value of the subgraph.

# 计算论文关系中有多少个联通子图
print(len(nx.communicability(G)))

plt.loglog(degree_sequence, "b-", marker="o")
plt.title("Degree rank plot")
plt.ylabel("degree")
plt.xlabel("rank")

# draw graph in inset
plt.axes([0.45, 0.45, 0.45, 0.45])
Gcc = G.subgraph(sorted(nx.connected_components(G), key=len, reverse=True)[0])

pos = nx.spring_layout(Gcc)
plt.axis("off")
nx.draw_networkx_nodes(Gcc, pos, node_size=20)
nx.draw_networkx_edges(Gcc, pos, alpha=0.4)
plt.show()

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_45696161/article/details/113107209