Data analysis task 3: Paper code statistics

Data analysis task 3: Paper code statistics

The Github
data set
involves knowledge points:

1. Data processing steps

In the original arxiv data set, authors often give specific code links in the comments or abstract fields of the paper, so we need to find the code links in these fields.

  1. Determine where the data appears;
  2. Use regular expressions to complete matching;
  3. Complete relevant statistics;

2. Regular expressions

Ordinary characters: uppercase and lowercase letters, all numbers, all punctuation marks and some other symbols
Common characters: uppercase and lowercase letters, all numbers, all punctuation marks, and some other symbols
Special characters: character qualifiers with special meaning
Insert picture description here

Insert picture description here

3. Code implementation

import json #读取数据
import re #正则表达式
import pandas as pd

Read data

data = []
with open("E:/datawhale数据分析/arxiv-metadata-oai-2019.json",'r') as f:
    for idx,line in enumerate(f):
        d = json.loads(line)
        d = {
    
    'abstract':d['abstract'],'categories':d['categories'],'comments':d['comments']}
        data.append(d)

data = pd.DataFrame(data)

Pages extracted

data['pages'] = data['comments'].apply(lambda x:re.findall('[1-9][0-9]* pages',str(x)))

data = data[data['pages'].apply(len) > 0]#抽取出页码存在的论文

data['pages'][1][0]
#27 pages

#去除pages  留下数字作为pages项
data['pages'] = data['pages'].apply(lambda x: float(x[0].replace('pages','')))

Statistics on pages

data['pages'].describe().astype(int)

Count the number of paper pages according to the classification, and select the main category of the first category of the paper

import matplotlib.pyplot as plt
#选择主要类别
data['categories'] = data['categories'].apply(lambda x: x.split(' ')[0])

#每类论文的平均页数
plt.figure(figsize=(12,6))
data.groupby(['categories'])['pages'].mean().plot(kind = 'bar')

Insert picture description here

Extract the number of paper charts and set it to data["figures"]:

data = data.copy()
#删掉copy会报错??
data['figures'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* figures',str(x)))
data = data[data['figures'].apply(len) >0]
data['figures'] = data['figures'].apply(lambda x: float(x[0].replace('figures','')))

Extract the code link of the paper, in order to simplify the task, only extract the github link

data = data.copy()
# 筛选包含github的论文
data_with_code = data[
    (data.comments.str.contains('github')==True)|
                      (data.abstract.str.contains('github')==True)
]
data_with_code['text'] = data_with_code['abstract'].fillna('') + data_with_code['comments'].fillna('')

# 使用正则表达式匹配论文
pattern = '[a-zA-z]+://github[^\s]*'
#所有大小写字母出现一次或多次://github任意非空字符出现0次或多次
data_with_code['code_flag'] = data_with_code['text'].str.findall(pattern).apply(len)
#finall函数的返回类型依其正则表达式中()的个数不同而不同
#https://www.cnblogs.com/springionic/p/11327187.html
data_with_code['code_flag']

Plot the number of papers citing github links by category

data_with_code = data_with_code[data_with_code['code_flag'] == 1]
plt.figure(figsize=(12, 6))
data_with_code.groupby(['categories'])['code_flag'].count().plot(kind='bar')

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43720646/article/details/112856468