Data analysis task 3: Paper code statistics
The Github
data set
involves knowledge points:
- Regular expression
- The combined use of apply function and lambda for each column in the data frame data['categories'] = data['categories'].apply(lambda x: x.split('')[0])
- The return type of the finall function varies according to the number of () in its regular expression
1. Data processing steps
In the original arxiv data set, authors often give specific code links in the comments or abstract fields of the paper, so we need to find the code links in these fields.
- Determine where the data appears;
- Use regular expressions to complete matching;
- Complete relevant statistics;
2. Regular expressions
Ordinary characters: uppercase and lowercase letters, all numbers, all punctuation marks and some other symbols
Special characters: character qualifiers with special meaning
3. Code implementation
import json #读取数据
import re #正则表达式
import pandas as pd
Read data
data = []
with open("E:/datawhale数据分析/arxiv-metadata-oai-2019.json",'r') as f:
for idx,line in enumerate(f):
d = json.loads(line)
d = {
'abstract':d['abstract'],'categories':d['categories'],'comments':d['comments']}
data.append(d)
data = pd.DataFrame(data)
Pages extracted
data['pages'] = data['comments'].apply(lambda x:re.findall('[1-9][0-9]* pages',str(x)))
data = data[data['pages'].apply(len) > 0]#抽取出页码存在的论文
data['pages'][1][0]
#27 pages
#去除pages 留下数字作为pages项
data['pages'] = data['pages'].apply(lambda x: float(x[0].replace('pages','')))
Statistics on pages
data['pages'].describe().astype(int)
Count the number of paper pages according to the classification, and select the main category of the first category of the paper
import matplotlib.pyplot as plt
#选择主要类别
data['categories'] = data['categories'].apply(lambda x: x.split(' ')[0])
#每类论文的平均页数
plt.figure(figsize=(12,6))
data.groupby(['categories'])['pages'].mean().plot(kind = 'bar')
Extract the number of paper charts and set it to data["figures"]:
data = data.copy()
#删掉copy会报错??
data['figures'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* figures',str(x)))
data = data[data['figures'].apply(len) >0]
data['figures'] = data['figures'].apply(lambda x: float(x[0].replace('figures','')))
Extract the code link of the paper, in order to simplify the task, only extract the github link
data = data.copy()
# 筛选包含github的论文
data_with_code = data[
(data.comments.str.contains('github')==True)|
(data.abstract.str.contains('github')==True)
]
data_with_code['text'] = data_with_code['abstract'].fillna('') + data_with_code['comments'].fillna('')
# 使用正则表达式匹配论文
pattern = '[a-zA-z]+://github[^\s]*'
#所有大小写字母出现一次或多次://github任意非空字符出现0次或多次
data_with_code['code_flag'] = data_with_code['text'].str.findall(pattern).apply(len)
#finall函数的返回类型依其正则表达式中()的个数不同而不同
#https://www.cnblogs.com/springionic/p/11327187.html
data_with_code['code_flag']
Plot the number of papers citing github links by category
data_with_code = data_with_code[data_with_code['code_flag'] == 1]
plt.figure(figsize=(12, 6))
data_with_code.groupby(['categories'])['code_flag'].count().plot(kind='bar')