Team study notes Task1: Paper statistics

Data analysis first team study notes-Lizzy

@Datawhale

Task1: Paper statistics

  • Study topic: Statistics of the number of papers (data statistics task), count the number of papers in various fields of computer in the whole year of 2019;
  • Learning content: comprehension of competition questions, Pandasreading data, data statistics;
  • Learning outcomes: learning Pandasfoundation;

1.1 Read the paper json file

json: Since the paper data is in a json format, the import json module is needed to read them and read them into python's built-in data structure.

The json.loads() method can use append to pass each line read to the list object

Since the data set is too large, here I read the data of the first 100,000 papers and counted them. The specific code is as follows:

#首先导入需要使用的包
import pandas as pd#用于数据处理
#然后我用基本的文件操作读取了前100000个论文,并利用DataFrame的head()方法来显示其中的前五行,代码如下
data=[]
num=0
with open("D:\\arxiv-metadata-oai-snapshot.json",'r')as f:
    for line in f:
        data.append(json.loads(line))
        num+=1
        if num >99999:
            break
data=pd.DataFrame(data)
data.shape
data.head()

Insert picture description here

It can be found that there are a total of 14 tags, corresponding to 14 feature numbers

1.2 Data preprocessing

To analyze the characteristics of catalogs, the describe method () is used here, which is a simple description method that can generate multiple summary statistics at one time

For numerical data and non-numerical data, it will produce different outputs. The type here is not numeric data. We analyze the type according to the characteristics of the code.

data['categories'].describe()

Insert picture description here

Count is the count, unique is the unique value, top is the category with the highest frequency, and freq is the frequency of this category. Through analysis, it can be found that among the top 100,000 data, the 6068th category appears only once, and the most frequent One type is astro-ph (astrophysics). Since we only need data for 2019, we need to preprocess the data for the first time. The code is as follows:

#这个操作首先提取出时间这一列,然后将它转化为datatime格式,这是pandas自己内置的一种数据类型
data['year']=pd.to_datetime(data['update_date']).dt.year#只提取其中的年份,此时也新建了一列'year’,
del data["update_date"]
data=data[data'year'>=2019]#删除2019年之前的数据
data.reset_index(drop=True, inplace=True)
data

Here I have not used reset_index, only reindex, so I found the difference between them, reindex is to rearrange the column or row, reset_index is to rearrange the index, because when we delete operation, it may make The data is no longer continuous data. After the reset_index operation is used, the original index will become a new column. If you don’t want to display it, you can pass a parameter drop=True, this parameter is false by default, and another parameter is inplace. The meaning of is to rebuild one, or modify it on the original DataFrame, inplace is false by default, which is to create a new object.

Through screening, new data for 2019 can be obtained:

Insert picture description here

Then filter the data in the computer field, and only keep the year, id, categories three columns

data=data[['id','categories','year']]

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-1RbUZkid-1610544850026) (C:\Users\李子毅\AppData\Roaming\Typora\typora-user-images \image-20210113192015245.png)]

1.3 Types of crawled papers

Crawlers have only used BeautifulSoup in this respect, and don’t know much about regular rules yet--

website_url = requests.get('https://arxiv.org/category_taxonomy').text #获取网页的文本数据
soup = BeautifulSoup(website_url,'lxml') #爬取数据,这里使用lxml的解析器,加速
root = soup.find('div',{
    
    'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签入口
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags
#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

#进行
for t in tags:
    if t.name == "h2":
        level_1_name = t.text    
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串:raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)

#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
    
    
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})

#按照 "group_name" 进行分组,在组内使用 "archive_name" 进行排序,返回一个层次化的列表
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy

groupby function: group by a certain index, it can accept a variety of parameters, similar to the filtering operation in Excel

Then, merge the common attribute "categories" between the DateFrame of the category and the DataFrame of the thesis data, and use "group_name" as the category for statistics, and put the statistical results in the "id" column and sort them.

Here is a key to supplement the usage of the merge method, because I feel very powerful :

Merge function: used for merging and linking operations: for example, pd.merge(df1,df2), it is best to display the specified key when connecting, such as on='key',

In addition, there is a parameter how, which has four types of values,'inner','outer','left', and'right', which represent different merging benchmarks: intersection, union, left table, right table

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({
    
    "id":"count"}).sort_values(by="id",ascending=False).reset_index()
#drop_duplicates->去重操作
#agg->列上以数值进行聚合
#sort_values->排序操作

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-MUxQ8LSJ-1610544850028) (C:\Users\李子毅\AppData\Roaming\Typora\typora-user-images \image-20210113194446493.png)]

Compared with the many types in the sample code, I think this is because most of the same types of papers are clustered together, so the types are limited

1.4 Data analysis and visualization

The matplotlib module is mainly used here for basic visualization operations. The code is roughly as follows:

import matplotlib.pyplot as plt#导入画图模块
fig=plt.figure(figsize=(15,12))#以窗口的格式展示,窗口大小为(15,12)
plt.pie(_df['id'],labels=_df['group_name'],autopct='%1.1f',explode=(0,0,0.2))#创建一个饼图
plt.tight_layout()#自动调节参数,使图片填充整个图
plt.show()

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-9eJ7yBxp-1610544850030) (C:\Users\李子毅\AppData\Roaming\Typora\typora-user-images \image-20210113200622336.png)]

Here we need to create a pie chart. The parameters of the pie() function are explained in detail as follows:

x       :(每一块)的比例,如果sum(x) > 1会使用sum(x)归一化;
labels  :(每一块)饼图外侧显示的说明文字;
explode :(每一块)离开中心距离;
startangle :起始绘制角度,默认图是从x轴正方向逆时针画起,如设定=90则从y轴正方向画起;
shadow  :在饼图下面画一个阴影。默认值:False,即不画阴影;
labeldistance :label标记的绘制位置,相对于半径的比例,默认值为1.1, 如<1则绘制在饼图内侧;
autopct :控制饼图内百分比设置,可以使用format字符串或者format function
        '%1.1f'指小数点前后位数(没有用空格补齐);
pctdistance :类似于labeldistance,指定autopct的位置刻度,默认值为0.6;
radius  :控制饼图半径,默认值为1;counterclock :指定指针方向;布尔值,可选参数,默认为:True,即逆时针。将值改为False即可改为顺时针。wedgeprops :字典类型,可选参数,默认值:None。参数字典传递给wedge对象用来画一个饼图。例如:wedgeprops={'linewidth':3}设置wedge线宽为3。
textprops :设置标签(labels)和比例文字的格式;字典类型,可选参数,默认值为:None。传递给text对象的字典参数。
center :浮点类型的列表,可选参数,默认值:(0,0)。图标中心位置。
frame :布尔类型,可选参数,默认值:False。如果是true,绘制带有表的轴框架。
rotatelabels :布尔类型,可选参数,默认为:False。如果为True,旋转每个label到指定的角度。

Ok, let’s go to the last step, but I want to change my goal to count the number of papers in the field of physics after 2019. I also use the emerge function. However, because I don’t know much about the field of physics, I no longer do data analysis. the process of,

The final code is as follows:

group_name="Physics"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-IAXwXj1v-1610544850031) (C:\Users\李子毅\AppData\Roaming\Typora\typora-user-images \image-20210113201741809.png)]

Most of them seem to be called high energy physics, I don’t know much about hhh

Guess you like

Origin blog.csdn.net/weixin_45717055/article/details/112592874