Paper author statistics (pandas string manipulation)

Paper author statistics (pandas string manipulation)

The
knowledge points involved in github are as follows:
1. data['categories'].apply(lambda x:'cs.CV' in x) result understanding
2. The nested list elements of the sum function merge
3. The use of value_counts function in dataframe and series

code show as below:

# 导入所需的package
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图工具
# 选择类别为cs.CV下面的论文
data2 = data[data['categories'].apply(lambda x: 'cs.CV' in x)]

# 拼接所有作者
all_authors = sum(data2['authors_parsed'], [])

Knowledge point 1: data['categories'].apply(lambda x:'cs.CV' in x) The result understands that
lambda is an anonymous function. The function body is'cs.CV' in x, which means that each row element in data['categories'] is judged by apply, and the result is True or False
to realize the selection of the category

Knowledge point 2: The nested list elements of the sum function merge
sum(data2['authors_parsed'], [])
where data2['authors_parsed'] is a nested structure, and each element at the outermost level is composed of a list. The elements in the list are the authors of the paper.
The sum function has two parameters, sun(iterable,start), start is the initial value of the sum, iterable is an iterable object, sum will add the elements in the iterable object to the initial value. That is, the return result is all elements in start+iterable

例:
sum([[1,2],[3,4]], [5])
out:[5]+[1,2]+[3,4]
[5,1,2,3,4]

sum([[1,2],[3,4]], [[5]])
out:[[5]]+[1,2]+[3,4]
[[5],1,2,3,4]

First, complete the statistics of name frequency.

# 拼接所有的作者
authors_names = [' '.join(x) for x in all_authors]
authors_names = pd.DataFrame(authors_names)

# 根据作者频率绘制直方图
plt.figure(figsize=(10, 6))
authors_names[0].value_counts().head(10).plot(kind='barh')

# 修改图配置
names = authors_names[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')

Knowledge point 33, the use of the value_counts function in dataframe and series is used to
calculate the number of occurrences of each string in the dataframe or series, and the frequency is arranged in descending order by default

The complete syntax and parameters are
Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)
normalize represents the normalization of count items, sort represents the default descending order of sorting, ascending represents ascending order, bins refers Is the segmented count of numerical data, dropna means that the count of NA is not included

Insert picture description here
Last name of statistical name

authors_lastnames = [x[0] for x in all_authors]
authors_lastnames = pd.DataFrame(authors_lastnames)

plt.figure(figsize=(10, 6))
authors_lastnames[0].value_counts().head(10).plot(kind='barh')

names = authors_lastnames[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43720646/article/details/112727574