Team study notes Task3: Paper code statistics

@DateWhale

“Stay hungry Stay young”

3.1 Task description

  • Task theme: statistics of paper codes, statistics related to codes of all papers;
  • Task content: use regular expressions to count code connections, page numbers and chart data;
  • Task result: learn regular expression statistics

Regular expression learning

Common characters: uppercase and lowercase letters, all numbers, all punctuation marks, and some other symbols

character description
[ABC] Match all characters in […], for example, [aeiou] matches all eoua letters in the string "google runoob taobao".
[^ABC] Matches all characters except the characters in […] , for example, [^aeiou] matches all the letters except eoua in the string "google runoob taobao".
[A-Z] [AZ] means a range, matching all uppercase letters, [az] means all lowercase letters.
. Matches any single character except the newline character (\n, \r), which is equivalent to [^\n\r] .
[\s\S] Match all. \s matches all whitespace characters, including newlines, and \S non-whitespace characters, including newlines.
\w Match letters, numbers, and underscores. Equivalent to [A-Za-z0-9_]

Special characters: characters with special meaning

Special characters description
( ) Mark the beginning and end of a sub-expression. The sub-expression can be retrieved for later use. To match these characters, use (and).
* Matches the preceding sub-expression zero or more times. To match * characters, use *.
+ Match the preceding sub-expression one or more times. To match the + character, use +.
. Matches any single character except the newline character \n. To match., please use.
[ Mark the beginning of a bracket expression. To match [, use [.
? Match the preceding subexpression zero or one time, or specify a non-greedy qualifier. To match the? Character, use ?.
\ Mark the next character as a special character, a literal character, a backward quote, or an octal escape character. For example,'n' matches the character'n'. '\n' matches a newline character. The sequence'' matches "", while'(' matches "(".
^ Match the starting position of the input string, unless used in a square bracket expression. When the symbol is used in a square bracket expression, it means that the set of characters in the square bracket expression is not accepted. To match the ^ character itself, use ^.
{ Mark the beginning of the qualifier expression. To match {, use {.
| Specify a choice between the two. To match |, use |.

Qualifier

character description
* Matches the preceding sub-expression zero or more times. For example, zo* can match "z" as well as "zoo". * Equivalent to {0,}.
+ Match the preceding sub-expression one or more times. For example,'zo+' can match "zo" and "zoo" but not "z". + Is equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, "do(es)?" can match "do" in "do", "does" in "does", and "do" in "doxy". ? Equivalent to {0,1}.
{n} n is a non-negative integer. Matches certain n times. For example,'o{2}' cannot match the'o' in "Bob", but it can match the two o's in "food".
{n,} n is a non-negative integer. Match at least n times. For example,'o{2,}' cannot match the'o' in "Bob", but it can match all o's in "foooood". 'o{1,}' is equivalent to'o+'. 'o{0,}' is equivalent to'o*'.
{n,m} Both m and n are non-negative integers, where n <= m. Matches at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to'o?'. Please note that there can be no spaces between the comma and the two numbers.

Code implementation of specific tasks

Here, we need to read the comment field and count the number of words in page and figure from it. First, to complete the reading of the field, just pass different parameters in the previous function call.

data=readFile(path,['abstract','categories', 'comments'],10000)#只能读取1万个,否则电脑跑不动- -

Next, use regular expressions to extract pages

re模块 findall方法详解:
函数定义: findall(pattern, string [,flags])
函数描述:查找字符串中所有出现的正则表达式模式,并返回一个匹配列表

data['pages'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* pages', str(x)))
data = data[data['pages'].apply(len) > 0]

# 由于匹配得到的是一个list,如['19 pages'],需要进行转换
data['pages'] = data['pages'].apply(lambda x: float(x[0].replace(' pages', '')))

#利用describe对数据型数据进行分析
data['pages'].describe().astype(int)

接着,对各个类别进行分析

# 选择主要类别
data['categories'] = data['categories'].apply(lambda x: x.split(' ')[0])
data['categories'] = data['categories'].apply(lambda x: x.split('.')[0])

# 每类论文的平均页数
plt.figure(figsize=(12, 6))
#每组的数值是平均值
data.groupby(['categories'])['pages'].mean().plot(kind='bar')

这里又学习了split函数的用法:
时间紧迫,直接上链接了

https://www.runoob.com/python/att-string-split.html

接下来按照分类统计论文页数,选取论文的第一个类别的主要类别:

在这里插入代码片# 选择主要类别
data['categories'] = data['categories'].apply(lambda x: x.split(' ')[0])
data['categories'] = data['categories'].apply(lambda x: x.split('.')[0])

# 每类论文的平均页数
plt.figure(figsize=(12, 6))
data.groupby(['categories'])['pages'].mean().plot(kind='bar')

绘图如下:
在这里插入图片描述
接下来对论文图表个数进行提取

data['figures'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* figures', str(x)))
data = data[data['figures'].apply(len) > 0]
data['figures'] = data['figures'].apply(lambda x: float(x[0].replace(' figures', '')))

然后再plot以下即可,这里就不写代码了
由于我这里只对前十万个数据进行分析,属于计算机科学领域的数据并不多,所以找到GitHub链接的也不多,这里就做这么多(其实是因这几天刚放假太懒)

Guess you like

Origin blog.csdn.net/weixin_45717055/article/details/112847538