datawhale data analysis task1 (1) Use pandas to read data and count the paper data

Statistical knowledge point record (datawhale data analysis task1 (1))

GitHub
goal: to count the
content of papers in various directions in computer in 2019 ; basic operations of pandas
Data set source: paper data

1 json data type and its reading

1.1 The meaning of json data type

Detailed description, comparison with xml, json online tool BeJson
JSON is a language similar to html, using a text format that is completely independent of the language, but also using habits similar to the C language family (including C, C++, Java , JavaScript, Perl, Python, etc.). These features make JSON an ideal data exchange language. It is easy for people to read and write, but also easy for machine to parse and generate (usually used to improve network transmission rate).

JSON is constructed in two structures:

  1. A collection of name/value pairs. In different languages, it is understood as an object, record, struct, dictionary, hash table, keyed list, or associative array array).
    Such as {"firstName": "Brett", "lastName": "McLaughlin", "email": "aaaa"}
  2. An ordered list of values. In most languages, it is understood as an array.

JSON is composed of list and dict in python.

1.2 Python read and write module for json files

  • json: used to convert between string and python data types, provides four functions: dumps, dump, loads, load
    json dumps converts data types into strings dump converts data types into strings and stores them in a file loads Convert a string to a data type load convert the file from a string to a data type

  • pickle: used to convert between python-specific types and python data types, providing four functions: dumps, dump, loads, load
    pickle, the same

  • Difference 1: json can exchange data between different languages, while pickle is only used between python.

  • Difference 2: json can only serialize the most basic data types such as (lists, dictionaries, lists, strings, numbers, etc.), but not for date formats and class objects, while pickle can serialize all data types, including Classes and functions can be serialized.

import json
test_dict = {
    
    'bigberg': [7600, {
    
    1: [['iPhone', 6300], ['Bike', 800], ['shirt', 300]]}]}
print(test_dict)
print(type(test_dict))
#dumps 将数据转换成字符串
json_str = json.dumps(test_dict)
print(json_str)
print(type(json_str))
#loads 将字符串转换成原有数据类型
new_dict = json.loads(json_str)
print(new_dict)
print(type(new_dict))

1.3 witn...as statement and open function for data reading

data  = [] #初始化
#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open("arxiv-metadata-oai-snapshot.json", 'r') as f: 
    for line in f: 
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式,方便使用pandas进行分析

Use python's built-in open() function, open('path+file name','open mode'), the second parameter determines the mode of opening the file: read-only r, write w, append a, etc. The meaning of specific parameters is shown in the following table:

mode meaning
r Open the file as read-only. The pointer of the file will be placed at the beginning of the file. This is the default mode.
rb Open a file in binary format for read-only. The file pointer will be placed at the beginning of the file.
r+ Open a file for reading and writing. The file pointer will be placed at the beginning of the file.
w Open a file for writing only. If the file already exists, it will be overwritten. If the file does not exist, create a new file.
a Open a file for appending. If the file already exists, the file pointer will be placed at the end of the file. In other words, the new content will be written after the existing content. If the file does not exist, create a new file for writing.
(r/w/a)b Open the file in binary form for read-only/write/read-write/append/
(a/r/w/ab/wb/rb)+ Read and write modes for different pointer positions
with open('/path/to/file', 'r') as f:
    print(f.read())# 调用read()会一次性读取文件的全部内容,如果数据过大,内存会承受不了,可通过f.read(size)读取size个数据
    for line in f.readlines():# 按行读取
    print(line)
 #写入数据,注意写文件时,操作系统往往不会立刻把数据写入磁盘,而是放到内存缓存起来,空闲的时候再慢慢写入。只有调用close()方法时,操作系统才保证把没有写入的数据全部写入磁盘。忘记调用close()的后果是数据可能只写了一部分到磁盘,剩下的丢失了。
 #with语句会自动调用close方法
with open('file', 'w') as f:
    f.write('Hello, world!')
    

2 Split function and nested loop of list generator

2.1 split function

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
len(unique_categories)
unique_categories

split() returns a list. The string is sliced ​​by the specified separator. If the parameter num has a specified value, num+1 substrings are separated

str.split(str="", num=string.count(str)).

  • str-Separator, the default is all empty characters, including spaces, newlines (\n), tabs (\t), etc.
  • num-the number of divisions. The default is -1, that is, all are separated.
str = "Line1-abcdef \nLine2-abc \nLine4-abcd";
print str.split( );       # 以空格为分隔符,包含 \n
print str.split(' ', 1 ); # 以空格为分隔符,分隔成两个

[‘Line1-abcdef’, ‘Line2-abc’, ‘Line4-abcd’]
[‘Line1-abcdef’, ‘\nLine2-abc \nLine4-abcd’]

2.2 List builder nested loop

#列表生成式:用来生成列表,在元列表的基础上生成一个新列表,可以让代码更简洁
格式:
[exp for var in iterable]
exp:表达式
var:变量
iterable:可迭代对象
执行过程:
1,先遍历可迭代对象中的元素
2,将此元素赋值给var
3,将var的值作用到exp这个表达式上
4,将表达式的结果生成一个新列表
'''
# range在python2中返回一个列表 python3中返回的是一个可迭代对象
li=[i for i in range(1,11)]  #两个变量i要一致
print(li)
l1=[1,2,3,4]
l2=[i*i for i in l1]
print(l1)
print(l2)   #[1, 2, 3, 4]
            #[1, 4, 9, 16]
l1=[1,2,3,4]
l2=[]
for i in l1:
    l2.append(i*i)
print(l1)
print(l2)
#生成一个2n+1的数字,n的取值范围是2-8
# l4=[(2*i+1)for i in range(2,9)]
l4=[2*i+1 for i in range(2,9)]
print(l4)
格式二:

[exp for var in iterable if 判断条件]

1,遍历得到每一个元素

2,将遍历得到的元素赋值给var

3,将var的值作用到if语句上

4,如果满足条件就将满足条件的值作用到exp表达式上

5,将exp表达式的运算结果追加到新的列表
#使用列表生成式

l5=list(range(1,11))

l6=[i for i in l5 if i%2==0]

print(l6)
格式三:嵌套循环
l10=["a",'b','c','d']

l11=['f','j']

l12=[]

for i in l10:
    for j in l11:
        l12.append(i+j)
print(l12)

使用列表生成式:

l10=["a",'b','c','d']

l11=['f','j']

l12=[i+j for i in l10 for j in l11]

print(l12)

[‘af’, ‘aj’, ‘bf’, ‘bj’, ‘cf’, ‘cj’, ‘df’, ‘dj’]

格式四:

有else语句时使用,注意这个判断条件都放for前面

l14=['Abc','DEF',10]

l15=[]

for i in l14:
    if isinstance(i,str):
        l15.append(i.lower())
    else:
        l15.append(i)

print(l15)

使用列表生成式,大写字母变小写,不是的不变

l14=['Abc','DEF',10]

l15=[i.lower() if isinstance(i,str) else i for i in l14 ]

print(l15)

Guess you like

Origin blog.csdn.net/qq_43720646/article/details/112598481