Python web crawler study notes (ten): data storage

Article Directory

1. Text storage

import requests
from pyquery import PyQuery as pq
 
url = 'https://www.zhihu.com/explore'
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
html = requests.get(url, headers=headers).text
doc = pq(html)
items = doc('.ExploreCollectionCard-contentItem').items()
for item in items:
    question = item.find('a').text()
    answer = pq(item.find('.ExploreCollectionCard-contentExcerpt').html()).text()
    file = open('explore.txt', 'a', encoding='utf-8')
    file.write('\n'.join([question, answer]))
    file.write('\n' + '=' * 50 + '\n')
    file.close()

Here, the first parameter of the open() method is the name of the target file to be saved, and the second parameter is a, which means writing to the text by appending.

We use requests to extract the "discovery" page of Zhihu, and then extract the full text of the hot topic's question, answerer, and answer, and then use the open()method provided by Python to open a text file to obtain a file operation object, assign the value here file, and then use fileThe write()method of the object writes the extracted content into the file, and finally calls the close()method to close it, so that the captured content can be successfully written into the text.

Open method

There are several other ways to open files:
Insert picture description here

Simplified writing

Use with asgrammar. At the withend of the control block, the file is automatically closed, so there is no need to call the close()method. This storage method can be abbreviated as follows:

with open('explore.txt', 'a', encoding='utf-8') as file:
    file.write('\n'.join([question, answer]))
    file.write('\n' + '=' * 50 + '\n')

If you want to clear the original text when saving, you can rewrite the second parameter to w, the code is as follows:

with open('explore.txt', 'w', encoding='utf-8') as file:
    file.write('\n'.join([question, author, answer]))
    file.write('\n' + '=' * 50 + '\n')

2. JSON file storage

In the JavaScript language, everything is an object. Therefore, any supported type can be represented by JSON, such as strings, numbers, objects, arrays, etc.

Object

It is wrapped in curly braces {} in JavaScript, and its data structure is {key1: value1, key2: value2, …} key-value pair structure. In an object-oriented language, the key is the attribute of the object, and the value is the corresponding value. Key names can be represented by integers and strings. The value type can be any type.

A JSON object can be written as follows:

[{
    
    
    "name": "Bob",
    "gender": "male",
    "birthday": "1992-10-18"
}, {
    
    
     "name": "Selina",
    "gender": "female",
    "birthday": "1995-10-18"
}]

[{'birthday': '1992-10-18', 'gender': 'male', 'name': 'Bob'},
 {'birthday': '1995-10-18', 'gender': 'female', 'name': 'Selina'}]

Enclosed by square brackets is equivalent to a list type. Each element in the list can be of any type. In this example, it is a dictionary type surrounded by braces.

JSON can be freely combined by the above two forms, can be nested unlimited times, has a clear structure, and is an excellent way of data exchange.

2.1 Read JOSN

We can call library loads()methods to convert JSON text strings to JSON objects, and we can dumps()convert JSON objects to text strings through methods.

import json
 
str = '''
[{
    "name": "Bob",
    "gender": "male",
    "birthday": "1992-10-18"
}, {
    "name": "Selina",
    "gender": "female",
    "birthday": "1995-10-18"
}]
'''
print(type(str))
data = json.loads(str)
print(data)
print(type(data))

<class 'str'>
[{'name': 'Bob', 'gender': 'male', 'birthday': '1992-10-18'}, {'name': 'Selina', 'gender': 'female', 'birthday': '1995-10-18'}]
<class 'list'>

with open('data.json', 'w') as f:
    f.write(str)

The loads()method is used here to convert a string into a JSON object. Since the outermost layer is brackets, the final type is a list type.

In this way, we can use the index to get the corresponding content. For example, if you want to get the nameattributes in the first element , you can use the following method:

data[0]['name']
data[0].get('name')

'Bob'

The results obtained by the two methods are the same. It is recommended to use the get() method, so that if the key name does not exist, no error will be reported and None will be returned. In addition, the get() method can also pass in a second parameter (ie the default value), an example is as follows:

print(data[0].get('age'))
data[0].get('age', 25)

None
25

Here we try to get the age. In fact, the key name does not exist in the original dictionary, and it will return None by default. If the second parameter (ie the default value) is passed in, the default value is returned if it does not exist.

It should be noted that JSON data needs to be surrounded by double quotes, not single quotes. For example, if it is expressed in the following form, an error will occur:

import json
 
str = '''
[{
    'name': 'Bob',
    'gender': 'male',
    'birthday': '1992-10-18'
}]
'''
data = json.loads(str)

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 3 column 5 (char 8)

Be sure to use double quotes to represent the JSON string, otherwise the loads() method will fail to parse.

If you read content from JSON text, for example, here is a data. text file whose content is the JSON string just defined, we can read the content of the text file first, and then use the loads() method to convert:

import json

with open('data.json') as file:
    str = file.read()
    data = json.loads(str)
    print(data)

[{'name': 'Bob', 'gender': 'male', 'birthday': '1992-10-18'}, {'name': 'Selina', 'gender': 'female', 'birthday': '1995-10-18'}]

2.2 Output JSON

We can call the dumps()method to convert the JSON object into a string:

import json

data = [{
    
    
    'name': 'Bob',
    'gender': 'male',
    'birthday': '1992-10-18'
}]
with open('data.json', 'w') as f:
    f.write(json.dumps(data))

Using the dumps()method, we can convert the JSON object into a string, and then call the file write()method to write the text

If you want to save the JSON format, you can add another parameter indent, which represents the number of indented characters:

 with open('data.json', 'w') as file:
    file.write(json.dumps(data, indent=2))

If the JSON contains Chinese characters, the Chinese characters become Unicode characters:

import json
 
data = [{
    
    
    'name': '王伟',
    'gender': '男',
    'birthday': '1992-10-18'
}]
with open('data.json', 'w') as file:
    file.write(json.dumps(data, indent=2))

In order to output Chinese, you need to specify the parameter ensure_asciias False, and also specify the encoding of the file output:

with open('data.json', 'w', encoding='utf-8') as file:
    file.write(json.dumps(data, indent=2, ensure_ascii=False))

3.CSV file storage

CSV, the full name is Comma-Separated Values, Chinese can be called comma-separated values or character-separated values, and its files store tabular data in plain text. The file is a sequence of characters, which can consist of any number of records, separated by some kind of newline character. Each record is composed of fields. The separators between fields are other characters or strings. The most common ones are commas or tabs. However, all records have exactly the same sequence of fields, which is equivalent to the plain text form of a structured table. Therefore, sometimes it is more convenient to save data in CSV.

3.1 Write

import csv
 
with open('data.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['id', 'name', 'age'])
    writer.writerow(['10001', 'Mike', 20])
    writer.writerow(['10002', 'Bob', 22])
    writer.writerow(['10003', 'Jordan', 21])

First, open the data.csv file, then specify the open mode as w(ie write), get the file handle, then call the csvlibrary writer()method to initialize the write object, pass in the handle, and then call the writerow()method to pass in the data of each row. Complete writing.

If you want to modify the separator between columns, you can pass in delimiterparameters, the code is as follows:

import csv
 
with open('data.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=' ')
    writer.writerow(['id', 'name', 'age'])
    writer.writerow(['10001', 'Mike', 20])
    writer.writerow(['10002', 'Bob', 22])
    writer.writerow(['10003', 'Jordan', 21])

We can also call the writerows()method to write multiple lines at the same time. At this time, the parameter needs to be a two-dimensional list, for example:

import csv
 
with open('data.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['id', 'name', 'age'])
    writer.writerows([['10001', 'Mike', 20], ['10002', 'Bob', 22], ['10003', 'Jordan', 21]])

The writing method of the dictionary is also provided in the csv library, an example is as follows:

import csv
 
with open('data.csv', 'w', newline='') as csvfile:
    fieldnames = ['id', 'name', 'age']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerow({
    
    'id': '10001', 'name': 'Mike', 'age': 20})
    writer.writerow({
    
    'id': '10002', 'name': 'Bob', 'age': 22})
    writer.writerow({
    
    'id': '10003', 'name': 'Jordan', 'age': 21})

If you want to write Chinese content, you may encounter character encoding problems. At this time, you need to specify the encoding format for the open() parameter:

import csv
 
with open('data.csv', 'a', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['id', 'name', 'age']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writerow({
    
    'id': '10005', 'name': '王伟', 'age': 22})

3.2 Read

We can also use the csv library to read CSV files:

import csv
 
with open('data.csv', 'r', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        print(row)

['id', 'name', 'age']
['10001', 'Mike', '20']
['10002', 'Bob', '22']
['10003', 'Jordan', '21']
['10005', '王伟', '22']

If you use pandas, you can use the read_csv() method to read the data from the CSV, for example:

import pandas  as pd
 
df = pd.read_csv('data.csv')
print(df)

      id    name  age
0  10001    Mike   20
1  10002     Bob   22
2  10003  Jordan   21
3  10005      王伟   22