Article Directory
1. Text storage
import requests
from pyquery import PyQuery as pq
url = 'https://www.zhihu.com/explore'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
html = requests.get(url, headers=headers).text
doc = pq(html)
items = doc('.ExploreCollectionCard-contentItem').items()
for item in items:
question = item.find('a').text()
answer = pq(item.find('.ExploreCollectionCard-contentExcerpt').html()).text()
file = open('explore.txt', 'a', encoding='utf-8')
file.write('\n'.join([question, answer]))
file.write('\n' + '=' * 50 + '\n')
file.close()
Here, the first parameter of the open() method is the name of the target file to be saved, and the second parameter is a, which means writing to the text by appending.
We use requests to extract the "discovery" page of Zhihu, and then extract the full text of the hot topic's question, answerer, and answer, and then use the open()
method provided by Python to open a text file to obtain a file operation object, assign the value here file
, and then use file
The write()
method of the object writes the extracted content into the file, and finally calls the close()
method to close it, so that the captured content can be successfully written into the text.
Open method
There are several other ways to open files:
Simplified writing
Use with as
grammar. At the with
end of the control block, the file is automatically closed, so there is no need to call the close()
method. This storage method can be abbreviated as follows:
with open('explore.txt', 'a', encoding='utf-8') as file:
file.write('\n'.join([question, answer]))
file.write('\n' + '=' * 50 + '\n')
If you want to clear the original text when saving, you can rewrite the second parameter to w, the code is as follows:
with open('explore.txt', 'w', encoding='utf-8') as file:
file.write('\n'.join([question, author, answer]))
file.write('\n' + '=' * 50 + '\n')
2. JSON file storage
In the JavaScript language, everything is an object. Therefore, any supported type can be represented by JSON, such as strings, numbers, objects, arrays, etc.
Object
It is wrapped in curly braces {} in JavaScript, and its data structure is {key1: value1, key2: value2, …} key-value pair structure. In an object-oriented language, the key is the attribute of the object, and the value is the corresponding value. Key names can be represented by integers and strings. The value type can be any type.
A JSON object can be written as follows:
[{
"name": "Bob",
"gender": "male",
"birthday": "1992-10-18"
}, {
"name": "Selina",
"gender": "female",
"birthday": "1995-10-18"
}]
[{'birthday': '1992-10-18', 'gender': 'male', 'name': 'Bob'},
{'birthday': '1995-10-18', 'gender': 'female', 'name': 'Selina'}]
Enclosed by square brackets is equivalent to a list type. Each element in the list can be of any type. In this example, it is a dictionary type surrounded by braces.
JSON can be freely combined by the above two forms, can be nested unlimited times, has a clear structure, and is an excellent way of data exchange.
2.1 Read JOSN
We can call library loads()
methods to convert JSON text strings to JSON objects, and we can dumps()
convert JSON objects to text strings through methods.
import json
str = '''
[{
"name": "Bob",
"gender": "male",
"birthday": "1992-10-18"
}, {
"name": "Selina",
"gender": "female",
"birthday": "1995-10-18"
}]
'''
print(type(str))
data = json.loads(str)
print(data)
print(type(data))
<class 'str'>
[{'name': 'Bob', 'gender': 'male', 'birthday': '1992-10-18'}, {'name': 'Selina', 'gender': 'female', 'birthday': '1995-10-18'}]
<class 'list'>
with open('data.json', 'w') as f:
f.write(str)
The loads()
method is used here to convert a string into a JSON object. Since the outermost layer is brackets, the final type is a list type.
In this way, we can use the index to get the corresponding content. For example, if you want to get the name
attributes in the first element , you can use the following method:
data[0]['name']
data[0].get('name')
'Bob'
The results obtained by the two methods are the same. It is recommended to use the get() method, so that if the key name does not exist, no error will be reported and None will be returned. In addition, the get() method can also pass in a second parameter (ie the default value), an example is as follows:
print(data[0].get('age'))
data[0].get('age', 25)
None
25
Here we try to get the age. In fact, the key name does not exist in the original dictionary, and it will return None by default. If the second parameter (ie the default value) is passed in, the default value is returned if it does not exist.
It should be noted that JSON data needs to be surrounded by double quotes, not single quotes. For example, if it is expressed in the following form, an error will occur:
import json
str = '''
[{
'name': 'Bob',
'gender': 'male',
'birthday': '1992-10-18'
}]
'''
data = json.loads(str)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 3 column 5 (char 8)
Be sure to use double quotes to represent the JSON string, otherwise the loads() method will fail to parse.
If you read content from JSON text, for example, here is a data. text file whose content is the JSON string just defined, we can read the content of the text file first, and then use the loads() method to convert:
import json
with open('data.json') as file:
str = file.read()
data = json.loads(str)
print(data)
[{'name': 'Bob', 'gender': 'male', 'birthday': '1992-10-18'}, {'name': 'Selina', 'gender': 'female', 'birthday': '1995-10-18'}]
2.2 Output JSON
We can call the dumps()
method to convert the JSON object into a string:
import json
data = [{
'name': 'Bob',
'gender': 'male',
'birthday': '1992-10-18'
}]
with open('data.json', 'w') as f:
f.write(json.dumps(data))
Using the dumps()
method, we can convert the JSON object into a string, and then call the file write()
method to write the text
If you want to save the JSON format, you can add another parameter indent
, which represents the number of indented characters:
with open('data.json', 'w') as file:
file.write(json.dumps(data, indent=2))
If the JSON contains Chinese characters, the Chinese characters become Unicode characters:
import json
data = [{
'name': '王伟',
'gender': '男',
'birthday': '1992-10-18'
}]
with open('data.json', 'w') as file:
file.write(json.dumps(data, indent=2))
In order to output Chinese, you need to specify the parameter ensure_ascii
as False
, and also specify the encoding of the file output:
with open('data.json', 'w', encoding='utf-8') as file:
file.write(json.dumps(data, indent=2, ensure_ascii=False))
3.CSV file storage
CSV, the full name is Comma-Separated Values, Chinese can be called comma-separated values or character-separated values, and its files store tabular data in plain text. The file is a sequence of characters, which can consist of any number of records, separated by some kind of newline character. Each record is composed of fields. The separators between fields are other characters or strings. The most common ones are commas or tabs. However, all records have exactly the same sequence of fields, which is equivalent to the plain text form of a structured table. Therefore, sometimes it is more convenient to save data in CSV.
3.1 Write
import csv
with open('data.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['id', 'name', 'age'])
writer.writerow(['10001', 'Mike', 20])
writer.writerow(['10002', 'Bob', 22])
writer.writerow(['10003', 'Jordan', 21])
First, open the data.csv file, then specify the open mode as w
(ie write), get the file handle, then call the csv
library writer()
method to initialize the write object, pass in the handle, and then call the writerow()
method to pass in the data of each row. Complete writing.
If you want to modify the separator between columns, you can pass in delimiter
parameters, the code is as follows:
import csv
with open('data.csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter=' ')
writer.writerow(['id', 'name', 'age'])
writer.writerow(['10001', 'Mike', 20])
writer.writerow(['10002', 'Bob', 22])
writer.writerow(['10003', 'Jordan', 21])
We can also call the writerows()
method to write multiple lines at the same time. At this time, the parameter needs to be a two-dimensional list, for example:
import csv
with open('data.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['id', 'name', 'age'])
writer.writerows([['10001', 'Mike', 20], ['10002', 'Bob', 22], ['10003', 'Jordan', 21]])
The writing method of the dictionary is also provided in the csv library, an example is as follows:
import csv
with open('data.csv', 'w', newline='') as csvfile:
fieldnames = ['id', 'name', 'age']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({
'id': '10001', 'name': 'Mike', 'age': 20})
writer.writerow({
'id': '10002', 'name': 'Bob', 'age': 22})
writer.writerow({
'id': '10003', 'name': 'Jordan', 'age': 21})
If you want to write Chinese content, you may encounter character encoding problems. At this time, you need to specify the encoding format for the open() parameter:
import csv
with open('data.csv', 'a', newline='', encoding='utf-8') as csvfile:
fieldnames = ['id', 'name', 'age']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({
'id': '10005', 'name': '王伟', 'age': 22})
3.2 Read
We can also use the csv library to read CSV files:
import csv
with open('data.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
print(row)
['id', 'name', 'age']
['10001', 'Mike', '20']
['10002', 'Bob', '22']
['10003', 'Jordan', '21']
['10005', '王伟', '22']
If you use pandas, you can use the read_csv() method to read the data from the CSV, for example:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
id name age
0 10001 Mike 20
1 10002 Bob 22
2 10003 Jordan 21
3 10005 王伟 22