Where is the data crawled by python stored? python crawler saves the data

Hello everyone, the editor is here to answer the following questions for you. In which folder does python save the crawled data? In which file does python save the crawled data? Now let's take a look!

The data parsed by the crawler request needs to be saved before further processing can be carried out. Generally, the methods of saving data are as follows:

  • Files: txt, csv, excel, json, etc., small amount of data saved.

  • Relational databases: mysql, oracle, etc., what do you need to install in python if you want to save a large amount of data ?

  • Non-relational databases: Mongodb, Redis, etc. store data in the form of key-value pairs and save large amounts of data.

  • Binary files: Save crawled images, videos, audio and other format data.

First, crawl the 3-page short review information of "The Ordinary World" from Douban Reading, and then save it to a file.

https://book.douban.com/subject/1200840/comments/

The specific code is as follows (ignore exceptions):

import requests
from bs4 import BeautifulSoup

urls=['https://book.douban.com/subject/1200840/comments/?start={}&limit=20&status=P&sort=new_score'.format(str(i)) for i in range(0, 60, 20)] #通过观察的url翻页的规律,使用for循环得到3个链接,保存到urls列表中
print(urls)
dic_h = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
comments_list = [] #初始化用于保存短评的列表

for url in urls: #使用for循环分别获取每个页面的数据,保存到comments_list列表
    r = requests.get(url=url,headers = dic_h).text

    soup = BeautifulSoup(r, 'lxml')
    ul = soup.find('div',id="comments")
    lis= ul.find_all('p')

    list2 =[]
    for li in lis:
        list2.append(li.find('span').string)
    # print(list2)
    comments_list.extend(list2)
 print(comments_list)

Climb to the comment data and save it to the list:
Insert image description here

Use the open() method to write to a file

Save data to txt

Save the above crawled list data to a txt file:

with open('comments.txt', 'w', encoding='utf-8') as f: #使用with open()新建对象f
    # 将列表中的数据循环写入到文本文件中
    for i in comments_list:
        f.write(i+"\n") #写入数据

Insert image description here
Save data to csv

CSV (Comma-Separated Values, comma-separated values ​​or character-separated values) is a storage format that records data in a pure file format. To save a csv file, you need to use Python's built-in module csv.

Write list or tuple data: Create a writer object, use writerow() to write one row of data, and use the writerows() method to write multiple rows of data.

Use the writer object to write list data. The sample code is as follows:

'''
遇到问题没人解答?小编创建了一个Python学习交流QQ群:531509025
寻找有志同道合的小伙伴,互帮互助,群里还有不错的视频学习教程和PDF电子书!
'''
import csv

headers = ['No','name','age']
values = [
    ['01','zhangsan',18],
    ['02','lisi',19],
    ['03','wangwu',20]
]
with open('test1.csv','w',newline='') as fp:
    # 获取对象
    writer = csv.writer(fp)
    # 写入数据
    writer.writerow(headers) #写入表头
    writer.writerows(values) # 写入数据

Write dictionary data: Create a DictWriter object, use writerow() to write one row of data, and use the writerows() method to write multiple rows of data.

Use the DictWriter object to write dictionary data. The sample code is as follows:

import csv

headers = ['No','name','age']
values = [
    {"No":'01',"name":'zhangsan',"age":18},
    {"No":'02',"name":'lisi',"age":19},
    {"No":'03',"name":'wangwu',"age":20}]
with open('test.csv','w',newline='') as fp:
    dic_writer = csv.DictWriter(fp,headers)
    dic_writer.writeheader()# 写入表头
    dic_writer.writerows(values) #写入数据

Save the data crawled above into a csv file:

'''
遇到问题没人解答?小编创建了一个Python学习交流QQ群:531509025
寻找有志同道合的小伙伴,互帮互助,群里还有不错的视频学习教程和PDF电子书!
'''
import requests
import csv
from bs4 import BeautifulSoup
urls=['https://book.douban.com/subject/1200840/comments/?start={}&limit=20&status=P&sort=new_score'.format(str(i)) for i in range(0, 60, 20)] #通过观察的url翻页的规律,使用for循环得到5个链接,保存到urls列表中
print(urls)
dic_h = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
comments_list = [] #初始化用于保存短评的列表

for url in urls: #使用for循环分别获取每个页面的数据,保存到comments_list列表
    r = requests.get(url=url,headers = dic_h).text

    soup = BeautifulSoup(r, 'lxml')
    ul = soup.find('div',id="comments")
    lis= ul.find_all('p')

    list2 =[]
    for li in lis:
        list2.append(li.find('span').string)
    # print(list2)
    comments_list.extend(list2)

new_list = [[x] for x in comments_list] #列表生成器,将列表项转为子列表

with open("com11.csv", mode="w", newline="", encoding="utf-8") as f:
    csv_file = csv.writer(f) # 创建CSV文件写入对象
    for i in new_list:
        csv_file.writerow(i)

Insert image description here
Save data using pandas

Pandas supports reading and writing in multiple file formats. The most commonly used ones are csv and excel data operations. Because the data read directly is in data frame format, it is widely used in crawlers and data analysis.

Generally, the crawled data is stored as a DataFrame object (a DataFrame is a table or a structure similar to a two-dimensional array, each row of which represents an instance and each column of which represents a variable).
pandas save data to excel, csv

Saving excel and csv in pandas is very simple and can be done with just two lines of code:

df = pd.DataFrame(comments_list) #把comments_list列表转换为pandas DataFrame
df.to_excel('comments.xlsx') #保存到excel表格
# df.to_csv('comments.csv')#保存在csv文件

Insert image description here
At the end, I recommend a very good learning tutorial to everyone. I hope it will be helpful for you to learn Python!

Recommended Python basic introductory tutorials: More Python video tutorials - Follow Station B: Python Learners

[Python Tutorial] 1,000 episodes of Python system learning tutorials that are the easiest to understand on the Internet (Q&A is in the last four issues, full of useful information)

Recommended Python crawler case tutorials: More Python video tutorials - Follow Station B: Python Learners

2021 Python's latest and most complete 100 crawler complete case tutorials, data analysis, data visualization, remember to bookmark it

Guess you like

Origin blog.csdn.net/chatgpt001/article/details/132925706