Data storage of Python crawler
How to store data after crawling? This article will explain the common operations of storing data in excel, txt, and database
1. Result output
The result here is directly printed out, how to persist storage?
for title,actor,time,score,count,comment in zip(titles,actors,times,scores,counts,comments):
actor = actor.strip()
time = time.strip().split()[0]
print(title,actor,time,score,count,comment)
2. Data storage
mode | description |
---|---|
w | Open a file for writing only. If the file already exists, open the file and start editing from the beginning, that is, the original content will be deleted. If the file does not exist, create a new file. |
wb | Open a file in binary format for writing only. If the file already exists, open the file and start editing from the beginning, that is, the original content will be deleted. If the file does not exist, create a new file. |
w+ | Open a file for reading and writing. If the file already exists, open the file and start editing from the beginning, that is, the original content will be deleted. If the file does not exist, create a new file. |
wb+ | Open a file in binary format for reading and writing. If the file already exists, open the file and start editing from the beginning, that is, the original content will be deleted. If the file does not exist, create a new file. |
a | Open a file for appending. If the file already exists, the file pointer will be placed at the end of the file. In other words, the new content will be written after the existing content. If the file does not exist, create a new file for writing. |
from | Open a file in binary format for appending. If the file already exists, the file pointer will be placed at the end of the file. In other words, the new content will be written after the existing content. If the file does not exist, create a new file for writing. |
a+ | Open a file for reading and writing. If the file already exists, the file pointer will be placed at the end of the file. When the file is opened, it will be in append mode. If the file does not exist, create a new file for reading and writing. |
from + | Open a file in binary format for appending. If the file already exists, the file pointer will be placed at the end of the file. If the file does not exist, create a new file for reading and writing. |
2.1 Store data to txt
with open('text.txt','w') as f:
# 写入单行
f.write()
# 写入多行
f.writelines([])
2.2 Store data to csv
import csv
with open('bilibili.csv','w',encoding='utf-8',newline='') as f:
# 创建一个写的对象
writer = csv.writer(f)
# 写入单行
writer.writerow([])
# 写入多行
writer.writerows([(),(),()])
2.3 Data storage to the database
For more complex data, we can store it in the database, take mysql as an example
pip install pymysql
2.3.1 Database connection
import pymysql
db = pymysql.connect('IP','username','passwd','DATABASE')
# 连接到Mysql
db = pymysql.connect('localhost','root','123456')
# 连接到Mysql指定数据库
db= pymysql.connect('localhost','root','123456','database')
2.3.2 Create a database, create a data table
# 首先连接到数据库
db = pymysql.connect('localhost','root','123456')
# 建立游标
cursor = db.cursor()
# 创建数据库
cursor.execute("CREATE DATABASE doubanTop250")
# 创建数据表(在已有的数据库的前提下)
cursor.execute('''CREATE TABLE movie
(`id` INT AUTO_INCREMENT PRIMARY KEY,
`title` VARCHAR(100) NOT NULL,
`actor`VARCHAR(100) NOT NULL,
`release_time` VARCHAR(100) NOT NULL,
`score` VARCHAR(100) NOT NULL,
`count` VARCHAR(100) NOT NULL,
`comment` VARCHAR(100) NOT NULL)
DEFAULT CHARSET=utf8;''')
2.3.3 Insert data
import pymysql
db = pymysql.connect('localhost','root','123456','database', charset='utf8')
cursor = db.cursor()
sql = '''insert into douban_movie4(title,actor,release_time,score,count,comment) values (%s,%s,%s,%s,%s,%s)'''
cursor.execute(sql,data)
# 提交
db.commit()
Take the Top250 Douban movie as an example
import csv
import time
import pymysql
import requests
from lxml import etree
class MovieSpider(object):
def __init__(self):
self.headers = {
'user-agent': 'Mozilla/5.0'
}
self.url = 'https://movie.douban.com/top250?start={}&filter='
self.db = pymysql.connect('localhost','root','123456','doubanTop250')
self.cursor = self.db.cursor()
def get_html(self,url):
resp = requests.get(url, headers=self.headers)
html = resp.text
self.parse_html(html)
def parse_html(self,html):
xp_html = etree.HTML(html)
titles = xp_html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()')
scores = xp_html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()')
counts = xp_html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[4]/text()')
comments = xp_html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/p[2]/span/text()')
for title, score, count, comment, in zip(titles, scores, counts, comments):
data = [title, score, count, comment]
sql = '''insert into movie(title,score,count,comment) values (%s,%s,%s,%s)'''
self.cursor.execute(sql,data)
def main(self):
start_time = time.time()
for i in range(0,250,25):
url = self.url.format(i)
self.get_html(url)
self.db.commit()
end_time = time.time()
print('总耗时:',end_time-start_time)
if __name__ == '__main__':
spider = MovieSpider()
spider.main()
Recommended reading:
- Use xpath to crawl data
- jupyter notebook use
- BeautifulSoup crawls the top 250 Douban movies
- An article takes you to master the requests module
- Python web crawler basics-BeautifulSoup
This is the end, if it helps you, welcome to like and follow, your likes are very important to me