Save to Redis database
The format of saving to the database is similar to that of saving to a file, except that during initialization, the operation of opening the file is converted to the operation of connecting to the database. When writing, the operation that was originally written to the file is converted to the operation of writing to the database. Take Redis database as an example:
# 这个是保存到redis
class RedisPipeline(object):
def __init__(self):
## 初始化链接
reids self.redis_cli = redis.StrictRedis(
host='127.0.0.1',
port=6379, db=1,
)
def process_item(self, item, spider):
## 保存到redis
self.redis_cli.lpush('quotes', json.dumps(dict(item)))
return item
def close_spider(self, spider):
self.redis_cli.close()
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Python learning exchange group: 1039645993
Save to MySQL database
# 这个是保存到mysql
class MySQLPipeline(object):
""" create database quotes charset=utf8;
use quotes; create table quotes (txt text, author char(20),
tags char(200)); """
def __init__(self):
self.connect = pymysql.connect(
host='192.168.159.128',
port=3306,
db='quotes', # 数据库名
user='windows',
passwd='123456',
charset='utf8',
use_unicode=True
)
# 创建操作数据的游标
self.cursor = self.connect.cursor()
def process_item(self, item, spider):
# 保存到mysql
# 执行sql语句
self.cursor.execute(
'insert into quotes (txt, author, tags) value(%s, %s, %s)', (item['text'], item['author'], item['tags'])
)
# 提交数据执行数据
self.connect.commit()
return item
# 关闭链接
def close_spider(self, spider):
self.cursor.close()
self.connect.close()
Store data in MongoDB
Sometimes, we want to store the crawled data in some kind of database, which can implement Item Pipeline to complete such tasks. The following implements an Item Pipeline that can store data in the MongoDB database. The code is as follows:
The above code is explained as follows.
Define two constants in the class attribute:
- DB_URI The URI address of the database.
- DB_NAME The name of the database.
from scrapy.item import Item
import pymongo
class MongoDBPipeline(object):
DB_URI = 'mongodb://localhost:27017/'
DB_NAME = 'scrapy_data'
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.DB_URI)
self.db = self.client[self.DB_NAME]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
collection = self.db[spider.name]
post = dict(item) if isinstance(item, Item)
else item collection.insert_one(post)
return item