Python scrapy framework teaching (four): save to the database

Save to Redis database

The format of saving to the database is similar to that of saving to a file, except that during initialization, the operation of opening the file is converted to the operation of connecting to the database. When writing, the operation that was originally written to the file is converted to the operation of writing to the database. Take Redis database as an example:

# 这个是保存到redis 
class RedisPipeline(object): 
  def __init__(self): 
    ## 初始化链接
    reids self.redis_cli = redis.StrictRedis( 
    host='127.0.0.1', 
    port=6379, db=1, 
  ) 

  def process_item(self, item, spider): 
    ## 保存到redis 
    self.redis_cli.lpush('quotes', json.dumps(dict(item))) 
    return item 

  def close_spider(self, spider): 
    self.redis_cli.close()

 

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542 

Python learning exchange group: 1039645993

Save to MySQL database

# 这个是保存到mysql 
class MySQLPipeline(object):
  """ create database quotes charset=utf8; 
  use quotes; create table quotes (txt text, author char(20), 
  tags char(200)); """ 

  def __init__(self): 
    self.connect = pymysql.connect( 
      host='192.168.159.128', 
      port=3306, 
      db='quotes', # 数据库名 
      user='windows', 
      passwd='123456', 
      charset='utf8', 
      use_unicode=True 
    )
    # 创建操作数据的游标 
    self.cursor = self.connect.cursor() 

  def process_item(self, item, spider): 
    # 保存到mysql 
    # 执行sql语句
    self.cursor.execute( 
      'insert into quotes (txt, author, tags) value(%s, %s, %s)', (item['text'], item['author'], item['tags'])
     )
    # 提交数据执行数据 
    self.connect.commit() 
    return item 
  # 关闭链接 
  def close_spider(self, spider): 
    self.cursor.close() 
    self.connect.close()

 

Store data in MongoDB

Sometimes, we want to store the crawled data in some kind of database, which can implement Item Pipeline to complete such tasks. The following implements an Item Pipeline that can store data in the MongoDB database. The code is as follows:

The above code is explained as follows.

Define two constants in the class attribute:

  • DB_URI The URI address of the database.
  • DB_NAME The name of the database.
from scrapy.item import Item 
import pymongo 

class MongoDBPipeline(object): 

DB_URI = 'mongodb://localhost:27017/' 
DB_NAME = 'scrapy_data' 

def open_spider(self, spider): 
self.client = pymongo.MongoClient(self.DB_URI) 
self.db = self.client[self.DB_NAME] 

def close_spider(self, spider): 
self.client.close() 

def process_item(self, item, spider): 
collection = self.db[spider.name] 
post = dict(item) if isinstance(item, Item) 
else item collection.insert_one(post) 
return item

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114887765