【爬虫】Scrapy Item Pipeline

【原文链接】https://doc.scrapy.org/en/latest/topics/item-pipeline.html

爬虫爬取了一个 item 后, 它会被发送到 Item Pipeline, which 通过好几个组件 that are executed sequentially 处理 item.

每个 item 管道组件 (sometimes referred as just “Item Pipeline”) 是 Python 的一个类。这个类实现了一个简单的方法. 它们会接收一个 item 然后对其 perform an action, 并决定是否这个 item 应该继续走管道，还是被丢弃不再进行处理.

Typical uses of item pipelines are:

cleansing HTML data
校验爬取到的数据 (checking that the items contain certain fields)
检查是否有重复 (并丢弃它们)
存储爬取到的 item 到数据库

Writing your own item pipeline

每个 item 管道组件是一个必须实现以下方法的 Python 类:

process_item(self, item, spider)

每个 item 管道组件都会调用这个方法. process_item() 必须要么返回一个有数据的字典，要么返回一个 Item (或任何子孙类) 对象, 要么返回一个 Twisted Deferred 或抛出 DropItem 异常. Dropped items are no longer processed by further pipeline components.

Parameters:	item (`Item` object or a dict) – the item scraped spider (`Spider` object) – the spider which scraped the item

此外, 他们还可以实现下列方法:

open_spider(self, spider)

当爬虫被打开的时候该方法会被调用.

Parameters:	spider (`Spider` object) – the spider which was opened

close_spider(self, spider)

当爬虫被关闭的时候该方法被调用.

Parameters:	spider (`Spider` object) – the spider which was closed

from_crawler(cls, crawler)

如果该类方法存在，会调用该方法来根据 Crawler 创建一个管道实例. 它必须返回管道的一个新实例. Crawler 对象对所有 Scrapy 核心组件提供访问权限，比如 settings 和 signals; 这是 pipeline 访问它们并将自身功能 hook 到 Scrapy 的一种方法.

Parameters:	crawler (`Crawler` object) – crawler that uses this pipeline

Item pipeline 例子

对价格进行校验并丢弃没有价钱的 items

我们来看一下下面的管道，这个管道对那些不包含 VAT 的 items 调整了 price 属性 (price_excludes_vat 属性), 并丢掉了那些不包含价钱的 items:

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

Write items to a JSON file

下面的管道将所有爬取到的 items (from all spiders) 保存到一个单独的 items.jl 文件, 该文件每行包含一个用 JSON 格式序列化的 item:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

Write items to MongoDB

在这个例子中我们会使用 pymongo 写 items 到 MongoDB. MongoDB 地址和数据库名称是在 Scrapy settings 中指定的; MongoDB 集合以 item 类命名.

这个例子的要点是显示如何使用 from_crawler() 方法和如何 clean up the resources properly.:

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

Take screenshot of item（略）

Duplicates filter（略）

激活一个 Item Pipeline 组件

想要激活一个 Item Pipeline 组件，你必须将其类加入到 ITEM_PIPELINES setting 中, like in the following example:

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

The integer values you assign to classes in this setting determine 决定了他们运行的顺序: items go through from lower valued to higher valued classes. It’s customary to define these numbers in the 0-1000 range.