Python crawler 5.2 - pipeline module frame using scrapy

Overview

At the same time this series document for learning Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4

Beginning last article we Scrapy the framework, and implement a framework for the use of Scrapy reptiles project. This one we will explain in detail the framework of the pipeline Scrapy use (although the last article we talked about a little detail optimizations pipeline to store data, we mainly explain if this project more than a reptile, how to use a problems pipeline for received data).

pipeline core methods

We can customize Pipeline, only you need to implement the specified method, a method which must be implemented are: process_item(item, spider).

Further there are several more practical approach:

  • open_spider(spider)
  • close_spider(spider)
  • close_spider () method is Spider close automatically when called. Here we can do some finishing work, such as closing database connections. Wherein the parameter is to be closed spider Spider object.
  • from_crawler(cls, crawler)

process_item(item, spider)

process_item () method must be achieved, calling this method Item processing is defined ItemPipeline default. For example, we can process data or writes data to database operations. It must return a value of type Item or throw an anomaly DropItem

close_spider(spider)

open_spider () method is turned on when the Spider is automatically called. Here we can do some initialization operations, such as open database connectivity. Among them, the parameters spider Spider object is opened.

close_spider(spider)

close_spider () method is Spider close automatically when called. Here we can do some finishing work, such as closing database connections. Wherein the parameter is to be closed spider Spider object.

from_crawler(cls, crawler)

from_crawler () method is a class method, with @classmethod logo, a dependency injection manner. Its argument is a crawler, crawler through the object, we can get all the core components Scrapy, such as global configuration information for each, and then create an instance of Pipeline. Cls parameter is the Class, and finally return to a Class instance.

Use pipeline

Can be seen from the dictionary form of pipeline, pipeline can have multiple, and indeed can define multiple pipeline.

Why do we need more pipeline:

  1. The contents of a spider might do different operations, such as stored in different databases
  2. You may have multiple spider, various different pipeline processing item of content

note:

  1. Use pipeline need to be configured in the setting.py
  2. The pipeline weight values ​​are higher priority
  3. pipeline in process_item can not be modified for other name

A spider structure where a plurality of item types

We use the command to create a spider program:

scrapy genspider qsbk_spider qiushibaike.com

Will need to set the configuration is good, it will not be described here. Qsbk_spider.py then write the following code:

import scrapy


class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['http://qiushibaike.com/']

    def parse(self, response):
        item = {}
        # 我们进行奇数偶数的不同参数处理
        for i in range(0, 50):
            if (i % 2) == 0:
                # 偶数处理
                item['come_from'] = 'oushu'
                item['data'] = i
            else:
                # 奇数处理
                item['come_from'] = 'jishu'
                item['data'] = i
            yield item

Pipelines.py then write the following code:

class MyspiderPipeline(object):
    def process_item(self, item, spider):

        # 增加if判断传来的item是什么数据,然后进行相对应的逻辑判断
        if item['come_from'] == 'jishu':
            # code..
            print('%d 是奇数' % item['data'])
        else:
            # code ..
            print('%d 是偶数' % item['data'])

        return item

Successful operation to see the effect. A plurality of class definitions or pipeline, as follows:

class MyspiderPipeline(object):
    def process_item(self, item, spider):

        # 增加if判断传来的item是什么数据,然后进行相对应的逻辑判断
        if item['come_from'] == 'jishu':
            # code..
            print('%d 是奇数' % item['data'])

        return item


class MyspiderPipeline1(object):
    def process_item(self, item, spider):
        # 增加if判断传来的item是什么数据,然后进行相对应的逻辑判断
        if item['come_from'] == 'oushu':
            # code..
            print('%d 是偶数' % item['data'])

        return item

Then the configuration file is:

ITEM_PIPELINES = {
    'mySpider.pipelines.MyspiderPipeline': 300,
    'mySpider.pipelines.MyspiderPipeline1': 301,
}

Run can view the same effect.

More spider case

We can also use a plurality of spider above treated that way, and then added item identification data, then different treatment according to the identifier. In addition to above, we can also use another way determined. Specific methods are as follows:

I created three were spider uses three commands:

# 爬取糗百的“文字”模块
scrapy genspider qsbk1_spider qiushibaike.com
# 爬取糗百的“糗图”模块
scrapy genspider qsbk2_spider qiushibaike.com
# 爬取糗百的“穿越模块”
scrapy genspider qsbk3_spider qiushibaike.com

After completion of each run three commands to create a crawler, you will find a spider in a new folder qsbk1_spider.py, qsbk2_spider.py, qsbk3_spider.pythree documents, this is our three new crawler module file (the project directory structure is not posted here). Command can scrapy listview the list of reptiles created in the project.

Respectively, the following code is written in three files:

  1. qsbk1_spider.py
    import scrapy
    
    
    class Qsbk1SpiderSpider(scrapy.Spider):
        name = 'qsbk1_spider'
        allowed_domains = ['qiushibaike.com']
        start_urls = ['https://www.qiushibaike.com/text/']
    
        def parse(self, response):
            item = {}
            # 这是qsbk1_spider返回的数据
            for i in range(0, 10):
                item['come_from'] = 'qsbk1'
                item['data'] = i
                yield item
    
  2. qsbk2_spider.py
    import scrapy
    
    
    class Qsbk2SpiderSpider(scrapy.Spider):
        name = 'qsbk2_spider'
        allowed_domains = ['qiushibaike.com']
        start_urls = ['https://www.qiushibaike.com/pic/']
    
        def parse(self, response):
            item = {}
            # 这是qsbk2_spider返回的数据
            for i in range(10, 20):
                item['come_from'] = 'qsbk2'
                item['data'] = i
                yield item
    
  3. qsbk3_spider.py
    import scrapy
    
    
    class Qsbk3SpiderSpider(scrapy.Spider):
        name = 'qsbk3_spider'
        allowed_domains = ['qiushibaike.com']
        start_urls = ['https://www.qiushibaike.com/history/']
    
        def parse(self, response):
            item = {}
            # 这是qsbk3_spider返回的数据
            for i in range(20, 30):
                item['come_from'] = 'qsbk3'
                item['data'] = i
                yield item
    

The last three reptiles crawling data are put in the pipeline, which we need to be in the pipeline in the judgment is that reptiles pass over the data. pipelines.py code is as follows:

class MyspiderPipeline(object):
    def process_item(self, item, spider):

        # 这是我们可以根据spider来进行判断
        if spider.name == 'qsbk1_spider':
            print("这是qsbk1的数据:", item)
        elif spider.name == 'qsbk2_spider':
            print("这是qsbk2的数据:", item)
        elif spider.name == 'qsbk3_spider':
            print("这是qsbk3的数据:", item)
        else:
            print('未知数据')
        return item

You can see the corresponding running reptile prints.

Using a plurality of items to distinguish

Items.py write code is as follows:

import scrapy


class Qsbk1Item(scrapy.Item):
    """
    qsbk1爬虫items类
    """
    num = scrapy.Field()


class Qsbk2Item(scrapy.Item):
    """
    qsbk2爬虫items类
    """
    num = scrapy.Field()


class Qsbk3Item(scrapy.Item):
    """
    qsbk3爬虫items类
    """
    num = scrapy.Field()

Qsbk1_spider.py write code is as follows (similar to the other two reptiles):

import scrapy
# 引入对应的items类
from mySpider.items import Qsbk1Item


class Qsbk1SpiderSpider(scrapy.Spider):
    name = 'qsbk1_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        for i in range(0, 10):
            item = Qsbk1Item(num=i)
            yield item

Pipeline.py write code is as follows:

# 引入对应的items类
from mySpider.items import Qsbk1Item
from mySpider.items import Qsbk2Item
from mySpider.items import Qsbk3Item


class MyspiderPipeline(object):
    def process_item(self, item, spider):

        # 这是我们可以根据items类来进行判断
        if isinstance(item, Qsbk1Item):
            print("这是qsbk1的数据:", item)
        elif isinstance(item, Qsbk2Item):
            print("这是qsbk2的数据:", item)
        elif isinstance(item, Qsbk3Item):
            print("这是qsbk3的数据:", item)
        else:
            print('未知数据')
        return item

Run to see the effect of reptiles

Other Bowen link

Published 154 original articles · won praise 404 · Views 650,000 +

Guess you like

Origin blog.csdn.net/Zhihua_W/article/details/103615741