scrapy framework plurality spider, tiems, pipelines and operating method of use

Scrapy created with only one item, create multiple spider, each spider specified items, pipelines. Just write a startup script can all start at the same time when you start crawling.

In this paper the code has been uploaded to github, not links in the text.

First, create multiple spider's scrapy project

scrapy startproject mymultispider
cd mymultispider
scrapy genspider myspd1 sina.com.cn
scrapy genspider myspd2 sina.com.cn
scrapy genspider myspd3 sina.com.cn

Second, the method of operation

1. In order to facilitate observation, the print-related information in each of the spider

import scrapy
class Myspd1Spider(scrapy.Spider):
    name = 'myspd1'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/']

def parse(self, response): print('myspd1')

Others such as myspd2, myspd3 are print-related content.

2. spider run more than two ways, first written relatively simple, create crawl.py file in the project directory, as follows

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# myspd1是爬虫名
process.crawl('myspd1')
process.crawl('myspd2')
process.crawl('myspd3')

process.start()

For observation convenience, the output can be defined in the log file settings.py

LOG_LEVEL = 'ERROR'

Right-run this file can be output as follows

 

 

 3. The second method of operation is to modify crawl source code can be found in the official github: https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py

Create a directory in the same directory spiders mycmd a directory, and create a mycrawl.py In this directory, copied in crawl source, wherein the modified run method, to the following

DEF RUN (Self, args, the opts):
     # get a list of reptiles 
    spd_loader_list = self.crawler_process.spider_loader.list ()
     # traversing each reptile 
    for SPName in spd_loader_list or args: 
        self.crawler_process.crawl (SPName, ** opts.spargs)
         Print ( " the startup of reptiles: " + SPName) 
    self.crawler_process.start ()

Create __init__.py initialization files in the directory of the file

After the completion of the following institutions catalog

 

 Use the command to start the crawler

scrapy mycrawl --nolog

Output is as follows:

 

 

Third, the designated items

1, this is relatively simple, create a corresponding class in the items.py file, in the introduction to the spider

items.py

import scrapy


class MymultispiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class Myspd1spiderItem(scrapy.Item):
    name = scrapy.Field()

class Myspd2spiderItem(scrapy.Item):
    name = scrapy.Field()

class Myspd3spiderItem(scrapy.Item):
    name = scrapy.Field()

spider内,例myspd1

# -*- coding: utf-8 -*-
import scrapy
from mymultispider.items import Myspd1spiderItem

class Myspd1Spider(scrapy.Spider):
    name = 'myspd1'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/']

    def parse(self, response):
        print('myspd1')
        item = Myspd1spiderItem()
        item['name'] = 'myspd1的pipelines'
        yield item

四,指定pipelines

1,这个也有两种方法,方法一,定义多个pipeline类:

pipelines.py文件内:

class Myspd1spiderPipeline(object):
    def process_item(self,item,spider):
        print(item['name'])
        return item

class Myspd2spiderPipeline(object):
    def process_item(self,item,spider):
        print(item['name'])
        return item

class Myspd3spiderPipeline(object):
    def process_item(self,item,spider):
        print(item['name'])
        return item

1.1settings.py文件开启管道

ITEM_PIPELINES = {
   # 'mymultispider.pipelines.MymultispiderPipeline': 300,
   'mymultispider.pipelines.Myspd1spiderPipeline': 300,
   'mymultispider.pipelines.Myspd2spiderPipeline': 300,
   'mymultispider.pipelines.Myspd3spiderPipeline': 300,
}

1.2spider中设置管道,例myspd1

# -*- coding: utf-8 -*-
import scrapy
from mymultispider.items import Myspd1spiderItem

class Myspd1Spider(scrapy.Spider):
    name = 'myspd1'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/']
    custom_settings = {
        'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300},
    }

    def parse(self, response):
        print('myspd1')
        item = Myspd1spiderItem()
        item['name'] = 'myspd1的pipelines'
        yield item

指定管道的代码

custom_settings = {
        'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300},
    }

1.3运行crawl文件,运行结果如下

 

 2,方法二,在pipelines.py文件内判断是哪个爬虫的结果

2.1 pipelines.py文件内

class MymultispiderPipeline(object):
    def process_item(self, item, spider):
        if spider.name == 'myspd1':
            print('myspd1的pipelines')
        elif spider.name == 'myspd2':
            print('myspd2的pipelines')
        elif spider.name == 'myspd3':
            print('myspd3的pipelines')
        return item

2.2 settings.py文件内只开启MymultispiderPipeline这个管道文件

ITEM_PIPELINES = {
   'mymultispider.pipelines.MymultispiderPipeline': 300,
   # 'mymultispider.pipelines.Myspd1spiderPipeline': 300,
   # 'mymultispider.pipelines.Myspd2spiderPipeline': 300,
   # 'mymultispider.pipelines.Myspd3spiderPipeline': 300,
}

2.3spider中屏蔽掉指定pipelines的相关代码

# -*- coding: utf-8 -*-
import scrapy
from mymultispider.items import Myspd1spiderItem

class Myspd1Spider(scrapy.Spider):
    name = 'myspd1'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/']
    # custom_settings = {
    #     'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300},
    # }

    def parse(self, response):
        print('myspd1')
        item = Myspd1spiderItem()
        item['name'] = 'myspd1的pipelines'
        yield item

2.4 运行crawl.py文件,结果如下

 

 

代码git地址:https://github.com/terroristhouse/crawler

 

python系列教程:

链接:https://pan.baidu.com/s/10eUCb1tD9GPuua5h_ERjHA 

提取码:h0td 

 

 

Guess you like

Origin www.cnblogs.com/nmsghgnv/p/12369656.html