Scrapy created with only one item, create multiple spider, each spider specified items, pipelines. Just write a startup script can all start at the same time when you start crawling.
In this paper the code has been uploaded to github, not links in the text.
First, create multiple spider's scrapy project
scrapy startproject mymultispider
cd mymultispider
scrapy genspider myspd1 sina.com.cn
scrapy genspider myspd2 sina.com.cn
scrapy genspider myspd3 sina.com.cn
Second, the method of operation
1. In order to facilitate observation, the print-related information in each of the spider
import scrapy class Myspd1Spider(scrapy.Spider): name = 'myspd1' allowed_domains = ['sina.com.cn'] start_urls = ['http://sina.com.cn/']
def parse(self, response): print('myspd1')
Others such as myspd2, myspd3 are print-related content.
2. spider run more than two ways, first written relatively simple, create crawl.py file in the project directory, as follows
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings process = CrawlerProcess(get_project_settings()) # myspd1是爬虫名 process.crawl('myspd1') process.crawl('myspd2') process.crawl('myspd3') process.start()
For observation convenience, the output can be defined in the log file settings.py
LOG_LEVEL = 'ERROR'
Right-run this file can be output as follows
3. The second method of operation is to modify crawl source code can be found in the official github: https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py
Create a directory in the same directory spiders mycmd a directory, and create a mycrawl.py In this directory, copied in crawl source, wherein the modified run method, to the following
DEF RUN (Self, args, the opts): # get a list of reptiles spd_loader_list = self.crawler_process.spider_loader.list () # traversing each reptile for SPName in spd_loader_list or args: self.crawler_process.crawl (SPName, ** opts.spargs) Print ( " the startup of reptiles: " + SPName) self.crawler_process.start ()
Create __init__.py initialization files in the directory of the file
After the completion of the following institutions catalog
Use the command to start the crawler
scrapy mycrawl --nolog
Output is as follows:
Third, the designated items
1, this is relatively simple, create a corresponding class in the items.py file, in the introduction to the spider
items.py
import scrapy class MymultispiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass class Myspd1spiderItem(scrapy.Item): name = scrapy.Field() class Myspd2spiderItem(scrapy.Item): name = scrapy.Field() class Myspd3spiderItem(scrapy.Item): name = scrapy.Field()
spider内,例myspd1
# -*- coding: utf-8 -*- import scrapy from mymultispider.items import Myspd1spiderItem class Myspd1Spider(scrapy.Spider): name = 'myspd1' allowed_domains = ['sina.com.cn'] start_urls = ['http://sina.com.cn/'] def parse(self, response): print('myspd1') item = Myspd1spiderItem() item['name'] = 'myspd1的pipelines' yield item
四,指定pipelines
1,这个也有两种方法,方法一,定义多个pipeline类:
pipelines.py文件内:
class Myspd1spiderPipeline(object): def process_item(self,item,spider): print(item['name']) return item class Myspd2spiderPipeline(object): def process_item(self,item,spider): print(item['name']) return item class Myspd3spiderPipeline(object): def process_item(self,item,spider): print(item['name']) return item
1.1settings.py文件开启管道
ITEM_PIPELINES = { # 'mymultispider.pipelines.MymultispiderPipeline': 300, 'mymultispider.pipelines.Myspd1spiderPipeline': 300, 'mymultispider.pipelines.Myspd2spiderPipeline': 300, 'mymultispider.pipelines.Myspd3spiderPipeline': 300, }
1.2spider中设置管道,例myspd1
# -*- coding: utf-8 -*- import scrapy from mymultispider.items import Myspd1spiderItem class Myspd1Spider(scrapy.Spider): name = 'myspd1' allowed_domains = ['sina.com.cn'] start_urls = ['http://sina.com.cn/'] custom_settings = { 'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300}, } def parse(self, response): print('myspd1') item = Myspd1spiderItem() item['name'] = 'myspd1的pipelines' yield item
指定管道的代码
custom_settings = { 'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300}, }
1.3运行crawl文件,运行结果如下
2,方法二,在pipelines.py文件内判断是哪个爬虫的结果
2.1 pipelines.py文件内
class MymultispiderPipeline(object): def process_item(self, item, spider): if spider.name == 'myspd1': print('myspd1的pipelines') elif spider.name == 'myspd2': print('myspd2的pipelines') elif spider.name == 'myspd3': print('myspd3的pipelines') return item
2.2 settings.py文件内只开启MymultispiderPipeline这个管道文件
ITEM_PIPELINES = { 'mymultispider.pipelines.MymultispiderPipeline': 300, # 'mymultispider.pipelines.Myspd1spiderPipeline': 300, # 'mymultispider.pipelines.Myspd2spiderPipeline': 300, # 'mymultispider.pipelines.Myspd3spiderPipeline': 300, }
2.3spider中屏蔽掉指定pipelines的相关代码
# -*- coding: utf-8 -*- import scrapy from mymultispider.items import Myspd1spiderItem class Myspd1Spider(scrapy.Spider): name = 'myspd1' allowed_domains = ['sina.com.cn'] start_urls = ['http://sina.com.cn/'] # custom_settings = { # 'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300}, # } def parse(self, response): print('myspd1') item = Myspd1spiderItem() item['name'] = 'myspd1的pipelines' yield item
2.4 运行crawl.py文件,结果如下
代码git地址:https://github.com/terroristhouse/crawler
python系列教程:
链接:https://pan.baidu.com/s/10eUCb1tD9GPuua5h_ERjHA
提取码:h0td