This article is based on the scrapy 2.6 version description
foreword
After half a month of parameter tuning, I sorted out some commonly used configurations and practice methods in Scrapy (all lessons from blood and tears TAT)
Configuration instructions
settings.py common configuration
It is recommended to put the global configuration in this location, such as database connection, third-party secret key, mail configuration/webhook and other information, and Spider-related configuration is not recommended to be placed in this file
All Spiders inherit BaseSpider, rewrite the constructor to initialize the Spider from the configuration
It is recommended to put the configuration related to a single Spider into the custom_settings of the Spider, such as the maximum number of concurrency, the maximum number of IP concurrency, whether to enable cookies and additional configuration, etc.
All SpiderMiddleware inherits BaseSpiderMiddleware, DownloaderMiddleware inherits BaseDownloaderMiddleware, rewrite the constructor to initialize Middlewares from the configuration
import signals
from utils import get_single_name
classBaseSpiderMiddleware:"""
基础爬虫中间件, 含settings参数的构造方法
"""def__init__(self, settings=None):
self.settings = settings
@classmethoddeffrom_crawler(cls, crawler):
s = cls(crawler.settings)
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return s
defprocess_spider_input(self, response, spider):returnNonedefprocess_spider_output(self, response, result, spider):for i in result:yield i
defprocess_spider_exception(self, response, exception, spider):
spider.logger.warn('SpiderMiddleware %s, Spider %s, process exception: %s'%(get_single_name(self), spider.name, exception))defprocess_start_requests(self, start_requests, spider):for r in start_requests:yield r
defspider_opened(self, spider):
spider.logger.info('SpiderMiddleware %s, Spider opened: %s'%(get_single_name(self), spider.name))classBaseDownloaderMiddleware:"""
基础下载中间件, 含settings参数的构造方法
"""def__init__(self, settings=None):
self.settings = settings
@classmethoddeffrom_crawler(cls, crawler):
s = cls(crawler.settings)
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return s
defprocess_request(self, request, spider):returnNonedefprocess_response(self, request, response, spider):return response
defprocess_exception(self, request, exception, spider):
spider.logger.warn('DownloadMiddleware %s, Spider %s, process exception: %s'%(get_single_name(self), spider.name, exception))defspider_opened(self, spider):
spider.logger.info('DownloadMiddleware: %s, Spider opened: %s'%(get_single_name(self), spider.name))
pipelines.py
All Pipelines inherit BasePipeline and rewrite the constructor to initialize Pipeline from the configuration
It is easier to manage and maintain through the initialization of the configuration loading Pipeline