[Scrapy framework] "Version 2.4.0 source code" settings (Settings) detailed articles

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

Parse the description of the settings configuration file under the Scrapy framework.

Priority setting (descending order)

Use different mechanisms to populate the settings, the priority list is sorted from high to low

  1. Command-line options
    Use the -s (or -set) command-line option to explicitly override one (or more) settings.
scrapy crawl myspider -s LOG_FILE=scrapy.log
  1. Each spider setting
class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
    
    
        'SOME_SETTING': 'some value',
    }
  1. Project Settings Module The
    project settings module is the standard configuration file of the Scrapy project, and most of the custom settings will be populated in it. For a standard Scrapy project, this means you add or change settings in settings.py in the file created for the project.

  2. Default settings for
    each command Each Scrapy tool command can have its own default settings, which will override the global default settings. These custom command settings default_settings are specified in the attributes of the command class.

  3. Default global settings The
    global default values ​​are located in the scrapy.settings.default_settings module.

Import path and class

2.4.0 new feature, set to reference the callable object to be imported by Scrapy

from mybot.pipelines.validate import ValidateMyItem
ITEM_PIPELINES = {
    
    
    # 通过 类名...
    ValidateMyItem: 300,
    # ...等于通过类的路径
    'mybot.pipelines.validate.ValidateMyItem': 300,
}

Access Settings

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        print(f"Existing settings: {self.settings.attributes.keys()}")

Settings can be accessed through the scrapy.crawler.Crawler.settings passed to the from_crawler method in the Crawler property in extensions, middleware, and project pipelines.

class MyExtension:
    def __init__(self, log_is_enabled=False):
        if log_is_enabled:
            print("log is enabled!")

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings.getbool('LOG_ENABLED'))

Built-in setting reference

Since there are many useless settings, here are the general settings. Learn more about the settings and go directly to the
Scrapy 2.4.0 settings section

Basic configuration

  1. Project name The
    default USER_AGENT is composed of it, and is also used as the log name for log records

That is, you execute the create project command

scrapy startproject Amazon

The project name in is automatically created and does not need to be modified

BOT_NAME = 'Amazon'
  1. Application path
    is created by default without modification
SPIDER_MODULES = ['Amazon.spiders']
NEWSPIDER_MODULE = 'Amazon.spiders'
  1. User-Agent request header
    does not need to be modified by default
#USER_AGENT = 'Amazon (+http://www.yourdomain.com)'
  1. Crawler protocol
    Whether to follow the crawler protocol, generally the website will open the suffix and add robots.txt, there will be a description of the robot protocol, which is basically ignored. After all, if you follow the protocol, you will not be able to crawl anything.
ROBOTSTXT_OBEY = False  # 不遵循协议
  1. Cookie operation
是否支持cookie,cookiejar进行操作cookie,默认开启
#COOKIES_ENABLED = False
  1. View information record
    Telnet is used to view the current crawler information, operate crawlers, etc... Use telnet ip port, and then operate through commands
#TELNETCONSOLE_ENABLED = False
#TELNETCONSOLE_HOST = '127.0.0.1'
#TELNETCONSOLE_PORT = [6023,]
  1. Headers
    Scrapy is the default request header used by Scrapy to send HTTP requests. It is generally used for redirection 302, or a simple anti-crawling solution that requires random replacement of request headers for different websites. It is recommended to recreate a set of randomly changing request headers, which is not changed here by default.
#DEFAULT_REQUEST_HEADERS = {
    
    
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

Concurrency and latency

  1. Total download concurrency setting The
    maximum number of concurrent requests processed by the downloader, the default value is 16
# CONCURRENT_REQUESTS = 32
  1. Single domain name and set
    the maximum number of concurrent requests that can be executed for each domain name, the default value is 8
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
  1. Single IP concurrent setting The
    number of concurrent requests that can be processed by a single IP, the default value is 0, which means unlimited
  2. If it is not zero, then CONCURRENT_REQUESTS_PER_DOMAIN will be ignored, that is, the limit of concurrent number is calculated according to each IP, not each domain name
  3. This setting also affects DOWNLOAD_DELAY, if the value is not zero, then DOWNLOAD_DELAY download delay is limited to each IP instead of each domain
#CONCURRENT_REQUESTS_PER_IP = 16
  1. Smart rate limit
    The number of seconds to delay the request for the same URL, if not set, it will be fixed.
#DOWNLOAD_DELAY = 3

Intelligent speed limit / automatic throttling

Description Introduction

from scrapy.contrib.throttle import AutoThrottle 
#http://scrapy.readthedocs.io/en/latest/topics/autothrottle.html#topics-autothrottle

Set goals:

  1. It is better to crawl the target site than using the default download delay.
  2. Automatically adjust scrapy to the best crawling speed, so users do not need to adjust the download delay to the best state by themselves. Developers only need to define the maximum allowed concurrent requests, and the rest is done automatically by the extension component.

Implementation:

  1. Scrapy download delay is measured by calculating the time between establishing a TCP connection and receiving the HTTP header.
  2. Since Scrapy may be busy processing spider callback functions or unable to download, it is difficult to accurately measure these delays in a multitasking environment. These parameters need to be set in advance.

Speed ​​limit algorithm:

  1. The automatic speed limit algorithm adjusts the download delay based on the following rules
  2. When a response is received, the download delay to the target site = the delay time of receiving the response/AUTOTHROTTLE_TARGET_CONCURRENCY
  3. The download delay of the next request is set to: the average of the download delay time of the target site and the past download delay time
  4. It is not allowed to reduce the delay if it does not reach 200 responses
  5. The download delay cannot be changed lower than DOWNLOAD_DELAY or higher than AUTOTHROTHLE_MAX_DELAY

Configuration and use:

#开启True,默认False
AUTOTHROTTLE_ENABLED = True
#起始的延迟
AUTOTHROTTLE_START_DELAY = 5
#最小延迟
DOWNLOAD_DELAY = 3
#最大延迟
AUTOTHROTTLE_MAX_DELAY = 10
#每秒并发请求数的平均值,不能高于 CONCURRENT_REQUESTS_PER_DOMAIN或CONCURRENT_REQUESTS_PER_IP,调高了则吞吐量增大强奸目标站点,调低了则对目标站点更加”礼貌“
#每个特定的时间点,scrapy并发请求的数目都可能高于或低于该值,这是爬虫视图达到的建议值而不是硬限制
AUTOTHROTTLE_TARGET_CONCURRENCY = 16.0
#调试
AUTOTHROTTLE_DEBUG = True
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16

Crawling depth and method

  1. The maximum depth
    allowed by the crawler The current depth can be viewed through meta; 0 means no depth
# DEPTH_LIMIT = 3
  1. Basic principles of
    crawling When crawling, 0 means depth-first Lifo (default); 1 means breadth-first FiFo
# 后进先出,深度优先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'

# 先进先出,广度优先
# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
  1. Scheduler queue
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler
  1. Access URL de-duplication
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'

Middleware, Pipelines, extension

# 启用或禁用中间件
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
    
    
#    'Amazon.middlewares.AmazonSpiderMiddleware': 543,
#}

# 启用或禁用下载器中间件,这里需要使用,否则抓取内容无法使用
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    
    
   'Amazon.middlewares.DownMiddleware': 543,
}

DOWNLOADER_MIDDLEWARES_BASE = {
    
    
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}

DOWNLOAD_HANDLERS_BASE = {
    
    
    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
    'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
    'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}

# 启用或禁用扩展
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
EXTENSIONS = {
    
    
    'scrapy.extensions.corestats.CoreStats': 0,
    'scrapy.extensions.telnet.TelnetConsole': 0,
    'scrapy.extensions.memusage.MemoryUsage': 0,
    'scrapy.extensions.memdebug.MemoryDebugger': 0,
    'scrapy.extensions.closespider.CloseSpider': 0,
    'scrapy.extensions.feedexport.FeedExporter': 0,
    'scrapy.extensions.logstats.LogStats': 0,
    'scrapy.extensions.spiderstate.SpiderState': 0,
    'scrapy.extensions.throttle.AutoThrottle': 0,
}

# 配置项目管道
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
   # 'Amazon.pipelines.CustomPipeline': 200,
}

Cache

The purpose of enabling caching is used to cache the requests that have been sent or the corresponding ones for future use

from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.extensions.httpcache import FilesystemCacheStorage
# 是否启用缓存策略
# HTTPCACHE_ENABLED = True

# 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS = 0

# 缓存保存路径
# HTTPCACHE_DIR = 'httpcache'

# 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 缓存存储的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113521524