Spider data mining-9, scrapy framework (5)

1. Download the middleware (the fourth and fifth steps, the components between the engine and the downloader, can be used to modify after his)

scrapy settings --get DOWNLOADER_MIDDLEWARES_BASE can view system custom middleware,

Get multiple download middleware classes, they all have their own unique functions

The smaller the value, the closer to the engine, the larger the closer to the downloader, and the data returned by the downloader will pass the largest value and then the smallest value.

It can be understood that the request is from small to large, and the response is from large to small. After customizing the middleware, you should pay attention to whether the value of the basic middleware may cover the introduced middleware. The custom middleware is based on the following basic middleware, so if you want to execute only your own, you can remove the following one if you don’t execute the following and your own settings.

比如  'baidu.middlewares.User_AgentDownloaderMiddleware': 323,
"scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": None,都是起一样的作用,就把不是自己设定的关闭
"scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,机器人协议中间件
 "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300, http身份验证中间件
 "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
 下载超时中间件
  "scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware": 400,
  默认请求头中间件
   "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 500,
   用户代理中间件(UA)
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
    重新尝试中间件(前面超时中间件超时后才传到这里重新尝试)
 "scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware": 560,
 	ajax抓取中间件(基于元片段html标签抓取ajax页面的中间件)
 "scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware": 580,
  	始终使用字符串作为原因中间件(根据meta-refresh.html标签处理request重定向的中间件)
  "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 590,
  允许网站数据发送或接收压缩的(gizp)流量的中间件
 "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 600,
 重定向中间件(根据request的状态处理重定向)
  "scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
  凭证中间件
 "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
 代理中间件
  "scrapy.downloadermiddlewares.stats.DownloaderStats": 850, 
  通过此中间件存储通过它的所有请求、响应、异常信息
  "scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware": 900,
  缓存中间件

Built-in download middleware: (the middleware that comes with the system)

Commonly used built-in middleware:
CookieMiddleware supports cookies, which can be turned on and off by setting COOKIES_ENABLED

HttpProxyMiddleware HTTP proxy, set by setting the value of request.meta['proxy'], IP proxy

UserAgentMiddleware and user agent middleware.

Custom middleware: you can write it yourself, it will not conflict with the system, even if the priority is the same, it will not be processed

Download middleware is a framework used to hook into Scrapy's request/response process.
It is a lightweight low-level system used to modify scrapy's request and response globally.

The download middleware in the scrapy framework is a class that implements special methods.

The middleware that comes with the scrapy system is placed in the DOWNLOADER_MIDDLEWARES_BASE setting

User-defined middleware needs to be set in DOWNLOADER_MIDDLEWARES. The
change setting is a dict, the key is the middleware class path, and the expected value is the order of the middleware, which is a positive integer 0-1000. The smaller the closer, the closer to the engine.

Download middleware API: (The middleware of the return value is more important, and the return value determines where the next request will go)

Each middleware is a class of Python, and they have defined one or more of the following methods:
process_request (request (request object), spider (crawler object)) to process the request, call this for each request through the middleware Method, write the request proxy here, you don’t need to write each request once when the crawler request is constructed, saving the amount of code

# Called for each request that goes through the downloader
# middleware.

# Must either:必选其一,就是必定会出现以下的一种情况
# - return None: continue processing this request
返回None  request被继续交给下一个中间件处理,证明处理的内容不是当前优先级的中间件可以完成的
# - or return a Response object
返回response对象,不会交给下一个process_request,而是交给下载器,证明该中间件符合条件
# - or return a Request object
返回一个request对象,直接交给引擎处理,进行重新排队再提交一次
# - or raise IgnoreRequest: process_exception() methods of
抛出异常,并使用process_exception进行处理
#   installed downloader middleware will be called
一定会被处理

process_response(request, response, spider) process the response, for each response through the middleware, call this method

# Called with the response returned from the downloader.处理响应

# Must either;
# - return a Response object
返回response对象,继续交给下一个中间件,返回的途中不需要找适合的中间件了,直接穿过所有的中间件
# - return a Request object
返回一个request对象,直接交给引擎处理,不需穿过中间件,没有其他的组件接收,最后引擎进行提交处理
# - or raise IgnoreRequest
抛出异常,并使用process_exception进行处理

process_exception(request, exception, spider) An exception call occurred while processing the request

# Called when a download handler or a process_request()  处理异常
# (from other downloader middleware) raises an exception.

# Must either:
# - return None: continue processing this exception
返回None,继续调用其他中间件,不断传递给下一个中间件直到能执行此任务的
# - return a Response object: stops process_exception() chain
停止调用其他中间件,说明第一个找到的就是符合要求的
# - return a Request object: stops process_exception() chain
返回request,直接交给引擎处理

from_crawler (cls, crawler): used to create crawler files, not important

For other middleware, please refer to the official document: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

Crawler middleware: Used less, it belongs to the first and second steps, etc., the component between the engine and the crawler.

Two, custom User-Agent middleware

request.headers["User_Agent"]=random.choice (randomly selected target pool) (random selection, random library import random)
request.headers["parameters"] add
multiple agents in the specified parameters to the request header , Set via meta

User agent pool: added to User-Agent middleware

proxyip = random.choice(IPPOOL) (also use the random library for operation)
request.meta['proxy']="http://"+proxyip["ipaddr"] (only the value does not take the key)

Add the corresponding "scrapy.downloadermiddlewares.cookies.CookiesMiddleware" to the middleware in settings: None to close the middleware

Three, commonly used Scrapy.settings, (the more project requirements, the more settings configuration file content)

BOT_NAME = 'baidu'  scrapy 项目名字
SPIDER_MODULES = ['baidu.spiders']爬虫模块
NEWSPIDER_MODULE = 'baidu.spiders'	使用genspider(可以自己写)命令创建的爬虫模块
USER_AGENT = 'baidu (+http://www.yourdomain.com)' 默认的用户代理
Configure maximum concurrent requests performed by Scrapy (default: 16)
设置最大并发请求,通过scrapy操作,默认为16,CONCURRENT_REQUESTS = 32可设置32个
Configure a delay for requests for the same website (default: 0)
设置网络请求延迟,,DOWNLOAD_DELAY = 3设置为3秒
The download delay setting will honor only one of:对访问的一种限制
#CONCURRENT_REQUESTS_PER_DOMAIN = 16 单个域名允许的最大并发请求(对域名限制),(单个站点以此为主)CONCURRENT_REQUESTS = 32设置32,但单个域名设置16,最大只会执行16,而不是32
#CONCURRENT_REQUESTS_PER_IP = 16	单个IP允许的最大并发请求(对IP限制)(域名和IP都设置的时候,只会对IP进行限制,而不会再去管域名)
# Disable cookies (enabled by default)是否关闭cookies
#COOKIES_ENABLED = False 			 默认使用cookies
 Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False  设置使用监控控制台,默认为使用,启用时false为取消监控控制台
# Override the default request headers:一般有添加请求头有默认的,request命令行的,中间件的
DEFAULT_REQUEST_HEADERS = {可添加请求头}这里在settings添加的请求头是全局的(优先级别最低的)
SPIDER_MIDDLEWARES = {爬虫中间件
#    'baidu.middlewares.BaiduSpiderMiddleware': 543,
# }
DOWNLOADER_MIDDLEWARES = {·····}爬虫中间件
还有一些基础的下载中间件
"scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
·······

扩展程序,取消注释,启用时会取消使用监控控制台
EXTENSIONS = { 'scrapy.extensions.telnet.TelnetConsole': None,
}


#ITEM_PIPELINES = {
#    'github.pipelines.GithubPipeline': 300,
#}				设置管道

启用或者配置 扩展            下载延迟的时间:通过计算建立TCP连接到接收到HTTP包头之间的时间来测量,由于scrapy忙于处理其他的回调函数,所以往往无法准确测量这个时间,但是这个大概的测量值算是合理的,所以按照这个编写出延迟时间还是可以使用的。限制的算法:下载延迟==收到响应的延迟时间除于当前的并发数。。。。。
	然后在智能延迟中,下一次的延迟就会取多组延迟数的平均值(可自定义设置),一般没有达到200response都会自动把下一次延迟调的更高,以免出事。不会比DOWNLOAD_DELAY更低,不会比AUTOTHROTTLE_MAX_DELAY更高,智能延迟的每秒并发请求平均数等都不能比上面自己设定过的默认延迟、平均、并发等更快
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True  默认是不允许  ,启动就可以智能限速/自动节流,比默认延迟要方便,这里可以自动调整最佳的爬取速度,只需要设置最大和最小,他就可以通过这个区间进行匹配最佳

# The initial download delay  初始下载延迟
#AUTOTHROTTLE_START_DELAY = 5	

# The maximum download delay to be set in case of high latencies 最大下载延迟
#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to
# each remote server		并行发给远程请求的平均数
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0(取消注释可以设置并发一个)

# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False		启用 显示收到每个响应的调节信息

启用或配置 Http缓存
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True			默认不启用
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

DEPTH_LIMT=3爬虫爬取次级页面的深度(可自定义设置)


常用项目设置
BOT_NAME				项目名称

CONCURRENT_ITEMS			item处理最大并发数,默认100

CONCURRENT_REQUESTS			下载最大并发数

CONCURRENT_REQUESTS_PER_DOMAIN	单个域名最大并发数

CONCURRENT_REQUESTS_PER_IP		单个ip最大并发数



UAgent user agent: the client itself performs proxy access (such as browser access to the server)

IP proxy: through the IP server to help proxy access

Download delay, when accessing crawling to download movies and other data, set in order to adjust to the same as ordinary users, otherwise it is easy to be IP blocked

Settings set priority:

Command line (operation specified by the console)>spider (code for crawler file settings)>settings of setting items (such as in middleware)>operations not specified by the console>code settings in the settings file (such as DEFAULT_REQUEST_HEADERS = {available Add request header})

Cache strategy: All requests are cached, and the original cache can be accessed directly on the request next time

Guess you like

Origin blog.csdn.net/qwe863226687/article/details/114117160