09 scrapy中间件

中间件是Scrapy里面的一个核心概念。使用中间件可以在爬虫的请求发起之前或者请求返回之后对数据进行定制化修改,从而开发出适应不同情况的爬虫。

“中间件”这个中文名字和前面章节讲到的“中间人”只有一字之差。它们做的事情确实也非常相似。中间件和中间人都能在中途劫持数据,做一些修改再把数据传递出去。不同点在于,中间件是开发者主动加进去的组件,而中间人是被动的,一般是恶意地加进去的环节。中间件主要用来辅助开发,而中间人却多被用来进行数据的窃取、伪造甚至攻击。

在Scrapy中有两种中间件:下载器中间件(Downloader Middleware)和爬虫中间件(Spider Middleware)。

下载中间件:

# -*- coding: utf-8 -*-
import random
from scrapy.utils.project import get_project_settings
from scrapy import signals
import logging
logger
= logging.getLogger(__name__)class ProxyMiddleware(object): def __init__(self): self.settings = get_project_settings() def process_request(self, request, spider): proxy = random.choice(self.settings["PROXIES"]) logging.info(f'use proxy {proxy}') request.meta['proxy'] = proxy class UaMiddleware(object): def __init__(self): self.settings = get_project_settings() def process_request(self, request, spider): ua = random.choice(self.settings['USER_AGENT_LIST']) logging.info(f'use ua : {ua}') request.headers['User-Agent'] = ua class CookieMiddleware(object): def __init__(self): pass def process_request(self, request, spider): request.cookies = {'_octo': 'GH1.1.308160386.1573462117', 'dotcom_user': 'zj008', 'logged_in': 'yes', '__Host-user_session_same_site': '_RswyHk7fUP475BeR1pVow6qB0XNSq5cOCfw9tUINjraeRhU', '_device_id': '69c607831d178592b4c83dde16be0f22', '_gh_sess': 'akxkaDREcUpRZFErNmRWTXYwUGovK3QrMjN3aEp4RDY3K1QwK2g4NHVQSkdYV2o5ajl4czZ6Q2dRN0ZHSXVhU3N5S1NsT1haM1hUd281Rkp5eExSQ0FQZmhwbUhSNUFCVmFQbXB1MDhCS2sxL0lhWHp5a3VWZHdiNVZia1JKbjNVNy9zVjYwdWxNcDFTdnNwUHYxaDBMTUpTMGRpR0drOElvZ0J2U3c0bjZjdisvemYvK1NGaENkM2d5UEhLTTRxY1YyWW83b2o4amljK0ZiSG4zeXJtM21sODhXS3JxdG82bG9YbENIV1oyRXUxTDI3VkE2RlNTcmRYWFoxQVBOK0QyR2tUV0Jqd1lFV0VteG9WWWo5U2dxcVNDeHNYTENDRzFhL3pvMVE1SHc9LS1VU2FHTFEycWxWZlVBZWdNcitOUWJ3PT0%3D--38d0fe27f00d3b3c34ac8ec54f6f80c4db3c2a66', 'has_recent_activity': '1', 'ignored_unsupported_browser_notice': 'false', 'user_session': '_RswyHk7fUP475BeR1pVow6qB0XNSq5cOCfw9tUINjraeRhU'}

这里我实现了三个中间件,对请求对象做了一些处理,来设置代理,设置ua,添加cookies

中间件除了process_request方法,还有process_response方法,用来处理响应对象,process_exception方法用来捕获异常

猜你喜欢

转载自www.cnblogs.com/zhangjian0092/p/11836823.html