One: Download middleware
FIG frame scrapy
Download middleware (Downloader Middlewares) layer is located between the engine and the downloader scrapy
effect:
Request to the engine 1. The process of downloading, the download request can be a series of intermediate processing such as setting the User-Agent request, setting agents, etc.
2. In downloading completes the transfer engine, the middleware can download the response to the Response series of processes, for example, like gzip decompression.
- we mainly use the downloaded middleware to process the request, the request will generally set random User-Agent, a random set of agents, strategies aimed at preventing anti-climb crawling the site.
- bis: UA pool: User-Agent Pool
- role: as many requests scrapy project disguised as a different type of browser identity
-Operating procedures:
1. interception request download middleware
UA requests header information 2. intercepted request tamper disguise
3. Open the downloaded middleware in the configuration file
Code Display
# Leader packet from scrapy.contrib.downloadermiddleware.useragent Import UserAgentMiddleware Import Random # write (UA pool to a single class encapsulates a downloaded middleware) UA pool code class RandomUserAgent (UserAgentMiddleware): DEF process_request (Self, Request, Spider): # Randomly selected from a list of values ua ua = The random.choice (user_agent_list) # ua values ua current intercepted write operation request request.headers.setdefault ( ' - Agent-the User ' , UA) user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
III. Acting pool
- role: as many as possible will scrapy project requests to a different IP settings
-Operating procedures:
1. interception request download middleware
2. The intercepted request is modified to a certain IP proxy IP
3. Open the downloaded middleware in the configuration file
Code shows:
# Batch of intercepted requests to replace ip # individually packaged downloaded middleware classes class the Proxy (Object): DEF process_request (Self, Request, Spider): # interception request url to be determined (in the end protocol is http or https header ) # request.url return value: HTTP: //www.xxx.com H = request.url.split ( ' : ' ) [0] # requested protocol header IF H == ' HTTPS ' : ip = random.choice(PROXY_https) request.meta['proxy'] = 'https://'+ip else: ip = random.choice(PROXY_http) request.meta['proxy'] = 'http://' + ip
Proxy IP can be selected
PROXY_http = [ '153.180.102.104:80', '195.208.131.189:56055', ] PROXY_https = [ '120.83.49.90:9000', '95.189.112.214:35508', ]