05.scrapy framework UA pool and agent pool

One: Download middleware

 FIG frame scrapy

Download middleware (Downloader Middlewares) layer is located between the engine and the downloader scrapy

  effect:

  Request to the engine 1. The process of downloading, the download request can be a series of intermediate processing such as setting the User-Agent request, setting agents, etc.

  2. In downloading completes the transfer engine, the middleware can download the response to the Response series of processes, for example, like gzip decompression.

- we mainly use the downloaded middleware to process the request, the request will generally set random User-Agent, a random set of agents, strategies aimed at preventing anti-climb crawling the site.

- bis: UA pool: User-Agent Pool

  - role: as many requests scrapy project disguised as a different type of browser identity

  -Operating procedures:

    1. interception request download middleware

    UA requests header information 2. intercepted request tamper disguise

    3. Open the downloaded middleware in the configuration file

Code Display

# Leader packet 
from scrapy.contrib.downloadermiddleware.useragent Import UserAgentMiddleware
 Import Random
 # write (UA pool to a single class encapsulates a downloaded middleware) UA pool code 
class RandomUserAgent (UserAgentMiddleware):

    DEF process_request (Self, Request, Spider):
         # Randomly selected from a list of values ua 
        ua = The random.choice (user_agent_list)
         # ua values ua current intercepted write operation request 
        request.headers.setdefault ( ' - Agent-the User ' , UA)


user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

III. Acting pool

  - role: as many as possible will scrapy project requests to a different IP settings

  -Operating procedures:

    1. interception request download middleware

    2. The intercepted request is modified to a certain IP proxy IP

    3. Open the downloaded middleware in the configuration file

Code shows:

  

# Batch of intercepted requests to replace ip 
# individually packaged downloaded middleware classes 
class the Proxy (Object):
     DEF process_request (Self, Request, Spider):
         # interception request url to be determined (in the end protocol is http or https header ) 
        # request.url return value: HTTP: //www.xxx.com 
        H = request.url.split ( ' : ' ) [0]   # requested protocol header 
        IF H == ' HTTPS ' :
            ip = random.choice(PROXY_https)
            request.meta['proxy'] = 'https://'+ip
        else:
            ip = random.choice(PROXY_http)
            request.meta['proxy'] = 'http://' + ip

Proxy IP can be selected

PROXY_http = [ '153.180.102.104:80', '195.208.131.189:56055', ] PROXY_https = [ '120.83.49.90:9000', '95.189.112.214:35508', ]

 

Guess you like

Origin www.cnblogs.com/zhaoyang110/p/11525263.html