python based web crawler scrapy crawler frame counter mechanism of the User-Agent disguise

user agent means that the user agent , referred to as UA.

Role: allows the server to identify the client operating system and version used, CPU type, browser and version, browser rendering engine, browser language, browser plug-ins.

Web sites often come in different browser sends different pages to different operating systems by determining UA. But when we use crawlers when we frequent requests for a page with a User-Agent is easy to find our website server is a reptile robot, which is blacklisted. So we need frequent replacement request headers.

1. Configure random requests header file in the middleware (middlewares.py) in

  code show as below:

class DobanDownloaderMiddleware(object):
    
    def process_request(self, request, spider):
        # After each request sent to the server will go through this method, this method is that you need to configure the setting in the configuration folder
         # Here addition to the manual configuration may be adopted in fake_useragent python packet (current devices available in the market all browsers in a User-Agent, the presence of artifact called).
  # Use detailed reference: https://www.jianshu.com/p/74bce9140934
  MY_USER_AGENT = [
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1"
    ]
        user_agent = random.choice (MY_USER_AGENT) # variable name can not have -
        request.headers [ 'User-Agent'] = user_agent # header is a request dictionary
Note: some functions do not know if a method, in addition to Baidu can position the mouse cursor, press F12 view source function (at least so vscode)
 
2. Set the following code setting.py file:
 
DOWNLOADER_MIDDLEWARES = {
    # The smaller the value, the higher the priority value
   'doBan.middlewares.DobanDownloaderMiddleware': 543, # Do
banDownloaderMiddleware middleware is the class name defined in the file
}
 
3. Check whether the configuration
  Here you can use this website: http: //httpbin.org/get (this site will return a request header (the message headers)
  In spider file code is as follows:
 
(Partially omitted introduced above, can be customized except a necessary portion)
class DobanSpiderSpider(scrapy.Spider):
    name = 'doBan_spider'
    # allowed_domains = ['movie.douban.com']
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
        Print (response.text) # this time you can see in the vscode automatically pop up in the shell
 
 
 
 
 
 
 
 
 
 
 
  

Guess you like

Origin www.cnblogs.com/RosemaryJie/p/12336662.html