Set User-Agent random Scrapy Agent

User Agent Chinese called user agent, referred to as UA, it is a special string head, so that the server can identify the operating system and the version used by the customer, CPU type, browser and version, browser rendering engine, browser language, browser plug-ins.

Some Web sites often come in different browser sends different pages to different operating systems by determining the UA, it may cause some reptiles are prohibited sites, but can bypass detection by disguising UA.

Common forms such as User-Agent:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36

Solution is provided herein Scrapy random user agent pool method, is used to Fake-userAgent .


  1. Install fake-useragent
    pip install fake-useragent

  2. Establish a RandomUserAgentMiddlware in Scrapy in:

    from fake_useragent import UserAgent
    class RandomUserAgentMiddlware(object):
        # 随机更换user-agent
        def __init__(self, crawler):
            super(RandomUserAgentMiddlware, self).__init__()
            self.ua = UserAgent()
            self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random")  # random是默认值
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)
    
        def process_request(self, request, spider):
            def get_ua():
                return getattr(self.ua, self.ua_type)  # 获取ua的ua_type
    
            request.headers.setdefault('User-Agent', get_ua())
    
  3. settings provided 'RANDOM_UA_TYPE' value

    RANDOM_UA_TYPE = "chrome"
    

    And DOWNLOAD_MIDDLEWARSEadded RandomUserAgentMiddlware:

    DOWNLOADER_MIDDLEWARES = {
    	'DouBanSpider.middlewares.RandomUserAgentMiddlware': 2,
    	'DouBanSpider.mdownloadermiddlewares.useragent.UserAgentMiddleware': None,
    	}
    
  4. Test:
    In the spider parse_itemprocess each printing User-Agentvalue:
    print(response.request.headers['User-Agent'])
    parse_item Method complete code:

    def parse_book(self, response):
        item_loader = DoubanItemLoader(item=DoubanspiderItem(), response=response)
        url = response.url
        item_loader.add_value("url", url)
        item_loader.add_value("url_object_id", get_md5(url))
        item_loader.add_css("book_name", "#wrapper>h1>span:nth-child(1)::text")
    
        author = response.xpath("//*[@id='info']/span[1]/a[1]/text()").extract()
        if author == []:
            author = response.xpath("//*[@id='info']/a/text()").extract()
        item_loader.add_value("author", author)
    
        item_loader.add_css("content", "#link-report .intro p::text")
        item_loader.add_xpath("comment", "//p[@class='comment-content']/span/text()")
        item_loader.add_value("crawl_time", datetime.now())
        douban_item = item_loader.load_item()
        
        print(response.request.headers['User-Agent'])  # 添加断点,查看每次User-Agent值是否变化
        return douban_item
    

Test Results:
Here Insert Picture Description

Published 673 original articles · won praise 644 · views 380 000 +

Guess you like

Origin blog.csdn.net/zhaohaibo_/article/details/105242805