User Agent Chinese called user agent, referred to as UA, it is a special string head, so that the server can identify the operating system and the version used by the customer, CPU type, browser and version, browser rendering engine, browser language, browser plug-ins.
Some Web sites often come in different browser sends different pages to different operating systems by determining the UA, it may cause some reptiles are prohibited sites, but can bypass detection by disguising UA.
Common forms such as User-Agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36
Solution is provided herein Scrapy random user agent pool method, is used to Fake-userAgent .
-
Install fake-useragent
pip install fake-useragent
-
Establish a RandomUserAgentMiddlware in Scrapy in:
from fake_useragent import UserAgent class RandomUserAgentMiddlware(object): # 随机更换user-agent def __init__(self, crawler): super(RandomUserAgentMiddlware, self).__init__() self.ua = UserAgent() self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random") # random是默认值 @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): def get_ua(): return getattr(self.ua, self.ua_type) # 获取ua的ua_type request.headers.setdefault('User-Agent', get_ua())
-
settings provided 'RANDOM_UA_TYPE' value
RANDOM_UA_TYPE = "chrome"
And
DOWNLOAD_MIDDLEWARSE
addedRandomUserAgentMiddlware
:DOWNLOADER_MIDDLEWARES = { 'DouBanSpider.middlewares.RandomUserAgentMiddlware': 2, 'DouBanSpider.mdownloadermiddlewares.useragent.UserAgentMiddleware': None, }
-
Test:
In the spiderparse_item
process each printingUser-Agent
value:
print(response.request.headers['User-Agent'])
parse_item Method complete code:def parse_book(self, response): item_loader = DoubanItemLoader(item=DoubanspiderItem(), response=response) url = response.url item_loader.add_value("url", url) item_loader.add_value("url_object_id", get_md5(url)) item_loader.add_css("book_name", "#wrapper>h1>span:nth-child(1)::text") author = response.xpath("//*[@id='info']/span[1]/a[1]/text()").extract() if author == []: author = response.xpath("//*[@id='info']/a/text()").extract() item_loader.add_value("author", author) item_loader.add_css("content", "#link-report .intro p::text") item_loader.add_xpath("comment", "//p[@class='comment-content']/span/text()") item_loader.add_value("crawl_time", datetime.now()) douban_item = item_loader.load_item() print(response.request.headers['User-Agent']) # 添加断点,查看每次User-Agent值是否变化 return douban_item
Test Results: