user agent means that the user agent , referred to as UA.
Role: allows the server to identify the client operating system and version used, CPU type, browser and version, browser rendering engine, browser language, browser plug-ins.
Web sites often come in different browser sends different pages to different operating systems by determining UA. But when we use crawlers when we frequent requests for a page with a User-Agent is easy to find our website server is a reptile robot, which is blacklisted. So we need frequent replacement request headers.
1. Configure random requests header file in the middleware (middlewares.py) in
code show as below:
class DobanDownloaderMiddleware(object):
def process_request(self, request, spider):
# After each request sent to the server will go through this method, this method is that you need to configure the setting in the configuration folder
# Here addition to the manual configuration may be adopted in fake_useragent python packet (current devices available in the market all browsers in a User-Agent, the presence of artifact called).
# Use detailed reference: https://www.jianshu.com/p/74bce9140934
MY_USER_AGENT = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1"
]
user_agent = random.choice (MY_USER_AGENT) # variable name can not have -
request.headers [ 'User-Agent'] = user_agent # header is a request dictionary
Note: some functions do not know if a method, in addition to Baidu can position the mouse cursor, press F12 view source function (at least so vscode)
2. Set the following code setting.py file:
DOWNLOADER_MIDDLEWARES = {
# The smaller the value, the higher the priority value
'doBan.middlewares.DobanDownloaderMiddleware': 543, #
Do
banDownloaderMiddleware middleware is the class name defined in the file
}
3. Check whether the configuration
Here you can use this website: http: //httpbin.org/get (this site will return a request header (the message headers)
In spider file code is as follows:
(Partially omitted introduced above, can be customized except a necessary portion)
class DobanSpiderSpider(scrapy.Spider):
name = 'doBan_spider'
# allowed_domains = ['movie.douban.com']
start_urls = ['http://httpbin.org/get']
def parse(self, response):
Print (response.text)
# this time you can see in the vscode automatically pop up in the shell