The proxy problem of Python crawler

For programmers engaged in crawling, the anti-crawling strategy is indispensable, and adding a request header is the simplest anti-crawling strategy. There is also the use of proxy ip. To put it bluntly, use a different ip to access the website where you want to collect data when collecting data. Many people may ask why I need to use a proxy? Can't it be done? After using the agent, I found that the speed of crawling data has dropped by no less than giving up the use of the crawler agent. When the data you grab reaches a certain amount, or when you collect in large quantities, you will find that the program will report errors to you from time to time, and the frequency is getting higher and higher. This means that your crawler has been recognized by others, and the other party's anti-picking system has already remembered you. It will usually tell you that the connection is timed out, the connection is interrupted, and even more, it will not directly interrupt your program. It will give you some fake data or bring your crawler into an endless loop. There are many such anti-picking measures here. One introduced.

User-Agent

This way can be understood as the way your own program is accessed by the browser when visiting the website. In order to reduce the recognition of the website as a crawler, what is a User-Agent?
User-Agent is a special string header, which is widely used to indicate the information of the browser client, so that the server can identify the operating system and version, CPU type, browser and version used by the client, the rendering engine of the browser, and the browser Language etc.
Different browsers (IE, FF, Opera, Chrome, etc.) will use different User Agent Strings (User Agent Strings) as their own logo, when search engines (Google, Yahoo, Baidu, Bing) are accessing web pages through web crawlers At the same time, it will also use the user agent string to mark itself, which is why the website statistics report can count browser information, crawler information, etc. The website needs to obtain the information of the user's client and understand how the content of the website is displayed on the client. Some websites send different pages to different operating systems and different browsers by judging UA. Normally displayed in the browser.
For more explanation about User-Agent, please refer to: User Agent Learning

1. Get random request headers

For details, please refer to the official website: fake-headers 1.0.2

    1. Pip install fake_header
pip install fake_headers
    1. Import fake_headers and randomly generate headers
>>> from fake_headers import Headers
>>> headers = Headers(headers=True).generate()
>>> headers
{
    
    'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US;q=0.5,en;q=0.3', 'Upgrade-Insecure-Requests': '1', 'Referer': 'https://google.com', 'Pragma': 'no-cache'}

If you don't want to generate the entire headers, there is also a plug-in to generate User-Agent separately!

2. Get random User-Agent and use

For details, please refer to the official website: fake-useragent 0.1.11

  1. 安装pip install fake_useragent
pip install fake_useragent
  1. Import fake_useragent and randomly generate UserAgent
>>> from fake_useragent import UserAgent
>>> UserAgent().random
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'
3. Use of User-Agent
'''
将生成的随机User-Agent放入headers传入requests,get()中即可
'''
>>> import requests
>>> url = 'https://blog.csdn.net/Lin_Hv/article/details/106119568'
>>> req = requests.get(url=url, headers=headers)
>>> req
<Response [200]>

Guess you like

Origin blog.csdn.net/Lin_Hv/article/details/109090731