Simple solution for crawler IP banned

The crawler used to sound so powerful and mysterious. If you use it well, it can become a Soso engine like Google and Baidu. If you don't use it well, you can crash a small website in minutes with inappropriate high concurrency. When I write here, I think of the number of concurrent requests that 12306 can handle every year, and I think it's awesome.

Reptiles and anti-reptiles have always been a stance where the road is one foot high and the devil is ten feet high. Anti-crawler technology increases the difficulty of crawling. The crawling process of various crawler can be said to be a battle of wits and courage with various webmasters. Various solutions can be described one after another, but here is the "simple" solution. It must be based on some relatively basic methods, and you can get started in minutes.

user_agent masquerading and rotation

Different versions of different browsers have different user_agent, which is the detailed information of the browser type and an important header information for the browser to submit Http requests. We can provide a different user_agent for each request, bypassing the anti-crawling mechanism of the website detection client. For example, many user_agents can be placed in a list, and one randomly selected at a time for submitting access requests. There is a website that provides various user_agents:

http://www.useragentstring.com/

Recently, I saw an open source library that provides disguised browser identity. The name is very straightforward:

fake-useragent

Using proxy IP and rotation

Checking ip access is the favorite and favorite way for the website's anti-crawling mechanism. At this time, you can change different IP addresses to crawl the content. Of course, if you have a lot of hosts or vps with public IP addresses, it is a better choice. If not, you can consider using a proxy, let the proxy server help you get the web content, and then forward it back to your computer. Proxy can be divided into transparent proxy, anonymous proxy and highly anonymous proxy according to transparency:

  • Transparent proxy : The target website knows that you use a proxy and knows your source IP address. This kind of proxy is obviously not in line with the original intention of using a proxy here.
  • Anonymous proxy : The degree of anonymity is relatively low, that is, the website knows that you use a proxy, but does not know your source IP address
  • High anonymous proxy : This is the safest way. The target website does not know the proxy you use nor the 
    way to obtain your source IP proxy. You can buy it. Of course, you can also crawl it for free. Here is a free one. Proxy websites can be climbed down and used, but free proxies are usually not stable enough.

Set the access interval

The anti-crawling mechanism of many websites has set the access interval. If an IP exceeds the specified number of times in a short time, it will enter the "cooling CD", so in addition to rotating the IP and user_agent, 
you can set the access interval to be longer. Grab a page and sleep for a random time:

  1. import timerandom
  2. time.sleep(random.random()*3)

For a crawler, this is a more responsible approach. 
Because the crawler may cause load pressure on the other party's website, this kind of prevention can not only prevent the block from being blocked to a certain extent, but also reduce the access pressure of the other party.

references

    1. The crawler disguises the click behavior of the browser
    2. Crawling proxy IP
    3. A brief analysis of anti-crawling strategies for Internet websites

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325526073&siteId=291194637