Anti-crawler mechanism (a)

Reptile with a long time, always will be sealed. - Lu Xun

 

Some sites, especially some of Chen old station, did not do anti-crawler mechanism, we can enjoy the climb, climb happily put their underwear. . Data all climb down. For most feelings considerations, we climb slowly, do not give too much pressure on its server. But for the site has anti-crawler mechanism, we can not do this.

 

UA Ko验

 

The simplest anti-crawler mechanism should be checking the UA. When the browser sends a request, the parameters will be included with part of the browser and the server to the current system environment, this part of the data on the header of the HTTP request.

 

 We have to do is to set our crawler requests by the UA library. Generally there will be a third-party library sends a request to the default UA, if we directly use the UA, equivalent to just tell people, I am reptiles, Come ban me! Some sites if you do not set the UA is not go up. requests UA library settings is simple.

def download_page(url):
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'
    }
    data = requests.get(url,headers=headers)
    return data

Of course, if we repeatedly visit the same Web site, has been using the same UA, is not acceptable. UA can get a pool, and then each time randomly selected from a UA to access.

 

Access frequency limit

 

In general, the speed of browsing live program is relatively slow, but reptiles are not the same. If someone 100 times a second visit to the same site, it is almost without a doubt, this is the reptile. In general, the face of this situation, we have two ways to solve.

 

The first approach is very simple. Since the visit will be banned too fast, that I visited slow enough ah. We can just set up a time.sleep after each completed visit the Web site, restricting access speed. Preferably using a machine accessible from slow to fast, the threshold value is found sealed, then slightly lower speed access.

 

The second method is to change the ip. Web site visitors generally are identified by the ip, so we just kept changing its ip, you can masquerade as different people. Same ip 100 times a second visit very normal, but 100 ip access 100 times a second just fine. So how do we replace ip? In fact, it does not actually replacing our ip, but our request forwarded by proxy ip. Many sites offer a lot of free proxy ip, we just put them climb down for a rainy day. However, many proxy ip life is not long, so it is necessary from time to time for testing. requests to set the proxy ip is simple.

proxies = {"http": "http://42.228.3.155:8080",}
requests.get(url, proxies=proxies)

 

Verification code

 

Some sites, no matter what you do, or log in to access the page, you need to enter a verification code to verify. In this case, we must identify a verification code to crawl Web site content. Some simple letters plus numbers of the code can be used to identify ocr, a number of other validation slip like you need other tricks to crack, not go into great detail here.

 

Login authentication

 

Log in many cases is to serve the function of the site, but in passing anti-reptile purpose. We can check by pressing the F12 Developer Tools to see what the site will send data at login, and then landing by the correlation function of the analog requests. If you have a later time will be dedicated to write an article to explain in detail.

Guess you like

Origin www.cnblogs.com/rain-poi/p/11516116.html