Python reptile (a): Reptile Camouflage

1 Introduction

For some there is a certain size or nature of the relatively strong profitability of the site, almost always do some anti-climbing measures, anti-climbing measures in general, there are two: one is to do authentication directly to the bugs get in the door, the other is site settings variety of anti-climb mechanism for insects difficult to know and return.

2 camouflage strategy

We know that even some of the small size of the site will usually on the identity of visitors for examination, such as authentication requests Headers, and for those on a certain scale sites to mention. Therefore, in order for us to successfully crawling reptile required data information, we need to let the reptile camouflage, is simply to let the behavior of reptiles become like a normal user to access the same.

2.1 Request Headers problem

To demonstrate I use Baidu search 163 mailboxes

Use F12 tool to look at information request

In the figure above, we can see the Request Headers contains two Referer and User-Agent attribute information, Referer role is to tell the server which page the page is linked from over, User-Agent Chinese user agent, it is a special string head, is to allow the server to identify the user's operating system, CPU type, browser and other information. Usual treatment strategies are: 1) To check for Referer on site plus; 2) For each request are added User-Agent.

2.2 IP restrictions

Sometimes we may have some of the sites for long-term or large-scale crawling, crawling and when we do not transform the basic IP, some sites may monitor access frequency and a number of IP, once this threshold is exceeded, it may recognized as reptiles, thus shielding them, in this case, we have to take strategy intermittent access.

Usually we are not crawling IP conversion, but there may be some special cases, to a long uninterrupted crawling on a website, then we may need to adopt IP way agents, but this approach generally increase our spending, which is likely to spend money.

3 summary

Sometimes when we were crawling Request Headers disguise what has been done, but has not been successful, the results may appear the following situations: the information obtained is incomplete, get irrelevant information, not information, in this case we would need to study the mechanism of anti-climbing site, its detailed analysis. I have listed several common look:

1) Irregular information: there is no rule on the number of long list URL information, which usually employ Selenium (analog browser, lower efficiency) solution;
2) Dynamic check code: for example according to a time and a number of other generate custom rules, in which case we would need to find its rules to crack up;
3) dynamic interaction: the need to interact with the page to be verified, can be solved using selenium;
4) load asynchronously in batches: this situation is acquired information may be incomplete, may be solved using selenium.

Guess you like

Origin www.cnblogs.com/ityard/p/11621311.html