Common anti-reptile apricot color source and build our approach?

Headers by apricot color source to build [sweet potato] Source Forum diguaym.com enterprises hungry 2152876294 Anti reptiles:
from user requests Headers anti-crawler is the most common anti-reptile strategy. Headers are many sites on the User-Agent is detected, as well as part of the site Referer will be detected (some of the resources of the site is to detect anti-hotlinking Referer). If you encounter this type of anti-reptile mechanism, can be added directly in the reptile in Headers, copy the browser's User-Agent to Headers reptile's; or modify the Referer value for the target domain. Headers for detecting anti reptiles, modify or add Headers can be a good bypassed crawlers.
Anti-crawler based on user behavior:
There is a part of the site by detecting user behavior, such as the same IP to access the same page multiple times within a short time, or the same account in a short time the same operation several times. Most sites are the former case, in this case, the use of IP proxy can be resolved. You can write a special reptile, crawling public Internet proxy ip, save all post-test them. Such proxy ip crawlers are often used, it is best to prepare yourself a. Once you have a large number of agents can each ip ip request to replace a few times, it is very easy to do in requests or urllib2, so you can easily bypass the first anti-crawlers. For the second case, one may request a few seconds at random intervals and then after each request. Some logical vulnerable site, you can request several times, log out, log back in and continue to limit the request to bypass the same account in a short time can not make the same request several times.
Anti reptiles dynamic pages:
Most of the above situations are found in static pages, as well as part of the site, we need crawling through ajax request data is obtained or generated by JavaScript. First, the network analyzes the request by Fiddler. If you can find ajax request, can analyze the meaning of specific parameters and specific responses, we can use the above method, directly or urllib2 analog ajax request requests, for json data obtained by analyzing the response desired.
Can directly simulate ajax request to obtain data of course is excellent, but some sites all the parameters ajax requests all encrypted. We simply can not construct the requested data they need. In this case on the use of selenium + phantomJS, call the browser kernel, and use phantomJS js execution to simulate human action and trigger js script page. From Fill out the form and then click the button to scroll the page, can simulate all, without regard to the specific request and response process, but a full account of the man page for browsing data simulation again. With this framework almost able to bypass most anti-reptile, since it is not disguised as a browser to access the data (that is, above a certain extent by adding Headers to masquerade as a browser), which itself is a browser phantomJS no interface of the browser, but the browser is not handling this person. Using selenium + phantomJS capable many things, for example, touch-type recognition (12,306) or slide type codes, to form the page brute like.

Guess you like

Origin www.cnblogs.com/sdhjtfhjds88/p/11119153.html