How to break the anti-climb mechanism website

Current common Web Anti-collection policy about the following:
§ 1) data encryption;
§ 2) restrict access frequency;
§ 3) data presented in a non-text form;
§ 4) code protection;
§ 5) Cookie verification;
this article mainly explore how to break the "restricted access frequency":
"restrict access frequency" principle:
server-side programs (for example, WAF) maintain a client (IP) access count, if the client (IP) request frequency exceeds the threshold value, the request will be blocked, usually occurs following circumstances:
§ 1) the most common: return 403 or 503 errors.
 2) connected to be reset.
 3) most troubling: Return invalid content

Breakthrough method:
§ 1) using an HTTP proxy reptiles. Because the server is limited based on the IP, by using a proxy can be averaged over a plurality of downloads IP. Note that the transparent proxy is often ineffective, because the WAF can detect the real source IP, so use a secret (secret) agent.
 2) addition request delay. For example, WAF single IP requests to limit the frequency can not exceed 20 beats / minute, we can increase the 5S delay between requests, so download frequency is 12 beats / min, it will not be intercepted.
We will normally 1) and 2) a method of binding, i.e. can prevent such interception, but also to accelerate the acquisition speed. For example, using the proxy 10, each download 5S delay increases, the amount of actual download is one minute: 120 times.
 3) use of search engines cache (Google, Bing, Baidu). "Quxianjiuguo" strategy, bypassing the target server, collected from the cache of search engines. And the cache of the original page and page structure is the same, without rewriting extraction rules.
 4) Google translation. Let Google as our "agent", the source and target languages are set to the same, so get from Google translation result data and the original page is the same (note, HTML structure of great change, needs to be rewritten extraction rules) .
 5) In the case of returned invalid content, be sure to find whether effective method to detect the content, it would be difficult to ensure that all data are correct.

Guess you like

Origin blog.51cto.com/14400115/2421496