How the reasonable control of the speed of reptiles

Staff are aware of reptiles, reptile speed is not the sooner the better. If reptiles collected faster, more easily be found, the more easily blocked IP. So, how reasonable control reptile speed?
Generally, it is possible to control the frequency of the maximum delay settings between each page crawl, so not a burden to the server, it will not be closed due to frequent access. However, this approach can lead to a slow crawl, crawl if you have a large number of tasks, will seriously affect the efficiency.
There is a natural solution is to change the dynamic latency, the minimum time interval minus the time to read the page, so in terms of network smooth or poor network when the page is the minimum time interval. However, this method is only suitable for small-scale single-threaded crawler site.
Another way is to PID control algorithm, not by the method of calculation can control the speed of reptiles, reptile when it simply is too fast and they will increase the delay time. When the speed is too slow, the delay time will be automatically reduced.
The above is a brief introduction to control the speed of reptiles, you can not quickly capture, you can use a proxy IP to improve efficiency, switch to a different IP, continuous acquisition. Flash reptiles cloud agent is a good helper, IP-line stability, simple operation and reasonable price.

Guess you like

Origin blog.51cto.com/14338698/2404709