Baidu Encyclopedia of the definition: a reptile accordance with certain rules, procedures or scripts to automatically crawl the World Wide Web information.
Sites generally do not like to make sure you crawling.
Anti-climbing mechanism: the corresponding carrier is the site to prevent the reptiles crawling data.
Anti-anti-climbing strategy: the corresponding carrier is crawler
Common anti-climb mechanisms and coping strategies:
1. Headers check, view request header is a browser or a machine, as well as the Referer (the upper link) is detected. (Strategies for: camouflage request header, the request header requests encapsulated)
2.IP restriction, a restriction ip address, short high-frequency request. (For Strategy: Use a proxy IP, replace the IP address of the anonymous note the extent ip proxy.)
3. UA limits verification request parameter User-Agent header. (For strategy: camouflage it)
4. Anti codes reptile or simulated landing, by checking whether the authentication code and a user login. (For policy: You can use third-party verification code recognition platform identification code, use the Session feature requests, records cookie, simulated landing)
The dynamic loading data, the display data to the page request via ajax dynamic, preventing direct access crawler. (For policy: The capture tool, web inspection, to identify and process the request url to obtain a response data)
6.cookie restrictions, identifying reptiles or by checking the browser cookie. (For Strategy: cookie seldom changes can manually add parameters in the request header cookie, cookie often change can be made using the Session object to the request, too troublesome situation can use selenium module, fully simulate browser behavior)
1. General reptiles: grab a whole page source content
2. Focused Crawler: crawl pages of local content
3. Incremental reptiles: website monitoring data can be updated, the latest update to crawl out of the site data