Knowledge to understand reptiles

Baidu Encyclopedia of the definition: a reptile accordance with certain rules, procedures or scripts to automatically crawl the World Wide Web information.

Sites generally do not like to make sure you crawling.

　　Anti-climbing mechanism: the corresponding carrier is the site to prevent the reptiles crawling data.

　　Anti-anti-climbing strategy: the corresponding carrier is crawler

Common anti-climb mechanisms and coping strategies:

　　　　1. Headers check, view request header is a browser or a machine, as well as the Referer (the upper link) is detected. (Strategies for: camouflage request header, the request header requests encapsulated)

　　　　2.IP restriction, a restriction ip address, short high-frequency request. (For Strategy: Use a proxy IP, replace the IP address of the anonymous note the extent ip proxy.)

　　　　3. UA limits verification request parameter User-Agent header. (For strategy: camouflage it)

　　　　4. Anti codes reptile or simulated landing, by checking whether the authentication code and a user login. (For policy: You can use third-party verification code recognition platform identification code, use the Session feature requests, records cookie, simulated landing)

　　　　The dynamic loading data, the display data to the page request via ajax dynamic, preventing direct access crawler. (For policy: The capture tool, web inspection, to identify and process the request url to obtain a response data)

　　　　6.cookie restrictions, identifying reptiles or by checking the browser cookie. (For Strategy: cookie seldom changes can manually add parameters in the request header cookie, cookie often change can be made using the Session object to the request, too troublesome situation can use selenium module, fully simulate browser behavior)

Reptiles Category:

1. General reptiles: grab a whole page source content

2. Focused Crawler: crawl pages of local content

3. Incremental reptiles: website monitoring data can be updated, the latest update to crawl out of the site data

Reptile legitimacy:

　　1. interfere with the normal operation of the access to the site

　　2. grab a specific type of receipt of the legal protection of data or information

Risk aversion illegal:

　　Compliance with robots protocol

　　Optimizer, to avoid sensitive data

　　When the use of information dissemination to crawl, personal privacy and other sensitive information.

Website robots protocol:

　　Web site tells crawlers can crawl data path. (Usually after the website url added directly robots.txt View)

　　robots protocol anti-anti-villain is not a gentleman, is not able to prevent the real crawlers

Knowledge to understand reptiles

Guess you like