Problems caused by web crawlers and robots protocol

A network crawler size

1. crawling pages, pages for the purpose of Fun small, a small amount of data is not sensitive to crawling speed may be achieved using the request function library (90%)

2. crawling crawling a website or series of websites for the purpose of obtaining reptiles such as one or more travel websites, data requirements for large-scale crawling speed sensitive can use the library Scrapy

3. crawling the entire network for the purpose of large-scale search engine crawling speed key, need custom development

Issues relating to the web crawler to bring

Overall it is: harassment, legal risk, loss of privacy

1. reptiles function can be used to quickly access the server computer, and it is coming hundred or even a thousand times more than human speed, limited by the level and purpose of writing, the web crawler will be a huge resource overhead web server. Run the site concerned, reptile form of harassment.

2. crawlers will bring legal risk. Data on the server have property ownership, such as rules Sina news on Sina all, if a web crawler to get data for profit will bring legal risk.

3. crawlers can cause loss of privacy. Web crawlers may have the ability to break through a simple access control, or have to be protected so that disclosure of personal data privacy.   

III. Web crawler limits

Source Review: User-Agent to determine restrictions

Inspection visit HTTP protocol header User-Agent field, responds only to access a browser or crawler-friendly.

Announcement: Robots agreement

All told website crawling reptiles policy that requires compliance with reptiles.

Four .Robots agreement

Role: website tells crawlers which pages can be crawled, what does not

Form: robots.txt file in the root directory of the site.

Here I made an example of open Jingdong Robots agreement https://www.jd.com/robots.txt

Will appear the following words

 

approximately

 

 

 

 

Guess you like

Origin www.cnblogs.com/qq991025/p/11871743.html