Python crawlers (2) Size and constraints of web crawlers

 Infi-chu:

http://www.cnblogs.com/Infi-chu/

 

First, the size of the web crawler:

1. Small scale, small amount of data, insensitive to crawling speed, Requests library, crawling web pages
2. Medium scale, large data scale, sensitive to crawling speed, Scrapy library, crawling website
3. Large scale, large scale, Search engine, crawling speed is very important, custom development, crawling the whole site

 

2. Robots protocol:

1. Meaning Robots Exclusion Standard Web crawler exclusion standard
2. Function: The website informs the web crawler which pages can be crawled and which are not
3. Form: robots.txt file in the root directory of the website
4. Use:
  a. Web crawler: Automatic Or manually identify robots.txt, and then crawl the content
  b. Binding: you can not follow it, but pay attention to legal risks

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324816223&siteId=291194637