Python common reptiles, focused crawler conceptual understanding

General reptiles : Baidu, 360, Sohu, Google, Bing .......

principle:

(1) crawls the web

(2) Data Acquisition

(3) Data processing

(4) providing access to services

Baidu reptiles: Baiduspider

How common reptiles crawl a new website?

(1) voluntarily submit url

(2) Set Links

(3) DNS service provider Baidu will and cooperation, crawl new web site

Retrieving ranking

(1) PPC

(2) according to the value of pagerank ranked by traffic, traffic draw, SEO job done work

If you do not want to Baidu reptiles your website: add a robots.txt file, which can be defined crawling my site, which can not, for example, Taobao part robots.txt content:

The User-Agent: Baiduspider 
the Allow: / Article This article was 
the Allow: / oshtml 
the Allow: / ershou 
the Allow: / $ 
Disallow: / Product / 
Disallow: / 

the User-Agent: Googlebot 
the Allow: / Article This article was 
the Allow: / oshtml 
the Allow: / Product 
the Allow: / SPU 
the Allow: / dianpu 
the Allow: / Oversea 
the Allow: / List 
the Allow: / ershou 
the Allow: / $ 
Disallow: / 
this agreement is only verbal agreement, the real can still crawling.
Focused crawler : fetch the specified data according to specific requirements.
Idea: instead of the Internet browser
features page:
(1) website has its own unique url
(2) web content is HTML structure
(3) are used http, https protocol
(1) to a url
(2) written procedures, to simulate a browser to access url
(3) analytical content, extract data


Guess you like

Origin www.cnblogs.com/lyxcode/p/11490064.html