General reptiles : Baidu, 360, Sohu, Google, Bing .......
principle:
(1) crawls the web
(2) Data Acquisition
(3) Data processing
(4) providing access to services
Baidu reptiles: Baiduspider
How common reptiles crawl a new website?
(1) voluntarily submit url
(2) Set Links
(3) DNS service provider Baidu will and cooperation, crawl new web site
Retrieving ranking
(1) PPC
(2) according to the value of pagerank ranked by traffic, traffic draw, SEO job done work
If you do not want to Baidu reptiles your website: add a robots.txt file, which can be defined crawling my site, which can not, for example, Taobao part robots.txt content:
The User-Agent: Baiduspider the Allow: / Article This article was the Allow: / oshtml the Allow: / ershou the Allow: / $ Disallow: / Product / Disallow: / the User-Agent: Googlebot the Allow: / Article This article was the Allow: / oshtml the Allow: / Product the Allow: / SPU the Allow: / dianpu the Allow: / Oversea the Allow: / List the Allow: / ershou the Allow: / $ Disallow: /
this agreement is only verbal agreement, the real can still crawling.
Focused crawler : fetch the specified data according to specific requirements.
Idea: instead of the Internet browser
features page:
(1) website has its own unique url
(2) web content is HTML structure
(3) are used http, https protocol
(1) to a url
(2) written procedures, to simulate a browser to access url
(3) analytical content, extract data