Reptile classification - general web crawler, the web crawler focused, incremental web crawler, deep web crawler

Reptile classification

Web crawler system according to the structure and implementation techniques can be roughly divided into the following types: general crawler, focusing a web crawler, incremental web crawler, deep web crawler . The actual system typically several web crawler crawler technology combined to achieve



Universal Web Crawler

General Web crawler, also known as network-wide crawler (Scalable Web Crawler) , crawling objects from some of the seed URL expanded to the entire Web, search engine portal site mainly for large Web service providers and data collection.

Great range and number of such crawling web crawler, for creep speed and high storage requirements for crawling pages order requirements are relatively low, and because too many pages to be refreshed, usually parallel work, but it takes a long time to refresh the page.

Simply put, it is to crawl all data on the Internet.


Focus Web Crawler

Focus crawler (Focused Crawler), also known as the theme crawler (Topical Info-Crawler) , refers to selectively crawl those with pre-defined theme relevant page of the web crawler.

And general web crawler compared to just focus reptiles crawling related to the topic of the page, significant savings in hardware and network resources, saved pages and also due to the small number of fast update, can satisfy a number of specific areas for specific populations demand information.

Simply means that only a certain kind of crawled data on the Internet.


Incremental Web Crawler

Incremental web crawler (Incremental Web Crawler) refers to taking only incremental updates and newly generated crawling reptiles or page changes that have taken place on the downloaded pages, it can ensure that the crawling pages to some extent, as far as possible the new page.

And Web crawler periodically crawl and refresh the page compared to incremental crawling reptile will only produce new pages or update occurs, when needed not to re-download page does not change, which can effectively reduce the amount of data download, timely update page has been crawling, reducing the cost of time and space, but increases the complexity of the algorithm and implementation difficulty crawling.

Simply put, we are only just grab the data updated on the Internet.


Deep Web Crawler

Web pages can be divided into existence by the surface Web page (Surface Web) and deep web (Deep Web, also known as the Invisible Web Pages or Hidden Web).

Surface pages refers to the traditional search engines can index the pages to static pages hyperlinks can reach the main Web page configuration.

Deep Web is that most of the content can not be obtained by static links, hidden in the search form, only the user submits a Web page to get in some key words.

 

Published 434 original articles · won praise 105 · views 70000 +

Guess you like

Origin blog.csdn.net/qq_39368007/article/details/105047654