java - crawler

This article is reproduced: http://syc001.iteye.com/blog/1028001 The
principle of crawler:
every web page returns to the client is html, the content you need is in this html, this html you can use a string to Save it to a java variable. What you need to do is to intercept the content of the corresponding position of the string and save it
. Each product page of the website you give has a special place. The



crawlers are divided into two categories:
Aggregated crawlers:
Focused crawlers are A program that automatically downloads web pages, it selectively accesses web pages and related links on the World Wide Web according to a given crawl target to obtain the required information. Focused crawlers do not pursue large coverage, but aim to crawl web pages related to a specific topic content and prepare data resources for topic-oriented user queries.

Universal crawler:
 composition
  of web crawler In the system framework of web crawler, the main process consists of three parts: controller, parser, and resource library. The main job of the controller is
to assign work tasks to each crawler thread in multiple threads. The main job of the parser is to download web pages and process the pages, mainly to process some JS script tags, CSS code content, space characters, HTML
tags etc. The basic work of the crawler is completed by the parser. The resource library is used to store downloaded web resources. Generally, a large database, such as an Oracle database
, is used to store and index it.

Controller:
  The controller is the central controller of the web crawler. It is mainly responsible for allocating a thread according to the URL link sent by the system, and then starting the thread to call the crawler to crawl the web page.
  
Parser:
  The parser is responsible for the main part of the web crawler. Its main tasks are: the function of downloading web pages, processing the text of web pages, such as filtering, extracting special HTML tags, and analyzing data.
  
Resource library:
  It is a container mainly used to store downloaded data records from web pages, and provides a target source for generating indexes. Medium and large database products include: Oracle, Sql Server, etc.



Overview web
crawler The main function of web crawler is to discover, download and store content from the web. Widely used in various search engines.
A typical web crawler mainly consists of the following parts:

    A URL library that can be recognized by the crawler.
    The document download module is mainly used to download content from the web.
    The document parsing module is used to parse the content in the downloaded document, such as parsing HTML, PDF, Word, etc. This module also extracts URLs from web pages and some useful data for indexing.
    A library that stores metadata for documents as well as content.
    The canonical URL module converts URLs into standard formats.
    URL filter, crawlers can filter out unwanted URLs.



The design and implementation of the above modules mainly depends on what your crawler wants to crawl and the scope to crawl. The simplest example is to crawl some web pages from a known site, and the crawler code
can . In Internet applications, this very simple requirement may be encountered, but if you want to implement a crawler that crawls a large number of documents, it is not so simple. Generally speaking, this crawler is composed
of , and the difficulty is based on distributed.

Two stages of crawler
A typical crawler mainly has the following two stages

    of initializing the URL library and then starting to crawl.
    The crawler reads unvisited URLs to determine its scope of work.



For the URL to be crawled, perform the following

    content
    parsing without re-fetching the URL to obtain the URL and the required data.
    Store valuable data.
    Normalize newly crawled URLs.
    Filter out URLs that do not need to be crawled.
    Update the URL to be crawled into the URL library.
    Repeat step 2 until the depth of crawled pages is complete.



In terms of breadth, there are two types of reptiles. Universal and centralized. The general type is to capture all parseable documents. They do this mainly through URL filtering techniques. The centralized crawler mainly crawls documents with specific content, such as crawling the sina blog, and the format is fixed content is also of interest to us.

Fortunately, there are open source crawlers available
in Java, both nutch and heritrix provide implementations of crawlers. Nutch is a sub-project of apache lucene at http://lucene.apache.org/nutch/
. This project is very stable and well documented. Nutch stores multiple web pages in a single file. For large crawlers, this reduces I/O reading and writing, and the performance is better.

Heritrix is ​​a web crawler for the Internet Archive. The project address is http://crawler.archive.org/
. Heritrix focuses on the implementation of large crawlers. The license is LGPL.

In addition, there is another project worthy of attention, that is apache tika. The project address is http://tika.apache.org/
. tika uses parsers to discover and extract metadata and textual content from documents.


Google: "Java Open Source Web Crawler Classification List"
(1)
ItSucks is a java web spider (web robot, crawler) open source project. Supports defining download rules by downloading templates and regular expressions. Provide a swing GUI operation interface. Download address: http://itsucks.sourceforge.net/

(2)
WebSPHINX
WebSPHINX is an interactive development environment for Java class packages and web crawlers. Web crawlers (also called robots or spiders) are programs that automatically browse and process Web pages. WebSPHINX consists of two parts: the crawler work platform and the WebSPHINX class package. http://www.cs.cmu.edu/~rcm/websphinx/.

(3)
JSpider
JSpider: is a fully configurable and customizable Web Spider engine. You can use it to check the website for errors (intrinsic server errors etc.), check internal and external links on the website, analyze the structure of the website (create a sitemap), download the entire website, and you can also write a JSpider plugin to extend the functions you need. http://j-spider.sourceforge.net/

(4)
Arale
Arale is mainly designed for personal use, and does not focus on page indexing like other crawlers. Arale can download entire web sites or certain resources from web sites. Arale can also map dynamic pages to static pages. http://web.tiscali.it/_flat/arale.jsp.html

(5)
Web-Harvest
Web-Harvest is a Java open source Web data extraction tool. It is able to collect specified web pages and extract useful data from these pages. Web-Harvest mainly uses technologies such as XSLT, XQuery, regular expressions, etc. to realize the operation of text/xml. http://web-harvest.sourceforge.net/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326504002&siteId=291194637