JAVA web crawler WebCollector in-depth analysis - crawler core

WebCollector official website: https://github.com/CrawlScript/WebCollector

Technical discussion group: 250108697


How to import the crawler kernel into your own project?

1. Go to the official website of the crawler http://crawlscript.github.io/WebCollector/ , download the compressed package, and unzip it.

2. After decompression, find "webcollector-version-bin.zip" and decompress it.

3. Unzip all the jars in "webcollector-version-bin.zip" and import them into your project to use the crawler kernel.


Demo of the crawler kernel

Go to the folder where "webcollector-version-bin.zip" is decompressed (explorer for Windows, command line for Linux).

Windows: Double click start.bat

Linux: execute sh start.sh

You can see a simple DEMO, this DEMO can crawl the entire website (including pictures, files, JS, CSS), and store it locally according to the original file path of the website.

The picture is to use this DEMO to download all the webpages and files on the official website of Hefei University of Technology.





What functions does the crawler kernel provide?

1. An extensible framework. For most crawler developers, a stable and understandable framework is required. Make your own crawler based on the framework.

2. The basic class library required by the crawler.

    1) Get the html source code (file download).

    2) File operations.

    3) HTML source code analysis (extraction).

    4) Thread pool.

    5) URL generator (traverser).

    6) Message mechanism (communication between components).


Basic class library:

Before introducing the crawler framework, let's introduce the basic class library.

If you don't want to use our crawler framework, just want to make a basic crawler or web page information collection product, or you just want to make a simple HTML source getter, you can import the jar package of WebCollector and directly call the class library provided by the crawler core .



Crawler framework:

The crawler framework will be introduced in detail in subsequent articles. Only some differences between it and other crawler frameworks are introduced here.

The biggest difference between WebCollector and other crawler frameworks is that it provides "message mechanism" and "URL generator".

1) Message mechanism:

The previous large-scale crawler frameworks, Heritrix, Nutch, and Crawler4j, all used the mechanism of plug-ins or overloaded code to process (parse and save) crawling information. WebCollector provides a powerful message mechanism (Handler).

For example, Crawler4j, if you need to customize the operation of each page during crawling, you need to overload the relevant functions in the WebCrawler class, and cannot be customized at runtime. You must customize a class that inherits the WebCrawler class before compiling. For details, see : http://code.google.com/p/crawler4j/

But for WebCollector, you only need to customize a Handler

        Handler gene_handler = new Handler () {
            @Override
            public void handleMessage(Message msg) {
                Page page = (Page) msg.obj;
                System.out.println(page.html);
            }
        };
Just pass this handler to the traverser.

2) URL generator:

Heritrx、Nutch、Crawler4j只提供广度遍历的网页遍历方式,而且很难通过他们自带的插件机制去修改遍历方式。所以WebCollector里提供了URL生成器(Generator),自定义URL生成器可以完成各种形式的URL遍历(尤其是对于深网爬取,如微博、动态页面)。

  


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325391192&siteId=291194637