webcollector 爬虫框架使用说明

学习使用，看到

WebCollector

WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架（内核），它提供精简的的API，只需少量代码即可实现一个功能强大的爬虫。WebCollector-Hadoop是WebCollector的Hadoop版本，支持分布式爬取。

目前WebCollector在Github上维护:https://github.com/CrawlScript/WebCollector

1.WebCollector与传统网络爬虫的区别
传统的网络爬虫倾向于整站下载，目的是将网站内容原样下载到本地，数据的最小单元是单个网页或文件。而WebCollector可以通过设置爬取策略进行定向采集，并可以抽取网页中的结构化信息。

2.WebCollector与HttpClient、Jsoup的区别
WebCollector是爬虫框架，HttpClient是Http请求组件，Jsoup是网页解析器（内置了Http请求功能）。

一些程序员在单线程中通过迭代或递归的方法调用HttpClient和Jsoup进行数据采集，这样虽然也可以完成任务，但存在两个较大的问题：

1) 单线程速度慢，多线程爬虫的速度远超单线程爬虫。

2) 需要自己编写任务维护机制。这套机制里面包括了URL去重、断点爬取（即异常中断处理）等功能。

WebCollector框架自带了多线程和URL维护，用户在编写爬虫时无需考虑线程池、URL去重和断点爬取的问题。

3.WebCollector能够处理的量级
WebCollector目前有单机版和Hadoop版（WebCollector-Hadoop），单机版能够处理千万级别的URL，对于大部分的精数据采集任务，这已经足够了。WebCollector-Hadoop能够处理的量级高于单机版，具体数量取决于集群的规模。

4.WebCollector的遍历
WebCollector采用一种粗略的广度遍历，但这里的遍历与网站的拓扑树结构没有任何关系，用户不需要在意遍历的方式。

网络爬虫会在访问页面时，从页面中探索新的URL，继续爬取。WebCollector为探索新URL提供了两种机制，自动解析和手动解析。两种机制的具体内容请读后面实例中的代码注释。

5.WebCollector 2.x版本新特性
1）自定义遍历策略，可完成更为复杂的遍历业务，例如分页、AJAX
2）可以为每个URL设置附加信息(MetaData)，利用附加信息可以完成很多复杂业务，例如深度获取、锚文本获取、引用页面获取、POST参数传递、增量更新等。
3）使用插件机制，WebCollector内置两套插件。
4）内置一套基于内存的插件（RamCrawler)，不依赖文件系统或数据库，适合一次性爬取，例如实时爬取搜索引擎。
5）内置一套基于Berkeley DB（BreadthCrawler)的插件：适合处理长期和大量级的任务，并具有断点爬取功能，不会因为宕机、关闭导致数据丢失。
6）集成selenium，可以对javascript生成信息进行抽取
7）可轻松自定义http请求，并内置多代理随机切换功能。可通过定义http请求实现模拟登录。
8）使用slf4j作为日志门面，可对接多种日志
9）新增了WebCollector-Hadoop，支持分布式爬取
6.WebCollector爬取新闻网站实例
这里是2.7版本范例大家请注意
自动探测URL
AutoNewsCrawler.java:

import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import org.jsoup.nodes.Document;

/**
 * Crawling news from hfut news
 *
 * @author hu
 */
public class AutoNewsCrawler extends BreadthCrawler {
    /**
     * @param crawlPath crawlPath is the path of the directory which maintains
     *                  information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     *                  links which match regex rules from pag
     */
    public AutoNewsCrawler(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        /*start page*/
        this.addSeed("http://news.hfut.edu.cn/list-1-1.html");

        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("http://news.hfut.edu.cn/show-.*html");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
        this.addRegex("-.*#.*");

        setThreads(50);
        getConf().setTopN(100);

//        setResumable(true);
    }

    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.url();
        /*if page is news page*/
        if (page.matchUrl("http://news.hfut.edu.cn/show-.*html")) {

            /*extract title and content of news by css selector*/
            String title = page.select("div[id=Article]>h2").first().text();
            String content = page.selectText("div#artibody");

            System.out.println("URL:\n" + url);
            System.out.println("title:\n" + title);
            System.out.println("content:\n" + content);

            /*If you want to add urls to crawl,add them to nextLink*/
            /*WebCollector automatically filters links that have been fetched before*/
            /*If autoParse is true and the link you add to nextLinks does not match the 
              regex rules,the link will also been filtered.*/
            //next.add("http://xxxxxx.com");
        }
    }

    public static void main(String[] args) throws Exception {
        AutoNewsCrawler crawler = new AutoNewsCrawler("crawl", true);
        /*start crawl with depth of 4*/
        crawler.start(4);
    }

}

手动探测URL

import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import org.jsoup.nodes.Document;

/**
 * Crawling news from hfut news
 *
 * @author hu
 */
public class ManualNewsCrawler extends BreadthCrawler {
    /**
     * @param crawlPath crawlPath is the path of the directory which maintains
     *                  information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     *                  links which match regex rules from pag
     */
    public ManualNewsCrawler(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        /*add 10 start pages and set their type to "list"
          "list" is not a reserved word, you can use other string instead
         */
        for(int i = 1; i <= 10; i++) {
            this.addSeed("http://news.hfut.edu.cn/list-1-" + i + ".html", "list");
        }

        setThreads(50);
        getConf().setTopN(100);


//        setResumable(true);
    }

    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.url();

        if (page.matchType("list")) {
            /*if type is "list"*/
            /*detect content page by css selector and mark their types as "content"*/
            next.add(page.links("div[class=' col-lg-8 '] li>a")).type("content");
        }else if(page.matchType("content")) {
            /*if type is "content"*/
            /*extract title and content of news by css selector*/
            String title = page.select("div[id=Article]>h2").first().text();
            String content = page.selectText("div#artibody", 0);

            //read title_prefix and content_length_limit from configuration
            title = getConf().getString("title_prefix") + title;
            content = content.substring(0, getConf().getInteger("content_length_limit"));

            System.out.println("URL:\n" + url);
            System.out.println("title:\n" + title);
            System.out.println("content:\n" + content);
        }

    }

    public static void main(String[] args) throws Exception {
        ManualNewsCrawler crawler = new ManualNewsCrawler("crawl", false);

        crawler.getConf().setExecuteInterval(5000);

        crawler.getConf().set("title_prefix","PREFIX_");
        crawler.getConf().set("content_length_limit", 20);

        /*start crawl with depth of 4*/
        crawler.start(4);
    }

}

webcollector 爬虫框架使用说明

WebCollector

猜你喜欢