Webmagic - Custom Components

The original text comes from: http://webmagic.io/docs/zh Visits often make mistakes, so I transfer the document to my blog

In the first chapter, we mentioned the components of WebMagic. A major feature of WebMagic is that it can flexibly customize component functions to achieve the functions you want.

In the Spider class, the PageProcessor, Downloader, Schedulerand Pipelinefour components are the fields of the Spider. Except PageProcessor is specified when the spider is created, Downloader, Schedulerand Pipelinecan be configured and changed through the spider's setter method.

method	illustrate	Example
setScheduler()	Set up Scheduler	spipder.setScheduler(new FileCacheQueueScheduler("D:\data\webmagic"))
setDownloader()	Setup Downloader	spipder.setDownloader(new SeleniumDownloader()))
addPipeline()	Set Pipeline, a Spider can have multiple Pipelines	spipder.addPipeline(new FilePipeline())

In this chapter, we'll talk about how to customize these components to do what we want.

Custom Pipelines

Pileline is the part that is processed after extraction. It is mainly used to save the extraction results. Pileline can also be customized to achieve some general functions. In this section, we will introduce the Pipeline, and use two examples to explain how to customize the Pipeline.

Introduction to Pipeline

The interface of Pipeline is defined as follows:

public interface Pipeline {

    // ResultItems保存了抽取结果，它是一个Map结构，
    // 在page.putField(key,value)中保存的数据，可以通过ResultItems.get(key)获取
    public void process(ResultItems resultItems, Task task);

}

It can be seen that Pipelinein fact, the PageProcessorextracted results continue to be processed. In fact, the functions completed in the Pipeline can basically be implemented directly in the PageProcessor, so why there is a Pipeline? There are several reasons:

For module separation. "Page extraction" and "post-processing, persistence" are the two stages of the crawler, and they are separated. One is that the code structure is relatively clear, and the other is that the processing process may be separated in the future, and they are separated in independent threads. As for different machine implementations.
The function of Pipeline is relatively fixed, and it is easier to make general components. The extraction method of each page is ever-changing, but the subsequent processing methods are relatively fixed, such as saving to a file and saving to a database, which are common to all pages. WebMagic already provides several common Pipelines for console output, saving to file, and saving as JSON format file.

In WebMagic, one Spidercan have multiple Pipelines, and you Spider.addPipeline()can add a Pipeline by using it. These Pipelines are processed, for example you can use

spider.addPipeline(new ConsolePipeline()).addPipeline(new FilePipeline())

Achieve the goal of outputting results to the console and saving to a file.

output the result to the console

When introducing PageProcessor, we used GithubRepoPageProcessor as an example, and in a certain piece of code, we saved the result:

public void process(Page page) {
    page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
    page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+)").all());
    //保存结果author，这个结果会最终保存到ResultItems中
    page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
    page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
    if (page.getResultItems().get("name")==null){
        //设置skip之后，这个页面的结果不会被Pipeline处理
        page.setSkip(true);
    }
    page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}

Now we want to save the result to the console, how do we do that? ConsolePipeline can do the job:

public class ConsolePipeline implements Pipeline {

    @Override
    public void process(ResultItems resultItems, Task task) {
        System.out.println("get page: " + resultItems.getRequest().getUrl());
        //遍历所有结果，输出到控制台，上面例子中的"author"、"name"、"readme"都是一个key，其结果则是对应的value
        for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) {
            System.out.println(entry.getKey() + ":\t" + entry.getValue());
        }
    }
}

With this example, you can customize your own Pipeline ResultItems- take data out of it and process it the way you want.

Save results to MySQL

Here is a demo project: jobhunter . It is an example that integrates Spring, uses WebMagic to fetch job information, and uses Mybatis to persist to Mysql. We will use this project to introduce how to persist to Mysql.

In Java, we have many ways to save data to MySQL, such as jdbc, dbutils, spring-jdbc, MyBatis and other tools. These tools can all accomplish the same thing, but with different functionality and sophistication to use. If jdbc is used, then we only need to fetch data from ResultItems and save it.

If we use ORM frameworks to complete the work of persistence to MySQL, we will face a problem: these frameworks generally require that the content to be saved is an object with a defined structure, rather than a ResultItems in the form of key-value. Taking MyBatis as an example, we can define such a DAO using MyBatis-Spring :

public interface JobInfoDAO {

    @Insert("insert into JobInfo (`title`,`salary`,`company`,`description`,`requirement`,`source`,`url`,`urlMd5`) values (#{title},#{salary},#{company},#原文出自：http://webmagic.io/docs/zh 访问经常出错，于是把文档转到自己博客里

在第一章里，我们提到了WebMagic的组件。WebMagic的一大特色就是可以灵活的定制组件功能，实现你自己想要的功能。
在Spider类里，PageProcessor、Downloader、Scheduler和Pipeline四个组件都是Spider的字段。除了PageProcess,#{requirement},#{source},#{url},#{urlMd5})")
    public int add(LieTouJobInfo jobInfo);
}

All we have to do is implement a Pipeline that combines ResultItems and LieTouJobInfoobjects.

Annotation mode

In annotation mode, WebMagic has a built-in PageModelPipeline :

public interface PageModelPipeline<T> {

    //这里传入的是处理好的对象
    public void process(T t, Task task);

}

At this time, we can elegantly define a JobInfoDaoPipeline to achieve this function:

@Component("JobInfoDaoPipeline")
public class JobInfoDaoPipeline implements PageModelPipeline<LieTouJobInfo> {

    @Resource
    private JobInfoDAO jobInfoDAO;

    @Override
    public void process(LieTouJobInfo lieTouJobInfo, Task task) {
        //调用MyBatis DAO保存结果
        jobInfoDAO.add(lieTouJobInfo);
    }
}

Basic Pipeline Mode

At this point, the result saving has been completed! So how do we do it if we use the original Pipeline interface? In fact, the answer is also very simple, if you want to save an object, then you need to save it as an object when extracting:

public void process(Page page) {
    page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
    page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+)").all());
    GithubRepo githubRepo = new GithubRepo();
    githubRepo.setAuthor(page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
    githubRepo.setName(page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
    githubRepo.setReadme(page.getHtml().xpath("//div[@id='readme']/tidyText()").toString());
    if (githubRepo.getName() == null) {
        //skip this page
        page.setSkip(true);
    } else {
        page.putField("repo", githubRepo);
    }
}

In Pipeline, just use

GithubRepo githubRepo = (GithubRepo)resultItems.get("repo");

to get this object.

PageModelPipeline is actually implemented through the original Pipeline. It will be integrated with PageProcessor. When saving, the class name is used as the key, and the object is the value. For the specific implementation, see: ModelPipeline .

Several Pipelines already provided by WebMagic

Several Pipelines for outputting results to console, saving to file and saving in JSON format are already provided in WebMagic:

kind	illustrate	Remark
ConsolePipeline	output the result to the console	The extraction result needs to implement the toString method
FilePipeline	save results to file	The extraction result needs to implement the toString method
JsonFilePipeline	Save the results to a file in JSON format
ConsolePageModelPipeline	(annotation mode) output the result to the console
FilePageModelPipeline	(annotation mode) save results to file
JsonFilePageModelPipeline	(Annotation mode) Save the result to a file in JSON format	Fields that you want to persist need to have getter methods

Custom Scheduler

Scheduler is a component for URL management in WebMagic. In general, Scheduler includes two functions:

Manage the queue of URLs to be crawled.
De-duplicate the crawled URLs.

WebMagic has built-in several commonly used Schedulers. If you are only executing small-scale crawlers locally, there is basically no need to customize the Scheduler, but it makes sense to have a look at several Schedulers that are already provided.

kind	illustrate	Remark
DuplicateRemovedScheduler	Abstract base class that provides some template methods	Inherit it to implement your own functions
QueueScheduler	Use an in-memory queue to save URLs to be crawled
PriorityScheduler	Use an in-memory queue with priority to save URLs to be crawled	It consumes more memory than QueueScheduler, but when request.priority is set, only PriorityScheduler can be used to make the priority take effect
FileCacheQueueScheduler	Use a file to save the crawling URL, you can continue crawling from the previously crawled URL when you close the program and start it next time	The path needs to be specified, and two files, .urls.txt and .cursor.txt, will be created
RedisScheduler	Use Redis to save the fetching queue, which can be used for simultaneous cooperative fetching of multiple machines	Need to install and start redis

In version 0.5.1, I refactored the internal implementation of Scheduler, and the deduplication part was abstracted into an interface: DuplicateRemover, so that different deduplication methods can be selected for the same Scheduler to suit different needs. Two deduplication methods are provided.

kind	illustrate
HashSetDuplicateRemover	Use HashSet for deduplication, which takes up a lot of memory
BloomFilterDuplicateRemover	Use BloomFilter for deduplication, which takes up less memory, but may miss pages

All default Schedulers use HashSetDuplicateRemover for deduplication, (except RedisScheduler which uses Redis set for deduplication). If you have a lot of URLs, using HashSetDuplicateRemover will take up more memory, so you can also try the following BloomFilterDuplicateRemover, using the method:

spider.setScheduler(new QueueScheduler()
.setDuplicateRemover(new BloomFilterDuplicateRemover(10000000)) //10000000是估计的页面数量
)

Using Downloader

WebMagic's default Downloader is based on HttpClient. Generally speaking, you do not need to implement Downloader yourself, but HttpClientDownloader also reserves several extension points to meet the needs of different scenarios.

In addition, you may want to implement page downloads in other ways, such as using SeleniumDownloaderto render dynamic pages.