[Java-Crawler] Learn to use the WebMagic crawler framework in one article

WebMagic

The crawler is mainly divided into 采集、处理、存储three parts.
Before learning the WebMagic framework, you need to understand HttpClient、Jsoup(Java HTML Parse)the libraries, or their basic use. Because the WebMagic framework uses them internally, when you have a problem and look at the source code to check the error, if you don't know HttpClient and Jsoup, you may not know what's going on. The main reason is that if WebMagic is separated from these two, it cannot be said to be an easy-to-start crawler framework.

WebMagic Official Documentation

WebMagic Architecture

​ The structure of WebMagic is divided into four major components Downloader, PageProcessor, Scheduler, Pipelineand , and they are organized by Spider (container). These four components correspond to the functions of downloading, processing, management and persistence in the crawler life cycle. The design of WebMagic refers to Scapy (a framework of Python), but the implementation is more Java-like.

​ The Spider organizes these components so that they can interact with each other and execute in a process-based manner. You can think of the Spider as a large container, which is also the core of WebMagic logic.

The overall architecture of WebMagic is as follows:

Four major components of WebMagic

1.Downloader

Downloader (downloader) is responsible for downloading pages from the Internet for subsequent processing, which can be understood as grabbing data.

2.PageProcessor

PageProcessor (page processor) is responsible for parsing pages, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool, and based on it developed Xsoup, a tool for parsing XPath . Simple understanding and parsing of data.

3.Scheduler

Scheduler (scheduler) is responsible for managing URLs to be crawled, as well as some deduplication work. By default, WebMagic provides JDK's memory queue to manage URLs, and uses collections for deduplication. Distributed management using Redis is also supported.

4.Pipeline

Pipeline(Pipeline) defines the way to save the results. If you need to save to the specified database, you need to write the corresponding Pipeline. Generally, only one needs to be written for one type of requirement Pipeline.

The four major components are properties in Spider

insert image description here

Objects used for data flow

1. Request

RequestIt is a layer of encapsulation of the URL address, and a Request corresponds to a URL address.

It is the carrier for the interaction between PageProcessor and Downloader, and it is also the only way for PageProcessor to control Downloader.

In addition to the URL itself, it also contains a field of the Key-Value structure extra. You can save some special attributes in extra, and then read them in other places to complete different functions. For example, add some information of the previous page, etc.

2. Page

PageRepresents a page downloaded from Downloader—it may be HTML, JSON or other text format content.

Page is the core object of WebMagic's extraction process, and it provides some methods for extraction and result preservation. In the examples in Chapter 4, we will introduce its use in detail.

3. ResultItems

ResultItemsIt is equivalent to a Map, which saves the results processed by PageProcessor for use by Pipeline. Its API is very similar to Map, it is worth noting that it has a field skip, if set to true, it should not be processed by Pipeline.

Write a basic crawler

1. Implement the PageProcessor interface

The customization of PageProcessor is divided into three parts, which are crawler configuration, page element extraction and link discovery.

public class JobProcessor implements PageProcessor {
    
    
    /**
     * 该方法负责解析页面
     * @param page
     */
    @Override
    public void process(Page page) {
    
    
        // 解析返回的数据page,并且把解析的结果放到 resultItems中
        // 如果没有制定pipeline输出的位置,是直接输出在控制台上
        page.putField("title",page.getHtml().css("title").all());
    }

    private Site site = Site.me();

    @Override
    public Site getSite() {
    
    
        return site;
    }

    // 主函数,执行爬虫
    public static void main(String[] args) {
    
    
        Spider.create(new JobProcessor())
                .addUrl("https://www.51cto.com/")  // 设置爬取数据的页面
                .run();// 执行爬虫
    }
}

2 Use Selectable to extract elements

SelectableThe related extracted element chain API is a core function of WebMagic. Using the Selectable interface, you can directly complete the chain extraction of page elements, and you don't need to care about the details of the extraction.

As you can see in the example just now, page.getHtml() returns an Htmlobject that implements Selectablethe interface. This interface contains some important methods, I will divide it into two categories: the extraction part and the result part.

Three extraction techniques are mainly used in WebMagic: XPath, regular expressions and CSS selectors. In addition, JSONPath can be used to parse content in JSON format.

The following are examples of three extraction methods: XPath, regular, CSS selectors, and can also be integrated

    @Override
    public void process(Page page) {
    
    
        // 解析返回的数据page,并且把解析的结果放到 resultItems中
        // 如果没有制定pipeline输出的位置,是直接输出在控制台上

        // css选择器
        page.putField("title",page.getHtml().css("title").all());

        // 正则表达式
        page.putField("p1",page.getHtml().xpath("//div[@class=top]/top/template/div/a/p").regex(".*电话.*").all());

        // XPath
        page.putField("p",page.getHtml().xpath("//div[@class=top]/top/template/div/a/p").all());
    }

XPath /text()is equivalent to the method text() in Element. Anyway, the three extraction methods of XPath, regular expressions and CSS selectors are very important.

insert image description here
insert image description here

(get and toString are consistent results, and the first result is the default when there are multiple results)

3. Get the link

There are two situations in which pageProcessor parses the page, one is to output the data through the Pipeline component, and the other is to store the link through the Scheduler component, which will be automatically parsed next time.

// 获取链接
page.addTargetRequests(page.getHtml().css("div.blog-nav-box ul li").links().regex("https.*").all());

4. Use Pipeline to save data

​ The component used by WebMagic to save results is called Pipeline. What we are now doing through "console output" is also done through a built-in Pipeline, which is called ConsolePipelineline, and the console Pipeline is added during initialization.

insert image description here

​ So, I now need to save the result to a file, how to do it? Just replace the implementation of Pipeline with " FilePipeline".

5. Configuration, startup and termination of the crawler

1. Spider

SpiderIt is the entry point for the crawler to start. Before starting the crawler, we need to use one PageProcessorto create a Spider object (by calling the Spider's static create() method), and then use it run()to start. At the same time, other components of Spider (Downloader, Scheduler, Pipeline) can be set through the set method.

insert image description here

2. Crawler Configuration Site

Some configuration information of the site itself, such as encoding, HTTP header, timeout, retry strategy, proxy, etc., can be Siteconfigured by setting objects.

    private Site site = Site.me()
            .setCharset("utf-8")
            .setTimeOut(10000)  // 设置超时时间,单位是毫秒
            .setRetrySleepTime(3000)  // 设置重试的间隔时间
            .setRetryTimes(3); // 设置重试次数

    @Override
    public Site getSite() {
    
    
        return site;
    }

insert image description here

Reptile classification

According to the system structure and implementation technology, web crawlers can be roughly divided into the following types: general web crawlers (crawl all data), aggregate web crawlers (crawl interesting data), incremental web crawlers (crawl changed data ), deep web crawler (crawling data that can only be obtained after processing). The actual web crawler system is usually implemented by combining several crawler technologies.

1. General Web Crawler

​ General-purpose web crawler is also called Scalable Web Crawler. Its crawling objects expand from some seed URLs to the entire web, mainly collecting data for portal search engines and large web service providers.

​ The crawling range and quantity of this type of web crawler are huge, and they have high requirements for crawling speed and storage space, and relatively low requirements for the sequence of crawled pages. At the same time, because there are too many pages to be refreshed, parallel work is usually used, but It takes a long time to refresh the page once.

​To put it simply

2. Gather web crawlers (commonly used)

​ Focused Crawler, also known as Topical Crawler, refers to a web crawler that selectively crawls pages related to pre-defined topics.

​Compared with general-purpose web crawlers, aggregate crawlers only need to crawl pages related to the topic, which greatly saves hardware and network resources, and 保存的页面也由于数量少而更新快can also well meet the needs of some specific groups of people for information in specific fields.

Simply put, only one type of data is captured on the Internet. (eg: buy slowly)

3. Incremental web crawler

​Incremental Web Crawler (Incremental Web Crawler) refers to a crawler that incrementally updates downloaded web pages and only crawls newly generated or changed web pages. It can guarantee that the crawled pages are as good as possible to a certain extent. new page.

​ Compared with web crawlers that periodically crawl and refresh pages, incremental crawlers only crawl newly generated or updated pages when needed, and do not re-download pages that have not been sent, which can effectively reduce data downloads. Timely update the crawled web pages to reduce time and space consumption, but increase the complexity and implementation difficulty of the crawling algorithm.

Simply put, only the newly updated data is crawled on the Internet.

4. Deep Web Crawler

​ Web pages can be divided into surface web pages (Surface Web) and deep web pages (Deep Web), also known as Invisible Web Pages or Hidden Web).

​ Surface web pages refer to pages that can be indexed by traditional search engines, and are mainly composed of static web pages that can be reached by hyperlinks.

​Deep Web is those Web pages where most of the content cannot be obtained through static links, hidden behind the search form, and can only be obtained by users submitting some keywords.

Scheduler component

WebMagic provides Scheduler which can help us solve URL management problems.

Scheduler is a component for URL management in WebMagic. Generally speaking, Scheduler includes two functions:

  1. Manage the queue of URLs to be crawled.
  2. Deduplication of crawled URLs.

Several commonly used Schedulers are built into WebMagic. If you are only executing small-scale crawlers locally, there is basically no need to customize the Scheduler, but it is still meaningful to know about the several Schedulers that have been provided (the default is QueueScheduler).

insert image description here

insert image description here

The deduplication part is separately abstracted into an interface: DuplicateRemover, so that different deduplication methods can be selected for the same Scheduler to meet different needs. Currently, two deduplication methods are provided.
insert image description here

All default Schedulers use HashSetDuplicateRemover for deduplication (except RedisScheduler which uses Redis set for deduplication). If you have many URLs, using HashSetDuplicateRemover will consume more memory, so you can also try the following BloomFilterDuplicateRemover, using:

    @Scheduled(initialDelay = 1000,fixedDelay = 100000)
    public void process(){
    
    
        Spider.create(new JobProcessor())
                .addUrl(url)
                .setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(100000)))
                .thread(5)
                .run();
    }

Three deduplication methods

  • HashSet
    • Use the HashSet in Java to remove duplicates. The advantage is that it is easy to understand. Easy to use.
    • Disadvantages: It takes up a lot of memory and has low performance.
  • Redis deduplication
    • Use Redis set to deduplicate. The advantage is that it is fast (Redis itself is very fast), and deduplication will not occupy the resources of the crawler server, and it can handle data crawling with a larger amount of data.
    • Disadvantages: Redis server needs to be prepared, which increases development and usage costs.
  • Bloom filter (BloomFilter)
    • Deduplication can also be achieved by using Bloom filters. The advantage is that the memory occupied is much smaller than that of HashSet, and it is also suitable for deduplication operations of large amounts of data.
    • Disadvantages: There is a possibility of misjudgment. Duplication may be judged if there is no duplication, but duplicate data will definitely be judged as duplication, that is, data may be lost. It is acceptable for crawlers.
      • Bloom Filter (Bloom Filter) was proposed by Burton Howard Bloom in 1970. It is a space efficient probabilistic data structure used to determine whether an element is in a set. It is often used in the black and white list method of spam filtering, and in the URL judgment module of Crawler.
      • Hash tables can also be used to determine whether an element is in a set, but Bloom filters only need 1/8 or 1/4 of the space responsibility of hash tables to accomplish the same problem. Bloom filters can insert elements, but cannot delete existing elements.其中的元素越多,误报率越大,但是漏报是不可能的。

Advantages and disadvantages of using WebMagic

advantage:

  • Easy to use, just add dependencies and implement them PageProcessorto start the operation.
  • Provides IO and multithreading, efficient and stable.
  • It is modular and easy to expand.

shortcoming:

  • It can only parse static pages and obtain the original HTML of static pages, but now many web pages are dynamically produced, and many of them are composed of < script > tags that have not been parsed by the browser. That is to say, WebMagic does not support JavaScript parsing.

Fortunately, using Selenium + ChromeDriver can solve it (this thing has been organized for the editor for two days, and I vomited, and then I will sort out a blog to sort it out). You can also write your own Downloader component and use Spider to configure it, but I haven't done it for a long time.

Guess you like

Origin blog.csdn.net/qq_63691275/article/details/130836239