WebMagic of Web Crawler 2

Web crawler 2

1. Introduction to WebMagic

Basic knowledge:
WebMagic is a crawler framework, which uses HttpClient and Jsoup at the bottom, allowing us to develop crawlers more conveniently.
The WebMagic project code is divided into two parts: core and extension. The core part (webmagic-core) is a streamlined, modular crawler implementation, while the extended part includes some convenient and practical functions.
The design goal of WebMagic is to be as modular as possible and reflect the functional characteristics of crawlers. This part provides a very simple and flexible API, and writes a crawler without basically changing the development model.
The extension part (webmagic-extension) provides some convenient functions, such as writing crawlers in annotation mode. At the same time, some commonly used components are built-in to facilitate crawler development.
Architecture Introduction
The structure of WebMagic is divided into four components: Downloader, PageProcessor, Scheduler, and Pipeline, and Spider organizes them together. These four components correspond to the download, processing, management, and persistence functions in the crawler life cycle. WebMagic's design refers to Scapy, but the implementation is more Java-based.
The Spider organizes these components so that they can interact with each other and execute in a process. It can be considered that Spider is a large container and it is also the core of WebMagic logic.
The overall structure of WebMagic is as follows:

3.1. Four components of WebMagic
1. Downloader
  Downloader is responsible for downloading pages from the Internet for subsequent processing. WebMagic uses Apache HttpClient as the download tool by default.
2. PageProcessor
  PageProcessor is responsible for parsing pages, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool, and develops Xsoup, a tool for parsing XPath based on it.
  Among these four components, PageProcessor is different for each page of each site, and is the part that needs to be customized by the user.
3. Scheduler
  Scheduler is responsible for managing URLs to be crawled and some deduplication work. WebMagic provides JDK memory queues by default to manage URLs, and uses collections to remove duplicates. It also supports the use of Redis for distributed management.
4. Pipeline
  Pipeline is responsible for the processing of extraction results, including calculations, persistence to files, databases, etc. By default, WebMagic provides two result processing solutions: "output to console" and "save to file".
  Pipeline defines the way the results are saved. If you want to save to the specified database, you need to write the corresponding Pipeline. Generally, only one Pipeline needs to be written for a type of requirement.

3.2. Objects used for data flow

Request
Request is a layer of encapsulation of URL address, one Request corresponds to one URL address.
It is the carrier of the interaction between PageProcessor and Downloader, and it is also the only way for PageProcessor to control Downloader.
In addition to the URL itself, it also contains a field extra in the Key-Value structure. You can save some special attributes in extra and read them in other places to complete different functions. For example, some information on a page is attached.
Page
Page represents a page downloaded from Downloader-it may be HTML, JSON or other text format content.
Page is the core object of WebMagic's extraction process, it provides some methods for extraction, results preservation, etc.
ResultItems
ResultItems is equivalent to a Map, which stores the results processed by PageProcessor for use by Pipeline. Its API is very similar to Map. It is worth noting that it has a field skip. If set to true, it should not be processed by Pipeline.

3.3 Introductory case

Add dependencies
Create a Maven project and add the following dependencies
```

<dependency>
   <groupId>us.codecraft</groupId>
   <artifactId>webmagic-core</artifactId>
   <version>0.7.3</version>
</dependency>

<dependency>
   <groupId>us.codecraft</groupId>
   <artifactId>webmagic-extension</artifactId>
   <version>0.7.3</version>
</dependency>
```
注意：0.7.3版本对SSL的并不完全，如果是直接从Maven中央仓库下载依赖，在爬取只支持SSL v1.2的网站会有SSL的异常抛出。
solution:
1. Wait for the author's 0.7.4 version to be released
2. Download the latest code directly from github and install it to the local warehouse.
  You can also refer to the following information to fix it yourself
  https://github.com/code4craft/webmagic/issues/701

Add the configuration file
WebMagic to use slf4j-log4j12 as the implementation of slf4j.
Add log4j.properties configuration file

log4j.rootLogger=INFO,A1 

log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{
      
      yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

Case realization
Insert picture description here

public class JobProcessor implements PageProcessor {
      
      

    //解析页面
    public void process(Page page) {
      
      
      page.putField("logo_subtit", page.getHtml().css("div.logo>h2").all());
    }

    private Site site = Site.me();
    public Site getSite() {
      
      
        return site;
    }
    
    public static void main(String[] args) {
      
      
        Spider.create(new JobProcessor())
                //初始访问url地址
                .addUrl("https://www.jd.com/") 
                .run();
    }
}

Print result:
Insert picture description here

2. WebMagic function

2.1. Implement PageProcessor

Extracting elements Selectable: Three extraction techniques are mainly used in WebMagic: XPath, CSS selectors and regular expressions. In addition, for JSON format content, you can use JsonPath to parse.
Extract element API
Selectable related extraction element chain API is a core function of WebMagic. Using the Selectable interface, you can directly complete the chained extraction of page elements without having to care about the details of extraction.
What page.getHtml() returns is an Html object, which implements the Selectable interface. The methods contained in this interface are divided into two categories: the extraction part and the result part

method	Description	Example
xpath(String xpath)	Select using XPath	html.xpath("//div[@class=‘title’]")
$(String selector)	Use the Css selector to select	html.$(“div.title”)
$(String selector,String attr)	Use the Css selector to select	html.$(“div.title”,“text”)
css(String selector)	The function is the same as $(), use the Css selector to select	html.css(“div.title”)
links()	Select all links	html.links()
regex(String regex)	Use regular expression extraction	html.regex("(.*?)")

This part of the extraction API returns a Selectable interface, which means that it supports chained calls. For example, visit https://www.jd.com/ page

Get results API
When the chain call ends, we generally want to get a string type result. At this time, you need to use the API to get the results.
We know that an extraction rule, whether it is XPath, CSS selector or regular expression, can always extract multiple elements. WebMagic unifies these, and one or more elements can be obtained through different APIs.

method	Description	Example
get()	Return a result of type String	String link= html.links().get()
toString()	Same as get(), returns a String type result	String link= html.links().toString()
all()	Return all extraction results	List links= html.links().all()

When there are multiple pieces of data, both get() and toString() are used to obtain the first URL address.
Here selectable.toString() uses the toString() interface, which is more convenient when outputting and combining with some frameworks. Because under normal circumstances, we only need to select one element! selectable.all() will get all elements

With the logic to obtain links , our crawler is almost complete, but there is still a problem: there are many pages on a site, and we can’t list them all at the beginning, so how do we find the subsequent links? An indispensable part of crawlers.
The following example is to get
all the URL addresses in https://www.jd.com/moreSubject.aspx that conform to https://www.jd.com/news.\w+?.*regular expressions
and link these Add to the queue to be crawled.

2.2 Use Pipeline to save results

The component used by WebMagic to save the results is called Pipeline. We are now through the "console output results" this thing is also done through a built-in Pipeline, it is called ConsolePipeline.
So, now I want to save the result to a file, how do I do it? Just replace the implementation of Pipeline with "FilePipeline".

public static void main(String[] args) {
      
      
    Spider.create(new JobProcessor())
            //初始访问url地址
            .addUrl("https://www.jd.com/")
            .addPipeline(new FilePipeline("D:/webmagic/"))
            .thread(5)//设置线程数
            .run();
}

2.3. Crawler configuration, startup and termination

Spider
Spider is the entry point for crawlers to start. Before starting the crawler, we need to use a PageProcessor to create a Spider object, and then use run() to start it.
At the same time, other Spider components (Downloader, Scheduler, Pipeline) can be set through the set method.

method	Description	Example
create(PageProcessor)	Create Spider	Spider.create(new GithubRepoProcessor())
addUrl(String…)	Add the initial URL	spider .addUrl(“http://webmagic.io/docs/”)
thread(n)	Open n threads	spider.thread(5)
run()	Start, will block the current thread execution	spider.run()
start()/runAsync()	Start asynchronously, the current thread continues to execute	spider.start()
stop()	Stop spider.stop()
addPipeline(Pipeline)	Add a Pipeline, a Spider can have multiple Pipelines	spider .addPipeline(new ConsolePipeline())
setScheduler(Scheduler)	Set the Scheduler, one Spider can only have one Scheduler	spider.setScheduler(new RedisScheduler())
setDownloader(Downloader)	Set Downloader, one Spider can only have one Downloader	spider .setDownloader(new SeleniumDownloader())
get(String)	Call synchronously and get the result directly	ResultItems result = spider.get(“http://webmagic.io/docs/”)
getAll (String…)	Call synchronously and get a bunch of results directly	List results = spider .getAll(“http://webmagic.io/docs/”, “http://webmagic.io/xxx”)

Crawler configuration Site
Site.me() can configure the crawler, including encoding, crawling interval, timeout period, number of retries, etc. Here we first set it up briefly: the number of retries is 3, and the crawl interval is one second.
```
private Site site = Site.me()
        .setCharset("UTF-8")//编码
        .setSleepTime(1)//抓取间隔时间
        .setTimeOut(1000*10)//超时时间
        .setRetrySleepTime(3000)//重试时间
        .setRetryTimes(3);//重试次数
```
Some configuration information of the site itself, such as encoding, HTTP headers, timeout period, retry strategy, etc., proxy, etc., can be configured by setting the Site object.

method	Description	Example
setCharset(String)	Set encoding	site.setCharset(“utf-8”)
setUserAgent(String)	Set up UserAgent	site.setUserAgent(“Spider”)
setTimeOut(int)	Set the timeout period,
The unit is milliseconds	site.setTimeOut(3000)
setRetryTimes(int)	Set the number of retries	site.setRetryTimes(3)
setCycleRetryTimes(int)	Set the number of cyclic retries	site.setCycleRetryTimes (3)
addCookie(String,String)	Add a cookie	site.addCookie(“dotcomt_user”,“code4craft”)
setDomain(String)	Set the domain name, you need to set the domain name before addCookie can take effect site.setDomain("github.com")
addHeader(String,String)	Add an addHeader site.addHeader("Referer","https://github.com")
setHttpProxy(HttpHost)	Set up Http proxy	site.setHttpProxy(new HttpHost(“127.0.0.1”,8080))

3. Crawler classification

网络爬虫按照系统结构和实现技术，大致可以分为以下几种类型：通用网络爬虫、聚焦网络爬虫、增量式网络爬虫、深层网络爬虫。实际的网络爬虫系统通常是几种爬虫技术相结合实现的

通用网络爬虫
通用网络爬虫又称全网爬虫（Scalable Web Crawler），爬行对象从一些种子 URL 扩充到整个 Web，主要为门户站点搜索引擎和大型 Web 服务提供商采集数据。
这类网络爬虫的爬行范围和数量巨大，对于爬行速度和存储空间要求较高，对于爬行页面的顺序要求相对较低，同时由于待刷新的页面太多，通常采用并行工作方式，但需要较长时间才能刷新一次页面。
简单的说就是互联网上抓取所有数据。
聚焦网络爬虫
聚焦网络爬虫（Focused Crawler），又称主题网络爬虫（Topical Crawler），是指选择性地爬行那些与预先定义好的主题相关页面的网络爬虫。
和通用网络爬虫相比，聚焦爬虫只需要爬行与主题相关的页面，极大地节省了硬件和网络资源，保存的页面也由于数量少而更新快，还可以很好地满足一些特定人群对特定领域信息的需求。
简单的说就是互联网上只抓取某一种数据。
Incremental web crawler
incremental web crawler (Incremental Web Crawler) means taking only incremental updates and newly generated crawling reptiles or page changes that have taken place on the downloaded pages, it can ensure that the crawl to some extent, The page is as new as possible.
Compared with web crawlers that periodically crawl and refresh pages, incremental crawlers will only crawl newly generated or updated pages when needed, and will not re-download pages that have not changed. This can effectively reduce the amount of data downloads and timely Updating the crawled web pages reduces time and space consumption, but increases the complexity and difficulty of the crawling algorithm.
Simply put, only the newly updated data is captured on the Internet.
Deep Web crawler
web pages can be divided into surface web pages (Surface Web) and deep web pages (Deep Web, also known as Invisible Web Pages or Hidden Web) according to the way of existence.
Surface web pages refer to web pages that can be indexed by traditional search engines, and are mainly composed of static web pages that can be reached by hyperlinks.
Deep Web is a Web page that most of the content cannot be obtained through static links, hidden behind the search form, and can only be obtained by users submitting some keywords.