Getting started with webmagic - rookie tutorial html to markdown

      Recently, I was learning Java crawler, discovered the webmagic lightweight framework, searched for some tutorials on the Internet, and then tried to write a crawler for the rookie tutorial . The main function is to convert the tutorial content html into markdown text, which is convenient for offline reading;   The main reason for making this tool is that the work environment of our unit generally requires disconnection from the network, and the teaching on the rookie tutorial is generally good as an introduction .
      In order to facilitate offline learning, I made this application; now I write it mainly for sharing and learning by myself;  
Write a blog post once, please forgive me for any imperfections

About **WebMagic**, I will not introduce it, the homepage portal ->  WebMagic
Maven dependency

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.1</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.1</version>
</dependency>

Chinese documentation -> http://webmagic.io/docs/zh/  


Because lambda expressions are used, the jdk version requires 1.8+, and the IDE uses IDEA

----
It's really hard to write an introduction, let's enter the project below

----
project creation

  • Create a project, import the jar package (omitted)

  • The main content structure is shown in the figure  

Controller - Controller, Main method entry

MarkdownSavePipeline - Persistent component - save as file

RunoobPageProcessor - Page parsing component

Service - Service providing components, equivalent to Utils, mainly used to wrap common methods

Rookie Tutorial Page

The Scala tutorial is selected here as a template

start to code

import us.codecraft.webmagic.Spider;

/**
 * 爬虫控制器,main方法入口
 * Created by bekey on 2017/6/6.
 */
public class Controller {
    public static void main(String[] args) {
//        String url = "http://www.runoob.com/regexp/regexp-tutorial.html";
        String url = "http://www.runoob.com/scala/scala-tutorial.html";
        //爬虫控制器   添加页面解析                添加url(request)     添加持久化组件               创建线程   执行
        Spider.create(new RunoobPageProcessor()).addUrl(url).addPipeline(new MarkdownSavePipeline()).thread(1).run();
    }
}

There are four main components in WebMagic  

  • Downloader is responsible for downloading pages
  • PageProcessor is responsible for parsing the page
  • Scheduler schedule URL
  • Pipeline persists to files/databases, etc.

Generally Downloader and Scheduler do not need to be customized

 Process core control engine--Spider, used to freely configure crawlers, create/start/stop/multi-thread, etc.

 

import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Html;


/**
 * 菜鸟教程markdown转换
 * Created by bekey on 2017/6/6.
 */
public class RunoobPageProcessor implements PageProcessor{
    private static String name = null;
    private static String regex = null;

    // 抓取网站的相关配置,包括编码、重试次数、抓取间隔、超时时间、请求消息头、UA信息等
    private Site site= Site.me().setRetryTimes(3).setSleepTime(1000).setTimeOut(3000).addHeader("Accept-Encoding", "/")
            .setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36");

    @Override
    public Site getSite() {
        return site;
    }

    @Override
    //此处为处理函数
    public void process(Page page) {
        Html html = page.getHtml();
//        String name = page.getUrl().toString().substring();
        if(name == null ||regex == null){
            String url = page.getRequest().getUrl();
            name = url.substring(url.lastIndexOf('/',url.lastIndexOf('/')-1)+1,url.lastIndexOf('/'));
            regex = "http://www.runoob.com/"+name+"/.*";
        }
        //添加访问
        page.addTargetRequests(html.links().regex(regex).all());
        //获取文章主内容
        Document doc = html.getDocument();
        Element article = doc.getElementById("content");
        //获取markdown文本
        String document = Service.markdown(article);
        //处理保存操作
        String fileName = article.getElementsByTag("h1").get(0).text().replace("/","").replace("\\","") + ".md";
        page.putField("fileName",fileName);
        page.putField("content",document);
        page.putField("dir",name);
    }
}

In general, the most important thing for a crawler is parsing, so a parser must be created to implement the PageProcessor interface.

The PageProcessor interface has two methods

  • public Site getSite() Site crawls the configuration of the website, generally can be set as a static property
  • public void process(Page page) The page processing function, where Page represents a page downloaded from the Downloader - it may be HTML, or it may be JSON or other text format content.

There are many attribute settings, you can try it yourself, of course, the crawling interval should not be too short, otherwise it will bring a lot of burden to the target website, pay special attention

addHeader -- add message header; the most basic anti-crawler means; 

        Html html = page.getHtml();
//        String name = page.getUrl().toString().substring();
        if(name == null ||regex == null){
            String url = page.getRequest().getUrl();
            name = url.substring(url.lastIndexOf('/',url.lastIndexOf('/')-1)+1,url.lastIndexOf('/'));
            regex = "http://www.runoob.com/"+name+"/.*";
        }
        //添加访问
        page.addTargetRequests(html.links().regex(regex).all());

This section is mainly about link processing; in the Controller, the Spider generally has an entry request, but it is not necessary to create a Spider every time a request is sent (otherwise, what would you do with multiple threads);

Requests can be easily added through page.addTargetRequests and other overloaded methods. The request will be put into the Scheduler and deduplicated, and accessed according to the Sleeptime interval.

The links() method is an abstract method of the Selectable interface, which can extract the links on the page. Because the whole tutorial is to be crawled, the correct links are extracted using regular rules and put into the Scheduler;

Selectable-related extraction element chaining API is a core function of WebMagic. Using the Selectable interface, the chain extraction of page elements can be directly completed, and there is no need to care about the details of extraction. It mainly provides methods such as xpath (Xpath selector) / $(css selector) / regex (regular extraction) /replace (replacement)/links (getting links), but I don't know how to use them, so Jsoup is mainly used for subsequent page parsing. accomplish

The page parsing in WebMagic PageProcessor is mainly realized by using Jsoup, Jsoup is an excellent page parser, please refer to the official document  http://www.open-open.com/jsoup/

        //获取文章主内容
        Document doc = html.getDocument();
        Element article = doc.getElementById("content");

The conversion between page and jsoup is achieved through getDocument, here the Document class, import org.jsoup.nodes.Document

Through the page structure, we can easily find that the main content of the tutorial is hidden in the div with the id of content, take it out

        //获取markdown文本
        String document = Service.markdown(article);

Get the markdown text through static methods, look at the specific implementation, Service class

    /**
     * 公有方法,将body解析为markdown文本
     * @param article #content内容
     * @return markdown文本
     */
    public static String markdown(Element article){
        StringBuilder markdown = new StringBuilder("");
        article.children().forEach(it ->parseEle(markdown, it, 0));
        return markdown.toString();
    }

    /**
     * 私有方法,解析单个元素并向StringBuilder添加
     */
    private static void parseEle(StringBuilder markdown,Element ele,int level){
        //处理相对地址为绝对地址
        ele.getElementsByTag("a").forEach(it -> it.attr("href",it.absUrl("href")));
        ele.getElementsByTag("img").forEach(it -> it.attr("src",it.absUrl("src")));
        //先判断class,再判定nodeName
        String className = ele.className();
        if(className.contains("example_code")){
            String code = ele.html().replace("&nbsp;"," ").replace("<br>","");
            markdown.append("```\n").append(code).append("\n```\n");
            return;
        }
        String nodeName = ele.nodeName();
        //获取到每个nodes,根据class和标签进行分类处理,转化为markdown文档
        if(nodeName.startsWith("h") && !nodeName.equals("hr")){
            int repeat = Integer.parseInt(nodeName.substring(1)) + level;
            markdown.append(repeat("#", repeat)).append(' ').append(ele.text());
        }else if(nodeName.equals("p")){
            markdown.append(ele.html()).append("  ");
        }else if(nodeName.equals("div")){
            ele.children().forEach(it -> parseEle(markdown, it, level + 1));
        }else if(nodeName.equals("img")) {
            ele.removeAttr("class").removeAttr("alt");
            markdown.append(ele.toString()).append("  ");
        }else if(nodeName.equals("pre")){
            markdown.append("```").append("\n").append(ele.html()).append("\n```");
        }else if(nodeName.equals("ul")) {
            markdown.append("\n");
            ele.children().forEach(it -> parseEle(markdown, it, level + 1));
        }else if(nodeName.equals("li")) {
            markdown.append("* ").append(ele.html());
        }
        markdown.append("\n");
    }

    private static String repeat(String chars,int repeat){
        String a = "";
        if(repeat > 6) repeat = 6;
        for(int i = 0;i<=repeat;i++){
            a += chars;
        }
        return a;
    }

I have to say that the lambda expression of java8 is so good that it makes java feel like a script (although many other languages ​​have been implemented for a long time)

Here is the specific business implementation, there is nothing special to explain, just do a little bit of hard work according to the rules; I mainly rely on class and nodeName to convert html to markdown, the processing is not perfect, the specific implementation can be gradually improved~

It should be noted that the Element objects here are all from the Jsoup framework, and they have a very JavaScript feel to use. If you often use js, you should have a good understanding of these method names, so I won't go into details. The attribute is connected, and the absolute link address can be easily obtained through absUrl(String attrName);

back to the process function

        //处理保存操作
        String fileName = article.getElementsByTag("h1").get(0).text().replace("/","").replace("\\","") + ".md";
        page.putField("fileName",fileName);
        page.putField("content",document);
        page.putField("dir",name);

After getting the text again, we can persist the text; in fact, we can persist without the Pieline component, but based on module separation and better reuse/extension, it is also necessary to implement a persistent component yes (if you need more than just a crawler)

Here, the page.putField method actually puts the content into a Map component of ResultItems , which is responsible for saving the results processed by the PageProcessor for use by the Pipeline. Its API is very similar to Map, but wraps some other useful information, It is worth noting that it has a field, skip, page can pass the page.setSkip(true) method, so that the page does not have to be persistent

/**
 * 保存文件功能
 * Created by bekey on 2017/6/6.
 */
public class MarkdownSavePipeline implements Pipeline {
    @Override
    public void process(ResultItems resultItems, Task task) {
        try {
            String fileName = resultItems.get("fileName");
            String document = resultItems.get("content");
            String dir = resultItems.get("dir");
            Service.saveFile(document,fileName,dir);
        }catch (IOException e){
            e.printStackTrace();
        }
    }
}

Pipeline interface, also implement a  

public void process(ResultItems resultItems, Task task) method, handles persistent operations

ResultItems has been introduced. In addition to the content saved in your page, it also provides the getRequest() method to obtain the Request object of this operation, and a getAll() encapsulation method to iterate for you;

The Task object provides two methods

  • getSite ()
  • getUUID ()

Haven't used it, but you can probably know what it does by looking at the method name;

Serivice.saveFile is my own simple encapsulation method of saving files. Create a folder named after the tutorial in the same level of src, and create a .md file with the title of each page as the file name. Simple IO operations will not be posted;

Special attention is that the WebMagic framework will catch exceptions at the bottom layer, but it will not report errors, so when developing and debugging, if you want to catch exceptions, you need to try catch yourself, especially those RuntimeException

 

I've written a lot, the complete code download, my GitHub

https://github.com/BekeyChao/HelloWorld/tree/master/src

Because there is no git management (mainly some other content), it is manually synchronized. If it doesn't work, just study it~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325430689&siteId=291194637