[Java reptile -WebMagic] -02- obtain the source code pages

Get page source

Spider program entry

Spider generally written in the Main method in

It may be provided crawler configuration, including coding, capture interval, timeout, retries etc., but also some simulation parameters, e.g. User Agent, cookie, and proxy settings

public static void main(String[] args) {
        //创建爬虫解析页面
        PageProcessor pageProcessor = new FirstWebmagic();
        //创建爬虫
        Spider spider = Spider.create(pageProcessor);
        //给爬虫添加爬取地址
        spider.addUrl("https://xiaoshuai.blog.csdn.net/");
        //启动一个线程
        spider.thread(1);
        //启动爬虫
        spider.run();
}

Write PageProcessor

In WebMagic, the basic realization of a reptile only need to write a class that implements the interface can PageProcessor. This class basically contains all the code to crawl a site, you need to write.

The following look at what's inside this interface

/** 负责解析页面,抽取有用信息,以及发现新的链接 **/
public interface PageProcessor {

    /**
     * 处理页面,提取要提取的 URL,提取数据和存储
     *
     * @param Page page 页面信息
     */
    public void process(Page page);

    /**
     * 获取设置信息
     *
     * @return site
     * @see Site
     */
    public Site getSite();
}

Site configuration information

Provided mainly to crawl the site configuration, including coding, capture interval, number of retries

Here we write in getSite () method in

@Override
public Site getSite() {
    Site site = Site.me();//创建Site
    site.setTimeOut(1000);//设置超时
    site.setRetryTimes(3);//设置重试次数
    return site;
}

page treatment process

process is custom reptile logic core interface, here to write an extraction logic

All information is crawled in here

We may need to get the results you want

@Override
public void process(Page page) {
    //抓取到的页面为一个page对象
    Html html = page.getHtml();//我们从page里面获取Html信息
    System.out.println(html);//然后一个html源代码就输出到控制台了
}

A complete code sample

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Html;

public class FirstWebmagic implements PageProcessor {

    @Override
    public void process(Page page) {
        //抓取到的页面为一个page对象
        Html html = page.getHtml();//我们从page里面获取Html信息
        System.out.println(html);//然后一个html源代码就输出到控制台了
    }

    @Override
    public Site getSite() {
        Site site = Site.me();//创建Site
        site.setTimeOut(1000);//设置超时
        site.setRetryTimes(3);//设置重试次数
        return site;
    }
    
    public static void main(String[] args) {
        //创建爬虫解析页面
        PageProcessor pageProcessor = new FirstWebmagic();
        //创建爬虫
        Spider spider = Spider.create(pageProcessor);
        //给爬虫添加爬取地址
        spider.addUrl("https://xiaoshuai.blog.csdn.net/");
        //启动一个线程
        spider.thread(1);
        //启动爬虫
        spider.run();

    }
}

Here Insert Picture Description


Previous [Java reptile -WebMagic] -01- acquaintance reptiles framework WebMagic

Next [Java reptile -WebMagic] -03- resolve Html source

Published 45 original articles · won praise 42 · views 20000 +

Guess you like

Origin blog.csdn.net/qq_18604209/article/details/104208837