Get page source
Spider program entry
Spider generally written in the Main method in
It may be provided crawler configuration, including coding, capture interval, timeout, retries etc., but also some simulation parameters, e.g. User Agent, cookie, and proxy settings
public static void main(String[] args) {
//创建爬虫解析页面
PageProcessor pageProcessor = new FirstWebmagic();
//创建爬虫
Spider spider = Spider.create(pageProcessor);
//给爬虫添加爬取地址
spider.addUrl("https://xiaoshuai.blog.csdn.net/");
//启动一个线程
spider.thread(1);
//启动爬虫
spider.run();
}
Write PageProcessor
In WebMagic, the basic realization of a reptile only need to write a class that implements the interface can PageProcessor. This class basically contains all the code to crawl a site, you need to write.
The following look at what's inside this interface
/** 负责解析页面,抽取有用信息,以及发现新的链接 **/
public interface PageProcessor {
/**
* 处理页面,提取要提取的 URL,提取数据和存储
*
* @param Page page 页面信息
*/
public void process(Page page);
/**
* 获取设置信息
*
* @return site
* @see Site
*/
public Site getSite();
}
Site configuration information
Provided mainly to crawl the site configuration, including coding, capture interval, number of retries
Here we write in getSite () method in
@Override
public Site getSite() {
Site site = Site.me();//创建Site
site.setTimeOut(1000);//设置超时
site.setRetryTimes(3);//设置重试次数
return site;
}
page treatment process
process is custom reptile logic core interface, here to write an extraction logic
All information is crawled in here
We may need to get the results you want
@Override
public void process(Page page) {
//抓取到的页面为一个page对象
Html html = page.getHtml();//我们从page里面获取Html信息
System.out.println(html);//然后一个html源代码就输出到控制台了
}
A complete code sample
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Html;
public class FirstWebmagic implements PageProcessor {
@Override
public void process(Page page) {
//抓取到的页面为一个page对象
Html html = page.getHtml();//我们从page里面获取Html信息
System.out.println(html);//然后一个html源代码就输出到控制台了
}
@Override
public Site getSite() {
Site site = Site.me();//创建Site
site.setTimeOut(1000);//设置超时
site.setRetryTimes(3);//设置重试次数
return site;
}
public static void main(String[] args) {
//创建爬虫解析页面
PageProcessor pageProcessor = new FirstWebmagic();
//创建爬虫
Spider spider = Spider.create(pageProcessor);
//给爬虫添加爬取地址
spider.addUrl("https://xiaoshuai.blog.csdn.net/");
//启动一个线程
spider.thread(1);
//启动爬虫
spider.run();
}
}