一、案例背景
这里为了简化操作,我们以爬取 http://www.fzdm.com/ 网页的热门漫画为例。
二、对比
SeimiCrawler爬虫框架 爬取速度较快,但是不稳定(表现在线程一多,易崩溃);selenium自动化测试工具 爬取速度略慢,但是稳定。
三、方式一:SeimiCrawler爬虫框架
(一)添加依赖
<!-- SeimiCrawler 开源爬虫框架 -->
<dependency>
<groupId>cn.wanghaomiao</groupId>
<artifactId>SeimiCrawler</artifactId>
<version>2.0</version>
</dependency>
(二)爬取逻辑
特别注意:
push()方法中调用的后续爬取逻辑,比如chapterBean(),其访问权限一定要是 public,否则什么都爬取不到!!!!
package org.pc.demo;
import cn.wanghaomiao.seimi.annotation.Crawler;
import cn.wanghaomiao.seimi.def.BaseSeimiCrawler;
import cn.wanghaomiao.seimi.struct.Request;
import cn.wanghaomiao.seimi.struct.Response;
import org.jsoup.nodes.Element;
import org.seimicrawler.xpath.JXDocument;
import org.springframework.util.CollectionUtils;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* @author 咸鱼
* @date 2018/12/26 21:12
*/
@Crawler(name = "my-crawler", httpTimeOut = 30000)
public class MyCrawler extends BaseSeimiCrawler {
@Override
public String[] startUrls() {
return new String[]{"http://www.fzdm.com/"};
}
@Override
public void start(Response response) {
JXDocument document = response.document();
List<Object> links = document.sel("//div[@id='box1']/li/a");
if (isEmpty(links)) {
return;
}
for (Object link : links) {
Element element = (Element) link;
String comicName = element.childNode(0).toString();
String comicUrl = "http:" + element.attr("href");
Map<String, String> params = new HashMap<>();
params.put("comicName", comicName);
//继续访问漫画章节信息
// push(Request.build(comicUrl, MyCrawler::chapterBean).setParams(params));
}
}
//下面是后续逻辑,这里案例只用到上面内容
public void chapterBean(Response response) {
String requestUrl = response.getUrl();
String comicName = response.getRequest().getParams().get("comicName");
logger.info("漫画名:" + comicName);
JXDocument document = response.document();
List<Object> links = document.sel("//div[@id='content']/li/a");
if (isEmpty(links)) {
return;
}
for (Object link : links) {
Element element = (Element) link;
String chapterName = element.childNode(0).toString();
String chapterUrl = requestUrl + element.attr("href");
logger.info("漫画地址:" + chapterUrl);
Pattern chapterNumberPattern = Pattern.compile("^" + comicName + "\\s*(\\d+)\\S*");
Matcher matcher = chapterNumberPattern.matcher(chapterName);
if (matcher.find()){
//取匹配的第一个()内的内容
logger.info(comicName + "第" + matcher.group(1) + "话");
}
// push(Request.build(chapterUrl, MyCrawler::contentBean)
// //使用SeimiAgent,预加载 js
// .useSeimiAgent()
// //渲染时间
// .setSeimiAgentRenderTime(6000)
// );
}
}
public void contentBean(Response response) {}
private boolean isEmpty(List<Object> links) {
if (CollectionUtils.isEmpty(links)){
logger.info("什么都没取到,是不是 xpath 写错了?");
return true;
}
return false;
}
}
(三)配置爬虫
application.properties
:
#开启爬虫
seimi.crawler.enabled=true
#指定爬虫
seimi.crawler.names=my-crawler
#配置 seimiagent IP
seimi.crawler.seimi-agent-host=192.168.10.133
#配置 seimiagent 端口
seimi.crawler.seimi-agent-port=8000
三、方式二:selenium自动化测试工具
(一)准备好Chrome浏览器和Chrome驱动
(二)添加依赖
<!-- selenium-java客户端段 -->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.141.59</version>
</dependency>
<!-- selenium-chrome驱动 -->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>
<!-- selenium-server -->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-server</artifactId>
<version>3.141.59</version>
</dependency>
(三)爬取逻辑
package org.pc.demo;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.springframework.util.CollectionUtils;
import java.util.List;
/**
* 爬去热门漫画
* @author 咸鱼
* @date 2018/12/27 19:02
*/
public class MySelenium {
public static void main(String[] args) {
crawl();
}
private static void crawl(){
//设置Chrome驱动
System.getProperties().setProperty("webdriver.chrome.driver", "E:\\demo\\crawler\\chromedriver.exe");
//实例化Chrome驱动
WebDriver webDriver = new ChromeDriver();
//爬取网址
webDriver.get("http://www.fzdm.com/");
//获取指定DOM元素
List<WebElement> elements = webDriver.findElements(By.xpath("//div[@id='box1']/li/a"));
if (CollectionUtils.isEmpty(elements)){
return;
}
for (WebElement element : elements) {
System.out.println(element.getText());
}
}
}