这系列文章主要分析分析webmagic框架,没有实战内容,如有实战问题可以讨论,也可以提供技术支持。
欢迎加群313557283(刚创建),小白互相学习~
Scheduler
我们先来看看接口
package us.codecraft.webmagic.scheduler; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Task; /** * Scheduler is the part of url management.<br> * You can implement interface Scheduler to do: * manage urls to fetch * remove duplicate urls * * @author [email protected] <br> * @since 0.1.0 */ public interface Scheduler { /** * add a url to fetch * * @param request request * @param task task */ public void push(Request request, Task task); /** * get an url to crawl * * @param task the task of spider * @return the url to crawl */ public Request poll(Task task); }
也很简单,放,取 两个方法
我们再来看看默认调用实现scheduler的那个类QueueScheduler
package us.codecraft.webmagic.scheduler; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Task; import java.util.concurrent.BlockingQueue; import java.util.concurrent.LinkedBlockingQueue; /** * Basic Scheduler implementation.<br> * Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap. * * @author [email protected] <br> * @since 0.1.0 */ public class QueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler { private BlockingQueue<Request> queue = new LinkedBlockingQueue<Request>(); @Override public void pushWhenNoDuplicate(Request request, Task task) { queue.add(request); } @Override public Request poll(Task task) { return queue.poll(); } @Override public int getLeftRequestsCount(Task task) { return queue.size(); } @Override public int getTotalRequestsCount(Task task) { return getDuplicateRemover().getTotalRequestsCount(task); } }
没啥好看的。。我们主要看下实现那个接口和继承的类
DuplicateRemovedScheduler
package us.codecraft.webmagic.scheduler; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.scheduler.component.DuplicateRemover; import us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover; import us.codecraft.webmagic.utils.HttpConstant; /** * Remove duplicate urls and only push urls which are not duplicate.<br><br> * * @author [email protected] * @since 0.5.0 */ public abstract class DuplicateRemovedScheduler implements Scheduler { protected Logger logger = LoggerFactory.getLogger(getClass()); private DuplicateRemover duplicatedRemover = new HashSetDuplicateRemover(); public DuplicateRemover getDuplicateRemover() { return duplicatedRemover; } public DuplicateRemovedScheduler setDuplicateRemover(DuplicateRemover duplicatedRemover) { this.duplicatedRemover = duplicatedRemover; return this; } @Override public void push(Request request, Task task) { logger.trace("get a candidate url {}", request.getUrl()); if (shouldReserved(request) || noNeedToRemoveDuplicate(request) || !duplicatedRemover.isDuplicate(request, task)) { logger.debug("push to queue {}", request.getUrl()); pushWhenNoDuplicate(request, task); } } protected boolean shouldReserved(Request request) { return request.getExtra(Request.CYCLE_TRIED_TIMES) != null; } protected boolean noNeedToRemoveDuplicate(Request request) { return HttpConstant.Method.POST.equalsIgnoreCase(request.getMethod()); } protected void pushWhenNoDuplicate(Request request, Task task) { } }
简单理解下就是request get 重复请求去除,post 重复不去除,没有用布隆过滤,还有个接口MonitorableScheduler接口是提供监控功能,也就是查看还剩下多少待爬取的URL,和总共有多少URL
还有带优先级PriorityScheduler
扩展
BloomFilterDuplicateRemover 用了布隆过滤器 重复post 也支持过滤了,没有测试过
FileCacheQueueScheduler 文件方式,主要是用于增量爬取记录url,这个指的是比如今天共100个页面我爬了20个下班了我关闭了爬虫,第二天他先把20个去重了。
RedisScheduler 加入了redis
RedisScheduler 加入了redis和优先级
总结
我们基本上把所有模块都分析完了,知道了工作原理,分析源码,知道了如何正确使用,接下来带来最后一篇。