一、问题

备注：在前面一章我们提到，我们使用 redis 来实现去重和增量爬取，这是本篇文章的前提。

在使用 webmagic 爬取小说网站时，由于网络或者访问过于频繁时，小说网站服务器会返回超时、402/400/502等错误，但是这些URL依然会被记录到redis中，这样就带来一个问题：我们在下次进行增量爬取时，这些URL不会再被访问（PS：去重的依据就是redis中有的，将不会再被爬取，只爬取没有的），这样就会导致有些URL始终无法被爬取到。

二、解决办法

(一)初步想法

我初步的想法是，在爬取过程中，对报以上错误的地方使用一个全局 List<String> errorUrls 变量记录这些错误的URL，然后在爬取结束时，使用 Springboot 自动装载的 ReidsTemplate 在redis中，将这些URL删除，这样既可下次进行增量爬取时，再爬取这些错误URL，直到成功为止。

（二）具体实现

1、寻找报错的地方，记录报错URL，并记数

根据Webmagic提供的报错日志，我确认了两个地方：

Spider#onDownloadSuccess() 方法：这里主要报400/402/502等错误
HttpClientDownloader#download() 方法：这里主要报访问超时异常。在这里调用了onError()方法来处理超时异常，但是原作者并没有具体实现这个方法，这里仅是个空实现。

根据上面的发现，需要改造这两个地方，以实现记录错误URL，并对他们的数量进行统计。

（1）改造Spider#onDownloadSuccess()

	/**
     * 自定义：对访问如403、502错误的URL进行处理
     */
    /**
     * 统计错误URL的数量
     */
    private final AtomicInteger errorCount = new AtomicInteger(0);
    /**
     * 统计错误URL
     */
    private List<String> errorUrls = Collections.synchronizedList(new ArrayList<>());

    public AtomicInteger getErrorCount(){
        return errorCount;
    }
    public List<String> getErrorUrls(){
        return errorUrls;
    }
    
	private void onDownloadSuccess(Request request, Page page) {
        if (site.getAcceptStatCode().contains(page.getStatusCode())){
            pageProcessor.process(page);
            extractAndAddRequests(page, spawnUrl);
            if (!page.getResultItems().isSkip()) {
                for (Pipeline pipeline : pipelines) {
                    pipeline.process(page.getResultItems(), this);
                }
            }
        } else {
            logger.info("page status code error, page {} , code: {}", request.getUrl(), page.getStatusCode());
            //todo 自定义对403错误等业务处理代码
            //增加一次错误次数
            errorCount.incrementAndGet();
            //将失败URL加入失败链接集合
            errorUrls.add(request.getUrl());
        }
        sleep(site.getSleepTime());
        return;
    }

（2）改造HttpClientDownloader#download()#onError()

	/**
     * 自定义：对访问超时的URL进行处理
     */
    private final AtomicInteger timeoutCount = new AtomicInteger(0);
    private List<String> timeoutUrls = Collections.synchronizedList(new ArrayList<>());

    public AtomicInteger getTimeoutCount() {
        return timeoutCount;
    }

    public List<String> getTimeoutUrls() {
        return timeoutUrls;
    }
	/**
     * 重载该方法，实现超时计数和统计超时Url
     */
    @Override
    protected void onError(Request request) {
        timeoutCount.incrementAndGet();
        timeoutUrls.add(request.getUrl());
    }

2、寻找爬取结束的地方，删除统计的所有错误URL和引导页（PS：引导页就是帮助我们发现最终目标页的URL，比如找到每本小说，则分类页就是引导页）

根据发现，Spider#run() 方法中有爬取线程的关闭，我们将在这里将前面统计的错误URL使用 RedisTemplate 进行删除。改造如下：

/**
     * redis工具类
     */
    private RedisUtils redisUtils = new RedisUtils();
    
	@Override
    public void run() {
        checkRunningStat();
        initComponent();
        logger.info("Spider {} started!",getUUID());
        while (!Thread.currentThread().isInterrupted() && stat.get() == STAT_RUNNING) {
            final Request request = scheduler.poll(this);
            if (request == null) {
                if (threadPool.getThreadAlive() == 0 && exitWhenComplete) {
                    break;
                }
                // wait until new url added
                waitNewUrl();
            } else {
                threadPool.execute(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            processRequest(request);
                            onSuccess(request);
                        } catch (Exception e) {
                            onError(request);
                            logger.error("process request " + request + " error", e);
                        } finally {
                            pageCount.incrementAndGet();
                            signalNewUrl();
                        }
                    }
                });
            }
        }
        stat.set(STAT_STOPPED);
        // release some resources
        if (destroyWhenExit) {
            close();
        }
        logger.info("Spider {} closed! {} pages downloaded.", getUUID(), pageCount.get());

        /*
         * 在Spider关闭之后，将失败URL（主要是报403、502等错误的URL），从Redis中删除
         * 原因：因为这些页面已经失败了，且被存到了redis中，在下一周期增量爬取时，也会略过这些页面
         */
        //将超时的URL也统计进来
        if (this.downloader instanceof CustomHttpClientDownloader){
            CustomHttpClientDownloader httpClientDownloader = (CustomHttpClientDownloader) this.downloader;
            List<String> timeoutUrls = httpClientDownloader.getTimeoutUrls();
            errorUrls.addAll(timeoutUrls);
            logger.info("超时的Url有 {} 个", httpClientDownloader.getTimeoutCount());
        }
        logger.info("无法访问的Url有 {} 个", errorCount);
        //redis中存储URL的 set 集合的key
        String setKey = "set_" + site.getDomain();
        logger.info("{} Urls deleted in redis", redisUtils.removeValuesFromRedisSet(setKey, errorUrls));

	/*
         *     将引导页也进行删除：因为引导页也被记录进了redis，若我们需要进行增量查询时，因为
         * 引导URL已经存在，将不会再次进行爬取（PS：在NovelProcessor中统计引导页）
         */
        if (this.pageProcessor instanceof NovelProcessor){
            NovelProcessor novelProcessor = (NovelProcessor) this.pageProcessor;
            List<String> helpUrls = novelProcessor.getHelpUrls();
            logger.info("{} Help urls deleted in redis", redisUtils.removeValuesFromRedisSet(setKey, helpUrls));
        }
    }

上面用到的 redis 工具类如下 RedisUtils：

@Component
public class RedisUtils {
    @Autowired
    private RedisTemplate<String, String> redisTemplate;

    public static RedisUtils redisUtils;

    @PostConstruct
    public void init(){
        redisUtils = this;
        redisUtils.redisTemplate = this.redisTemplate;
    }

    public Long removeValuesFromRedisSet(String key, List<String> values){
        long removeCount = 0;
        String[] valueArray = list2String(values);
        if (valueArray != null){
            removeCount = redisUtils.redisTemplate.opsForSet().remove(key, valueArray);
        }
        return removeCount;
    }

    private String[] list2String(List<String> list){
        if (CollectionUtils.isNotEmpty(list)){
            String[] array = new String[list.size()];
            int i = 0;
            for (String str : list){
                array[i++] = str;
            }
            return array;
        }
        return null;
    }
}

3、调用改造后的 `Spider` 和 `HttpClientDownloader`

因为我们是无法修改源码的，所以我们需要自定义Spider 和 HttpClientDownloader，除改动代码部分外，其他全部和 Spider 和 HttpClientDownloader 一样，具体如下：

public class CustomSpider extends AbstractDownloader {
...
}

public class CustomHttpClientDownloader extends AbstractDownloader {
...
}

private static CustomSpider spider;
public static void startCraw() {
        spider = CustomSpider.create(new NovelProcessor())
                .addUrl(NOVEL_WEBSITE_URL)
                .addPipeline(new NovelPipeline())
                .setDownloader(new CustomHttpClientDownloader())
                .setScheduler(new RedisScheduler("192.168.10.130"))
                .thread(10);
        addSpiderListeners(spider);
        spider.run();
    }

（三）增量爬取优化

1、增量爬取问题
在进行上述的改进之后，项目运行过程中又发现了新的问题：Redis连接失败，项目突然崩溃，引导URL无法正常删除。这样我们在进行增量爬取时，由于引导URL已经在Redis中了，那么依然无法进行增量爬取。

2、解决办法
在爬取页面准备阶段，我们就将引导URL放入Redis中的Set集合中，然后在每次开启增量爬取时，在所有爬取的URL组成的Set集合中删除这些引导URL。这样做，即使程序突然崩溃，我们下次依然可以正常进行增量爬取，具体实现如下：

（1） NovelPipeline：主要实现将引导URL存入Redis的Set集合中

/**
 * @author 咸鱼
 * @date 2019-01-24 21:51
 */
@Slf4j
@Component
public class NovelPipeline implements Pipeline {
    @Autowired
    private BookService bookService;
    @Autowired
    private RedisUtils redisUtils;

    public static NovelPipeline novelPipeline;

    @PostConstruct
    public void init(){
        novelPipeline = this;
        novelPipeline.bookService = this.bookService;
        novelPipeline.redisUtils = this.redisUtils;
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        String helpUrl = resultItems.get("helpUrl");
        if (helpUrl != null){
            //TODO:将引导URL放入redis中
            if (novelPipeline.redisUtils.saveValueInRedis(NovelProcessor.PREFIX_NOVEL_HELP_URL_KEY_IN_REDIS +
                    task.getSite().getDomain(), helpUrl) == 1){
                log.info("引导URL：{} 成功存入redis", helpUrl);
            }
        } else {
            String bookName = resultItems.get("bookName");
            String author = resultItems.get("author");
            String bookUrl = resultItems.get("bookUrl");
            String categoryName = resultItems.get("category");
            String coverImgUrl = resultItems.get("coverImgUrl");
            String summary = resultItems.get("summary");
            //因为类别下的小说也会过来，他们过来，这些字段时空的，所以可以具体处理
            if (StringUtil.isParamsValid(bookName, author, bookUrl)){
                if (!novelPipeline.bookService.addBook(bookName, author, bookUrl, categoryName, coverImgUrl, summary)){
                    log.error("保存小说 {0} 失败，请重试！", bookName);
                }
            }
        }
    }
}

（2）改造CustomSpider#setScheduler()：实现在爬虫启动前删除所有的引导URL

public CustomSpider setScheduler(Scheduler scheduler) {
        //TODO:删除redis中的引导URL
        clearHelpUrls();

        checkIfRunning();
        Scheduler oldScheduler = this.scheduler;
        this.scheduler = scheduler;
        if (oldScheduler != null) {
            Request request;
            while ((request = oldScheduler.poll(this)) != null) {
                this.scheduler.push(request, this);
            }
        }
        return this;
    }

/**
     * 删除redis中的引导URL，实现增量下载
     */
    private void clearHelpUrls() {
        String helpUrlKey = NovelProcessor.PREFIX_NOVEL_HELP_URL_KEY_IN_REDIS + getSite().getDomain();
        String allUrlKey = "set_" + getSite().getDomain();
        logger.info("删除{}个引导URL", redisUtils.deleteSetValues(allUrlKey, helpUrlKey));
    }

补充：为什么在CustomSpider#setScheduler()中删除引导URL，而不是CustomSpider#run()中删除？
原因：因为我们在创建Spider时，已经会将爬取入口URL和redis中已经爬取到的URL进行比较，以实现去重，所以若我们在CustomSpider#run()中删除引导页，那么程序依然会认为该URL已经爬取过了，不会再被爬取！！！

RedisUtils

/**
     * 删除key的set集合中所有otherKey中set集合的所有元素
     */
    public Long deleteSetValues(String key, String otherKey) {
        long num = 0;
        Set<String> members = redisUtils.redisTemplate.opsForSet().members(otherKey);
        if (CollectionUtils.isNotEmpty(members)){
            num = redisUtils.redisTemplate.opsForSet().remove(key, members.toArray());
        }
        return num;
    }

（3）改造CustomSpider#run()：将这里的删除引导URL代码去掉

十二、学习爬虫框架WebMagic（八）---访问超时、402等解决办法

一、问题

二、解决办法

(一)初步想法

（二）具体实现

1、寻找报错的地方，记录报错URL，并记数

（1）改造Spider#onDownloadSuccess()

（2）改造HttpClientDownloader#download()#onError()

2、寻找爬取结束的地方，删除统计的所有错误URL和引导页（PS：引导页就是帮助我们发现最终目标页的URL，比如找到每本小说，则分类页就是引导页）

3、调用改造后的 `Spider` 和 `HttpClientDownloader`

（三）增量爬取优化

猜你喜欢

十二、学习爬虫框架WebMagic（八）---访问超时、402等解决办法

一、问题

二、解决办法

(一)初步想法

（二）具体实现

1、寻找报错的地方，记录报错URL，并记数

（1）改造Spider#onDownloadSuccess()

（2）改造HttpClientDownloader#download()#onError()

2、寻找爬取结束的地方，删除统计的所有错误URL和引导页（PS：引导页就是帮助我们发现最终目标页的URL，比如找到每本小说，则分类页就是引导页）

3、调用改造后的 Spider 和 HttpClientDownloader

（三）增量爬取优化

猜你喜欢

3、调用改造后的 `Spider` 和 `HttpClientDownloader`