webmagic reptile crawling some URLs URL Regex filtering is not effective

Recent Webmagic filter url when: the following format

page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-]+/[\\w\\-]+)").all());
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-])").all());

You can not always filter to the correct url.

Deal with various problems.

With the Pattern, match, find (). I did not find the problem.

2020 February 22 17:49:55 today most of the day. deal with. In between occasional, time tracking to match.

Regex expressions found in local matching, actually changed.

https: // github \\ .com / [\\ w \\ -] This effect in Webmagic

But do not take effect SpringCloud environment.

Found adjusted format: https: \\ / \\ / github \\ com \\ / [\\ w \\ -].

The results actually had time to Debug: https: \\\\ / \\\\ / github \\\\ com \\\\ / [\\\\ w \\\\ -].

This problem found.

So regex increase expression, front-end storage, still follow the normal increase in expression format to store.

java default processing time.

So the URL regular expression pattern should be saved as: https: \ / \ / github \ .com \ / [\ w \ -] +
        

Published 20 original articles · won praise 5 · Views 7400

Guess you like

Origin blog.csdn.net/liuhagen/article/details/104447480