Recent Webmagic filter url when: the following format
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-]+/[\\w\\-]+)").all());
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-])").all());
You can not always filter to the correct url.
Deal with various problems.
With the Pattern, match, find (). I did not find the problem.
2020 February 22 17:49:55 today most of the day. deal with. In between occasional, time tracking to match.
Regex expressions found in local matching, actually changed.
https: // github \\ .com / [\\ w \\ -] This effect in Webmagic
But do not take effect SpringCloud environment.
Found adjusted format: https: \\ / \\ / github \\ com \\ / [\\ w \\ -].
The results actually had time to Debug: https: \\\\ / \\\\ / github \\\\ com \\\\ / [\\\\ w \\\\ -].
This problem found.
So regex increase expression, front-end storage, still follow the normal increase in expression format to store.
java default processing time.
So the URL regular expression pattern should be saved as: https: \ / \ / github \ .com \ / [\ w \ -] +