Is this a bug in the Java regexp implementation?

rkosegi :

I'm trying to match the string iso_schematron_skeleton_for_xslt1.xsl against the regexp ([a-zA-Z|_])?(\w+|_|\.|-)+(@\d{4}-\d{2}-\d{2})?\.yang.

The expected result is false, it should not match.

The problem is that the call to matcher.matches() never returns.

Is this a bug in the Java regexp implementation?

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HelloWorld{
    private static final Pattern YANG_MODULE_RE = Pattern
            .compile("([a-zA-Z|_])?(\\w+|_|\\.|-)+(@\\d{4}-\\d{2}-\\d{2})?\\.yang");

     public static void main(String []args){
        final Matcher matcher = YANG_MODULE_RE.matcher("iso_schematron_skeleton_for_xslt1.xsl");
        System.out.println(Boolean.toString( matcher.matches()));
     }
}

I'm using:

openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b15)
OpenJDK 64-Bit Server VM (build 25.181-b15, mixed mode)
Wiktor Stribiżew :

The pattern contains nested quantifiers. The \w+ is inside a group that is itself quantified with +, which makes it hard for the regex engine to process non-matching strings. It makes more sense to make a character class out of the alternation group, i.e. (\\w+|_|\\.|-)+ => [\\w.-]+.

Note that \w already matches _. Also, a | inside a character class matches a literal | char, and [a|b] matches a, | or b, so it seems you should remove the | from your first character class.

Use

.compile("[a-zA-Z_]?[\\w.-]+(?:@\\d{4}-\\d{2}-\\d{2})?\\.yang")

Note that you may use a non-capturing group ((?:...)) instead of a capturing one to avoid more overhead you do not need as you are just checking for a match and not extracting substrings.

See the regex demo (as the pattern is used with matches() and thus requires a full string match, I added ^ and $ in the regex demo).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=36941&siteId=1