Regular Expressions order multiple choice structure

Regular Expressions order multiple choice structure

Look at a programming problem

[Section contains only from ., 0- 9] character string extracted IPv4 address all possible.

IPv4 address is represented by decimal number and points, each address contains four decimal numbers, the range of 0-255, for example 172.16.254.1; the same time, not four decimal numbers to 0begin with, such 172.16.254.01is not legal.

Now a text input str="1.1.1111.16..172.16.254.1.1", the IPv4 address of the string in claim random among all possible substrings formed by application.

A lot of people think, IPv4的正则表达式我可熟悉啦,肯定能快速完成!so regular expressions to search the Internet from IPv4, write the following code:

// Java语言
import java.util.*;
import java.util.regex.*;

class Solution {
    private static final Pattern IPV4_PATTERN =
            Pattern.compile("((([1-9][0-9]?)|(1[0-9]{2})|(2[0-4]\\d)|(25[0-5])|0)\\.){3}" +
                    "(([1-9][0-9]?)|(1[0-9]{2})|(2[0-4]\\d)|(25[0-5])|0)");

    public Set<String> findAllIpv4(String input) {
        Set<String> s = new TreeSet<String>();
        Matcher m = IPV4_PATTERN.matcher(input);
        int from = 0;
        while (m.find(from)) {
            s.add(input.substring(m.start(), m.end()));
            from++;
        }
        return s;
    }
}

We read this piece of code, it is certain that: a regular expression, no problem, right IPV4 addresses can use it to verify; 2 Java Regex API usage is not much problem, in line with expectations...

输入:0.0.0.255
输出:[0.0.0.25]

But the result of this code is not correct, the output is correct [0.0.0.2, 0.0.0.25, 0.0.0.255], the reason lies in the regular alternation |on usage.

Alternation (Alternation)

Alternation in different regular engines, the working principle is different. In conventional NFAengines, will be checked in order from left to right in the expression of multiple choice branches, once the match is completed, the other branch of the multiple-choice will not try . 1

Regular expression section above example, we separate extraction expression for each decimal number (([1-9][0-9]?)|(1[0-9]{2})|(2[0-4]\\d)|(25[0-5])|0), it will be matched in the following order:

1. ([1-9][0-9]?)      # 一位数或两位数
2. (1[0-9]{2})        # 位于区间[100-199]的三位数
3. (2[0-4]\\d)        # 位于区间[200-249]的三位数
4. (25[0-5])          # 位于区间[250-255]的三位数
5. 0                  # 0

Then for an input character string 0.0.0.255matching process, when performing the fourth decimal 255match, the priority calculation expression ([1-9][0-9]?), this can be matched to 25and 2while ?(question mark)in a regular is a greedy quantifier 2 , thus leaving only 25, so in the end we see the operation result [0.0.0.25].

With the premise that knowledge, we can by 优先匹配多位数字,手工解析少量数字way of correct answers procedures to do so, the regular expression need to make some adjustments to match the multi-digit multi-select branches on the front, that is (25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]\\d|\\d), the code is as follows:

// Java语言
import java.util.*;
import java.util.regex.*;

class Solution {
    private static final Pattern IPV4_PATTERN =
            Pattern.compile("((25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]\\d|\\d)\\.){3}" +
                    "(25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]\\d|\\d)");

    public Set<String> findAllIpv4(String input) {
        Set<String> s = new TreeSet<String>();
        Matcher m = IPV4_PATTERN.matcher(input);
        int from = 0;
        int lastDotIdx = 0;
        String sub = null;
        while (m.find(from)) {
            sub = input.substring(m.start(), m.end());
            s.add(sub);
            lastDotIdx = sub.lastIndexOf('.');
            if (lastDotIdx == sub.length() - 3) {
                s.add(sub.substring(0, sub.length() - 1));
            } else if (lastDotIdx == sub.length() - 4) {
                s.add(sub.substring(0, sub.length() - 1));
                s.add(sub.substring(0, sub.length() - 2));
            }
            from++;
        }
        return s;
    }
}

  1. Friedl, J. E. (2006). Mastering regular expressions. " O'Reilly Media, Inc.", p174-p175.

  2. Friedl, J. E. (2006). Mastering regular expressions. " O'Reilly Media, Inc.", p142.

Guess you like

Origin www.cnblogs.com/imac/p/12077923.html