Regular Expressions order multiple choice structure
Look at a programming problem
[Section contains only from
.
,0
-9
] character string extracted IPv4 address all possible.IPv4 address is represented by decimal number and points, each address contains four decimal numbers, the range of
0-255
, for example172.16.254.1
; the same time, not four decimal numbers to0
begin with, such172.16.254.01
is not legal.Now a text input
str="1.1.1111.16..172.16.254.1.1"
, the IPv4 address of the string in claim random among all possible substrings formed by application.
A lot of people think, IPv4的正则表达式我可熟悉啦,肯定能快速完成!
so regular expressions to search the Internet from IPv4, write the following code:
// Java语言
import java.util.*;
import java.util.regex.*;
class Solution {
private static final Pattern IPV4_PATTERN =
Pattern.compile("((([1-9][0-9]?)|(1[0-9]{2})|(2[0-4]\\d)|(25[0-5])|0)\\.){3}" +
"(([1-9][0-9]?)|(1[0-9]{2})|(2[0-4]\\d)|(25[0-5])|0)");
public Set<String> findAllIpv4(String input) {
Set<String> s = new TreeSet<String>();
Matcher m = IPV4_PATTERN.matcher(input);
int from = 0;
while (m.find(from)) {
s.add(input.substring(m.start(), m.end()));
from++;
}
return s;
}
}
We read this piece of code, it is certain that: a regular expression, no problem, right IPV4 addresses can use it to verify; 2 Java Regex API usage is not much problem, in line with expectations...
输入:0.0.0.255
输出:[0.0.0.25]
But the result of this code is not correct, the output is correct [0.0.0.2, 0.0.0.25, 0.0.0.255]
, the reason lies in the regular alternation |
on usage.
Alternation (Alternation)
Alternation in different regular engines, the working principle is different. In conventional NFA
engines, will be checked in order from left to right in the expression of multiple choice branches, once the match is completed, the other branch of the multiple-choice will not try . 1
Regular expression section above example, we separate extraction expression for each decimal number (([1-9][0-9]?)|(1[0-9]{2})|(2[0-4]\\d)|(25[0-5])|0)
, it will be matched in the following order:
1. ([1-9][0-9]?) # 一位数或两位数
2. (1[0-9]{2}) # 位于区间[100-199]的三位数
3. (2[0-4]\\d) # 位于区间[200-249]的三位数
4. (25[0-5]) # 位于区间[250-255]的三位数
5. 0 # 0
Then for an input character string 0.0.0.255
matching process, when performing the fourth decimal 255
match, the priority calculation expression ([1-9][0-9]?)
, this can be matched to 25
and 2
while ?(question mark)
in a regular is a greedy quantifier 2 , thus leaving only 25
, so in the end we see the operation result [0.0.0.25]
.
With the premise that knowledge, we can by 优先匹配多位数字,手工解析少量数字
way of correct answers procedures to do so, the regular expression need to make some adjustments to match the multi-digit multi-select branches on the front, that is (25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]\\d|\\d)
, the code is as follows:
// Java语言
import java.util.*;
import java.util.regex.*;
class Solution {
private static final Pattern IPV4_PATTERN =
Pattern.compile("((25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]\\d|\\d)\\.){3}" +
"(25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]\\d|\\d)");
public Set<String> findAllIpv4(String input) {
Set<String> s = new TreeSet<String>();
Matcher m = IPV4_PATTERN.matcher(input);
int from = 0;
int lastDotIdx = 0;
String sub = null;
while (m.find(from)) {
sub = input.substring(m.start(), m.end());
s.add(sub);
lastDotIdx = sub.lastIndexOf('.');
if (lastDotIdx == sub.length() - 3) {
s.add(sub.substring(0, sub.length() - 1));
} else if (lastDotIdx == sub.length() - 4) {
s.add(sub.substring(0, sub.length() - 1));
s.add(sub.substring(0, sub.length() - 2));
}
from++;
}
return s;
}
}