Java regular expressions detailed analysis

Metacharacters

 

 

  1. Regular expressions use certain meta characters to search, match, and replace strings in line with rules
  2. Metacharacters: ordinary characters, character standard, defining character (quantifier), positioning the character (character boundaries)

The regular expression engine

  1. Regular expressions are written with a regular symbols formula
  • Program of regular expression syntax analysis, syntax analysis tree
  • Then parse tree generated in conjunction with the regular expression engine execution program (state machine) according to the character match
  • The regular expression engine is a core set of algorithms used to establish the state machine
  • summary
  • Regular Expressions => parse tree
  • Parse tree regular expression engine = +> = state machine> characters for matching
  1. Currently implement regular expression engine in two ways
  • Automaton DFA (Deterministic Finite Automaton, deterministic finite automaton)
  • NFA automaton (Nondeterministic Finite Automaton, a non-deterministic finite automaton)
  1. Configuration is much greater than the cost of automata DFA NFA automaton, but the efficiency is higher than the DFA automaton NFA automaton
  • Assuming that the length of a string is n, if automata DFA as regular expression engine, the matching of time complexity O (n)
  • If NFA automaton as a regular expression engine, NFA automaton large number of branches and backtracking in the matching process, the number of states is assumed NFA s,
  • Matching the time complexity is O (ns)
  1. Advantages NFA automatic machine that supports more advanced features, but they are based on independent sub-expression match
  • Therefore, the programming language, using the regular expression library are based on NFA automatic machine implementation

NFA automatic machine

Matching process

  1. NFA automatically opportunity to read regular expressions for each character, and take it to match the target string
  2. The match is successful the next character to change the regular expression, and vice versa will continue to target the next character string match
text="aabcab"
regex="bc"
Java regular expressions detailed analysis

 

Backtracking

  1. NFA automatic machine implemented with more complex regular expressions, in the matching process often causes back problems
  2. A lot of backtracking will be prolonged occupation of CPU, thus bringing the system performance overhead
text="abbc"
regex="ab{1,3}c"

Reading a first regular expression and a character string matches the first character of a compare, a pair a, matching

Java regular expressions detailed analysis

 

Reading a second regular expression matching symbol b {1,3} b and the second character string is compared to a match, but b {1,3} denotes 1 to 3 characters, and NFA automaton having greedy characteristics, so it will not read the next regular expression matching character c

Java regular expressions detailed analysis

 

Use string b {1,3} and the fourth character c are compared, a mismatch is found, then backtracking occurs, the fourth character string that has been read is discharged to the character c, the pointer back to the first b three character positions

Java regular expressions detailed analysis

 

After backtracking occurs, read the next regular expression matching character c, and a fourth character string c compared result matches

Java regular expressions detailed analysis

 

Avoid backtracking

Avoid backtracking: Use Lazy mode and exclusive mode

Greedy mode (Greedy)

  1. Matching the number, if used alone + ,? , *, {Min, max} quantifier etc., will match the regular expression as much content
  2. text = "abbc", regex = "ab {1,3} c", a match failure occurs, it will cause a backtracking
  3. text = "abbbc", regex = "ab {1,3} c", the matching is successful

Lazy mode (Reluctant)

  1. In lazy mode, the regular expression matching characters will be repeated as little as possible, if the match is successful, it will continue to match the rest of the string
  2. Using? Lazy open mode, text = "abc", regex = "ab {1,3}? C"
  • The matching result is "abc", in the first mode select the smallest NFA automaton matching range, i.e., a matching characters b, to avoid the problems backtracking

Exclusive (Possessive)

  1. And greedy as exclusive mode will maximize the match as more content, but in the end will match fails match, backtracking problem does not occur
  2. Use lazy + turn mode, text = "abbc", regex = "ab {1,3} + bc"
  • The result is a mismatch, the end of the match, backtracking problem does not occur

Code

match("ab{1,3}c", "abbc"); // abbc,贪婪模式,产生回溯
match("ab{1,3}c", "abbbc"); // abbbc,贪婪模式,不产生回溯
match("ab{1,3}?", "abbbb"); // ab,懒惰模式,不产生回溯
match("ab{1,3}+bc", "abbc"); // null,独占模式,不产生回溯

正则表达式的优化

  1. 少用贪婪模式,多用独占模式(避免回溯)
  2. 减少分支选择,分支选择类型"(X|Y|Z)"的正则表达式会降低性能,尽量减少使用,如果一定要使用
  • 考虑选择的顺序,将比较常用的选择放在前面,使它们可以较快地被匹配
  • 提取共用模式,(abcd|abef) => ab(cd|ef)
  • 如果是简单的分支选择类型,可以用三次index代替(X|Y|Z)
  1. 减少捕获嵌套
  • 捕获组:把正则表达式中,子表达式匹配的内容保存到以数字编号或显式命名的数组中,一般一个()就是一个捕获组
  • 每个捕获组都有一个编号,编号0代表整个匹配到的内容
  • 非捕获组:参与匹配却不进行分组编号的捕获组,其表达式一般由(?:exp)组成
  • 减少不需要获取的分组,可以提高正则表达式的性能

捕获组

String text = "<input high=\"20\" weight=\"70\">test</input>";
String reg = "(<input.*?>)(.*?)(</input>)";
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group(0));// 整个匹配到的内容
System.out.println(m.group(1));//(<input.*?>)
System.out.println(m.group(2));//(.*?)
System.out.println(m.group(3));//(</input>)
// 输出:
// <input high="20" weight="70">test</input>
// <input high="20" weight="70">
// test
// </input>
}

非捕获组

String text = "<input high=\"20\" weight=\"70\">test</input>";
String reg = "(?:<input.*?>)(.*?)(?:</input>)";
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group(0));// 整个匹配到的内容
System.out.println(m.group(1));//(.*?)
// 输出
// <input high="20" weight="70">test</input>
// test
}

summary

Under the premise of doing performance testing, you can use regular expressions, or can not to do, more to avoid performance problems.

Article, then to the end here, I hope you in the performance tests, regular expressions have their own understanding. This concludes today's performance piece!

For more video source, interview questions, Java technology, books and other learning materials

Follow me! 772 300 343 plus group can get!

I was a small frame, we see the next article!

Guess you like

Origin www.cnblogs.com/sevencutekk/p/11592465.html