must see! Be careful with regular expressions

what is a regular expression

A regular expression (Regular Expression) is a text pattern that uses some specific characters to retrieve, match, and replace strings that meet the rules.

The characters that construct the regular expression grammar are composed of ordinary characters, special characters (called "metacharacters"), limited characters (quantifiers), and positioning characters (boundary characters).

For an introduction to these characters, I recommend reading Regular Expressions - Syntax and Regular Expressions - Metacharacters .

regular expression engine

A regular expression is a formula written with regular symbols. The program performs grammatical analysis on the formula, builds a grammatical analysis tree, and then generates an execution program based on the analysis tree combined with the engine of the regular expression (this execution program is called State machine, also called state automaton), is used for character matching.

The regular expression engine here is a set of core algorithms for building state machines.

There are currently two ways to implement regular expression engines: DFA automaton (Deterministic Final Automaton) and NFA automaton (Non deterministic Finite Automaton). For a detailed explanation of DFA and NFA, interested friends can read "Compilation Principles (Dragon Book)".

In contrast, the cost of constructing DFA automaton is much higher than that of NFA automaton, but the execution efficiency of DFA automaton is higher than that of NFA automaton.

Assuming that the length of a string is n, if the DFA automaton is used as the regular expression engine, the matching time complexity is O(n); if the NFA automaton is used as the regular expression engine, since the NFA automaton is in the matching process There are a large number of branches and backtracking in , assuming that the number of states of NFA is s, the time complexity of the matching algorithm is O(ns).

Regarding this state number, we explain it through a case:

This

copy code

String reg = "ab{1,3}d";

For example, in the above matching rule, the number of states is 3, corresponding to different matching formats, namely abd, abbd, and abbbd.

The advantage of NFA automata is that it supports more functions. For example, advanced functions such as capture group, lookaround, possessive quantifier, etc. These functions are independently matched based on subexpressions, so in programming languages, the regular expression libraries used are all implemented based on NFA.

Regarding the capture group, here is the concept of grouping in regular matching. Grouping can be divided into two forms, capturing group and non-capturing group . The difference between the two will be introduced later, here we only introduce the grouping and how to capture the grouping.

This

copy code

String reg = "((\d+)([a-z]))\s+";

The above regular expression contains a total of four groups, according to the default matching method from left to right.

group(0) represents the match itself, which is the entire expression ((\d+)([az]))\s+
group(1) represents subexpression items ((\d+)([az]))
group(2) represents subexpression items (\d+)
group(3) represents subexpression items ([az])

It can be seen that group(0) represents the entire expression, and the reason for naming the capturing groups is that in the matching, each subsequence of the input sequence that matches these groups is saved. Captured subsequences can later be used in expressions via Back references (backreferences) , or retrieved from a matcher after a matching operation has completed.

Backtracking of NFA Automata

We should all have heard of the backtracking method when we were learning algorithms. The backtracking method (exploration and backtracking method) is a kind of optimal search method, also known as the heuristic method, which searches forward according to the optimal conditions to achieve the goal. But when you reach a certain step in the exploration, you find that the original choice is not good or you can’t reach the goal, so you go back and choose again. The classic Eight Queens problem is an example of the backtracking method.

The NFA automaton matching mode defaults to greedy mode, that is, the qualifiers in the regular expression will match as much content as possible, and if you don’t hit the south wall, you won’t turn back, which will bring backtracking problems.

Suppose there is such a piece of code that needs regular matching:

This

copy code

String text=“abbc”; String regex=“ab{1,3}c”;

The matching process is shown in the figure below:

The matching process in the above figure is relatively simple. If you encounter a complex regular expression, you may backtrack multiple times.

match pattern

The greedy mode is mentioned above, and the regular expression has two other matching modes.

1. Greedy mode (Greedy)

Qualifiers are used to specify how many times a given component of a regular expression must occur to satisfy a match. There are 6 kinds of ***** or + or ? or {n} or {n,} or {n,m} .

The presence of the above qualifiers in the regular expression will match as much content as possible, as shown in the following example:

This

copy code

String regex = "ab{1,3}c";

Regarding the greedy mode, you can refer to the matching flow chart above. The NFA automaton reads the maximum matching range, and will backtrack after failing.

Here I will talk about my first thought when I was studying. At that time, I thought that when I matched for the first time, I should choose the largest range of matching, that is, abbbc. The first match is unsuccessful, and the matching range changes from large to small, and it will try to continue matching.

The above ideas confuse me when learning the exclusive mode. I don't even understand the difference between the exclusive mode and the greedy mode, especially for the following case:

arduino

copy code

String text=“abbc” String regex=“ab{1,3}+bc” // 结果是不匹配

For this reason, I want to figure out how many steps have been taken in the regular matching. The greedy pattern matching flow chart above is just a reference to the online drawing, so what is the basis to support this point of view. To this end I made the following efforts:

1. First of all, I searched for online regular matching websites on the Internet. It would be best to explain how many steps there are in the matching process, but I didn’t find a suitable one. I put a few good regular matching tools in the following text, and you can refer to it later.

2. Since no suitable tool can be found, there is only one way out. Look at the code, the code will not lie, look at the matching logic in the code, and debug it, hoping to gain something.

I am used to using Java, so let's start with the Java code, the following is the test code:

This

copy code

public static void matchTest() { String text = "abbc"; String reg = "ab{1,3}c"; Pattern p = Pattern.compile(reg); Matcher m = p.matcher(text); System.out.println(m.find()); }

For the study of Pattern and Matcher source code, you can refer to these two articles: Regex Regularity of Java Source Code Analysis (1) and Pattern and Matcher.find Source Code Interpretation

Through the above two articles, we can help us overcome the pressure of reading the source code. If you have a clue, you can see that there are nearly 6000 lines of code in the Pattern file.

Greedy pattern matching logic source code analysis

I will not waste space to list the process below. After all, the original intention is not to talk about the source code, just focus on the core part. It is divided into the following major steps:

1. Read the content in reg and encapsulate it into the implementation class of Node. Node has many subclasses. The subclass I first came into contact with is Curly class, which includes four attributes: atom, type, cmin and cmax. Here is a brief introduction to these four attributes. atom is similar to a tree node, and the value of each node corresponds to an ordinary character in reg, and then executes the next node. type is used to distinguish the matching mode, the greedy mode is represented by 0 in the code, cmin refers to 1, and cmax refers to 3. I specially took a picture to make it easier for everyone to understand what I just said, as shown below:

98 is the ASCII code corresponding to character b.

2. Directly talk about the matching logic of b{1,3}, the core code is located in the match method of the Curly class.

This

copy code

boolean match(Matcher matcher, int i, CharSequence seq) { int j; for (j = 0; j < cmin; j++) { if (atom.match(matcher, i, seq)) { i = matcher.last; continue; } return false; } if (type == GREEDY)//贪婪模式 return match0(matcher, i, j, seq); else if (type == LAZY)//懒惰模式 return match1(matcher, i, j, seq); else//独占模式 return match2(matcher, i, j, seq); }

About the matching logic of the greedy mode, in the match0() method.

kotlin

copy code

// Greedy match. // i is the index to start matching at // j is the number of atoms that have matched boolean match0(Matcher matcher, int i, int j, CharSequence seq) { if (j >= cmax) { // We have matched the maximum... continue with the rest of // the regular expression return next.match(matcher, i, seq); } int backLimit = j; while (atom.match(matcher, i, seq)) { // k is the length of this match int k = matcher.last - i; if (k == 0) // Zero length match break; // Move up index and number matched i = matcher.last; j++; // We are greedy so match as many as we can while (j < cmax) { if (!atom.match(matcher, i, seq)) break; if (i + k != matcher.last) { if (match0(matcher, matcher.last, j+1, seq)) return true; break; } i += k; j++; } // Handle backing off if match fails while (j >= backLimit) { if (next.match(matcher, i, seq)) return true; i -= k; j--; } return false; } return next.match(matcher, i, seq); }

Regarding character matching, the specific logic is:

java

copy code

private static abstract class BmpCharProperty extends CharProperty { boolean match(Matcher matcher, int i, CharSequence seq) { if (i < matcher.to) { return isSatisfiedBy(seq.charAt(i)) && next.match(matcher, i+1, seq); } else { matcher.hitEnd = true; return false; } } } //其中 isSatisfiedBy具体代码为： static final class Single extends BmpCharProperty { final int c; Single(int c) { this.c = c; } boolean isSatisfiedBy(int ch) { return ch == c; } }

Regarding the logic of the above code, I try to explain it with debugging screenshots. First, enter the match0() method and pay attention to the values of i and j. i=2 means that it is time to match the third character in the text, and j=1 means b{1,3} already matches a b.

Enter the first loop, where is atom.match(matcher, i, seq)used to match the third character of text, and the match is successful. Because j=2 is less than cmax, and it is called again atom.match(matcher, i, seq), we know that the fourth character of text cannot match b{1,3}, so it breaks directly.

Then call next.match(matcher, i, seq), compare the fourth character of text with the last character of reg, and return true if the match is successful.

The above explanation of matching logic is relatively simple, but it also confirms the greedy pattern matching flow chart above.

2. Lazy mode (Reluctant)

This pattern means that the regular expression will repeat the matching characters as few times as possible. If it matches successfully, it continues to match the rest of the string.

This

copy code

String regex = "ab{1,3}?c";

It is just the opposite of the greedy mode. When matching for the first time, the smallest range match is selected, that is, abc.

However, the lazy mode cannot avoid the backtracking problem. For example, if the text to be matched is abbc, the match fails for the first time, and then the matching range changes from small to large, and backtracking also occurs.

The matching process of the lazy mode is shown in the figure below, and the source code will not be interpreted here. The core logic is in the match1() method of the Curly class. Interested friends can debug it by hand.

Lazy Pattern Matching Flowchart

3. Possessive mode

Like the greedy mode, the exclusive mode will match more content to the maximum; the difference is that in the exclusive mode, if the match fails, the match will end, and there will be no backtracking problem.

Add a "+" after the qualifier to enable exclusive mode.

The understanding of the exclusive mode comes from an article written by Mr. Liu Chao in the Geek Time column. It is said that in the following case a, the matching will not go back after the failure. Does that mean that the backtracking problem will not occur in the exclusive mode? As a result, another case b was given immediately, saying that the matching was successful, and backtracking occurred. I was a little confused. What is this and what is it?

This

copy code

//案例a，下述代码匹配不成功 String text=“abbc”; String regex = "ab{1,3}+bc"; //案例b，下述代码匹配成功，发生了回溯 String text=“abbc”; String regex = "ab{1,3}+c";

No way, we can only start from the code. According to the above, we can see that the exclusive mode will enter the match2() method. Let's find out.

This

copy code

boolean match2(Matcher matcher, int i, int j, CharSequence seq) { for (; j < cmax; j++) { if (!atom.match(matcher, i, seq)) break; if (i == matcher.last) break; i = matcher.last; } return next.match(matcher, i, seq); }

Compared with the greedy mode, the code logic is indeed much simpler. Let's debug case a first. The method in the loop is to match the last three characters of text. After the match fails, break and execute. When next.match(matcher, i, seq)debugging, we find that we have entered the Slice class. First, let's look at the value of next:

The specific code of the Slice class is as follows:

This

copy code

static final class Slice extends SliceNode { Slice(int[] buf) { super(buf); } boolean match(Matcher matcher, int i, CharSequence seq) { int[] buf = buffer; int len = buf.length; for (int j=0; j<len; j++) { if ((i+j) >= matcher.to) { matcher.hitEnd = true; return false; } if (buf[j] != seq.charAt(i+j)) return false; } return next.match(matcher, i+len, seq); } }

The content of buffer is [98,99], corresponding to the last two digits in the regex, execute buf[0] != seq.charAt(3) in the loop body, and return false directly as a result. It seems that there is indeed no backtracking, and the first time I saw buffer, although the specific logic behind it is not clear, it undoubtedly improves the code efficiency.

By the way, the Slice class is created through newSlice(buffer, first, hasSupplementary) in the atom() method. After testing, it is found that after adding "+" after the qualifier, if there are no less than two ordinary characters in the back, A buffer will be generated. Here are a few small cases:

This

copy code

String reg = "ab{1,3}+qcsd{1,2}+x"; //只会产生一个buffer，[q,c,s] String reg = "ab{1,3}+qcsd{1,2}+xd"; //会产生两个buffer，[q,c,s],[x,d]

Then debug case b, first of all, according to the previous article, because there is only one character 'c' after "+", no buffer will be generated, let's take a look at the next object at this time:

The subsequent matching is relatively simple, and it is enough to directly judge whether the values are equal.

Looking back at the above knowledge points, the author of the original text said that case b cannot avoid the occurrence of backtracking, but as far as my analysis is concerned, this is not considered backtracking. The traceback code in greedy mode is as follows:

lua

copy code

// Handle backing off if match fails while (j >= backLimit) { if (next.match(matcher, i, seq)) return true; i -= k; j--; }

Compared with the exclusive mode, it is much more complicated. When the content that needs regular matching is very long, the exclusive mode must be more efficient.

To sum up, the exclusive mode has better performance than the greedy mode, and I personally think that the exclusive mode does not backtrack.

group

What if you want to repeat multiple characters? At this point we will use grouping, we can use parentheses "()" to specify the subexpression to be repeated, and then repeat the subexpression, for example: (abc)? means 0 or 1 abc, Here a parenthesized expression represents a grouping.

Grouping can be divided into two forms, capturing groups and non-capturing groups .

Detailed explanation about grouping Recommended reading: Advanced usage of regular expressions (grouping and capturing)

Optimization of regular expressions

1. Use less greedy mode and more exclusive mode

The greedy mode will cause backtracking problems, and the exclusive mode personally thinks that there is no backtracking problem and the performance is better.

2. Reduce branch selection

The regular expression of the branch selection type "(X|Y|Z)" will reduce the performance, and it can be implemented in other ways, or the matching order can be adjusted.

Just like the prefix index in the database, put the characters with high frequency in front to achieve fast matching;
Extract common fields in front, for example, replace "(abcd|abef)" with "ab(cd|ef)", the latter will match faster, because the NFA automaton will try to match ab, if not found, it will not Try any option again;
If it is a simple branch selection type, we can use triple index (String.indexOf()) instead of "(X|Y|Z).

3. Reduce the use of capture groups

The capture group and non-capture group are mentioned above. Simply put, a () is a capture group, and capture groups can be nested. Expressions of the form (?:exp) are non-capturing groups.

csharp

copy code

public static void main( String[] args ) { String text = "<input high="20" weight="70">test</input>"; String reg="(<input.*?>)(.*?)(</input>)"; Pattern p = Pattern.compile(reg); Matcher m = p.matcher(text); while(m.find()) { System.out.println(m.group(0));//整个匹配到的内容 System.out.println(m.group(1));//(<input.*?>) System.out.println(m.group(2));//(.*?) System.out.println(m.group(3));//(</input>) } }

The execution result is:

bash

copy code

<input high="20" weight="70">test</input> <input high="20" weight="70"> test </input>

If you only want to get the content wrapped by the input tag, you can use non-capturing grouping.

This

copy code

public static void main( String[] args ) { String text = "<input high="20" weight="70">test</input>"; String reg="(?:<input.*?>)(.*?)(?:</input>)"; Pattern p = Pattern.compile(reg); Matcher m = p.matcher(text); while(m.find()) { System.out.println(m.group(0));//整个匹配到的内容 System.out.println(m.group(1));//(.*?) } }

The execution result is:

bash

copy code

<input high="20" weight="70">test</input> test

Of course, the above-mentioned method of using non-capturing groups is somewhat redundant, and it is also possible to directly remove the brackets of the groups that you do not want to capture, and the effect is the same.

This

copy code

String reg = "<input.*?>(.*?)</input>";

In summary, it can be seen that reducing the groups that do not need to be obtained can improve the performance of regular expressions.

regular expression tool

Regular expression online test

Play with regular expressions! Recommend a quick check, debugging, verification, visualization tool

Online Access Go directly to ihateregex.io/

Regulex

Regular Expression Test Page for Java

regular expressions 101

Summarize

Although regular expressions are small, they have powerful matching functions. No matter what development language it is, it can be used. In the past, it only completed the regular matching function, such as verifying the mobile phone number or email address on the registration page, but never considered whether the written matching statement had performance problems. Today, I will take you to understand regular expressions from a new perspective. I hope that in the future, if you don’t need regular expressions, you can use them. If you have to use them, you should do a performance check and try to write them perfectly.