04 - Use regular expressions sparingly

When talking about String object optimization, the Split() method was mentioned. The regular expression used by this method may cause backtracking problems. Let’s take a deeper look today. What is going on?

Before we start, let's look at a case that can help you better understand the content.

In a small project development, I encountered such a problem. In order to promote the new product, we developed a small program. According to the previously evaluated visits, the number of participating users in this event is expected to be 300,000+, and the maximum TPS (transaction processing per second) is about 3000.

This result comes from a microbenchmark performance test I did on the interface. I am used to using the ab tool (quickly installed by yum -y install httpd-tools) to test the http request interface on another machine.

I can simulate the peak request on the line by setting -n number of requests/-c number of concurrent users, and then use the three indicators of TPS, RT (response time per second) and distribution of request time per second to measure the performance of the interface. As shown in the figure below (the hidden part in the figure is my server address):

While doing performance testing, I found that the TPS of a submission interface has been unable to increase. It stands to reason that this business is very simple, and the possibility of performance bottlenecks is not high.

I quickly used a process of elimination to find the problem. First, comment all the business code in the method, leave an empty method here, and then see how the performance is. This method can well distinguish between framework performance problems and business code performance problems.

I quickly located the business code problem, and immediately checked the codes one by one to find the reason. After I added the code to insert the database operation, the TPS dropped slightly, but I still haven't found the reason. Finally, only the Split() method is left to operate. Sure enough, after I added the Split() method, the TPS dropped significantly.

But why does a Split() method affect TPS? Next, let's learn about the relevant content of regular expressions, and the answer will come out after learning.

1. What is a regular expression?

Very basic, here is a brief review for you.

Regular expressions are a concept in computer science, and many languages ​​implement it. Regular expressions use some specific metacharacters to search, match, and replace strings that match the rules.

The metacharacters used to construct regular expression grammars are composed of ordinary characters, standard characters, limited characters (quantifiers), and positioning characters (boundary characters). Details can be seen in the figure below:

2. Regular expression engine

A regular expression is a formula written with regular symbols. The program performs grammatical analysis on the formula, builds a grammatical analysis tree, and then generates an execution program based on the analysis tree combined with the engine of the regular expression (this execution program is called State machine, also called state automaton), is used for character matching.

The regular expression engine here is a set of core algorithms for building state machines.

There are currently two ways to implement regular expression engines: DFA automaton (Deterministic Final Automata) and NFA automaton (Non deterministic Finite Automaton).

In contrast, the cost of constructing DFA automaton is much higher than that of NFA automaton, but the execution efficiency of DFA automaton is higher than that of NFA automaton.

Assuming that the length of a string is n, if the DFA automaton is used as the regular expression engine, the matching time complexity is O(n); if the NFA automaton is used as the regular expression engine, since the NFA automaton is in the matching process There are a large number of branches and backtracking in , assuming that the number of states of NFA is s, the time complexity of the matching algorithm is O(ns).

The advantage of NFA automata is that it supports more functions. For example, advanced functions such as capture group, lookaround, possessive quantifier, etc. These functions are independently matched based on subexpressions, so in programming languages, the regular expression libraries used are all implemented based on NFA.

So how does the NFA automaton match? I exemplify the following characters and expressions.

text=“aabcab” regex=“bc”

The NFA automatic machine reads each character of the regular expression and matches it with the target string. If the match is successful, it will replace the next character of the regular expression, otherwise it will continue to match the next character of the target string. Break down the process.

First, read the first match character of the regular expression and compare it with the first character of the string, b does not match a; continue to change the next character of the string, which is also a, does not match; continue to replace the next character , is b, matches.

Then, in the same way, read the second match character of the regular expression and compare it with the fourth character of the string, c matches c; continue to read the next character of the regular expression, but there is no match behind The characters are up, over.

This is the matching process of the NFA automaton. Although in practical applications, the regular expressions encountered are more complex than this, the matching method is the same.

3. Backtracking of NFA automata 

Complicated regular expressions implemented with NFA automata often cause backtracking problems during the matching process. A large number of backtracking will occupy the CPU for a long time, which will bring system performance overhead. Let me illustrate.

text=“abbc” regex=“ab{1,3}c”

In this example, the matching purpose is relatively simple. Matches a string starting with a and ending with c with 1-3 b characters in between. The process of parsing it by NFA automaton is as follows:

First, read the first match character a of the regular expression and compare it with the first character a of the string, a matches a.

Then, read the second match character b{1,3} of the regular expression and compare it with the second character b of the string to match. But because b{1,3} represents 1-3 b strings, and the NFA automaton has greedy characteristics, so it will not continue to read the next match character of the regular expression at this time, but still use b{1, 3} is compared with the third character b of the string, and the result is still a match.

Then continue to use b{1,3} to compare with the fourth character c of the string, and if there is no match, backtracking will occur at this time, and the fourth character c of the string that has been read will be spit out, and the pointer Go back to the position of the third character b.

So how does the matching process continue after the backtracking occurs? The program will read the next match character c of the regular expression, compare it with the fourth character c in the string, the result will match, and end.

4. How to avoid the backtracking problem?

Since backtracking will bring performance overhead to the system, how do we deal with it? If you look at the above case carefully, you will find that the greedy feature of the NFA automaton is the fuse, which is closely related to the matching mode of the regular expression. Let's understand it together.

4.1. Greedy mode (Greedy)

As the name implies, in quantity matching, if quantifiers such as +, ?, * or {min,max} are used alone, the regular expression will match as much content as possible.

For example, the example above:

text=“abbc” regex=“ab{1,3}c”

It is in the greedy mode that the NFA automaton reads the largest matching range, that is, matches 3 b characters. A failure to match causes a backtracking. If the matching result is "abbbc", it will match successfully.

text=“abbbc” regex=“ab{1,3}c”

4.2. Lazy mode (Reluctant)

In this mode, the regular expression repeats the matching characters as few times as possible. If it matches successfully, it continues to match the rest of the string.

For example, adding a "?" after the characters in the above example will enable lazy mode.

text=“abc” regex=“ab{1,3}?c”

The matching result is "abc". In this mode, the NFA automaton first selects the smallest matching range, that is, matching 1 b character, so the backtracking problem is avoided.

4.3, exclusive mode (Possessive)

Like the greedy mode, the exclusive mode will match more content to the maximum; the difference is that in the exclusive mode, if the match fails, the match will end, and there will be no backtracking problem.

Still the above example, add a "+" after the character to enable the exclusive mode.

text=“abbc” regex=“ab{1,3}+bc”

The result is no match, end match, no backtracking problem occurs. Speaking of this, you should be very clear, the way to avoid backtracking is: use lazy mode and exclusive mode.

There is also the doubt at the beginning of "why does a split() method affect TPS", you should also know it?

I used the split() method to extract the domain name and check if the request parameters are as specified. When split() encounters a lot of backtracking when it encounters special characters when matching groups, I solved this problem by adding a character to be matched and "+" after the regular expression.

\\?(([A-Za-z0-9-~_=%]++\\&{0,1})+)

5. Optimization of regular expressions

The performance problems brought about by regular expressions gave me a wake-up call, and here I also hope to share some experience with you. Any detail problem may lead to performance problems, and what is reflected behind this is that we do not have a thorough understanding of this technology. So I encourage you to learn performance tuning, master the methodology, and learn to see the essence through phenomena. Below I will summarize several regular expression optimization methods for you.

5.1. Use less greedy mode and more exclusive mode

Greedy mode can cause backtracking problems, we can use exclusive mode to avoid backtracking. I have explained it in detail before, so I will not explain it here.

5.2. Reduce branch selection

The regular expressions of the branch selection type "(X|Y|Z)" will reduce performance, and we should try to use them as little as possible during development. If it must be used, we can optimize it in the following ways:

First of all, we need to consider the order of selection, and put the more commonly used options in front so that they can be matched faster;

Second, we can try to extract common patterns, for example, replace "(abcd|abef)" with "ab(cd|ef)", which matches faster because the NFA automaton tries to match ab, and if not found, it No options will be tried again;

Finally, if it is a simple branch selection type, we can replace "(X|Y|Z)" with triple index. If you test it, you will find that the efficiency of triple index is higher than "(X|Y|Z)" Make some.

5.3, reduce capture nesting

Before talking about this method, let me briefly introduce what is a capturing group and a non-capturing group.

A capture group refers to saving the matching content of a subexpression in a regular expression into an array numbered or explicitly named for later reference. Generally, a () is a capture group, and capture groups can be nested.

A non-capturing group refers to a capturing group that participates in matching but does not carry out group numbering, and its expression is generally composed of (?:exp).

In regular expressions, each capture group has a number, and number 0 represents the entire matched content. We can look at the following example:

public static void main( String[] args )
{
	String text = "<input high=\"20\" weight=\"70\">test</input>";
	String reg="(<input.*?>)(.*?)(</input>)";
	Pattern p = Pattern.compile(reg);
	Matcher m = p.matcher(text);
	while(m.find()) {
		System.out.println(m.group(0));// 整个匹配到的内容
		System.out.println(m.group(1));//(<input.*?>)
		System.out.println(m.group(2));//(.*?)
		System.out.println(m.group(3));//(</input>)
	}
}

operation result:

<input high=\"20\" weight=\"70\">test</input>
<input high=\"20\" weight=\"70\">
test
</input>

If you don't need to get the text in a group, then use a non-capturing group. For example, use "(?:X)" instead of "(X)", let's look at the following example again:

public static void main( String[] args )
{
	String text = "<input high=\"20\" weight=\"70\">test</input>";
	String reg="(?:<input.*?>)(.*?)(?:</input>)";
	Pattern p = Pattern.compile(reg);
	Matcher m = p.matcher(text);
	while(m.find()) {
		System.out.println(m.group(0));// 整个匹配到的内容
		System.out.println(m.group(1));//(.*?)
	}
}

operation result:

<input high=\"20\" weight=\"70\">test</input>
test

In summary, it can be seen that reducing the groups that do not need to be obtained can improve the performance of regular expressions.

6. Summary

Although regular expressions are small, they have powerful matching functions. We often use it, for example, to verify the mobile phone number or email address on the registration page.

But many times, we will ignore its usage rules because it is small, and some special use cases are not covered in the test cases, and there are many cases of being caught when it goes online.

Based on my previous experience, if using regular expressions can make your code concise and convenient, then you can use them on the premise of doing a good performance check; if not, then you can use regular expressions if you can, so as to avoid causing more performance problems.

Guess you like

Origin blog.csdn.net/qq_34272760/article/details/131898606