In general, the key to improving the efficiency of regular expressions lies in a thorough understanding of the process behind backtracking and mastering techniques to avoid possible backtracking.

1. Typical examples

First, let’s look at an example that truly demonstrates the importance of backtracking and efficiency. Use "(\\.|[^\\"])*" to match double-quoted strings, which allow escaped double quotes. This expression is not wrong, but if you use the NFA engine, it is applied to each character The efficiency of the multi-selection structure will be very low. For each "normal" (non-backslash, non-double quote) character in the string, the engine needs to test '\\.', backtrack after failure, and finally Matched by '[^\\"]'. Some changes can be made to speed up matching.

1. Slightly modified - step on the leg that works best first

For general double-quoted strings, the number of ordinary characters is more than the number of escape characters. A simple change is to change the order of the two multi-select branches and put '[^\\"]' into '\\.' before. In this way, only when an escape character is encountered in the string, the multi-selection structure will be backtracked. The process of matching the target string "2\"x3\" likeness" with the two expressions in the RegexBuddy tool is shown in the figure below .

Figure 1: The impact of the order of multiple choice branches (traditional NFA)

On the left is the matching process of "(\\.|[^\\"])*", which performed a total of 32 tests and 14 backtrackings. On the right is the matching process of "([^\\"]|\\.)*" During the matching process, a total of 22 tests and 4 backtracking were performed. Two backslashes caused two branch backtrackings, and the last double quote caused two branch backtrackings. The first time was caused by a mismatch with the branch [^\\"], and the second time was caused by the inability to match the asterisk. The quantifier backtracks. At this time, all multi-select branches fail to match, and the entire multi-select structure cannot be matched. Each line is tested once (except for line 22, which shows the position where * the quantifier is backtracked and is not tested).

We should think about this modification from the following two aspects:

Which engines benefit from this? Traditional NFA, or POSIX NFA, or both?
Under what circumstances would this modification bring the greatest benefit? When the text can be matched, when it cannot be matched, or all the time?

First, this change has no impact on POSIX NFA. Since it ultimately has to try every possibility of the regular expression, the order of the multiple-choice branches doesn't really matter. But for traditional NFA, such speed-increasing multi-choice branch reordering is advantageous because the engine stops once it finds a match.

The second point is that the speed will only be accelerated when the match is successful. NFA may fail only after all possibilities have been tried. If there is indeed no match, every possibility is tried, in which case the sort order has no effect.

The following table lists the number of tests and backtracks performed in several cases, with lower numbers being better:

target string	Traditional NFA				POSIX NFA
	*"(\\.\|[^\\"])"**		*"([^\\"]\|\\.)"**		Both expressions are the same
	test	Backtrace	test	Backtrace	test	Backtrace
"2\"x3\" likeness"	32	14	22	4	48	30
"makudonarudo"	28	14	16	2	40	26
"very...99 more chars...long"	218	109	111	2	325	216
"No \"match\" here	124	86	124	86	124	86

Table 1

The two expressions behave the same in POSIX NFA, and after the modification, the performance of traditional NFA is improved (backtracking is reduced). In the case of no match (last line), since both engines have to try all possibilities, the result is the same.

2. Efficiency vs Accuracy

The most important consideration when modifying a regular expression for efficiency is whether the change will affect the accuracy of the match. Rearranging the order of the multi-select branches as above will not affect accuracy only if the ordering is independent of the matching results. Look at a flawed example: use the regular expression "(\\.|[^"])*" to match the string "You need a 2\"3\" photo". After 46 tests and 21 backtracks, the result The entire string was matched.

If you exchange multi-select branches in order to improve efficiency and put [^"] in front, it does improve efficiency. There are only 4 backtrackings and 27 tests in total. But the result is two matching strings: "You need a 2\ " and " photo". The root cause of the problem is that the characters matched by the two branches overlap and both can match backslashes.

Therefore, when focusing on efficiency, never forget accuracy. For multi-selection branches, it is best to ensure that each branch is mutually exclusive first, so that the order of the multi-selection branches has nothing to do with the matching results, and accuracy is guaranteed, and then performance is considered.

3. Move on – Limit the scope of match priority

As can be seen from Figure 1, in any regular expression, asterisk iterates for each ordinary character, repeatedly entering and exiting multi-select structures and parentheses. This comes at a cost, and this additional processing should be avoided if possible.

Considering the case where [^\\"] matches "normal" characters (non-quotes, non-backslashes), using [^\\"]+ will read in as many as possible in one iteration of (...)* character. For strings without escape characters, this will read the entire string at once. So there is almost no backtracking, which means the number of asterisk iterations is reduced to a minimum.

Figure 2 shows the application of this example on a traditional NFA. The left side is the matching process of "(\\.|[^\\"]+)*", compare "(\\.|[^\\"])*" in Figure 1, the backtrace related to the multi-selection structure and asterisk iterations are reduced. On the right is the matching process of "([^\\"]+|\\.)*". You can see that combined with reordering techniques, this modification will also bring more benefits.

Figure 2: Result of adding plus sign (traditional NFA)

The new plus sign greatly reduces the number of backtracking in the multi-select structure and the number of asterisk iterations. The asterisk quantifier operates on a subexpression within parentheses, and each iteration requires entering and exiting the parentheses, which costs money because the engine needs to record the text matched by the subexpression within the parentheses.

4. “Exponential” matching

For POSIX NFA, the change to add the plus sign is just a disaster that has not yet occurred. If you use "([^\\"]+|\\.)*" to match "very...99 more chars...long" in Table 1, more than 3 billion trillion times of backtracking are required. Simply put, The reason why this happens is that while an element in this expression is qualified by a plus sign, it is also qualified by an asterisk outside the brackets. It is impossible to distinguish which quantifier controls which special character. This uncertainty is the crux.

When there is no asterisk, [^\\"] is the constraint object of the asterisk, and the characters that real ([^\\"])* can match are limited. It matches one character, then the next, and so on, up to every character in the target text. It may not match all characters in the target string (causing backtracking), but at best, the number of matched characters is linear in the length of the target string. The longer the target string, the greater the possible workload.

However, for the regular expression ([^\\"]+)*, the possibility of both the plus sign and the asterisk dividing (divvy up) the string increases exponentially. If the target string is makudonarudo, Is the asterisk iterating 12 times, with [^\\"]+ matching one character in each iteration? Or is the asterisk iterated 3 times, and the internal [^\\"]+ matches 5, 3, and 4 characters respectively? Or is the asterisk iterated 4 times, and the internal [^\\"]+ matches 2, 2, 5. 3 characters? Or something else...

There are 4096 possibilities for a string of length 12, with two possibilities for each character in the string. POSIX NFA must try all possibilities before giving a result. This is the origin of "exponential" matching, also called "super-linear". No matter what the name is, it is still a lot of backtracking. For a string of length n, the number of backtracking is 2^(n+1), and the number of independent tests is 2^(n+1)+2^n.

The main difference between POSIX NFA and traditional NFA is that traditional NFA stops at the first possible complete match. If there is no complete match, even a traditional NFA needs to try all possibilities before finding one. Even a string as short as "No \"match\" here in Table 1 requires 8192 possibilities to be tried before failure is reported.

While the regex engine is busy trying these huge numbers of possibilities, the entire program looks like it's "locked up." You can test the type of engine using regular expressions like this:

If one of the expressions can give a result quickly even if it cannot be matched, it may be DFA.
If the results are quickly produced only when a match can be made, that is a traditional NFA.
If it's always slow, it's POSIX NFA.

The word "possible" is used in the first judgment because an advanced optimized NFA might be able to detect and avoid these exponentially neverending matches. Similarly, we will see various methods later to improve or rewrite these expressions to speed up their matching or error reporting.

If you exclude the effects of certain advanced optimizations, you can determine the type of engine based on the relative performance of regular expressions. Traditional NFA is the most widely used engine, and it is easy to identify. First of all, if it supports ignoring priority quantifiers, it can basically be determined that it is a traditional NFA. Ignoring precedent quantifiers is not supported by DFA and has no meaning in POSIX NFA. To confirm this, simply use the regular expression nfa|nfa not to match the string nfa not. If only nfa matches, this is a traditional NFA; if the entire nfa not matches, the engine is either POSIX NFA, or DFA.

MySQL's regular engine is traditional NFA:

mysql> set @r:='nfa|nfa not';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='nfa not';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+------+------+
| c    | s    |
+------+------+
|    1 | nfa  |
+------+------+
1 row in set (0.00 sec)

In some cases, the difference between DFA and POSIX NFA is obvious. DFA does not support capturing parentheses and backreferences. This is helpful, but there are also hybrid systems using both engines, in which DFA is used if capturing brackets are not used.

The following simple test can illustrate a lot of problems. Use X(.+)+X to match strings of the form =XX======================. If execution takes a long time, it's NFA. If the previous test says it's not a traditional NFA, then it's definitely a POSIX NFA. If the execution time is very short, it is a DFA, or an NFA that supports some advanced optimizations. If it shows a stack overflow (stack overflow), or exits with a timeout, then it is an NFA engine.

mysql> set @r:='X(.+)+X';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='=XX======================';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
ERROR 3699 (HY000): Timeout exceeded in regular expression match.

2. Comprehensive investigation and review

Let's start with an example, applying the regular expression ".*" to the following text:
The name "McDonald's" is said "makudonarudo" in Japanese

The matching process is shown in Figure 3.

Figure 3: Successful matching process of ".*"

1. Matching process of traditional NFA

The regular expression will try each character in sequence starting from the beginning of the string, but because the opening quotation mark cannot be matched, subsequent characters cannot be matched until the attempt reaches the position of the first double quotation mark. Other parts of the expression are then tried, but the gearing knows that if this attempt is unsuccessful, the entire expression can be tried from the next position.

Then .* matches until the end of the string, at which point the dot cannot match, so the asterisk stops iterating. Since the .* match does not require any characters, the engine records 46 states for backtracking during this process. Now that the .* stops, the engine backtracks from the last saved state, which is at the end of the string trying to match the closing double quote. Here again the double quotes cannot be matched, so the attempt still fails. Then the engine continues to backtrack and try, and the result is also unable to match.

The engine tries the saved states in reverse (last saved state first). After many attempts, the position...arudo" was reached, and the match was successful, so a global match was obtained at this position:
"McDonald's" is said "makudonarudo"

This is the matching process of traditional NFA. The remaining unused status will be discarded and a successful match will be reported.

2. POSIX NFA requires more processing

A POSIX NFA match is the "longest match so far", but all saved states still need to be tried to see if a longer match exists. For this example, the first match found is the longest, but the regex engine needs to confirm this.

3. What must be done when matching cannot be done

It is also necessary to analyze the situation when matching cannot be achieved. ".*"! Unable to match example text. But it still does a lot of work during the matching process, which Figure 4 illustrates.

Figure 4: ".*"! Matching failure process

The entire sequence of attempts in Figure 4 is what both traditional and POSIX NFAs must go through: if a match cannot be made, a traditional NFA must make as many attempts as a POSIX NFA. Since there is no match in all attempts from the starting A to the ending I, the transmission must start the drive process to start a new round of attempts. Attempts starting with J, Q, and V appear to have a possible match, but the results are the same as attempts starting with A. Finally reaching Y, there is no way to continue trying, so the entire matching fails. As shown in Figure 4, it took a lot of work to get to this result.

4. See clearly

Replace the period with ^" for comparison. If you use "[^"]*"!, the matched content of [^"]* cannot include double quotes, reducing matching and backtracking. Figure 5 illustrates the failed attempt. process.

Figure 5: "[^"]*"! Unable to match

As can be seen from the figure, the number of backtracking is greatly reduced. If this result meets the requirements, reduced backtracking is a beneficial side effect.

5. Multiple-choice structures are expensive

The multi-select structure is probably the main reason for backtracking. Comparing u|v|w|x|y|z and [uvwxyz] matches the following string:
The name "McDonald's" is said "makudonarudo" in Japanese

The efficiency of different implementations may vary, but in general the efficiency of character groups is higher than the corresponding multi-select structure. Character groups are generally simply tested, so [uvwxyz] only needs 34 attempts to match.

If you use u|v|w|x|y|z, you need to backtrack 6 times at each position, for a total of 204 backtracks before you get the same result. Of course, not every multi-select structure can be replaced by a character group, and even if it could, it might not be that simple. However, techniques used in some cases can greatly reduce the backtracking associated with matching the multi-select structure required.

Understanding backtracking is probably the most important issue in learning NFA efficiency, but there's more to it than that. Regular engine optimization measures can greatly improve efficiency.

3. Performance test

1. Test points

The basic performance test is to record the running time of the program: first take the system time, run the program, then take the system time, and calculate the difference between the two, which is the running time of the program. For example, compare ^(a|b|c|d|e|f|g)+$ and ^[ag]+$. Here is a simple Perl program:

use Time::HiRes 'time';        # 这样 time() 的返回值更加精确
$StartTime = time();
"abababdedfg" =~ m/^(a|b|c|d|e|f|g)+$/;
$EndTime = time();
printf("Alternation takes %.3f seconds. \n", $EndTime - $StartTime);

$StartTime = time();
"abababdedfg" =~ m/^[a-g]+$/;
$EndTime = time();
printf("Character class takes %.3f seconds. \n", $EndTime - $StartTime);

It's simple, but there are a few things to keep in mind when doing performance testing:

Only "really interesting" processing times are logged. Record "processing" time as accurately as possible and avoid the impact of "non-processing time" as much as possible. If initialization or other preparations must be performed before starting, start timing after them; if finishing work is required, perform these tasks after timing stops.
Do "enough" processing. Often, the time required for testing is quite short, and computer clocks are not precise enough to give meaningful values.

Running this Perl program on my machine, the result is:
Alternation takes 0.000 seconds.
Character class takes 0.000 seconds.

If the running time of the program is too short, run it multiple times to ensure "enough" work. The "enough" here depends on the accuracy of the system clock. Most systems can be accurate to 1/100s, so that even if the program only takes 0.5s, it can still achieve meaningful results.

Perform "accurate" processing. Performing 10 million quick operations requires accumulating a 10 million counter in the block of code responsible for timing. If possible, the best approach is to increase the proportion of real processing parts without adding additional overhead. In the Perl example, the regular expression is applied to a rather short text: if it were applied to a much longer string, there would be more "real" processing done in each loop.

Taking these factors into account, the following procedure can be derived:

use Time::HiRes 'time';                    # 这样 time() 的返回值更加精确
$TimesToDo = 1000;                        # 设定重复次数
$TestString = "abababdedfg" x 1000;     # 生成长字符串

$Count = $TimesToDo;
$StartTime = time();
while ($Count-- > 0) {
    $TestString =~ m/^(a|b|c|d|e|f|g)+$/;
}
$EndTime = time();
printf("Alternation takes %.3f seconds. \n", $EndTime - $StartTime);

$Count = $TimesToDo;
$StartTime = time();
while ($Count-- > 0) {
    $TestString =~ m/^[a-g]+$/;
}
$EndTime = time();
printf("Character class takes %.3f seconds. \n", $EndTime - $StartTime);

$TestString and $Count are initialized before calculation ($TestString is initialized using the x operator provided by Perl, which represents the number of times the string on the left is repeated on the right). On my machine, running Perl 5.10.1, the result is:
Alternation takes 1.473 seconds.
Character class takes 0.012 seconds.

So, for this example, character groups are about 123 times faster than multi-select structures. This test should be run multiple times for the shortest possible time to reduce the impact of background system activity.

2. Understand what is being measured

Changing the initializer to the following gives more interesting results:
$TimesToDo = 1000000;
$TestString = "abababdedfg";

Now, the test string is only 1/1000 of the length above, and the test needs to be run 1,000,000 times. The total number of characters tested and matched per regex does not change, so in theory the "workload" should not change. But the results are quite different:
Alternation takes 1.863 seconds.
Character class takes 0.247 seconds.

Both times were longer than the previous ones. The reason is the new "non-processing" overhead - the time to detect and update $Count, and build the regular engine, is now 1000 times greater than before.

For the character group test, the added overhead took about 0.24 seconds, while the multi-select structure added 0.39 seconds. The main reason for the large variation in test times for multi-select structures is the capturing brackets, which require additional processing before and after each test, which is 1000 times more expensive.

3. MySQL test

Below is a test stored procedure on MySQL 8.0.16.

delimiter //
create procedure sp_test_regexp(s varchar(100), r varchar(100), c int)
begin
    set @s:=1;
    set @s1:='';
    while @s<=c do
        set @s1:=concat(@s1,s);                
        set @s:=@s+1;
    end while;
    select now(3) into @startts;
    select regexp_like(@s1, r, 'c') into @ret;
    select now(3) into @endts;
    select @ret,@startts,@endts,timestampdiff(microsecond,@startts,@endts)/1000 diff_ts;
end;
//
delimiter ;

Initialize strings and regular expressions:

set @str:='abababdedfg';
set @reg1:='^(a|b|c|d|e|f|g)+$';
set @reg2:='^[a-g]+$';

The first cycle is 1000 times, the results are as follows:

mysql> call sp_test_regexp(@str, @reg1, 1000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-10 15:22:44.921 | 2023-07-10 15:22:44.922 |  1.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (0.01 sec)

Query OK, 0 rows affected (0.01 sec)

mysql> call sp_test_regexp(@str, @reg2, 1000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-10 15:22:44.933 | 2023-07-10 15:22:44.933 |  0.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (0.01 sec)

Query OK, 0 rows affected (0.01 sec)

The multi-select branch took 1 millisecond, and the character group took 0 milliseconds. The contrast is not very obvious, so continue to increase the number of cycles. The second loop is performed 10,000 times, and the results are as follows:

mysql> call sp_test_regexp(@str, @reg1, 10000);
ERROR 3699 (HY000): Timeout exceeded in regular expression match.
mysql> call sp_test_regexp(@str, @reg2, 10000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-10 15:25:25.918 | 2023-07-10 15:25:25.919 |  1.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (0.28 sec)

Query OK, 0 rows affected (0.28 sec)

The multi-select branch reported an error, and the character group took 1 millisecond. The error is reported because the limit of the system variable regexp_time_limit is exceeded during multi-select branch matching. This variable limits the maximum allowed number of steps that the matching engine can execute, thereby indirectly affecting the execution time (usually on the order of milliseconds). The default value is 32. Increase the value of this variable to continue testing.

Third test:

mysql> set global regexp_time_limit=3200;
Query OK, 0 rows affected (0.00 sec)

mysql> call sp_test_regexp(@str, @reg1, 10000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-10 15:33:47.033 | 2023-07-10 15:33:47.046 | 13.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (0.29 sec)

Query OK, 0 rows affected (0.29 sec)

mysql> call sp_test_regexp(@str, @reg2, 10000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-10 15:33:47.318 | 2023-07-10 15:33:47.319 |  1.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (0.27 sec)

This time, the multi-select branch took 13 milliseconds, and the character group took 1 millisecond. The difference between the two is 13 times. Continue testing by increasing the number of cycles.

Fourth test:

mysql> call sp_test_regexp(@str, @reg1, 100000);
ERROR 3698 (HY000): Overflow in the regular expression backtrack stack.
mysql> call sp_test_regexp(@str, @reg2, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-10 15:36:42.548 | 2023-07-10 15:36:42.559 | 11.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (27.13 sec)

The multi-select branch reported an error, and the character group took 11 milliseconds. The error is reported because the limit of the system variable regexp_stack_limit is exceeded during multi-select branch matching. This variable is used for the maximum memory, in bytes, available for the backtrace stack when regexp_like() or similar regular expression functions perform matching. The default value is 8000000. Increase the value of this variable to continue testing.

Fifth test:

mysql> set global regexp_stack_limit=800000000;
Query OK, 0 rows affected (0.00 sec)

mysql> call sp_test_regexp(@str, @reg1, 100000);
+------+-------------------------+-------------------------+----------+
| @ret | @startts                | @endts                  | diff_ts  |
+------+-------------------------+-------------------------+----------+
|    1 | 2023-07-10 15:41:48.045 | 2023-07-10 15:41:48.190 | 145.0000 |
+------+-------------------------+-------------------------+----------+
1 row in set (27.24 sec)

Query OK, 0 rows affected (27.24 sec)

mysql> call sp_test_regexp(@str, @reg2, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-10 15:42:15.307 | 2023-07-10 15:42:15.317 | 10.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (27.12 sec)

This time, the multi-select branch took 145 milliseconds, and the character group took 10 milliseconds, a difference of 14.5 times.

4. Common optimization measures

There are usually two ways to optimize regular expression implementation:

Speed up certain operations. Certain types of matches, such as \d+, are so common that the engine may have special handling schemes that perform faster than the general-purpose handling mechanism.
Avoid redundant operations. If the engine determines that some special operations are unnecessary to produce correct results, or that some operations can be applied to less text than before, it can save time to ignore these operations. For example, a regular expression that begins with \A will match only at the beginning of the string; if it cannot match there, the gear will not try in vain elsewhere.

1. Every gain must come with a loss.

There is a mutually restrictive relationship between the time required for optimization, the time saved, and, more importantly, the possibility of optimization. Optimization is only beneficial if the time required to detect whether the optimization measure is feasible is less than the matching time saved.

Let’s look at an example. The expression \b\B (a position that is both a word separator and not a word separator) is not possible to match. If the engine finds that the supplied expression contains \b\B, it knows that the entire expression cannot be matched, and no matching operation is performed. It will immediately report a failed match. If the matched text is very long, the time savings can be significant.

However, no regular engine has performed such optimization. Because a regular expression containing \b\B is likely to match, the engine must do some extra work to confirm in advance. While doing so can save a lot of time in some cases, other times the increased speed comes at a much higher price. \b\B can be used to ensure that a certain part of the regular expression fails to match. For example, inserting \b\B into...(this|this other)... ensures that the first multi-selection branch fails.

mysql> select regexp_substr('this this other','this|this other');
+----------------------------------------------------+
| regexp_substr('this this other','this|this other') |
+----------------------------------------------------+
| this                                               |
+----------------------------------------------------+
1 row in set (0.00 sec)

mysql> select regexp_substr('this this other','this\\b\\B|this other');
+----------------------------------------------------------+
| regexp_substr('this this other','this\\b\\B|this other') |
+----------------------------------------------------------+
| this other                                               |
+----------------------------------------------------------+
1 row in set (0.00 sec)

2. Optimization varies

When carrying out various optimization measures, it is important to remember that "optimization is different (everyone's lunch is different)" and different engines may optimize in different ways. A small change to a regular expression may result in a large speed increase in one implementation and a large slowdown in another implementation.

3. Application principles of regular expressions

The process of applying regular expressions to target strings is roughly divided into the following steps:
(1) Regular expression compilation: Check the grammatical correctness of the regular expression, and if it is correct, compile it into an internal form.
(2) Transmission start: The transmission device "positions" the regular engine to the starting position of the target string.
(3) Element detection: The engine starts to test the regular expression and text, and tests each element (component) of the regular expression in turn.

Connected elements, such as S, u, b, j, e, etc. in Subject, will be tried once and will only stop when an element fails to match.
For elements modified by quantifiers, control alternates between the quantifier (checking whether the quantifier should continue to match) and the qualified element (testing whether it can match).
There is some overhead involved in switching control within and outside capturing brackets. The text matched by the bracketed expression must be preserved so that it can be referenced through $1. Because a pair of brackets may belong to a backtracking branch, the state of the brackets is part of the state used for backtracking, so the state needs to be modified when entering and exiting capturing brackets.

(4) Search for matching results: If a matching result is found, the traditional NFA will "lock" in the current state and report a successful match. For POSIX NFA, if the match is the longest so far, it will remember this possible match and continue from the available saved state. After all saved states are tested, the longest match is returned.
(5) The driving process of the transmission device: If no match is found, the transmission device will drive the engine and start a new round of attempts from the next character in the text (return to step 3).
(6) Complete match failure: If attempts starting from every character of the target string (including the position after the last character) fail, a complete match failure will be reported.

The following sections explain how advanced implementations can reduce this processing and how to apply these techniques.

4. Apply previous optimization measures

A good regular engine implementation can optimize the regular expression before it is actually used. It can sometimes even quickly determine that a certain regular expression cannot be matched anyway, so there is no need to apply this expression at all.

(1) Compilation cache

The first thing to do before using a regular expression is to perform a syntax check. If there is no problem, it will be compiled into an internal form. The compiled internal form can be used to check various strings, but what about the following program?

while (...) {
   if ($line =~ m/^\s*$/) ...
   if ($line =~ m/^Subject: (.*)/) ...
   if ($line =~ m/^Date: (.*)/) ...
   if ($line =~ m/^Reply-To: (\S+)/) ...
   if ($line =~ m/^From: (\S+) \(([^()]*)\)/) ...
}

Obviously, recompiling all the regular expressions every time through the loop is a waste of time. On the contrary, saving or caching the internal forms after the first compilation and reusing them in subsequent loops will obviously increase the speed, but will consume some memory. The specific method depends on the regular expression processing method provided by the application. There are three types: integrated, procedural and object-oriented.

MySQL's regular expressions belong to the integrated compilation cache, and the regular expression matching function is provided through functions. If you call a regular function in a stored procedure or custom function, it is precompiled. If you use a program to access the database, such as Java, you can use MySQL JDBC for pre-compilation.

(2) Compilation cache in integrated processing

Perl and awk use an integrated processing method, which is very easy to compile and cache. Internally, each regular expression is associated with a certain part of the code. The first time it is executed, an association is established between the compiled result and the code. The next time it is executed, it only needs to be referenced. This saves the most time, at the cost of requiring a portion of memory to store cached expressions.

Variable interpolation (variable interpolation, that is, using the value of a variable as part of a regular expression, such as dynamic SQL in MySQL) may cause trouble for caching. For example, for m/^Subject: \Q$DesiredSubject\E\s*$/, the content of the regular expression may change in each loop because it depends on the interpolation variable, and the value of this variable may change . If it is different every time, then the regular expression needs to be compiled every time and cannot be reused at all. A compromise optimization measure is to check the interpolated result (that is, the specific value of the regular expression) and only recompile when the specific value changes.

(3) Compilation cache in programmatic processing

In integrated processing, the use of regular expressions is related to their location in the program, so when this code is executed again, the compiled regular expressions can be cached and reused. However, there are only general "apply this expression" functions in programmatic processing. In other words, the compiled form is not tied to the specific location of the program, and the regular expression must be recompiled the next time this function is called. This is true in theory, but in practice, prohibiting attempts to cache is undoubtedly inefficient. Instead, optimization usually saves the most recently used regex pattern and associates it with the final compiled form.

After calling the "Apply this expression" function, the regular expression pattern passed as an argument is compared with the saved regular expression, and if it exists in the cache, the cached version is used. If not, compile the regular expression directly and store it in the cache. If the cache has capacity constraints, an older expression may be replaced. If the cache is full, one compilation must be thrown out, usually the one that has not been used for the longest time.

The GUN Emacs cache can store up to 20 regular expressions, and Tcl can store 30. PHP can save more than four thousand. The .NET Framework can save 15 expressions by default, but the number can be set dynamically or this feature can be disabled.

(4) Compilation cache in object-oriented processing

In object-oriented processing, it is entirely up to the programmer to decide when a regular expression is compiled. Compilation of regular expressions is performed by users through constructors such as New Regex, re.compile and Pattern.compile (corresponding to .NE, Python and java.util.regex respectively). Compilation is done before the regular expression is actually used, but they can also be done earlier, sometimes before a loop, or during the initialization phase of the program, and then can be used at will.

In object-oriented processing, the programmer throws out compiled regular expressions through object destructors. Immediately discarding unneeded compiled forms saves memory.

(5) Pre-check required characters/substring optimization

Compared with the complete application of regular expressions, searching for a certain character or a string of characters in a string is a more "lightweight" operation, so some systems will do additional analysis during the compilation phase to determine whether there is a necessary match for a successful match. character or string. Before actually applying the regex, a quick scan in the target string checks for the required character or string, and if it's not there, there's no need to try anything at all.

For example, 'Subject:' in ^Subject: (.*) is required. The program can examine the entire string, or use the Boyer-Moore search algorithm (a fast search algorithm that is more efficient for longer strings). Even doing a character-by-character check can improve efficiency. Selecting characters that are less likely to occur in the subject string (such as the ':' after the 't' in 'Subject: ') can further improve efficiency.

The regex engine must recognize that part of ^Subject: (.*) is a fixed text string. For arbitrary matching, identifying the 'th' in this|that|other is necessary, requires more effort, and most regular engines don't do this. The answer to this question is not black and white, an implementation may not recognize that 'th' is required, but recognize that both 'h' and 't' are required, so at least one character can be checked.

The required characters and strings that different applications recognize vary greatly. Many systems suffer from multiple-choice structures, and in such systems using th(is|at) performs better than this|that.

(6) Length judgment optimization

The length of text that ^Subject: (.*) can match is not fixed, but must contain at least 9 characters. So, if the length of the target string is less than 9 then there is no need to try at all. Of course, the matching characters need to be longer for the optimization effect to be more obvious, for example: \d{79}: (at least 81 characters are required).

5. Optimization via transmission

Even if the regex engine cannot predict whether a string will match, it can reduce the number of places where the gear actually applies the regex.

(1) String starting/line anchor optimization

This optimization can infer that any regular expression starting with ^ will only match if ^ can match it, so it only needs to be applied in these positions. Any implementation that uses this optimization must recognize that ^ must match if ^(this|that) matches successfully. However, many implementations cannot recognize ^this|^that, so using ^(this|that) or ^(?:this|that) can improve the matching speed.

The same optimization measures are also valid for \A and, if the match occurs multiple times, for \G.

(2) Implicit anchor point optimization

Engines that can use this optimization know that if a regular expression begins with .* or .+ and does not have a global alternation, it can be assumed that there is an invisible ^ at the beginning of the regular expression. In this way, the "string start/line anchor optimization" in the previous section can be applied, saving a lot of time.

Smarter systems realize that the same optimization can be performed even if the leading .* or .+ is inside a parenthesis, but care must be taken when encountering capturing parentheses. For example, (.+)X\1 expects to match strings that are the same on both sides of 'X', adding ^ will not match '1234X2345'.

mysql> select regexp_substr('1234X2345','(.+)X\\1');
+---------------------------------------+
| regexp_substr('1234X2345','(.+)X\\1') |
+---------------------------------------+
| 234X234                               |
+---------------------------------------+
1 row in set (0.00 sec)

(3) String end/line anchor optimization

When this optimization encounters a regular expression with a trailing $ or other ending anchor point (\Z, \z, etc.), it can try to match starting from the number of characters from the end of the string. For example, the regular expression regex(es)?$ matches only the 8th character from the end of the string, so the gear can jump to that position, skipping many possible characters in the target string.

Say 8 characters here, not 7, because in many flavors $ is able to match the position before the newline at the end of the string.

(4) Starting character/character group/substring recognition optimization

This is a more general version of the "prefetch required character/substring optimization", which uses the same information (any match of the regex must start with a specific character or literal substring), allowing the gear to perform fast Substring checking, so it can apply the regular expression at the appropriate position in the string. For example, this|that|other can only match starting at [ot] position, so Gearing pre-checks every character in the string and only applies it where a match is possible, which saves a lot of time. The longer substrings that can be checked in advance, the fewer "false starts" there will be.

(5) Embedded text string inspection optimization

This is somewhat similar to the initial string recognition optimization, but more advanced, targeting literal strings that occur at fixed positions in the match. If the regular expression is \b(perl|java)\.regex\.info\b, then '.regex.info' must be present in any match, so smart gearing can use high-speed Boyer-Moore string retrieval The algorithm looks for '.regex.info', then counts forward 4 characters and starts actually applying the regular expression.

In general, this optimization only works if the embedded literal string is a fixed distance from the start of the expression. Therefore it cannot be used for \b(vb|java)\.regex\.info\b. Although this expression contains a literal string, the distance between this string and the starting position of the matched text is undefined (2 or 4 characters). This optimization also cannot be used for \b(\w+)\.regex\.info\b, since (\w+) may match any number of characters.

(6) Length identification transmission optimization

This optimization is directly related to the "length judgment optimization". If the current position is less than the minimum length required for a successful match, the transmission will stop the match attempt.

6. Optimize the regular expression itself

(1) Text string connection optimization

Perhaps the most basic optimization is that the engine can treat abc as "one element" instead of three elements "a, then b, then c". If this is done, the whole part can be used as a unit of matching iterations instead of three iterations.

(2) Simplify quantifier optimization

Quantifiers such as plus signs and asterisks that constrain ordinary elements, such as literal strings or character groups, usually need to be optimized to avoid most of the step-by-step overhead of ordinary NFA engines. The main loop within the regularization engine must be general and able to handle all structures supported by the engine. In programming, "universal" means slow, so this kind of optimization treats simple quantifiers such as . handler. In this way, the general-purpose engine short-circuit these structures.

For example, .* and (?:.)* are logically equivalent, but on systems with this optimization, .* is actually faster. In java.util.regex, the performance improvement is about 10%; in Ruby and .NET, it is about 2.5 times; in Python, it is about 50 times; in PHP/PCRE, it is about 150 times; because of the Perl implementation With the optimizations described in the next section, .* is as fast as (?:.)* .

In MySQL, the performance is improved by about 4 times:

mysql> set @str:='abababdedfg';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg1:='.*';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg2:='(?:.)*';
Query OK, 0 rows affected (0.00 sec)

mysql> call sp_test_regexp(@str, @reg1, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-11 15:10:00.241 | 2023-07-11 15:10:00.252 | 11.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (27.15 sec)

Query OK, 0 rows affected (27.15 sec)

mysql> call sp_test_regexp(@str, @reg2, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-11 15:10:34.690 | 2023-07-11 15:10:34.731 | 41.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (27.74 sec)

(3) Eliminate unnecessary parentheses

If an implementation considers (?:.)* and .* to be exactly equivalent, it will replace the former with the latter.

(4) Eliminate unnecessary character groups

A character group containing only a single character is a bit redundant because it is treated as a character group, which is completely unnecessary. So smart implementations will convert [.] to \. internally.

(5) Ignore character optimization after priority quantifier

Ignore priority quantifiers, such as *? in "(.*?)". During processing, the engine usually has to switch between the object (dot) and the characters after ". For various reasons, ignoring priority quantifiers is usually better than Matching priority quantifiers is slower, especially for the matching priority qualifying structure of the "reducing quantifier optimization" above. Another reason is that if the priority quantifier is ignored within capturing brackets, control must be in the brackets Switching between inside and outside will bring additional overhead.

So the principle of this optimization is that if a literal character follows an ignored precedence quantifier, as long as the engine does not touch that literal character, the ignore precedence quantifier can be processed as an ordinary matching precedence quantifier. Therefore, implementations that include this optimization will switch to a special ignore priority quantifier in this case, quickly detect the literal string in the target text, and skip the regular "ignore" state until this literal character is encountered.

mysql> set @str:='"aba"bab"dedfg';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg1:='"(.*)"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg2:='"(.*?)"';
Query OK, 0 rows affected (0.00 sec)

mysql> call sp_test_regexp(@str, @reg1, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-11 15:47:52.673 | 2023-07-11 15:47:52.687 | 14.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (34.25 sec)

Query OK, 0 rows affected (34.25 sec)

mysql> call sp_test_regexp(@str, @reg2, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-11 15:48:28.258 | 2023-07-11 15:48:28.270 | 12.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (35.58 sec)

There are various other forms of this optimization, such as pre-checking a group of characters instead of a specific character, such as checking for ['"] in ['"](.*?)["']. This is somewhat similar to what was introduced earlier The beginning character recognition is optimized.

(6) "Excessive" backtesting

Quantifiers such as (.+)* combined with structures may create exponential backtracking. A simple way to avoid this situation is to limit the number of backtracking and stop matching when the limit is exceeded. This is useful in some practical situations, but it also sets an artificial limit on the text to which the regular expression can be applied.

For example, if the limit is 10,000 tracebacks, .*? cannot match strings longer than 10,000, because each matched character corresponds to one traceback. This situation is not uncommon, especially when dealing with Web pages, so this limitation is very bad.

For different reasons, some implementations limit the size of the backtrace stack (for example, Python has an upper limit of 10,000). That is the upper limit of the state that can be saved at the same time. Like the backtracking cap, this also limits the length of text that the regular expression can handle.

In the "MySQL Testing" section, we have seen the default values, effects, and changes of the two related MySQL configuration parameters.

(7) Avoid exponential (super-linear) matching

A better way to avoid exponential matching is to detect when a match attempts to enter a superlinear state. This allows you to do extra work to record where each quantifier's corresponding subexpression attempts to match, bypassing repeated attempts.

In fact, superlinear matching is easy to detect when it occurs. The number of iterations (loops) of a single quantifier should not be more than the number of characters of the target string, otherwise an exponential match will definitely occur. If it turns out that the match cannot be terminated based on this clue, detecting and eliminating redundant matches is a more complicated problem, but because there are so many multi-select branch matches, it may be worthwhile to do so.

One of the side effects of detecting superlinear matches and reporting match failures quickly is that truly inefficient regular expressions don't show up as inefficient. Even if exponential matching is avoided using this optimization, the time taken is much higher than it really needs to be, but not so slow that it will be easily noticed by users.

Of course, overall the advantages may outweigh the disadvantages. Many people don't care about the efficiency of regular expressions. They have a fear of regular expressions and just want to complete the task without caring about how to complete it.

(8) Use possessive priority quantifiers to reduce status

After objects constrained by normal quantifiers are matched, several "no matching here" states will be retained, one state created for each iteration of the quantifier. Possessive quantifiers do not retain these states. There are two specific methods. One is to discard all standby states after all quantifiers are completed. A more efficient way is to discard the previous round of standby states in each iteration. It is always necessary to save a state when matching so that the engine can continue to run if the quantifier cannot continue to match.

Discarding state on the fly during an iteration is more efficient because it takes up less memory. Applying .* creates a state when matching each character, which can take up a lot of memory if the characters are long.

(9) Quantifier equivalent conversion

Is there any difference in efficiency between \d\d\d\d and \d{4}? For NFA, the answer is almost yes, but the results vary depending on the tool. If the quantifier is optimized, \d{4} will be faster, unless the regular expression without the quantifier can be further optimized. \d{4} is about 25% faster in MySQL.

mysql> set @str:='1234';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg1:='\\d\\d\\d\\d';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg2:='\\d{4}';
Query OK, 0 rows affected (0.00 sec)

mysql> call sp_test_regexp(@str, @reg1, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-11 17:26:35.087 | 2023-07-11 17:26:35.091 |  4.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (7.09 sec)

Query OK, 0 rows affected (7.09 sec)

mysql> call sp_test_regexp(@str, @reg2, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-11 17:26:42.168 | 2023-07-11 17:26:42.171 |  3.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (7.07 sec)

Comparing ==== and ={4}, what is repeated at this time is a certain literal character, and it is easier to recognize it as a literal string using the ==== engine directly. If so, the supported efficient "starting character/character group/substring recognition optimization" can come in handy. This is exactly the case for Python and Java, where ==== is 100 times faster than ={4}.

Perl, Ruby, and .NET are more optimized and don't differentiate between ==== and ={4}, resulting in both being equally fast. The speed of both is also the same in MySQL.

mysql> set @str:='====';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg1:='====';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg2:='={4}';
Query OK, 0 rows affected (0.00 sec)

mysql> call sp_test_regexp(@str, @reg1, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-11 17:33:40.960 | 2023-07-11 17:33:40.963 |  3.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (7.08 sec)

Query OK, 0 rows affected (7.08 sec)

mysql> call sp_test_regexp(@str, @reg2, 100000);
+------+-------------------------+-------------------------+---------+
| @ret | @startts                | @endts                  | diff_ts |
+------+-------------------------+-------------------------+---------+
|    1 | 2023-07-11 17:33:48.054 | 2023-07-11 17:33:48.057 |  3.0000 |
+------+-------------------------+-------------------------+---------+
1 row in set (7.10 sec)

(10) Need identification

Another simple optimization is for the engine to preemptively cancel work that it deems to be of no value to the matching result, such as the use of capturing brackets where text is not necessarily captured. Recognition capabilities are highly dependent on the programming language, but this optimization can be easily implemented if certain options can be specified during matching to disable certain costly features.

Tcl is capable of this optimization, and its capturing brackets don't actually capture text unless the user explicitly asks for it. .NET's regular expressions provide an option that allows programmers to specify whether capturing brackets need to be captured.

5. Tips to improve expression speed

What was introduced earlier are various optimizations used by traditional NFA engines. If you also understand the working principle of traditional NFA and combine this knowledge, you can benefit from three aspects:

Write regular expressions suitable for optimization

Write expressions that adapt to known optimization measures. For example, xx* can apply more optimization measures than x+, such as checking characters that must appear in the target string, or starting character recognition.

Simulation optimization

A lot of time can be saved by manually simulating optimization measures. For example, add (?=t) before this|that, so that even if the system cannot predict that any matching result must start with t, it can still simulate the beginning character recognition.

Dominant engine matching

Using knowledge of how traditional NFA engines work, the engine can be directed to match faster. Take this|that as an example. Each multi-selection branch starts with th. If the first multi-selection branch cannot match th, the second one obviously cannot, so there is no need to waste your efforts. So th(?:is|at) can be used. In this way, th only needs to be checked once, and the relatively expensive multi-selection structure function will be used only when really needed. Moreover, the plain text character at the beginning of th(?:is|at) is th, so there is the possibility of other optimizations.

Efficiency and optimization are sometimes troublesome to deal with. Please pay attention to the following points:

Making changes that appear to be helpful sometimes backfires because they may prevent other optimizations from being effective.
Add some content to simulate optimization measures, and it may happen that you spend more time processing those additions than you save.
Add some content to simulate an optimization that is not currently provided. If the software is upgraded in the future to support this optimization, it will affect or repeat the real optimization.
Likewise, control expressions attempt to trigger certain optimizations that are currently available, and some more advanced optimizations may not be available after some future software upgrades.
Modifying an expression to improve efficiency may make the expression difficult to understand and maintain.
The degree of benefit or harm caused by the specific modification depends basically on the data to which the expression is applied. Modifications that are beneficial to one type of data may be harmful to another type of data.

Let's take an extreme example: I want to find (000|999)$ in a Perl script, and decide to replace these capturing brackets with non-capturing brackets, because I feel that this eliminates the overhead of capturing text and is faster. But strangely, this small and seemingly beneficial change actually slows down the expression by several orders of magnitude. How did that happen? It turns out that there are several factors working together here, and when using non-capturing brackets, "end-of-string/line anchor optimization" is turned off. Non-capturing parentheses are beneficial in most cases, but can have disastrous consequences in some cases.

Detecting and performance testing the same type of data for the intended application can help determine whether a change is worth it, but there are still many factors that must be weighed.

1. Common sense optimization

(1) Avoid recompiling

Regular expressions should be compiled and defined as few times as possible. In object-oriented processing, the user has precise control over this. For example, if you want to apply a regular expression in a loop, you should create the regular expression object outside the loop and reuse it in the loop.

In the case of functional processing, such as GNU Emacs and Tcl, you should try to ensure that the number of regular expressions used in a loop is less than the upper limit that the tool can cache.

If you are using integrated processing, such as Perl, you should try to avoid using variable interpolation in regular expressions within loops, because this will need to regenerate the regular expression each time it is looped, even if the value has not changed (although Perl provides an efficient way to avoid this problem).

(2) Use non-capturing parentheses

If you do not need to quote the text within the brackets, use non-capturing brackets (?:...). This will not only save capture time, but also reduce the number of states used for backtracking, improving speed in two ways. And further optimizations can be made, such as eliminating unnecessary parentheses.

(3) Don’t abuse parentheses

Use parentheses when needed; other times using parentheses will prevent certain optimizations. Do not use (.)* unless you need to know the last character matched by .*.

(4) Do not abuse character groups

Do not use single-character character groups such as ^.*[:]. In this way, you do not use the multi-character matching function provided by the character group, but you have to pay the price of processing the character group. When matching a metacharacter, use escape rather than character groups, such as \. or \* instead of [.] or [*]. Replace ^[Ff][Rr][Oo][Mm] with case-insensitive matching.

(5) Use starting anchor point

Except in extremely rare cases, regular expressions starting with .* should be preceded by ^ or \A. If the regular expression doesn't match at the beginning of a string, then obviously it won't match elsewhere. Adding anchor points (whether added manually or automatically by the engine) can be combined with "initial character/character group/substring recognition optimization" to save a lot of unnecessary work.

2. Separate text from text

Here are some manual optimization measures to help expose text, improve the possibility of engine recognition, and cooperate with the engine's optimization of text.

(1) "Extract" necessary elements from the quantifier

Replacing x+ with xx* exposes the x required for the match. In the same way -{5,7} can be written as ------{0,2}.

(2) "Extract" the necessary elements at the beginning of the multi-selection structure

Substituting th(?:is|at) for (?:this|that) exposes the necessary th. If the ending parts of different multi-selection branches are the same, they can also be "extracted" from the right, such as (?:optim|standard)ization. As you'll see in the next section, this can be very valuable if the extracted parts include anchor points.

3. Isolate the anchor point

Some effective internal optimizations use anchor points such as ^, $, and \G to "bind" the expression to one end of the target string. There are some techniques that can help when using these optimizations.

(1) Independently add ^ and \G in front of the expression

^(?:abc|123) and ^abc|^123 are logically equivalent, but many regular engines will only use "beginning character/character group/substring recognition optimization" for the first expression, so the first One approach is much more efficient. In PCRE and the tools that use it both are equally efficient, but in most other NFA tools the first expression is more efficient.

Comparing (^abc) and ^(abc) can reveal another difference. The setting of the former is not very appropriate. The anchor point is "hidden" within the capturing brackets. The brackets must be entered before the anchor point can be detected. On many systems, This is very inefficient. In some systems (PCRE, Perl, .NET) the two are equally efficient, but in others (Ruby and Java) only the latter is optimized.

(2) Independently add $ at the end of the expression

This measure is similar to the optimization idea in the previous section. Although abc$|123$ and (?:abc|123)$ are logically equivalent, the optimization performance may be different. This optimization is currently only available in Perl, because only Perl currently provides "end-of-string/line-anchor optimization". The optimization works for (...|...)$ but not for (...$|...$).

4. Ignore or match priority? Detailed analysis of specific situations

In general, whether to use an ignore-preferred quantifier or a match-preferred quantifier depends on the specific needs of the regular expression. For example, ^.*: is completely different from ^.*?: because the former matches the last colon and the latter matches the first colon. However, if the target data contains only one colon, there will be no difference between the two expressions until the unique colon is matched, so it may be more appropriate to choose the faster expression.

However, the advantages and disadvantages are not so clear at all times. The general principle is that if the target string is very long and the colon is thought to be closer to the beginning of the string, use the ignore priority quantifier so that the engine can find the colon faster. If the colon is considered to be near the end of the string, use a match precedence quantifier. If the data is random and you don't know which end the colon will be near, use matching priority quantifiers, because their optimization is generally better than other quantifiers, especially the later part of the expression is prohibited from "ignoring the priority quantifier" This is especially true when "Character Optimization" is used.

If the string to match is short, the difference is less noticeable. At this time, the speed of the two regular expressions is very fast, but if you care about the little speed difference, let's do a performance test on typical data.

A related question is how to choose between ignoring priority quantifiers and excluding character groups (^.*?: vs. ^[^:]*:)? The answer again depends on the programming language and application data, but for most engines, excluding character groups is much more efficient than ignoring precedence quantifiers. Perl is an exception because it optimizes to ignore characters after the precedence quantifier.

5. Split regular expressions

Sometimes, applying multiple small regular expressions is much faster than one large regular expression. To take an extreme example, if you want to check whether a long string contains the name of the month, checking January, February, March, etc. in sequence is much faster than January|February|March|... Because for the latter, there is no text content necessary for successful matching, so "embedded text string check optimization" cannot be performed. "Big and comprehensive" regular expressions must test all subexpressions at every position in the target text, which is quite slow.

Let’s look at another interesting example. To find data similar to HASH(0x80f60ac), the regular expression used is quite straightforward: \b(?:SCALAR|ARRAY|...|HASH)$0x[0-9a-fA- F]+$. One would hope that a sufficiently advanced engine would understand that (0x is required for any match, and therefore enable "prefetch required character/substring optimization". The data to which this regex is applied will rarely contain (0x, prefetch Can save a lot of time. Unfortunately Perl does not do this, it tests numerous multiple-choice branches of the entire regular expression for each character of each target string, which is not fast enough.

One optimization method is in a complicated way:$0x(?<=(?:SCALAR|ARRAY|...|HASH)\(0x)[0-9a-fA-F]+$. In this way, once\ (0x After matching, a positive reverse look can ensure that the previously matched text is the required text, and then check whether the subsequent text meets expectations. The reason for this trouble is to let the regular expression obtain the text that must appear\(0x , so that a variety of optimizations can be performed. In particular, if you want to perform pre-checking, you must optimize the string, and "optimize the beginning character/character group/substring recognition".

If Perl does not automatically look for \(0x, you can do this manually:

if ($data =~ m/\(0x/
   and
   $data =~ m/(?:SCALAR|ARRAY|...|HASH)\(0x[0-9a-fA-F]+\)/)
{
   # 错误数据报警
}

The \(0x check will actually filter out most of the text, and the relatively slow full regular expression will only check for lines that are likely to match, thus balancing efficiency and readability.

6. Simulate beginning character recognition

If the implementation you are using is not optimized for beginning character recognition, you can take matters into your own hands and add an appropriate lookaround structure to the beginning of the expression. Ring structures can be "looked ahead" to select an appropriate starting position before the rest of the regular expression is matched.

If the regular expression is Jan|Feb|...|Dec, the corresponding value is (?=[JFMASOND])(?:Jan|Feb|...|Dec). The [JFMASOND] at the beginning represents the possible first letter of the month word in English. However, this technique is not suitable in all cases, because the overhead of looking around the structure may be greater than the time saved.

If the regular engine can automatically detect [JFMASOND], the speed will of course be much faster than that specified by the user manually. On many systems, you can have the engine detect it automatically using the following complex expression:
[JFMASOND](?:(?<=J)an|(?<=F)eb|...|(?<=D) ec)

Characters at the beginning of an expression can be exploited by the "initial character/character/substring recognition optimization" of most systems, so that the transmission can efficiently look ahead [JFMASOND]. If the target string does not contain matching characters, the result will be faster than the original Jan|Feb|...|Dec or the manually added lookaround expression. However, if the target string contains many characters that the character group can match, the extra lookaround may actually slow down the matching.

7. Use fixed grouping and possessive priority quantifiers

In many cases, solid grouping and possessive quantifiers can greatly improve matching speed, and they do not change the matching results. For example, if the colon in ^[^:]+: doesn't match on the first try, then any backtracking is actually pointless, because by definition any character "handed back" by the backtracking can't be a colon. Using solidified grouping^(?>[^:]+): or occupying priority quantifier^[^:]++: can directly discard the standby state, or not create many standby states at all. Because the engine has no content state to backtrack to, unnecessary backtracking is avoided.

It must be emphasized, however, that extreme caution must be exercised as the improper use of these two constructs can inadvertently alter the matching results. If you don't use ^.*: but use ^(?>.*): the result will fail. The entire line of text will be matched by .*, and the following : cannot match any characters. The hardened grouping prevents the backtracking that the final : match must do, so the match must fail.

8. Matching of dominant engines

Another way to improve the efficiency of regular expression matching is to set the "control" in the matching process as accurately as possible. For example, use th(?:is|at) to replace this|that. In the latter expression, the multi-select structure gains the highest level of control, whereas in the former expression, the relatively expensive multi-select structure only gains control after th is matched.

(1) Put the most likely matching multi-selection branch first

If the order of the multiple-choice branches has nothing to do with the matching results, the multiple-choice branch that is most likely to match should be placed first. For example, in a regular expression matching hostnames, if you sort by the number of distributions: (?:com|edu|org|net|...), you are more likely to get more common matches quickly.

Of course, this only applies to traditional NFA engines, and only if a match exists. If POSIX NFA is used, or if there is no match, all multi-select branches must be tested, so the order does not matter.

(2) Disperse the ending part into the multi-selection structure

Compare (?:com|edu|...|[az][az])\b with com\b|edu\b|...\b|[az][az]\b. In the latter expression, the \b following the multi-select structure is dispersed to each multi-select branch. A possible benefit is that it may allow a multi-choice branch to match, but a subsequent \b may cause the match to fail. Adding \b to the multi-select structure will make the match fail faster, because it will detect the failure without exiting the multi-select structure.

This optimization is risky. Remember, be careful when using this feature so it doesn't prevent other optimizations that could otherwise be done. For example, if the "dispersed" subexpression is a literal text, then replacing (?:this|that): with this:|that: violates some of the ideas in "separating literal texts". All optimizations are equal, so be careful when optimizing and don't lose a big deal for a small amount.

This problem will also occur if you disperse the $ at the end of the regular expression on a system that is capable of independent ending anchors. On these systems, (?:com|edu|...)$ is much faster than com$|edu$|...$.

9. Eliminate loops

No matter what optimizations the system itself supports, perhaps the most important benefit comes from understanding the basic working principles of the engine and writing expressions that can work with the engine. The "loop" mentioned here uses the meaning represented by the asterisk in expressions such as (this|that|...)*, and the previous endless matching "(\\.|[^\\"]+) *" actually falls into this category. If it fails to match, this expression takes nearly infinite time to try, so it must be improved.

There are two different ways to implement this technique:

Check which part of (\\.|[^\\"]+)* actually matches successfully among various typical matches, so that traces of subexpressions can be left. Then based on the pattern just discovered, reconstruct an efficient expression. This conceptual model is a big ball, which represents the expression (...)*, and the ball rolls on some text. The elements inside (...) can always match some text, so that Leaves traces.
Another approach is to look at the structure expected for matching at a high level, and then make informal assumptions about what the target string is likely to be, based on what you think are common situations. Construct valid expressions from this perspective.

(1) Method 1: Construct regular expressions based on experience

When parsing "(\\.|[^\\"]+)*", it is natural to use several specific strings to check the global match. For example, if the target string is "hi", Then the subexpression used is "[^\\"]+". This shows that the global matching uses the initial ", then the multi-select branch [^\\"]+, and then the " at the end. If the target string is "he said \"hi there\" and left", the corresponding The expression is "[^\\"]+\\.[^\\"]+\\.[^\\"]+". Although it is impossible to construct a specific expression for each input string, you can find some common patterns and construct a regular expression that is more efficient without losing versatility.

Now look at the example of the first four rows in the table below.

target string	the corresponding expression
"hi there"	"[^\\"]+"
"just one \" here"	"[^\\"]+\\.[^\\"]+"
"some \"quoted\" things"	"[^\\"]+\\.[^\\"]+\\.[^\\"]+"
"with \"a\" and \"b\"."	"[^\\"]+\\.[^\\"]+\\.[^\\"]+\\.[^\\"]+\\.[^\\"]+"
"\"ok\"\n"	"\\.[^\\"]+\\.\\."
"empty \"\" quote"	"[^\\"]+\\.\\.[^\\"]+"

Table 2

In each case, it starts with a quote, then [^\\"]+, then several \\.[^\\"]+. Combined, we get [^\\"]+(\\.[^\\"]+)*. This particular example illustrates that general patterns can be used to construct many useful expressions.

When matching double-quoted strings, the quote itself and the escaped slash are "special" - because the quote can indicate the end of the string, and the backslash indicates that the character after it does not terminate the entire string. In other cases, [^\\"] is an ordinary period. Examine how they are combined into [^\\"]+(\\.[^\\"]+)*. First, it conforms to the general pattern normal+ (specialnormal+)*. Add the quotation marks at both ends, and you get "[^\\"]+(\\.[^\\"]+)*".

However, the examples in the last two rows of Table 2 cannot be matched by this expression. The crux is that the two [^\\"]+ in the current expression require the string to start with an ordinary character. You can try changing the two plus signs into asterisks "[^\\"]*(\\.[ ^\\"]*)*". Will this achieve the desired result? More importantly, will it have a negative impact?

Now all the examples in Table 2 match, even strings like "\"\"\"". But it still needs to be confirmed whether such a major change leads to unexpected results. Incorrectly formatted quotes Can the strings match? Is it possible that a properly formatted quoted string cannot match? What about efficiency?

Take a closer look at "[^\\"]*(\\.[^\\"]*)*". The "[^\\"]* at the beginning will only be applied once, it matches the quotation mark that must appear at the beginning, and any ordinary characters after it, which is no problem. The following (\\.[^\\"]*)* is qualified by an asterisk. If this part is matched zero times, it is equivalent to removing this part, and you will get "[^\\"]*". This is obviously fine and represents the common case where there are no escaped elements.

If the (\\.[^\\"]*)* part matches once, it is actually equivalent to "[^\\"]*\\.[^\\"]*". Even if the ending [^\ \"]* does not match any text, it is actually "[^\\"]*\\.", and there is no problem. If you analyze it like this, you will find that there is actually no problem with this change. So in the end we get , the regular expression used to match quoted strings including escaped quotes is: "[^\\"]*(\\.[^\\"]*)*". This matches the original expression The results are completely consistent. However, after the loop is eliminated, the expression can end the matching within a limited time, which is not only much more efficient, but also avoids endless matching.

A common solution to eliminate loops is:

opening normal*(special normal*)* closing

To avoid endless matching in "[^\\"]*(\\.[^\\"]*)*", three points are important:

1) The matching start of the special part and the normal part cannot overlap.
Subexpressions in the special and normal parts cannot match starting from the same position. In the above example, the normal part is [^\\"] and the special part is \\. Obviously they cannot match the same characters, because the latter requires a backslash to begin with, while the former does not allow backslashes.

On the other hand, \\. and [^"] both match starting at the backslash in "Hello \n", so they do not fit this solution. If both match starting at the same position in the string, There is no way of knowing which one to use, and this uncertainty leads to endless matching. Makudonarudo's example illustrates this. If a match cannot be made (or the POSIX NFA engine does in any case), all possibilities must be tried. The first reason to improve this expression is to avoid this situation.

If you confirm that the special and normal parts cannot match the same characters, you can use the special part as a checkpoint to eliminate the uncertainty caused by the normal part matching the same text in each iteration of (...)*. If it is confirmed that the special part and the normal part never match the same text, then there is a unique "combination sequence" of the special part and the normal part in the match of a specific target string. Checking this sequence is much faster than checking thousands of possibilities, thus avoiding endless matches.

2) The special part must match at least one character.
The second point is that the special part must match at least one character. If the match can be successful without occupying any character special part, then the subsequent characters must still be matched by different iterations of (specialnormal*)*, so we are back to the original (...*)* problem.

Selecting (\\.)* as the special part violates this rule. If "[^\\"]*((\\.)*[^\\"]*)*" is used to match "Tubby, the engine must try several [^\\" before concluding that the match failed. ]* matches every possibility of Tubby. Because the special part can match no characters, it cannot be used as a checkpoint.

3) The special part must be a solidified
special part. The matching text cannot be completed by multiple iterations of this part. For example, you need to match comments {...} and whitespace that may appear in Pascal. The regular expression that can match the comment part is \{[^}]*\}, so the entire regular expression is (\{[^}]*\}| +)*. Assume ' +' and \{[^}]*\} are divided into special and normal parts respectively. Using the solution of normal*(special normal*)*, we get (\{[^}]*\})*( +(\{[^}]*\})*)*. Now look at this string:
{comment} {another}

Matching of consecutive spaces may be a single '+', or a match of multiple '+'s (each matching a space), or a combination of multiple '+'s (each matching a different number of spaces). This is very similar to the previous 'makudonarudo' problem.

The root of the problem is that the special section can match both very long text and parts of it via (...)*. Non-determinism opens up the possibility of matching the same text in multiple ways.

If there is a global match, it is possible that ' +' will only match once, but if there is no global match (for example, this expression is part of another larger expression), the engine must test '( +) for every space *'All possibilities. It takes time but is not helpful for global matching.

The solution is to ensure that the special part can only match fixed-length spaces. Since it must match at least one space, but may match more, we use ' ' as the special part and (...)* to ensure that multiple uses of special match multiple spaces.

This example is suitable for explanation, but in actual application, a more efficient way may be to exchange special and normal expressions: ' *(\{[^}]*\} *)*'. Because it is estimated that Pascal programs have more spaces than comments, and a more effective way for common situations is to use the normal part to match common text.

If you have several quantifiers at different levels, such as (...*)*, you have to be careful, but many such expressions are perfectly fine. For example:

(Re: *)* is used to match any number of 'Re:' sequences (can be used to clear 'Subject: Re: Re: re: hey' in the email subject).
( *\$[0-9]+)* is used to match dollar amounts and may be separated by spaces.
(.*\n)+ is used to match one or more lines of text. (In fact, if the dot cannot match a newline character, and there are other elements after this subexpression that cause the match to fail, an endless match will result.)

There is no problem with these expressions, because each has a checkpoint, so there is no problem of "matching the same text in multiple ways". The first one is Re:, the second one is \$, and the third one is \n (if the period cannot match the newline character).

(2) Method 2: Top-down perspective

Start by matching only the most common parts of the target string, and then add handling of uncommon cases. Let's look at the expression (\\.|[^\\"]+)* which results in an endless match, the text it is expected to match and where it might be used. Normally normal characters in quoted strings are better than escaped characters There are many, so [^\\"]+ does most of the work. \\. is only needed to handle occasional escape characters. You can use a multi-selection structure to deal with these two situations, but in order to handle a small number of escape characters, doing so will reduce efficiency.

If you think that [^\\"]+ can match most of the characters in the string, you know that if the matching stops, it means encountering a closed quotation mark or an escape character. If it is an escape character, any character is allowed to appear later, and then Start a new round of matching of [^\\"]+. Each time the match of [^\\"]+ terminates, we end up in the same situation: expecting a closing quote or another escape.

You can express it naturally with an expression and get the same result as method 1:
"[^\\"]+(\\.[^\\"]+)*"

As before, the initial non-quoted content or text within quotes may be empty. You can change the two plus signs to asterisks to get the same expression as method 1.

(3) Match host name

A hostname is basically a sequence of subdomain names separated by dots. It is troublesome to accurately define the matching specifications of subdomain names. To ensure clarity, use [az]+ to match subdomain names. If the subdomain name is [az]+, and you want to get a sequence of subdomain names separated by dots, you must first match the first subdomain name, and then other subdomain names start with a dot. The regular expression expression is [az]+(\.[az]+)*.

Conceptually, the problem of dot-delimited hostnames can be thought of as a problem of double-quoted strings, that is, "a sequence of unescaped elements separated by escaped elements." The normal part here is [az]+, separated by the special part \., and the loop elimination solution of method 1 can be applied.

The subdomain example is in the same category as the double-quoted string example, but there are two major differences:

There are no delimiters at the beginning and end of the domain name.
The normal part of the subdomain name cannot be empty, which means that two dots cannot be next to each other, and the dot cannot appear at the beginning or end of the entire domain name. For double-quoted strings, the normal part can be empty, so [^\\"]+ needs to be changed to [^\\"]*. This modification cannot be made in the example of subdomain name.

Looking back at the double-quoted string example, the advantages and disadvantages of the expression "[^\\"]*(\\.[^\\"]*)*" are obvious.

shortcoming:

Readability: This is the biggest problem. The original "([^\\"]|\\.)*" is easier to understand at a glance. Readability is given up here to pursue efficiency.
Maintainability: Maintainability can be more complicated, because any changes must remain the same for both [^\\"]. Maintainability is sacrificed here in pursuit of efficiency.

advantage:

Speed: If it cannot match, or adopts POSIX NFA, this regular expression will not enter endless matching. Because of careful tuning, a particular text can only be matched in a unique way, and if the text doesn't match, the engine will quickly spot it.
Or speed: regex "flow" is good, and that's the topic of "Fluid Regular Expressions". In the detection of traditional NFA, the expression after eliminating the loop is always much faster than the previous expression using the multiple choice structure. This is true even if the match can succeed without entering an endless match state.

(4) Use solidified grouping and possessive priority quantifiers

The problem with the expression "(\\.|[^\\"]+)*" that it enters a state of endless matching is that if it cannot match, it will fall into trying in vain. However, if there is a match, it can It ends quickly, because [^\\"]+ can match most characters in the target string, which is the normal part discussed before. Because [...]+ is generally optimized for speed and matches most characters, the overhead of the outer (...)* quantifier is greatly reduced.

The problem with "(\\.|[^\\"]+)*" is that when it can't match, it keeps backtracking in useless fallback states. These states have no value because they just check different permutations of the same object. None can match. If these states can be discarded, the regular expression can quickly report a match failure. There are two ways to discard (or ignore) these states: solidified grouping and possessive priority quantifiers.

Before starting to eliminate backtracking, I hope to swap the order of the multi-select branches and change "(\\.|[^\\"]+)*" to "([^\\"]+|\\.)*", This way elements that match "normal" text appear first. If two or more multiple-choice branches can be matched at the same position, the order may affect the matching result. But for this example, the texts matched by different multi-select branches are mutually exclusive. If a certain multi-select branch can be matched at one place, other multi-select branches cannot be matched here. From a correct matching perspective, the order does not matter, so the order can be chosen based on requirements for clarity or efficiency.

Use possessive quantifiers to avoid endless matching

The expression "([^\\"]+|\\.)*" that will cause endless matching has two quantifiers. You can change one of them to a possessive priority quantifier, or change both. Because most backtracking The trouble all comes from the state left by [...]+, so changing it to possessive priority will result in an expression that is very fast even if no match is found. However, change the outer (...)* to possessive. Preference discards all states within parentheses, including [...]+ and the alternative state of the multi-select structure itself, so if you want to choose one, you should choose the latter.

You can also change both to possessive priority quantifiers. The specific speed may depend on the optimization of the possessive priority quantifiers. The test situation in MySQL is that only changing the outer quantifier to possess priority is more than twice as fast as the other two, while the other two are about the same speed.

mysql> set @str:='"empty \\\"\\\" quote"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg1:='"([^\\\\"]++|\\\\.)*"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg2:='"([^\\\\"]+|\\\\.)*+"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg3:='"([^\\\\"]++|\\\\.)*+"';
Query OK, 0 rows affected (0.00 sec)

mysql> 
mysql> call sp_test_regexp(@str, @reg1, 1000);
+------+-------------------------+-------------------------+----------+
| @ret | @startts                | @endts                  | diff_ts  |
+------+-------------------------+-------------------------+----------+
|    0 | 2023-07-17 09:18:46.766 | 2023-07-17 09:18:47.181 | 415.0000 |
+------+-------------------------+-------------------------+----------+
1 row in set (0.43 sec)

Query OK, 0 rows affected (0.43 sec)

mysql> call sp_test_regexp(@str, @reg2, 1000);
+------+-------------------------+-------------------------+----------+
| @ret | @startts                | @endts                  | diff_ts  |
+------+-------------------------+-------------------------+----------+
|    0 | 2023-07-17 09:18:47.191 | 2023-07-17 09:18:47.376 | 185.0000 |
+------+-------------------------+-------------------------+----------+
1 row in set (0.19 sec)

Query OK, 0 rows affected (0.19 sec)

mysql> call sp_test_regexp(@str, @reg3, 1000);
+------+-------------------------+-------------------------+----------+
| @ret | @startts                | @endts                  | diff_ts  |
+------+-------------------------+-------------------------+----------+
|    0 | 2023-07-17 09:18:47.386 | 2023-07-17 09:18:47.794 | 408.0000 |
+------+-------------------------+-------------------------+----------+
1 row in set (0.42 sec)

Query OK, 0 rows affected (0.42 sec)

Use solidified grouping to avoid endless matching

If you want to use solid grouping for "([^\\"]+|\\.)*", the easiest way to think of is to change ordinary brackets into solid grouping brackets: "(?>[^\\"]+| \\.)*". But it must be known that (?>...|...)* is completely different from the priority (...|...)*+ when it comes to discarding states.

(...|...)*+ leaves no state behind when it completes, whereas (?>...|...)* just eliminates the state left on each iteration of the multi-select structure. The asterisk is independent of the solidification grouping, so it is not affected. This expression will still retain the backup state of "skip this iteration". That is, the state in the traceback is still not a definite final state. Here we hope to eliminate the standby state of the outer quantifier at the same time, so we need to change the outer brackets to solidified grouping, which means that the simulated possession priority (...|...)*+ must be used (?>(... |...)*).

(...|...)*+ and (?>...|...)* are both useful when solving endless matching problems, but they differ in the choice and timing of discarding states. The test situation in MySQL is that the fastest is to use solidified grouping for two layers of brackets, followed by using solidified grouping for only the inner brackets, and the slowest is to use solidified grouping for only the outer brackets. In general, the speeds of the three solidification groupings are not much different, and they are all faster than the fastest occupying priority quantifier method "([^\\"]+|\\.)*+".

mysql> set @str:='"empty \\\"\\\" quote"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg1:='"(?>[^\\\\"]+|\\\\.)*"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg2:='"(?>([^\\\\"]+|\\\\.)*)"';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg3:='"(?>(?>[^\\\\"]+|\\\\.)*)"';
Query OK, 0 rows affected (0.00 sec)

mysql> 
mysql> call sp_test_regexp(@str, @reg1, 1000);
+------+-------------------------+-------------------------+----------+
| @ret | @startts                | @endts                  | diff_ts  |
+------+-------------------------+-------------------------+----------+
|    0 | 2023-07-17 09:54:29.521 | 2023-07-17 09:54:29.684 | 163.0000 |
+------+-------------------------+-------------------------+----------+
1 row in set (0.17 sec)

Query OK, 0 rows affected (0.17 sec)

mysql> call sp_test_regexp(@str, @reg2, 1000);
+------+-------------------------+-------------------------+----------+
| @ret | @startts                | @endts                  | diff_ts  |
+------+-------------------------+-------------------------+----------+
|    0 | 2023-07-17 09:54:29.694 | 2023-07-17 09:54:29.874 | 180.0000 |
+------+-------------------------+-------------------------+----------+
1 row in set (0.19 sec)

Query OK, 0 rows affected (0.19 sec)

mysql> call sp_test_regexp(@str, @reg3, 1000);
+------+-------------------------+-------------------------+----------+
| @ret | @startts                | @endts                  | diff_ts  |
+------+-------------------------+-------------------------+----------+
|    0 | 2023-07-17 09:54:29.884 | 2023-07-17 09:54:30.033 | 149.0000 |
+------+-------------------------+-------------------------+----------+
1 row in set (0.16 sec)

Query OK, 0 rows affected (0.16 sec)

(5) Simple example of eliminating loops

Eliminate cycles in "multi-character" citations

Matches the string...Billions and Zillions of suns...

mysql> set @str:='<B>Billions</B> and <B>Zillions</B> of suns';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg:='<B>(?>[^<]*)(?>(?!</?B>)<[^<]*)*</B>';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@str,@reg,'') c, regexp_extract(@str,@reg,'') s;
+------+---------------------------------+
| c    | s                               |
+------+---------------------------------+
|    2 | <B>Billions</B>,<B>Zillions</B> |
+------+---------------------------------+

1 row in set (0.01 sec)

matches the beginning ; (?>[^<]*) matches any number of "normal"; (?!</?B>) if it is not nor ; < matches "special"; [^<]* continues to match any number of "normal"; matches the trailing . Solidification grouping is not necessary here, but if there is only partial matching, using solidification grouping can improve the speed.

Eliminate loops in consecutive row matching

mysql> set @str:=
    -> 'SRC=array.c buildin.c eval.c field.c gawkmisc.c io.c main.c\\
    '>          missing.c msg.c node.c re.c version.c';
Query OK, 0 rows affected (0.01 sec)

mysql> set @reg:='^\\w+=((?>[^\\n]*)(?>\\n[^\\n]*)*)';
Query OK, 0 rows affected (0.00 sec)

mysql> select @reg r, regexp_count(@str,@reg,'') c, regexp_extract(@str,@reg,'') s\G
*************************** 1. row ***************************
r: ^\w+=((?>[^\n]*)(?>\n[^\n]*)*)
c: 1
s: SRC=array.c buildin.c eval.c field.c gawkmisc.c io.c main.c\
         missing.c msg.c node.c re.c version.c
1 row in set (0.00 sec)

\w+ matches the beginning text and the equal sign; (?>[^\n]*) matches "normal"; (?>\n[^\n]*) matches "special" and "normal". This example uses non-dotall mode, where only \n is a special character. If dotall mode is used, only backslash is a special character, and other characters, including line breaks, are ordinary characters.

mysql> set @str:=
    -> 'SRC=array.c buildin.c eval.c field.c gawkmisc.c io.c main.c\\
    '>          missing.c msg.c node.c re.c version.c';
Query OK, 0 rows affected (0.00 sec)

mysql> set @reg:='^\\w+=((?>[^\\\\]*)(?>\\\\.[^\\\\]*)*)';
Query OK, 0 rows affected (0.00 sec)

mysql> select @reg r, regexp_count(@str,@reg,'mn') c, regexp_extract(@str,@reg,'mn') s\G
*************************** 1. row ***************************
r: ^\w+=((?>[^\\]*)(?>\\.[^\\]*)*)
c: 1
s: SRC=array.c buildin.c eval.c field.c gawkmisc.c io.c main.c\
         missing.c msg.c node.c re.c version.c
1 row in set (0.00 sec)

As in the example above, hardening the grouping is not required, but it will allow the engine to report match failures faster.

Eliminate loops in CSV regular expression

The regular expression used to match CSV strings is (?:[^"]|"")*, which has distinguished the normal and special parts: [^"] and "".

mysql> set @s:='Ten Thousand,10000, 2710 ,,"10,000","It\'s ""10 Grand"", baby",10K';
Query OK, 0 rows affected (0.01 sec)

mysql> set @r:='\\G(?:^|,)(?:"((?>[^"]*)(?>""[^"]*)*)"|([^",]*))';
Query OK, 0 rows affected (0.00 sec)

mysql> select replace(trim(both '"' from (trim(leading ',' from regexp_substr(@s,@r,1,lv)))),'""','"') s
    ->   from (with recursive tab1(lv) as (select 1 lv union all select t1.lv + 1 from tab1 t1 
    ->  where lv < regexp_count(@s, @r, '')) select lv from tab1) t;
+-----------------------+
| s                     |
+-----------------------+
| Ten Thousand          |
| 10000                 |
|  2710                 |
|                       |
| 10,000                |
| It's "10 Grand", baby |
| 10K                   |
+-----------------------+
7 rows in set (0.00 sec)

Adding \G at the beginning can avoid trouble caused by the driver process and improve efficiency. "((?>[^"]*)(?>""[^"]*)*)" matches double-quoted fields; ([^",]*) matches text outside quotes and commas. and other examples Likewise, solidification grouping is not necessary, but can improve efficiency.

Eliminate loops in C language comments

In C language, comments start with /* and end with */, and can have multiple lines, but cannot be nested (C++, Java and C# also allow this form of comments). The simplest way is to use a dot-ignored precedence quantifier that matches all characters: /\*.*?\*/.

mysql> set @s:='/** some comment here **/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*.*?\\*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+-----------+------+---------------------------+
| @r        | c    | s                         |
+-----------+------+---------------------------+
| /\*.*?\*/ |    1 | /** some comment here **/ |
+-----------+------+---------------------------+
1 row in set (0.00 sec)

Using loop elimination techniques to match C language comments is also an efficient method. Because the terminator */ is two characters, directly using /\*[^*]*\*/ cannot match the asterisk in the comment content.

mysql> set @s:='/** some comment here **/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*[^*]*\\*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @s, @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+---------------------------+-------------+------+------+
| @s                        | @r          | c    | s    |
+---------------------------+-------------+------+------+
| /** some comment here **/ | /\*[^*]*\*/ |    0 |      |
+---------------------------+-------------+------+------+
1 row in set (0.00 sec)

To see it more clearly, use /x...x/ in this example instead of /*...*/. In this way, /\*[^*]*\*/ becomes /x[^x]*x/, which eliminates backslash escaping and makes it easier to understand.

The formula to match text within delimiters is:

Match the starting delimiter;
Match text: match "any character except the closing delimiter";
Match the closing delimiter.

Now with /x and x/ as the starting and ending delimiters, the difficulty is matching "any character except the ending delimiter". If the end delimiter is a single character, you can use an exclusive character group, but the character group cannot be used for multi-character matching. However, the negative sequential lookaround (?:(?!x/).)* is "any character except the closing delimiter", so we get /x(?:(?!x/).)*x/. It has no issues, but is very slow.

mysql> set @s:='/** some comment here **/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*(?:(?!\\*/).)*\\*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+---------------------+------+---------------------------+
| @r                  | c    | s                         |
+---------------------+------+---------------------------+
| /\*(?:(?!\*/).)*\*/ |    1 | /** some comment here **/ |
+---------------------+------+---------------------------+
1 row in set (0.00 sec)

Because almost all schools that support sequential look-around support ignoring priority quantifiers, /x.*?x/ can be used, so efficiency is not a problem.

There are two possible ways to match the text before the first x/. One is to use x as the start delimiter and end delimiter, that is to say, match any character other than x, and the following characters are not slashes. This way, "any character except the closing delimiter" becomes:

Any character except x: [^x].
The following character is not a slash x: x[^/].

This results in ([^x]|x[^/])* to match the body text, and /x([^x]|x[^/])*x/ to match the entire comment. However, this path does not work.

mysql> set @s:='/** some comment here **/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*([^*]|\\*[^/])*\\*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+----------------------+------+------+
| @r                   | c    | s    |
+----------------------+------+------+
| /\*([^*]|\*[^/])*\*/ |    0 |      |
+----------------------+------+------+
1 row in set (0.00 sec)

If you use /x([^x]|x[^/])*x/ to match /xx foo xx/, after 'foo ', the first x is matched by x[^/], no problem. But the latter x is matched by [^/], and this x should mark the end of the comment. So continue to the next round of iteration, [^x] matches the slash, and the result will match the text after x/, but not the closing slash.

Another way is to treat the slash immediately following x as the closing delimiter, so "any character except the closing delimiter" becomes:

Any character except slash: [^/].
Not the slash immediately following x: [^x]/.

So ([^/]|[^x]/)* is used to match the main text, and /x([^/]|[^x]/)*x/ is used to match the entire comment. Unfortunately, this is also a dead end.

mysql> set @s:='/*/ some comment here /*/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*([^/]|[^*]/)*\\*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+---------------------+------+------+
| @r                  | c    | s    |
+---------------------+------+------+
| /\*([^/]|[^*]/)*\*/ |    0 |      |
+---------------------+------+------+
1 row in set (0.00 sec)

/x([^/]|[^x]/)*x/ cannot match /x/ foo /x/. If the end of a comment is followed by a slash, the expression will match more than the comment's closing delimiter, which is also the case with the previous method.

mysql> set @s:='/** some comment here **// foo*/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*([^*]|\\*[^/])*\\*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+----------------------+------+----------------------------------+
| @r                   | c    | s                                |
+----------------------+------+----------------------------------+
| /\*([^*]|\*[^/])*\*/ |    1 | /** some comment here **// foo*/ |
+----------------------+------+----------------------------------+
1 row in set (0.00 sec)

mysql> set @s:='/*/ some comment here /*// foo*/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*([^/]|[^*]/)*\\*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+---------------------+------+------------+
| @r                  | c    | s          |
+---------------------+------+------------+
| /\*([^/]|[^*]/)*\*/ |    1 | /*// foo*/ |
+---------------------+------+------------+
1 row in set (0.00 sec)

Now let's fix these expressions. In the first case, x[^/] matches xx before the trailing slash. If you use /x([^x]|x+[^/])*x/, after adding the plus sign, x+[^/] matches a series of x's ending with a non-slash character. It's true that it can match like this, but because backtracking "any character other than a slash" can still be an Too many matches.

mysql> set @s:='/** some comment here **// foo*/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*([^*]|\\*+[^/])*\\*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+-----------------------+------+----------------------------------+
| @r                    | c    | s                                |
+-----------------------+------+----------------------------------+
| /\*([^*]|\*+[^/])*\*/ |    1 | /** some comment here **// foo*/ |
+-----------------------+------+----------------------------------+
1 row in set (0.00 sec)

To solve this problem, "x following characters that are not slashes" should be used x+[^/x], which will backtrack to the first x position in '...xxx/' and stop. To match any number of x's before the end of the comment, x+ must be added to handle this case. So we get /x([^x]|x+[^/x])*x+/, matching the final comment.

mysql> set @r:='/\\*([^*]|\\*+[^/*])*\\*+/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='/** some comment here **// foo*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+-------------------------+------+---------------------------+
| @r                      | c    | s                         |
+-------------------------+------+---------------------------+
| /\*([^*]|\*+[^/*])*\*+/ |    1 | /** some comment here **/ |
+-------------------------+------+---------------------------+
1 row in set (0.00 sec)

mysql> set @s:='/*/ some comment here /*// foo*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+-------------------------+------+---------------------------+
| @r                      | c    | s                         |
+-------------------------+------+---------------------------+
| /\*([^*]|\*+[^/*])*\*+/ |    1 | /*/ some comment here /*/ |
+-------------------------+------+---------------------------+
1 row in set (0.00 sec)

In order to improve the efficiency of the expression, the loop of this expression must be eliminated. The following table gives an expression that is more "loop-eliminating":
opening normal*(special normal*)* closing

element	Purpose	regular expression
opening	Comment starts	/x
normal*	Comment text, containing one or more 'x's	[^x]*x+
special	Characters that do not belong to the end boundary character	[^/x]
closing	trailing slash	/

table 3

As with the subdomain example, normal* must match at least one character. The required closing delimiter in this example consists of two characters. Any normal sequence ending with the first character of the terminating delimiter will give control to the special part only if the following characters do not form the terminating delimiter. So according to the general elimination routine we get:

/x[^x]*x+([^/x][^x]*x+)*/

Replace each x with \* (x in the character group is replaced with *) to get the actual expression.

mysql> set @r:='/\\*[^*]*\\*+([^/*][^*]*\\*+)*/';
Query OK, 0 rows affected (0.00 sec)

mysql> set @s:='/** some comment here **// foo*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+------------------------------+------+---------------------------+
| @r                           | c    | s                         |
+------------------------------+------+---------------------------+
| /\*[^*]*\*+([^/*][^*]*\*+)*/ |    1 | /** some comment here **/ |
+------------------------------+------+---------------------------+
1 row in set (0.00 sec)

mysql> set @s:='/*/ some comment here /*// foo*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+------------------------------+------+---------------------------+
| @r                           | c    | s                         |
+------------------------------+------+---------------------------+
| /\*[^*]*\*+([^/*][^*]*\*+)*/ |    1 | /*/ some comment here /*/ |
+------------------------------+------+---------------------------+
1 row in set (0.00 sec)

In actual situations, comments usually contain multiple lines, and this expression can also handle it.

mysql> set @s:=
    -> '/*/ some comment here / foo
    '> * some comment here * foo*
    '> /**/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
@r: /\*[^*]*\*+([^/*][^*]*\*+)*/
 c: 1
 s: /*/ some comment here / foo
* some comment here * foo*
/**/
1 row in set (0.00 sec)

This regular expression encounters many problems in practice. It recognizes C comments but not other important aspects of C syntax. For example, the /*...*/ part below will match even though it is not a comment.

mysql> set @s:='const char *cstart = "/*", *cend = "*/"';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+------------------------------+------+------------------+
| @r                           | c    | s                |
+------------------------------+------+------------------+
| /\*[^*]*\*+([^/*][^*]*\*+)*/ |    1 | /*", *cend = "*/ |
+------------------------------+------+------------------+
1 row in set (0.00 sec)

This example is discussed in the next section.

10. Regular expressions that work smoothly

The regular expression /\*[^*]*\*+([^/*][^*]*\*+)*/ has incorrect matching problems, such as the following line of C code: char *CommentStart = "/
* "; /* start of comment */

The matching result is: /*"; /* start of comment */, but the desired matching result should be: /* start of comment */.

mysql> set @s:='char *CommentStart = "/*"; /* start of comment */';
Query OK, 0 rows affected (0.00 sec)

mysql> set @r:='/\\*[^*]*\\*+([^/*][^*]*\\*+)*/';
Query OK, 0 rows affected (0.00 sec)

mysql> select @r, regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s;
+------------------------------+------+-----------------------------+
| @r                           | c    | s                           |
+------------------------------+------+-----------------------------+
| /\*[^*]*\*+([^/*][^*]*\*+)*/ |    1 | /*"; /* start of comment */ |
+------------------------------+------+-----------------------------+
1 row in set (0.00 sec)

The problem is how to match double quotes when encountered. Similar situations include C constants in single quotes, single-line comments in double slashes, and so on. A branch can be defined for each case to match:

Non-single quotes, double quotes, slash strings: [^"'/]
Double quoted string: "[^\\"]*(?:\\.[^\\"]*)*"
Single quoted string: '[^'\\]*(?:\\.[^'\\]*)*'
Single or multi-line comments: /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
Single line comment: //[^\n]*

Concatenating these five separate expressions sequentially with | as a multi-select branch is perfectly fine because there is no overlap between them. Traditional NFA will stop as soon as a match is found, so the most commonly used multi-select branch [^"'/] is placed first. If you scan from left to right a regular expression concatenated with |, you will find that when applied to characters When stringing together, there are several possibilities for one round of attempts:

Matches a single non-single quote, double quote, or slash character
Matches a double-quoted string in one go, directly to its end.
Matches a single-quoted string directly to its end.
Match multiple lines of comments at once, directly reaching the end of the comment.
Match the single-line comment part at once, directly to the end of the comment.

This way the regex never tries from inside a single or double quoted string or comment, which is the key to success. Use MySQL variables to represent the five-branch regular expression, paying attention to the escaping of backslashes and single quotes.

set @other:='[^"\'/]';
set @double:='"[^\\\\"]*(?:\\\\.[^\\\\"]*)*"';
set @single:='\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'';
set @comment1:='/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/';
set @comment2:='//[^\\n]*';

Concatenate expressions from five independent branches:

set @r:=concat('(',@other,'+','|',@double,@other,'*','|',@single,@other,'*',')','|',@comment1,'|',@comment2);

There are three points to note here:

Any number of @other characters on the same line can be grouped into a single unit, so @other+ is used. Because there are no elements later that force it to backtrack, there is no need to worry about endless matching.
After a quoted string, before other quoted strings and comments, there is likely to be a match of @other. Add @other* after each quoted string to tell the engine to match @other instead of going to the next step immediately. A cycle.

This is similar to the loop elimination technique, which improves speed because it dominates the regex engine's matching. Here, knowledge about global matching is used to perform local optimization and provide the engine with the necessary conditions for fast operation.

It is very important that the quantifier used for @other after each subexpression matching the quoted string is an asterisk, and the @other at the beginning of the multi-select structure must be used with a plus sign quantifier. If @other is preceded by an asterisk quantifier, any situation will match. If @other after the quoted string uses a plus sign quantifier, an error will occur when encountering two connected quoted strings.

Put all branches except comments into a capturing group. In this way, if the non-comment branch can be matched, $1 will save the corresponding content. If the comment branch was matched, $1 is empty. The result is that the comment part can be deleted by replacing it with $1 in the regexp_replace function.

Finally get the regular expression @r:

mysql> select @r;
+-------------------------------------------------------------------------------------------------------------------+
| @r                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------+
| ([^"'/]+|"[^\\"]*(?:\\.[^\\"]*)*"[^"'/]*|'[^'\\]*(?:\\.[^'\\]*)*'[^"'/]*)|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]* |
+-------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

The results of matching and removing comments are as follows:

mysql> set @s:=
    -> 'char *CommentStart = "/*"; /* start of comment */
    '> char *CommentEnd = "*/"; // end of comment';
Query OK, 0 rows affected (0.00 sec)

mysql> select regexp_count(@s, @r, '') c, regexp_extract(@s, @r, '') s\G
*************************** 1. row ***************************
c: 6
s: char *CommentStart = ,"/*"; ,/* start of comment */,
char *CommentEnd = ,"*/"; ,// end of comment
1 row in set (0.00 sec)

mysql> select @s, regexp_replace(@s, @r, '$1', 1, 0) c\G
*************************** 1. row ***************************
@s: char *CommentStart = "/*"; /* start of comment */
char *CommentEnd = "*/"; // end of comment
 c: char *CommentStart = "/*"; 
char *CommentEnd = "*/"; 
1 row in set (0.00 sec)

The @other+ at the beginning can only be matched in two situations: 1) The matched text is at the beginning of the entire target string, and the quote string is not matched at this time; 2) after any comment. You might think of adding @other+ after the comment. That's nice, except here you want the expression within the first pair of brackets to match all the text you want to keep.

So if @other+ appears after the comment, do I still need to put @other+ at the beginning? It depends on the application data - if there are more comments than quoted strings, it makes sense to put it first, otherwise put it later.

Master regular expressions - create efficient regular expressions