Murder Caused by a regular match

Introduction: a commercial service IT services company, providing customers with a sudden interruption for nearly an hour after the investigation and the reasons, turned out to be because a regular expression caused by small regular expressions cause why such a serious problem?

 

The reason the matter is resolved due to the positive results in the cpu resources exhausted, causing a chain reaction, follow-up services are unable to provide external. Cause failure of regular expressions like this, "(: (:?? \" | '| \] | \} | \\ | \ d | (:? Nan | infinity | true | false | null | undefined | symbol | math) | \ `| \ - | \ +) + [)] * ;? ((:? \ s || ~ | | {} | \ | \ || \ +) * * (?:!. . * = *))) "the key part is" .*(?:.*=.*)"need to know a little regular principles engine work, such as":? "is a non-capturing group expression (expression parentheses are grouped aggregated into a single expression).

To simplify matters, we can " .*(?:.*=.*)," reduced to "*. * =. *", Such expression is simply incomprehensible to the complex. Any attempt to express "a later match any character followed by any character of" this expression will cause a disastrous retracement, this often leads to problems.

 

 

In the regular expression "." Means to match a character, "*" means to match zero or more characters, and as many matches. Therefore, "*. * =. *" This means that the first match zero or more characters, followed by matches zero or more characters, and then match a "=", and then matches zero or more characters.

Suppose the character string "x = x", can match. "*. * =. *." . "*. *" To match the first "x", such as "*" matches "x", the other ". *" Matches zero characters, the last one. "*" Matches the last "x".

This matches the overall success of the process took place 23 matches. First of all, the first one. "*" Matches all of the characters in "x = x", when the engine tries to match a. "*", It has no match, so direct matches zero characters, then engine attempts to match "=", also because there are no remaining characters, so the match fails.

In this case, the engine will retracement, with only the first one. "*" Matches "x =", then the second. "*" Will match the success of "x", similarly, when the engine tries to match "= "time and found no remaining characters can not match, the match fails, the engine retracement again.

This is to make the first one. "*" To match the "x =", but the second. "*" Does not match any of the characters, the engine will then go to match "=" natural match fails, then the engine back again withdraw.

Next, the first now. "*" Matches only a "x", then a second. "*" Is successful matching "= x". Needless to say, you see, down match "=" when nature will not succeed retracement happening again.

Again. "*" Matches the first "x", then the second. "*" Matches "=", is equally clear, regular in "=" once again unable to match the success retracement engine again.

Every deduction we do not match up, just say the matching success of the process. First one. "*" Matching is successful "x", then a second. "*" Matches zero characters, then the expression "=" string matching is successful, "=", the third. "* "will match the success of" x ", the final expression and string matching is successful, this is only the matching process for only a string of three characters.

The figure is 23 times the full matching process, the engine uses a perl engine, you will see the steps and process retracement of execution.

 

 

If the string is changed from "x = x" "x = xx" What will happen? Obviously there will be more retracement occurs, in fact, the process of 33 matches will take place, if the string becomes "x = xxx" 45 times a match occurs, this increase in the number of matches is non-linear, if the character string to " x=xxxxxxxxxxxxxxxxxxxx" ( "=" followed by a 20 X), the matching process becomes 555 times, the figure shows the matching process. (If the string does not begin with "x =", the engine will perform 4947 times, and finally found it impossible to match the success).

 

The following shows a dynamic view of the same character string " x=xxxxxxxxxxxxxxxxxxxx" matching process:

 

 

 

This is a worst case, when only a few input parameter length increases, it is time-consuming a large amount of non-linear increase. If the regular expression with appropriate modifications, the situation would be even worse. This regular expression matching See "  .*.*=.*;", compared to the above expression, which after more than a " ;", such an expression may be used to match the string " foo=bar;."

This expression is used to match "x = x" is not a match occurs 23 times, 90 times but, if "=" followed by a string of 20 "x", the match occurs 5353 times. Below is a corresponding graph, you will see very rapid growth in the Y-axis.

 

 

Similarly, following a dynamic display of process 5353 times need to match.

 

 

If you use lazy matching policy in the course of the match instead of greedy strategy, the number will be reduced retracement occurs. If the expression changed. "*? *? =. *?", Then matching the string "x = x" 11 times a match (before is 23) will occur. Because. "*" After the "?" Tells the engine only needs to match a minimum of characters before the next match will be a pattern.

But the lazy strategy also can not completely solve the problem, such as expression from "* * = *;..." Becomes, matching the string "x = x" still have to "* * = * ?;.?.?." 555 times a match occurs, the matching " x=xxxxxxxxxxxxxxxxxxxx" is still 5353 times match to happen.

The real solution, you do not want to re-match mode, then, is the need to get rid of this mechanism retracement of the regular matching rules engine. In fact, something similar from the 1968 Ken Thompson's paper " Programming Techniques: Regular expression The Search algorithm " there are solutions.

 

This excerpt translated from https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/#appendix-about-regular-expression-backtracking

Guess you like

Origin www.cnblogs.com/029zz010buct/p/11431628.html