Regular Expressions (2) - Zero Width Assertion and Lazy Matching and Balanced Groups

The role of parentheses

Classification Code/Syntax Description

Capture    
      (exp) matches exp, and captures text into an automatically named group
      (?<name>exp) matches exp, and captures text into a group named name, can also be written as (?'name'exp)
      (?: exp) matches exp, does not capture the matched text, and does not assign a group number to the group
zero-width assertion
      (?=exp) matches the position before exp
      (?<=exp) matches the position after exp
      (?!exp) matches the following Followed by a position that is not exp
      (?<!exp) Matches a position that is not preceded by an exp
Comment
      (?#comment) This type of grouping does not have any effect on the processing of regular expressions, and is used to provide comments for human reading

It is important to note that a zero-width assertion does not occupy space, that is, it will not be returned in the matching result.

 (?:exp) neither captures the matched text nor assigns a group number to this group, so what's the use of this thing?

 (?:exp) is a non-capturing group that matches the content of exp, but does not capture it in the group.

Generally speaking, it is to save resources and improve efficiency.
For example, to verify whether the input is an integer, you can write it like this
^([1-9 ][0-9]*|0)$
At this time, we need to use () to limit the scope of "|" to indicate "or" relationship, but we just need to judge the rules, there is no need to save the content matched by exp into the group , then you can use a non-capturing group
^(?:[1-9][0-9]*|0)$

  Sometimes we have to use (), and () will capture the content matched by exp into the group by default, and in some cases we just judge the rules, or we don't need to do the matching content here () later. When referencing, there is no need to capture it into the group. On the one hand, it will cause a waste of resources and on the other hand, it will reduce the efficiency. In this case, a non-capturing group is used.

As for these things, it is said that it is unclear, and it is useless to look at the symbols. It is best to use the above example.

copy code
//The regular expression is awesome, the noun is awesome, but it's actually very simple
        static void Main(string[] args)
        {
            //(exp) matches exp and captures text into automatically named groups
            Regex reg = new Regex(@"A(\w+)A");
            Console.WriteLine(reg.Match("dsA123A"));    //输出 A123A
            Console.WriteLine(reg.Match("dsA123A").Groups[1]);      //输出123

            //(?<name>exp) matches exp, and captures the text into a group named name, which can also be written as (?'name'exp)
            Regex reg2 = new Regex(@"A(?<num>\w+)A");
            Console.WriteLine(reg2.Match("dsA123A").Groups["num"]); //输出123

            Regex reg3 = new Regex(@"A(?:\w+A)");
            Console.WriteLine(reg3.Match("dsA123A"));

            Console.WriteLine("==============================");

            //(?=exp) matches the zero-width positive prediction look-ahead assertion in front of exp
            Regex reg4 = new Regex(@"sing(?=ing)"); //The meaning of the expression is, I think there will be ing after sing, if sing is followed by ing, then the sing will match successfully, Note that predicates will not be matched
            Console.WriteLine(reg4.Match("ksingkksingingkkk"));     //输出    sing
            Console.WriteLine(reg4.Match("singddddsingingd").Index); //Output 8 Output 8 means that the previous sing is not matched

            //(?<=exp) Match the position after exp with zero width and make an assertion after looking back
            Regex reg5 = new Regex(@"(?<=wo)man");
            Console.WriteLine(reg5.Match("Hi man Hi woman"));   //输出 man
            Console.WriteLine(reg5.Match("Hi man Hi woman").Index); //Output 12 and count which one matches with your fingers

            //(?!exp) matches a positional zero-width negative lookahead assertion that is not followed by exp
            Regex reg6 = new Regex(@"sing(?!ing)");
            Console.WriteLine(reg6.Match("singing-singabc"));   //输出 sing
            Console.WriteLine(reg6.Match("singing-singabc").Index); //The output 8 has to be counted with fingers

            //(?<!exp) Assert after matching zero-width negative lookback at the position that is not exp before it
            Regex reg7 = new Regex(@"(?<!wo)man");
            Console.WriteLine(reg7.Match("Hi woman Hi man"));   //输出 man
            Console.WriteLine(reg7.Match("Hi woman Hi man").Index); //Output 12 to calculate which one matches

            //(?#comment) has no effect on the processing of regular expressions and is used to provide comments for human reading
            Regex reg8 = new Regex("ABC(?#This is just a comment)DEF");
            Console.WriteLine(reg8.Match("ABCDEFG"));   //输出 ABCDEF
        }
copy code

lazy matching

Code/Syntax Explanation
*? Repeat any number of times, but as few times as possible
+? Repeat 1 or more times, but as few times as possible
?? Repeat 0 or 1 times, but as few times as possible
{n,m}? Repeat n to m times, but as few as possible
{n,}? Repeat more than n times, but as little as possible

If you pay attention carefully, you will find that the lazy matcher is actually just an addition to the original qualifier? to mean as few matches as possible.

copy code
class Program
    {
        //The regular expression is awesome, the noun is awesome, but it's actually very simple
        static void Main(string[] args)
        {
            // lazy match
            Regex reg1 = new Regex(@"A(\w)*B");
            Console.WriteLine(reg1.Match("A12B34B56B")); //Output A12B34B56B //Note that the default is to match as much as possible

            Regex reg2 = new Regex(@"A(\w)*?B"); //\w Repeat as many times as possible          
            Console.WriteLine(reg2.Match("A12B34B56B"));   //输出 A12B

            Regex reg3 = new Regex(@"A(\w)+?"); //\w Repeat 1 or more times, but as little as possible
            Console.WriteLine(reg3.Match("AB12B34B56B")); //Output AB Note the test string here

            Regex reg4 = new Regex(@"A(\w)??B"); //\w Repeat 0 or 1 times, but as little as possible
            Console.WriteLine(reg4.Match("A12B34B56B")); //The output is blank, and the match fails, because at least \w must be repeated twice
            Console.WriteLine(reg4.Match("A1B2B34B56B"));   //输出 A1B

            Regex reg5 = new Regex(@"A(\w){4,10}?B"); //\w Repeat at least 4 times and at most 10 times
            Console.WriteLine(reg5.Match("A1B2B3B4B5B")); //When the output of A1B2B3B reaches the 4th, it happens that the 4th character is 3 and only matches the B behind 3

            Regex reg6 = new Regex(@"A(\w){4,}?"); //\w at least 4 repetitions, no upper limit at most
            Console.WriteLine(reg5.Match("A1B2B3B4B5B")); //When the output of A1B2B3B reaches the 4th, it happens that the 4th character is 3 and only matches the B behind 3

            Console.ReadKey();
        }
    }
copy code

 balance group

The regular expression balance group is used to match content that starts and ends with an equal number of symbols on the left and right sides.
  For example, for the string "xx <aa <bbb> <bbb> aa> yy>", the < > on the left and right sides is not equal, if simple The <.+> matches the content between the outermost opening bracket < and the closing bracket 
>, but the number of opening and closing brackets is inconsistent. If you want to match strings that are normally terminated by left and right parentheses, then you need to use balanced groups.

Balance group syntax:
  (?'group') Name the captured content group, and push it onto the stack
  (?'-group') Pop the last captured content of the group that was pushed onto the stack from the stack, if the stack Originally empty, the matching of this group fails
  (?(group)yes|no) If there is a capture content named group on the stack, continue to match the expression of the yes part, otherwise continue to match the no part
  (?!) Zero Wide negative lookahead assertion, since there is no postfix expression, trying to match always fails

copy code
        static void Main(string[] args)
        {
            //Balance group We now want to match the contents of the outermost parenthesis
            string strTag = "xx <aa <bbb> <bbb> aa> yy>"; //The target to match is <aa <bbb> <bbb> aa>, note that the number of brackets is not equal
            Regex reg = new Regex("<.+>");
            Console.WriteLine(reg.Match(strTag)); //Output <aa <bbb> <bbb> aa> yy> See the target inconsistent with the desired match, mainly because the number of < and > is not equal

            Regex reg3 = new Regex("<[^<>]*(((?'Open'<)[^<>]*)+((?'-Open'>)[^<>]+))*(?(Open)(?!))>");
            Console.WriteLine(reg3.Match(strTag)); //<aa <bbb> <bbb> aa> the target is correct


            //The most common example of a balanced group, matching HTML, the following is matching the content inside the nested DIV
            Regex reg2 = new Regex(@"<div[^>]*>[^<>]*(((?'Open'<div[^>]*>)[^<>]*)+((?'-Open'</div>)[^<>]*)+)*(?(Open)(?!))</div>");
            string str = "<a href='http://www.baidu.com'></a><div id='div1'><div id='div2'>Are you doing well in another country?</div ></div><p></p>";
            Console.WriteLine(reg2.Match(str)); //Output <div id='div1'><div id='div2'> Are you ok in a foreign country? </div></div>
            Console.ReadKey();
        }
copy code
Syntax Explanation: 
< #The outermost left parenthesis [^<>]* #The content of the outermost left parenthesis is not the parenthesis ( ( (?'Open'<) #When the left bracket is encountered, write an "Open" on the blackboard [^<>]* #Match the content that is not parenthesis after the left parenthesis )+ ( (?'-Open'>) #When the closing bracket is encountered, erase an "Open" [^<>]* #Match anything that is not a parenthesis after the closing parenthesis )+ )* (?(Open)(?!)) #Before encountering the outermost closing bracket, judge whether there is any "Open" on the blackboard that has not been erased; if there is, the match fails > #outermost closing parenthesis

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324537657&siteId=291194637