Regular expression selection group backward reference and capture and non-capture group (5) (1000)

Foreword

Use examples directly as a demonstration.

Four did not, because I looked at it, the fourth section was misunderstood at the time, and it was still being changed.

text

Select operation

Partially case sensitive

We hope that when doing matching, it is not case sensitive.

Of course, in the Python library, we can choose to display case-insensitive.

But in the writing of the python library, there is a requirement that is partly case-insensitive.

For example, we hope that in the, t is case sensitive, while he is not case sensitive, how to write?

We use (? I) to indicate case insensitivity, then we can write it like this: t (? I) he, and that's it.

Other selection operations

To explain here, in fact, the regularity of each language has a certain difference.

I introduce several general selection operations here:

(? m) means multi-line matching, if not clear please see my chapter three.

(? s) Single line matching

(? x) Ignore comments

There are many, mention here, and then, each language is different, it is best to use the same type of library.

Subpattern

What is a sub-pattern?

Refers to one or more groups in the group, and the sub-mode is the mode in the mode.

It may be a little difficult to understand in this way, for a chestnut:

THE RIME OF THE ANCYENT MARINERE

I want to match such a paragraph, how to match it?
Regularly I can definitely match this way.

THE RIME OF THE ANCYENT MARINERE

Then I write this now:

(THE) (RIME) (OF) (THE) (ANCYENT) (MARINERE)


Not only does it match the matches we want to match, but each group in parentheses is grouped. This is the sub-pattern.

Capture and refer back

it is a dog

Regular

(it is) (a dog)

At this time, they will be grouped.
The result of $ 2 $ 1 is a dog it is.

In addition to this, we can also \ U $ 1 $ 2 The result is A DOG it is

\ U is to convert the match to uppercase.

Then there are others: \ u is the first letter capitalized, \ l the first letter lowercase \ L all lowercase

These java or other high-level languages ​​are not natively supported, you need to use some libraries, but these are regular specifications.

Named group

The result is:

In other words, we can do named grouping, how to refer to grouping:

$+{one}

Through $ + {xx}.

Each naming group is different in different languages, and in Python is (? P ...), check the specific language.

Uncaptured packet

Non-capturing packets are relative to capturing packets.

Capturing a packet will save the packet information in memory instead of capturing it, so the performance is high.

I found another example online:

var str = "a1***ab1cd2***c2";
var reg1 = /((ab)+\d+)((cd)+\d+)/i;
var reg2 = /((?:ab)+\d+)((?:cd)+\d+)/i;
alert(str.match(reg1));//ab1cd2,ab1,ab,cd2,cd
alert(str.match(reg2));//ab1cd2,ab1,cd2

(?: xx) Use?: is a non-capturing group.

That is to say (?: Xx) is to facilitate us to write regular rules more clearly, with the aid of brackets.

c # example to one:

using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = @"(?:\b(?:\w+)\W*)+\.";
      string input = "This is a short sentence.";
      Match match = Regex.Match(input, pattern);
      Console.WriteLine("Match: {0}", match.Value);
      for (int ctr = 1; ctr < match.Groups.Count; ctr++)
         Console.WriteLine("   Group {0}: {1}", ctr, match.Groups[ctr].Value);
   }
}
// The example displays the following output:
//       Match: This is a short sentence.

Atomic grouping

The atomic grouping should be closed backtracking.

Many people do not understand backtracking, so here we explain specifically:

通常,如果正则表达式包含一个可选或可替代匹配模式并且备选不成功的话,正则表达式引擎可以在多个方向上分支以将输入的字符串与某种模式进行匹配。
如果未找到使用第一个分支的匹配项,则正则表达式引擎可以备份或回溯到使用第一个匹配项的点并尝试使用第二个分支的匹配项。
此过程可继续进行,直到尝试所有分支。

I found an example:

using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string[] inputs = { "cccd.", "aaad", "aaaa" };
      string back = @"(\w)\1+.\b";
      string noback = @"(?>(\w)\1+).\b";
      foreach (string input in inputs)
      {
         Match match1 = Regex.Match(input, back);
         Match match2 = Regex.Match(input, noback);
         Console.WriteLine("{0}: ", input);

         Console.Write("   Backtracking : ");
         if (match1.Success)
            Console.WriteLine(match1.Value);
         else
            Console.WriteLine("No match");

         Console.Write("   Nonbacktracking: ");
         if (match2.Success)
            Console.WriteLine(match2.Value);
         else
            Console.WriteLine("No match");
      }
   }
}
// The example displays the following output:
//    cccd.:
//       Backtracking : cccd
//       Nonbacktracking: cccd
//    aaad:
//       Backtracking : aaad
//       Nonbacktracking: aaad
//    aaaa:
//       Backtracking : aaaa
//       Nonbacktracking: No match

Why won't the last aaaa be matched?

Analyze:

@"(?>(\w)\1+).\b"

(\ w) Capture the packet, that is, a was caught.
Then \ 1+ means to match 0 or more a, then because of the atomicity, then it is greedy, he will match aaaa, that is, after matching, then you have one. (Any character), you can not match Too.

to sum up

The above is just a personal understanding, if there is anything wrong, please point out.

Guess you like

Origin www.cnblogs.com/aoximin/p/12748670.html