Extended syntax of Python regular expressions (5)

Non-capturing groups and named groups

A well-designed regular expression may be divided into many groups . These groups can not only match related substrings , but also group and structure the regular expression itself . In a complex regular expression, because there are too many groups, it becomes difficult to track and use the group number . There are two new features that can help you solve this problem- non-capturing groups and named groups -they both use a common regular expression extension syntax . Let's first take a look at what this expression expansion syntax is.

The extended syntax of regular expressions:

As we all know, Perl 5many powerful functions have been added to standard regular expressions. Perl developers cannot choose a new metacharacter or construct a new special sequence through backslashes to achieve extended functionality . Because this will conflict with standard regular expressions . For example, you want to select &as a meta-character extension function (in the standard regular expression in &no special meaning), but this is the case, have the expression would have to be modified in the standard syntax to write out positive, since they contained '&'the wishes of just Think of it as a normal character to match .

Finally, Perl developers have decided to use (?...)as an extended syntax. Question mark ?immediately after the opening parenthesis (behind, itself is a syntax error in the wording , because ?the front nothing can be repeated , so this will solve the compatibility problems (on the grounds that syntactically correct regular expression certainly not so written ). Then, immediately following the ?character behind it indicates which extended syntax will be used. For example, (?=foo)represents a new extension ( forward asserted ), (?:foo)it represents a further extension (a substring containing foothe non-capturing group ).

Python supports some extended syntax of Perl, and an extended syntax is added on this basis . If it follows the question mark ?behind that P, you can be sure this is a Python extension syntax.

Non-capturing group

First we talk about the non-capturing group . Sometimes you need to use a group to represent part of the regular expression . You don't need this group to match anything . At this time, you can use a non-capturing group to clearly express your intentions . The syntax for non-capturing groups is (?:...), this... You can replace it with any regular expression.

>>> m = re.match("([abc])+", "abc")
>>> m.groups()
('c',)
>>> m = re.match("(?:[abc])+", "abc")
>>> m.groups()
()

"Capture" means matching . Common subgroups are capture groups because they can match data from the string .

Except that you can't get matching content from the non-capturing group , other non-capturing groups are no different from ordinary subgroups . You can put anything in it , use metacharacters for repeating functions , or nest with other subgroups ( capturing or non-capturing subgroups are fine).

When you need to modify an existing model when (?:...)it is very useful. The original is that adding a non-capturing group does not affect the sequence numbers of other (capturing) groups . It is worth mentioning that there is no difference between the speed of the capturing group and the non-capturing group in terms of search speed.

Named group

Let's look at another important function: named groups. For ordinary subgroups, we use sequences to access them , and for named groups, we can use a meaningful name to access them .

Named group syntax is Pythonspecific to extended syntax: (?P<name>). Obviously, the < >inside nameis the name of a named group of friends. In addition to naming the group has a name identification beyond , with the other group is the same as the capture of.

All methods of matching objects can handle not only those capturing groups referenced by numbers , but also named groups referenced by strings . In addition to accessing by name**, named groups can still be accessed by using numerical serial numbers**:

>>> p = re.compile(r'(?P<word>\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'

Insert picture description here

Naming groups are very easy to use, because it allows you to replace some meaningless numbers with a memorable name. Below is from the imaplibexample module:

InternalDate = re.compile(r'INTERNALDATE "'
        r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
        r'(?P<year>[0-9][0-9][0-9][0-9])'
        r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
        r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
        r'"')

Obviously, the use of m.group('zonem')access than the use of digital content matching 9 more straightforward.

Regular expressions, like a reverse reference syntax (...)\1is to use the serial number of ways to access the sub-group ; in naming the group , apparently also has a corresponding variants : Use names instead of numbers . Its extended syntax is (?P=name)the meaning is the namepoint of the group need to reference again at the current position . Then search for two words of regular expressions can be written (\b\w+)\s+\1, it can also be written as (?P<word>\b\w+)\s+(?P=word):

>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
>>> p.search('Paris in the the spring').group()
'the the'

Insert picture description here

Forward assertion

Another zero-width assertion we want to explain is forward assertion . Forward assertion can be divided into two forms : forward positive assertion and forward negative assertion .

(?=...)

Forward positive assertion. If the currently contained regular expression (indicated by ... here) is successfully matched at the current position , it means success, otherwise it fails. Once this part of the regular expression has been tried by the matching engine , it will not continue to match; the rest of the pattern will continue to be tried at the beginning of this assertion.

(?!...)

Forward negative assertion. This is the opposite of forward positive assertions (a mismatch indicates success, a match indicates failure).

In order to make it easier for everyone to understand, we give an example to prove that this stuff is really useful. Everyone consider a simple regular expression pattern, the role of this pattern is to match a file name . As we all know, the file name is separated by . To separate the name from the extension . For example fishc.txt, the fishis the name of the file, .txtis the extension.

This regular expression is actually quite simple:.*[.].*$

Note that this is used to separate the. Is a meta-character, so we use [.]deprived of its special features. There $, we use $to ensure that the string is contained in the remaining part of the extension in. So this regular expression matching fishc.txt, foo.bar, autoexec.bat, sendmail.cf, printers.confand so on.

Now let's consider a more complicated situation. If you want to match files whose extension is not bat, how should your regular expressions be written?
Let's first take a look at your attempts to write wrong:

.*[.][^b].*$

In order to exclude bat, we first try to exclude the first character of the extension from being non-b. But this is the beginning of the error, because the first character of the suffix of foo.bar is also b.

In order to make up for the mistake just now, we tried this trick:

.*[.]([^b]..|.[^a].|..[^t])$

We have to admit that this regular expression has become ugly... But so the first character is not b, the second character is not a, and the third character is not t... This is just acceptable foo.bar, exclude it autoexec.bat. But again, this regular expression requires extension must be three characters, for example, sendmail.cfwill be excluded.

Well, let's fix the problem:

.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$

In the third attempt, we made the second and third characters optional. This will match shorter extensions, such as sendmail.cf.

I have to admit that we messed up things, and now the regular expressions have become difficult to understand and extremely ugly! !

Even worse is if the requirements change, for example, at the same time you want to exclude batand exeextensions, this regular expression pattern becomes more complicated ...

Dangdangdang! The protagonist comes on stage. In fact, a forward negative assertion can solve your problem:

.*[.](?!bat$).*$

Let's explain the meaning of this forward negative assertion: if the regular expression bat does not match at the current position, try the remaining part of the regular expression ; if the bat matches successfully, the entire regular expression will fail (because it is a forward negative Assertions _ ). (?!bat$)At the end $in order to ensure normal match extensions such as sample.batch to bat begins .

Similarly, with forward negative assertions , it becomes quite easy to exclude bat and exe extensions at the same time :

.*[.](?!bat$|exe$).*$

Guess you like

Origin blog.csdn.net/CSNN2019/article/details/114484123