Zero-width assertion of Python regular expressions (4)

statement

Some metacharacters do not match any characters , but simply indicate success or failure , so these characters are also called zero-width assertions . For example, \bit represents the current position is a word boundary , but \band not change position . Therefore, the zero-width assertion should not be reused , because \bdoes not modify the current position, so \b\bnow \bis no different.

Explanation:
Many people may not understand the meaning of "change position" and " zero-width assertion "? My attempt to explain, for example, abcafter the match finished a, it will move the current position in order to continue matching b, and so on ... But \babc, then, \brepresents the current position of the boundary in the word ( first letter of the word or the last letter ), At this time, the current position will not change , and then match a with the character at the current position...

|

Or operator, performs an OR operation on two regular expressions. If A and B are regular expressions, A | Bany characters in A or B will be matched. In order to be able to work more reasonably, |the priority is very low. For example Fish|Cshould match or Fish C, not match Fis, then a 'h'or 'C'.

Similarly, we use \|to match '|'the character itself; or is contained in a character class, like this [|].

^

The starting position of the matched string . If you set the MULTILINEflag, it will become a match starting position of each line . In the MULTILINEmiddle, whenever it encounters a newline will immediately be matched .
For example, if you only want to match the word at the beginning of the string From, then your regular expression can be written as ^From:

>>> print(re.search('^From', 'From Here to'))  
<_sre.SRE_Match object; span=(0, 4), match='From'>
>>> print(re.search('^From', 'Reciting From Memory'))
None

Achieve results:
Insert picture description here

$

The end position of the matching string , every time a newline character is encountered, it will leave for matching .

>>> print(re.search('}$', '{block}'))  
<_sre.SRE_Match object; span=(6, 7), match='}'>
>>> print(re.search('}$', '{block} '))
None
>>> print(re.search('}$', '{block}\n'))  
<_sre.SRE_Match object; span=(6, 7), match='}'>

Achieve results:
Insert picture description here

Similarly, we use \$to match '$'the character itself; or contained in a character class , the like [$].

print(re.search('}[$]', '{block}$'))
print(re.search('}\$', '{block}$'))

Achieve results:
Insert picture description here
Insert picture description here

\A

Only match the beginning of the string . If you do not set the MULTILINE flag when \Aand ^functions are the same; but if you set MULTILINEflag, then there will be something different: \Astill match the start of the string, but ^ would each row string match

\WITH

Match only the end position of the string .

\b

Word boundary, this is a zero-width assertion that only matches the beginning and end of a word . "Word" is defined as an alphanumeric sequence , so the end of a word refers to a space or a non-alphanumeric character .

In the following example, classonly in the event of a complete word classonly match; if there is in other word, and will not match.

>>> p = re.compile(r'\bclass\b')
>>> print(p.search('no class at all'))  
<_sre.SRE_Match object; span=(3, 8), match='class'>
>>> print(p.search('the declassified algorithm'))
None
>>> print(p.search('one subclass is'))
None

Achieve results:
Insert picture description here

When using these special sequences , there are two points to note: The first point to note is that Pythonthe string and the regular expression have conflicts on some characters (recall the previous backslash example). For example, in Python, it \brepresents the backspace character (ASCII code value is 8). So, if you don't use the original string , Python will convert \b into a backspace character, which will definitely be different from what you expected.

In the following example, we deliberately did not write the original string , 'r'and the results are indeed very different:

>>> p = re.compile('\bclass\b')
>>> print(p.search('no class at all'))
None
>>> print(p.search('\b' + 'class' + '\b'))  
<_sre.SRE_Match object; span=(0, 7), match='\x08class\x08'>

Achieve results:
Insert picture description here
Insert picture description here

The second point to note is that this assertion cannot be used in character classes . Like Python, in the character class, it \bis only used to represent the backspace character.

\B

A zero-width assertion , with \bthe opposite meaning, \Brepresents the position of the non-word boundary .

Grouping

Usually in the actual application process, in addition to knowing whether a regular expression matches , we also need more information . For more complex content, regular expressions usually use grouping to match different content separately .

Example below, we will RFC-822head “:”into a number of names and values respectively match:

From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com

Like this, we can write a regular expression to match a first whole RFC-822head, and then use the group function, use a group to match the name of the head, and the other group match the name of the corresponding value .

RFC-822It ’s the standard format of emails . Of course, you don’t know how to divide the groups when you see it here. Don’t worry, please read on...

In regular expressions, the use of meta-characters ( )to divide the group . ( )Metacharacters have the same meaning as parentheses in mathematical expressions; they combine the expressions contained inside , so you can use metacharacters that repeat operations on the contents of a group, such as *, +,? or {m,n}.

For example, it (ab)*will match zero or more ab:

>>> p = re.compile('(ab)*')
>>> print(p.match('ababababab').span())
(0, 10)

Insert picture description here
Insert picture description here

Use ( )subgroups represented by it we can perform the layer index, the index value may be passed as a parameter to these methods: group(), start(), end()and span(). The serial number 0 represents the first group (this is the default group and always exists, so not passing in the parameter is equivalent to the default value of 0):

>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'

Insert picture description here

There are several pairs of parentheses is divided into several sub-groups , for example, (a)(b)and (a(b))is composed of two sub-groups thereof.

The index values ​​of subgroups are numbered from left to right , and subgroups are also allowed to be nested, so we can count the left parenthesis ( from left to right to determine the number of the subgroup .

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'

Insert picture description here

The group() method can pass in the serial numbers of multiple subgroups at once:

>>> m.group(2,1,2)
('b', 'abc', 'b')

Insert picture description here

start()It is to obtain the starting position of the parameter subgroup; it end()is to obtain the end position span()of the corresponding subgroup ; it is to obtain the range of the corresponding subgroup.

You can groups()return all matching strings subset disposable Method:

>>> m.groups()
('abc', 'b')

Insert picture description here

Back reference

There is also the concept of backreferences that needs to be introduced. Backreference means that you can use the previously matched content in the back position , the usage is backslash plus number . For example, \1it represents a reference number in front of a successful match of a sub-group .

>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'

Insert picture description here

If you are only searching for strings , backreferences will not be used , because very few text formats repeat characters in this way . However, you will soon find that backreferences are very useful when replacing strings !

note

In the words of Python string will use the backslash plus the number of ways to represent ASCII character number corresponding to the value , so use an inverted index of the regular expression, we still stressed the need to use the original string .

Guess you like

Origin blog.csdn.net/CSNN2019/article/details/114483463