statement
Some metacharacters do not match any characters , but simply indicate success or failure , so these characters are also called zero-width assertions . For example, \b
it represents the current position is a word boundary , but \b
and not change position . Therefore, the zero-width assertion should not be reused , because \b
does not modify the current position, so \b\b
now \b
is no different.
Explanation:
Many people may not understand the meaning of "change position" and " zero-width assertion "? My attempt to explain, for example, abc
after the match finished a, it will move the current position in order to continue matching b, and so on ... But \babc
, then, \b
represents the current position of the boundary in the word ( first letter of the word or the last letter ), At this time, the current position will not change , and then match a with the character at the current position...
|
Or operator, performs an OR operation on two regular expressions. If A and B are regular expressions, A | B
any characters in A or B will be matched. In order to be able to work more reasonably, |
the priority is very low. For example Fish|C
should match or Fish C
, not match Fis
, then a 'h'
or 'C'
.
Similarly, we use \|
to match '|'
the character itself; or is contained in a character class, like this [|]
.
^
The starting position of the matched string . If you set the MULTILINE
flag, it will become a match starting position of each line . In the MULTILINE
middle, whenever it encounters a newline will immediately be matched .
For example, if you only want to match the word at the beginning of the string From
, then your regular expression can be written as ^From
:
>>> print(re.search('^From', 'From Here to'))
<_sre.SRE_Match object; span=(0, 4), match='From'>
>>> print(re.search('^From', 'Reciting From Memory'))
None
Achieve results:
$
The end position of the matching string , every time a newline character is encountered, it will leave for matching .
>>> print(re.search('}$', '{block}'))
<_sre.SRE_Match object; span=(6, 7), match='}'>
>>> print(re.search('}$', '{block} '))
None
>>> print(re.search('}$', '{block}\n'))
<_sre.SRE_Match object; span=(6, 7), match='}'>
Achieve results:
Similarly, we use \$
to match '$'
the character itself; or contained in a character class , the like [$]
.
print(re.search('}[$]', '{block}$'))
print(re.search('}\$', '{block}$'))
Achieve results:
\A
Only match the beginning of the string . If you do not set the MULTILINE
flag when \A
and ^
functions are the same; but if you set MULTILINE
flag, then there will be something different: \A
still match the start of the string, but ^
would each row string match
\WITH
Match only the end position of the string .
\b
Word boundary, this is a zero-width assertion that only matches the beginning and end of a word . "Word" is defined as an alphanumeric sequence , so the end of a word refers to a space or a non-alphanumeric character .
In the following example, class
only in the event of a complete word class
only match; if there is in other word, and will not match.
>>> p = re.compile(r'\bclass\b')
>>> print(p.search('no class at all'))
<_sre.SRE_Match object; span=(3, 8), match='class'>
>>> print(p.search('the declassified algorithm'))
None
>>> print(p.search('one subclass is'))
None
Achieve results:
When using these special sequences , there are two points to note: The first point to note is that Python
the string and the regular expression have conflicts on some characters (recall the previous backslash example). For example, in Python, it \b
represents the backspace character (ASCII code value is 8). So, if you don't use the original string , Python will convert \b into a backspace character, which will definitely be different from what you expected.
In the following example, we deliberately did not write the original string , 'r'
and the results are indeed very different:
>>> p = re.compile('\bclass\b')
>>> print(p.search('no class at all'))
None
>>> print(p.search('\b' + 'class' + '\b'))
<_sre.SRE_Match object; span=(0, 7), match='\x08class\x08'>
Achieve results:
The second point to note is that this assertion cannot be used in character classes . Like Python, in the character class, it \b
is only used to represent the backspace character.
\B
A zero-width assertion , with \b
the opposite meaning, \B
represents the position of the non-word boundary .
Grouping
Usually in the actual application process, in addition to knowing whether a regular expression matches , we also need more information . For more complex content, regular expressions usually use grouping to match different content separately .
Example below, we will RFC-822
head “:”
into a number of names and values respectively match:
From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com
Like this, we can write a regular expression to match a first whole RFC-822
head, and then use the group function, use a group to match the name of the head, and the other group match the name of the corresponding value .
RFC-822
It ’s the standard format of emails . Of course, you don’t know how to divide the groups when you see it here. Don’t worry, please read on...
In regular expressions, the use of meta-characters ( )
to divide the group . ( )
Metacharacters have the same meaning as parentheses in mathematical expressions; they combine the expressions contained inside , so you can use metacharacters that repeat operations on the contents of a group, such as *, +,? or {m,n}
.
For example, it (ab)*
will match zero or more ab
:
>>> p = re.compile('(ab)*')
>>> print(p.match('ababababab').span())
(0, 10)
Use ( )
subgroups represented by it we can perform the layer index, the index value may be passed as a parameter to these methods: group()
, start()
, end()
and span()
. The serial number 0 represents the first group (this is the default group and always exists, so not passing in the parameter is equivalent to the default value of 0):
>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'
There are several pairs of parentheses is divided into several sub-groups , for example, (a)(b)
and (a(b))
is composed of two sub-groups thereof.
The index values of subgroups are numbered from left to right , and subgroups are also allowed to be nested, so we can count the left parenthesis ( from left to right to determine the number of the subgroup .
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
The group() method can pass in the serial numbers of multiple subgroups at once:
>>> m.group(2,1,2)
('b', 'abc', 'b')
start()
It is to obtain the starting position of the parameter subgroup; it end()
is to obtain the end position span()
of the corresponding subgroup ; it is to obtain the range of the corresponding subgroup.
You can groups()
return all matching strings subset disposable Method:
>>> m.groups()
('abc', 'b')
Back reference
There is also the concept of backreferences that needs to be introduced. Backreference means that you can use the previously matched content in the back position , the usage is backslash plus number . For example, \1
it represents a reference number in front of a successful match of a sub-group .
>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'
If you are only searching for strings , backreferences will not be used , because very few text formats repeat characters in this way . However, you will soon find that backreferences are very useful when replacing strings !
note
In the words of Python string will use the backslash plus the number of ways to represent ASCII character number corresponding to the value , so use an inverted index of the regular expression, we still stressed the need to use the original string .