Compiled regular expressions of Python regular expressions (2)

Article Directory

statement

Python by providing an interface module re regular expression engine , while allowing you regular expression compiled into an object model , and use them to match .

Note: The
re module is written in C language , so the efficiency is much higher than you use ordinary string methods ; compile regular expressions is also to further improve efficiency ; we will often refer to "modes" later , Refers to the pattern object that the regular expression is compiled into .

Compile regular expression

Regular expressions are compiled into pattern objects , which have various methods for you to manipulate strings , such as finding pattern matches or performing string replacement .

import re
p = re.compile('ab*')

Insert picture description here
re.compile()Can also accept flagsparameters used to open various special features and syntax variations ,

>>> p = re.compile('ab*', re.IGNORECASE)

The regular expression is passed as a string parameterre.compile() . Because the regular expression is not a Pythoncore part of, and therefore does not provide special syntax support for it, so the regular expression can only be a string representation. (Some applications do not need to use regular expressions at all, so the friends in the Python community think it is not necessary to incorporate them into the core of Python.) On the contrary, the re module is only included in Python as an extension module of C , just like socket Module and zlib module .

Troublesome backslash

Regular expressions use '\'character to make some common characters with special abilities (for example \dmeans match any decimal digit), or deprived of the ability of some special characters (for example, \[indicates a match left bracket '['). This will follow Pythona string of characters to achieve the same functionality of the conflict.

If you need to LaTeXuse a file in a regular expression matching string '\section'. Because the backslash is a special character that needs to be matched , you need to add a backslash before it to deprive it of its special function. So we will write the characters of the regular expression as'\\section'

Python also uses backslashes in strings to express special meaning . So, if we want to '\\section'complete the pass re.compile(), we need to add two backslash again ...

Match characters	Matching phase
`\section`	The string to be matched
`\\section`	Regular expressions `'\\'`represent character match`'\'`
`"\\\\section"`	Unfortunately, Python strings can also use `'\\'`represent characters`'\'`

In short, in order to match the backslash character, we need to use four backslashes in the string . Therefore, frequent use of backslashes in regular expressions will cause a storm of backslashes, which in turn will make your string extremely difficult to understand.

The solution is to use Python's original string to represent the regular expression (that is, add r in front of the string, remember...):

Regular string	Raw string
`"ab*"`	r`"ab*"`
`"\\\\section"`	r`"\\section"`
`"\\w+\\s+\\1"`	r`"\w+\s+\1"`

It is recommended to use raw strings to express regular expressions .

Achieve matching

When you compile the regular expression, you get a pattern object . What can you do with him? The pattern object has many methods and properties . Let's enumerate the most important ones below:

method	Features
`match()`	Determine whether a regular expression matches a string from the beginning
`search()`	Traverse the string and find the first position matched by the regular expression
`findall()`	Traverse the string, find all positions matched by the regular expression, and return it in the form of a list
`finditer()`	Traverse the string, find all positions matched by the regular expression, and return it in the form of an iterator

If no match is found, then, match()and search()will return None; if the match is successful, it returns a matching object (match object), contains all the information that match: for example, where to start, where to end, matching substring, etc.

`match()`

Next, we explain step by step:

>>> import re
>>> p = re.compile('[a-z]+')
>>> p
re.compile('[a-z]+')

Implementation screenshot:
Insert picture description here

Now, you can try to use regular expression [a-z]+ to match various strings.
E.g:

>>> p.match("")
>>> print(p.match(""))
None

Implementation screenshot:
Insert picture description here

Because +indicates matching one or more times , it can not be the empty string match. Therefore, match()return None.

Let's try another string that can be matched:

>>> m = p.match('fish')
>>> m  
<_sre.SRE_Match object; span=(0, 4), match='fish'>

Implementation screenshot:
Insert picture description here

In this example, match()a matching object is returned , and we store it in the variable m for future use.

Next, let's take a look at what information is in the matching object. The matching object contains many methods and properties , the following are the most important:

method	Features
`group()`	Return the matched string
`start()`	Return the starting position of the match
`end()`	Return the ending position of the match
`span()`	Return a tuple representing the matching position (start, end)

>>> m.group()
'fish'
>>> m.start()
0
>>> m.end()
4
>>> m.span()
(0, 4)

Implementation screenshot:
Insert picture description here

Since match()only the check match the regular expression at the start of the string , the start () always returns 0.

`search()`

However, the search() method is different:

>>> print(p.match('^_^fish'))
None
>>> m = p.search('^_^fish')
>>> print(m)
<_sre.SRE_Match object; span=(3, 7), match='fish'>
>>> m.group()
'fish'
>>> m.span()
(3, 7)
>>> m=p.search('^_^f123i54sh')
>>> m
<re.Match object; span=(3, 4), match='f'>

Implementation screenshot:
Insert picture description here

In practical applications, the most common way is to store the matching object in a local variable and check whether the return value is None.

The form is usually as follows:

p = re.compile( ... )
m = p.match( 'string goes here' )
if m:
    print('Match found: ', m.group())
else:
    print('No match')

There are two methods to return all matching results, one is findall()and the other is finditer().

`findall()`

findall() What is returned is a list:

>>> p = re.compile('\d+')
>>> p.findall('3只小狗，15条腿，多出的3条在哪里？')
['3', '15', '3']

Implementation screenshot:
Insert picture description here

`finditer()`

findall()We need to create a list before returning, and finditer()sucked matching object returned as an iterator:

>>> iterator = p.finditer('3只小狗，15条腿，多出的3条在哪里？')
>>> iterator
<callable_iterator object at 0x00000212CE96ADC8>
>>> for match in iterator:
        print(match.span())
        
(0, 1)
(5, 7)
(13, 14)

Implementation screenshot:
Insert picture description here

If the list is large, it is much more efficient to return the iterator