Regular Expressions in Python

This pit is a sad thing. It is indeed used less in daily work, so summarize it to deepen your memory;

Regular Expression, often abbreviated as regex, regexp or RE in the code; a regular expression is a logical formula for operating on strings

The characteristics of regular expressions are:

1. Very flexible, logical and functional;

2. The complex control of strings can be achieved quickly and in a very simple way.

3. For those who are new to it, it is more obscure and difficult to understand.

Basic steps to create and find:

1. All regular expression functions in Python are in the re module, so you need to first: import re

2. Passing a string value to re.compile to represent a regular expression will return a Regex object;

Take a phone number as an example:

phonenumregex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

3. Matching the Regex object: The search() method of the Regex object looks for the incoming string, looks for a match of the regular expression, and returns None if the matching pattern for the surname is not found;

If a matching object is found, a Match object is returned.

mo=phonenumregex.search('My phone number is 415-555-4242.')

4. Call the group() method of the Match object to return the actual matching result;

mo.group()

Note: Pass in the original string to re.compile(), and add r in front of the first quotation mark of the string to mark the string as the original string, excluding escape characters,

That is, the \d in the string is the regular expression of the number, without the need to add the escape character \ and write it as \\d

Use parentheses to group (group method)

Parentheses can be added to create 'groups' in regular expressions

Such as phone number partition number and number two parts:

（\d\d\d)-(\d\d\d-\d\d\d\d)

The first pair of parentheses is the first group, and the second pair of parentheses is the second group. When 1 or 2 is passed to the group() matching object, the matched two sets of information are returned respectively, and 0 or no parameters are passed in. will return the text matched by the real brother;

>>> import re
>>> phonenumregex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = phonenumregex.search('My number is 415-555-4242')
>>> mo.group()
'415-555-4242'
>>> mo.group(1)
'415'
>>> mo.group(2)
'555-4242'

>>> mo.groups()
('415', '555-4242')

To get all the groups at once, you can use the groups() method;

If the regular expression needs to match ( ), the escape characters $ and $ should be used to express brackets;

Construct more complex regular expressions

1. Pipe | Similar to or, matching one of many expressions:

>>> heroRegex = re.compile(r'Batman|Tina Fey')
>>> mo1 = heroRegex.search('Batman and Tina Fey')
>>> mo1.group()
'Batman'
>>> mo1 = heroRegex.search(' Tina Fey and Batman')
>>> mo1.group()
'Tina Fey'

| Match returns the first occurrence of the matching object. If you need to match the real |, you need to use the escape character \| to match |;

>>> batRegex=re.compile(r'Bat(man|mobile|copter|bat)')
>>> mo = batRegex.search('Batmobile lost a wheel')
>>> mo.group()
'Batmobile'
>>> mo.group(1)
'mobile'

2. The question mark implements optional matching;

? Indicates that the preceding group is optional in this pattern, and matches if there is one, ? It can be understood as matching the previous group zero or one time;

>>>betRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
>>> mo1 = betRegex.search('My number is 415-555-4242')
>>> mo1.group()
'415-555-4242'
>>> mo2 = betRegex.search('My number is 555-4242')
>>> mo2.group()
'555-4242'

3. Use * to match zero or more times

*The previous grouping can appear any number of times in the text;

betRegex = re.compile(r'Bat(wo)*man')
>>> mo1=betRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'
>>> mo1=betRegex.search('The Adventures of Batwoman')
>>> mo1.group()
'Batwoman'
>>> mo1=betRegex.search('The Adventures of Batwowowowoman')
>>> mo1.group()
'Batwowowowoman'

4. Use + to match one or more times

+ means to match one or more times, different from *, the preceding grouping must appear at least once;

>>> betRegex = re.compile(r'Bat(wo)+man')
>>> mo1=betRegex.search('The Adventures of Batwoman')
>>> mo1.group()
'Batwoman'
>>> mo1=betRegex.search('The Adventures of Batwowowowoman')
>>> mo1.group()
'Batwowowowoman'
>>> mo1=betRegex.search('The Adventures of Batman')
>>> mo1.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> mo1 == None
True

It can be seen that when wo does not appear once, mo1 returns None

5. Use {} to match a specific number of occurrences

Add {number} after the regular expression to the group to indicate that the group matches number times;

You can also specify a range in {}, and in {min,max} you can limit the maximum number of times and the minimum number of occurrences; {min,} means at least min matches with no upper limit; {,max} means at most max matches , with no lower limit;

>>> haRegex = re.compile(r'(Ha){3}')
>>> mo1 = haRegex.search('HaHaHa')
>>> mo1.group()
'HaHaHa'
>>> mo1 = haRegex.search('HaHa')
>>> mo1.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> mo1 == None
True

It can be seen that because {3} limits the matching three times, the result of HaHa matching is None.

6. Greedy and non-greedy matching

Regular expressions in Python are 'greedy' by default, so when there is a second option, it will match the longest string possible;

The non-greedy version of {} matches the shortest possible string by adding a question mark after the end of {};

>>> greedyRegex = re.compile(r'(Ha){3,5}')
>>> mo1 = greedyRegex.search('HaHaHaHaHa')
>>> mo1.group()
'HaHaHaHaHa'
>>> nongreedyRegex = re.compile(r'(Ha){3,5}?')
>>> mo1 = nongreedyRegex.search('HaHaHaHaHa')
>>> mo1.group()
'HaHaHa'

nongreedyRegex is a non-greedy mode using regular expressions, namely {}?, so you can see that even min=3, max=5 but the shortest string is matched in non-greedy mode;

7. The findall() method of regular expressions

Only the search() method was introduced earlier, and it returns a Match object, which only contains the 'first' matching text in the searched string;

The findall() method will return a set of strings, including all matches of the matched string, not a Match object, but a list of strings;

When the regular expression is not grouped, return a list of matching results:

>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
>>> mo = phoneNumRegex.search('Cell:415-555-4242 Work 215-555-0000')
>>> mo.group()
'415-555-4242'
>>> mo = phoneNumRegex.findall('Cell:415-555-4242 Work 215-555-0000')
>>> mo.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'group'
>>> mo = phoneNumRegex.findall('Cell:415-555-4242 Work 215-555-0000')
>>> mo
['415-555-4242', '215-555-0000']

You can see that the result returned by the seach method needs to call the group method;

The findall method returns a list, so calling the group method will report an error, because the text that can be matched twice appears in the text, so a list of all matching results is returned;

When the regular expression has groupings, return a list of matched result tuples:

>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
>>> mo = phoneNumRegex.findall('Cell:415-555-4242 Work 215-555-0000')
>>> mo
[('415', '555', '4242'), ('215', '555', '0000')]

The matching result returns a list of tuples, each group corresponds to a string, and the result of each complete match is a tuple;

8. Character Classification

\d	any number from 0 to 9
\D	any character except 0 to 9
\w	Any letter, number or underscore character (can be thought of as matching a 'word' character)
\W	Characters other than letters, numbers and underscores
\s	space, tab or newline (understood to match 'whitespace' characters)
\S	Any character except space, tab and newline

Character classification [0-3] means only matching numbers 0-3; using \d{3} means matching an Arabic numeral three times;

>>> xmasregex = re.compile(r'\d+\s\w+')
>>> mo = xmasregex.findall('12 drummers,11 pipers,10 loards')
>>> mo
['12 drummers', '11 pipers', '10 loards']

In the fickle regular expression, \d+ means to match Arabic numerals at least once \s means to match space characters \w+ means to match subtitles, numbers or underscore characters at least once, so the text to be matched by \d+\s\w+ is one or more a number, followed by a whitespace character, followed by one or more alpha/numeric/underscore characters;

9. Use [] to create your own character classification

Because the range of character classification is very wide, if you only want to match certain Arabic numerals or certain alphabetic symbols, you can use [] to create the desired matching text;

[a-zA-Z0-9] will match all uppercase and lowercase characters and Arabic numerals;

[^aeio] will match non-behind character classes;

For matches in [], ordinary regular expression symbols will not be interpreted, so when characters such as ./*?() need to be matched in [], they can be written directly without adding \ to comment;

10. ^ and $

Use ^ at the beginning of a regular expression to indicate that the match must occur at the beginning of the matched text;

Use $ at the end of a regular expression to indicate that the matched string must end with the pattern;

If you use ^ and $ at the same time, it means that the matching content must start with a certain pattern, and only match a certain subset is not allowed;

>>> Regex1 = re.compile(r'^\d')
>>> Regex1 = re.compile(r'^\d+$')
>>> mo=Regex1.search('1234567890').group()
>>> mo
'1234567890'
>>> mo=Regex1.search('123456xy7890').group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> Regex1.search('123456xy7890') == None
True

11. Wildcard characters

The . character becomes a 'wildcard', which can match all characters except newlines;

However. only matches one character, and the real period needs to be escaped with a slash \;

>>> arRegex = re.compile(r'.at')
>>> arRegex.findall('The Cat in the hat sat on the flat mat.')
[ ' Cat ' , ' hat ' , ' sat ' , ' lat ' , ' mat ' ]

12. .* matches all characters

.* can match any character, use greedy mode to match as much text as possible;

.*? is a non-greedy pattern matching the shortest possible text;

>>> Regex1 = re.compile(r'<.*>')
>>> mo = Regex1.search('<To Serve man> for dinner.>')
>>> mo
<_sre.SRE_Match object; span=(0, 27), match='<To Serve man> for dinner.>'>
>>> mo.group()
'<To Serve man> for dinner.>'
>>> Regex2 = re.compile(r'<.*?>')
>>> mo = Regex2.search('<To Serve man> for dinner.>')
>>> mo.group()
'<To Serve man>'

13. Pass in the second parameter for matching

The previous summary. The period can match all characters except the newline. By passing in re.DOTALL as the second parameter of re.compile(), the period can match all characters, including the newline;

If you only care about matching letters and don't care about thank you, you can pass re.IGNORECASE or re.I as the second parameter of re.compile, and the letters listed in [] or directly will ignore case for matching;

If you want to match complex text patterns, you may need a long regular expression. In this case, you can pass re.VERBOSE as the second parameter of re.compile, ignoring blank characters, newlines and comments in the regular expression;

r plus triple quotes can enter multi-line strings, just like Python reads strings;

Regular Expressions in Python

Guess you like