[Liao Xuefeng] python regular expression

Base

Strings are the most commonly used data structure in programming, and the need to operate on strings is almost everywhere. For example, to determine whether a string is a legal email address, although you can programmatically extract @the substrings before and after, and then determine whether it is a word and a domain name, this is not only troublesome, but also difficult to reuse the code.

Regular expressions are a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string. Any string that conforms to the rule is considered to "match". Otherwise, the string is illegal.

So the way we judge whether a string is a legal email is:

  1. Create a regular expression that matches Email;
  2. Use this regular expression to match the user's input to determine whether it is legal.

Because regular expressions are also represented by strings, we must first understand how to use characters to describe characters.

In regular expressions, if characters are given directly, it is an exact match.

Use \dcan match a number, use \wcan match a letter or number, and .can match any character :

  • '00\d'Can be matched '007', but cannot be matched '00A';

  • '\d\d\d'can match '010';

  • '\w\w\d'can match 'py3';

  • 'py.'Can match 'pyc', 'pyo', 'py!'etc.

To match variable-length characters, in regular expressions, use *means any number of characters (including 0), +means at least one character, ?means 0 or 1 characters, {n}means n characters, and means {n,m}nm characters. :

Let's look at a complex example: \d{3}\s+\d{3,8}.

Let’s read it from left to right:

  1. \d{3}Indicates matching 3 numbers, for example '010';
  2. \sCan match a space (including tab and other whitespace characters) , so \s+it means there is at least one space, such as matching ' ', ' 'etc.;
  3. \d{3,8}Represents 3-8 numbers, for example '1234567'.

Taken together, the above regular expression can match phone numbers with area codes separated by any number of spaces.

What if you want to match '010-12345'such a number? Since '-'it is a special character, it needs to be escaped in the regular expression '\', so the above regular expression is \d{3}\-\d{3,8}.

However, it still doesn't match '010 - 12345'because there are spaces. So we need more complex matching methods.

Advanced

For a more precise match, you can use []range representation , for example:

  • [0-9a-zA-Z\_]Can match a number, letter or underscore;
  • [0-9a-zA-Z\_]+Can match a string consisting of at least one number, letter or underscore, such as 'a100', '0_Z', 'Py3000'etc.;
  • [a-zA-Z\_][0-9a-zA-Z\_]*It can match any string that starts with a letter or an underscore and is followed by any number of numbers, letters or underscores, which is a legal variable in Python;
  • [a-zA-Z\_][0-9a-zA-Z\_]{0, 19}More precisely, the length of the variable is limited to 1-20 characters (1 character in front + up to 19 characters in the back).

A|BCan match A or B, so (P|p)ythoncan match 'Python'either or 'python'.

^Indicates the beginning of the line, ^\dindicating that it must start with a number.

$Indicates the end of the line, \d$indicating that it must end with a number.

You may have noticed that it pycan also match 'python', but adding ^py$it turns into matching the entire line, and it can only match 'py'.

re module

With the preparatory knowledge, we can use regular expressions in Python. Python provides remodules that contain all regular expression functionality. Since Python's strings themselves also use \escape, special attention should be paid to:

s = 'ABC\\-001' # Python的字符串
# 对应的正则表达式字符串变成:
# 'ABC\-001'

Therefore, we strongly recommend using Python rprefixes, so you don’t have to worry about escaping:

s = r'ABC\-001' # Python的字符串
# 对应的正则表达式字符串不变:
# 'ABC\-001'

Let’s first look at how to determine whether a regular expression matches:

>>> import re
>>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345')  # 注意,{3,8}的8之前不能有空格
<re.Match object; span=(0, 9), match='010-12345'>
>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345')
>>>

match()The method determines whether there is a match. If the match is successful, it returns an Matchobject, otherwise it returns None. Common judgment methods are:

test = '用户输入的字符串'
if re.match(r'正则表达式', test):
    print('ok')
else:
    print('failed')
# 结果:failed

Split string

Using regular expressions to split strings is more flexible than using fixed characters. Please see the normal splitting code:

>>> 'a b   c'.split(' ')
['a', 'b', '', '', 'c']

Well, continuous spaces cannot be recognized. Try using regular expressions:

>>> re.split(r'\s+', 'a b    c')  # \s+表示至少有1个空格
['a', 'b', 'c']

It can be divided normally no matter how many spaces there are. Try joining ,:

>>> re.split(r'[\s\,]+', 'a, b,c   d')  # 使用[]更精确地表示范围
['a', 'b', 'c', 'd']

Try adding it again ;:

>>> re.split(r'[\s\,\;]+', 'a,b;; c    d')
['a', 'b', 'c', 'd']

If the user enters a set of tags, remember to use regular expressions to convert the irregular input into a correct array next time.

Group

In addition to simply determining whether there is a match, regular expressions also have the powerful function of extracting substrings. What is represented by ()is the group (Group) to be extracted. for example:

^(\d{3})-(\d{3,8})$Two groups are defined respectively, and the area code and local number can be extracted directly from the matching string:

>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m
<re.Match object; span=(0, 9), match='010-12345'>
>>> m.group(0)
'010-12345'
>>> m.group(1)
'010'
>>> m.group(2)
'12345'

If a group is defined in the regular expression, you can useMatch methods on the object to extract the substring .group()

Note that group(0)it is always a string that matches the entire regular expression , group(1), group(2)... represent the 1st, 2nd, ... substrings.

Very useful for extracting substrings. Let’s look at a more brutal example:

>>> t = '19:05:30'
>>> m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)  # |表示或者的意思
>>> m.groups()
('19', '05', '30')

This regular expression can directly identify legal times. But sometimes, full verification cannot be achieved using regular expressions, such as identifying dates:

'^(0[1-9]|1[0-2]|[0-9])-(0[1-9]|1[0-9]|2[0-9]|3[0-1]|[0-9])$'

For '2-30'such '4-31'illegal dates, it is still impossible to identify them using regular expressions, or it is very difficult to write them out. At this time, a program is required to cooperate with the identification.

greedy matching

Finally, it should be pointed out that regular matching is greedy matching by default , which means matching as many characters as possible. For example, match the following numbers 0:

>>> re.match(r'^(\d+)(0*)$', '102300').groups()  # *表示任意个字符(包括0个)
('102300', '')

Due to \d+the greedy matching, all the following are directly 0matched, and the result 0*can only match the empty string.

Non-greedy matching must be used (that is, as few matches as possible) to be able to match \d+the following ones . Adding one will allow non-greedy matching to be used :0?\d+

>>> re.match(r'^(\d+?)(0*)$', '102300').groups()
('1023', '00')

compile

When we use regular expressions in Python, the re module does two things internally:

  1. Compile the regular expression. If the string of the regular expression itself is illegal, an error will be reported;
  2. Use compiled regular expressions to match strings.

If a regular expression is to be reused thousands of times, for the sake of efficiency, we can precompile the regular expression, so that there is no need to compile this step when it is reused, and it will match directly:

>>> import re
# 编译:
>>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# 使用:
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')

After compilation, a Regular Expression object is generated. Since the object itself contains regular expressions, there is no need to give a regular string when calling the corresponding method.

summary

Regular expressions are so powerful that it would be impossible to cover them all in just one section. To explain clearly all the contents of regular rules, a thick book could be written. If you often encounter problems with regular expressions, you may need a regular expression reference book.

practise

1. Please try to write a regular expression to verify email addresses. Version 1 should be able to verify a similar email:

import re

def is_valid_email(addr):
    re_email = re.compile(r'^[a-zA-Z\.]+@[0-9a-zA-Z]+\.com$')
    # 另一种写法:
    # re_email = re.compile(r'^[\w]+\.?[\w]+@[\w]+\.com$')
    # 正则解释:       字母一个以上 .一个或没有 字母一个以上 @ 字母不限 .com
    if re_email.match(addr):
        return True

# 测试:
assert is_valid_email('[email protected]')
assert is_valid_email('[email protected]')
assert not is_valid_email('bob#example.com')
assert not is_valid_email('[email protected]')
print('ok')

# 结果:ok

2. Version 2 can extract the email address with name:

import re

def name_of_email(addr):
    re_email = re.compile(r'^<?([\w\s]+)>?\s*[\w]*@[\w]+\.(org|com)$')
    # 正则解释:<0或1个 字母、空格一个以上 >0或1个 空格不限 字母不限@字母一个以上 . org或者com
    if re_email.match(addr):
        return re_email.match(addr).group(1)

# 测试:
assert name_of_email('<Tom Paris> [email protected]') == 'Tom Paris'
assert name_of_email('[email protected]') == 'tom'
assert name_of_email('[email protected]') == 'bob'
print('ok')
# 结果:ok

Reference link: Regular expression-Liao Xuefeng’s official website (liaoxuefeng.com)

Guess you like

Origin blog.csdn.net/qq_45670134/article/details/127215779