Python study notes --- regular expressions [Liao Xuefeng]

regular expression

Regular expressions are a powerful weapon for matching strings . Its design idea is to use a descriptive language to define a rule for strings. We consider any string that meets the rule to "match", otherwise, the string is illegal.

The way we judge whether a string is a legal Email is:

  1. Create a regular expression that matches Email;
  2. Use this regular expression to match the user's input to determine whether it is legal.

.can match any character

To match variable-length characters , in regular expressions, use *to represent any character (including 0), use to +represent at least one character, use ?to represent 0 or 1 character, use {n}to represent n characters, and use to {n,m}represent nm characters :

Advanced

For a more precise match, you can use []ranges, for example:

  • [0-9a-zA-Z\_]Can match a number, letter or underscore;
  • [0-9a-zA-Z\_]+Can match a string consisting of at least one number, letter or underscore, such as 'a100', '0_Z', 'Py3000'etc.;
  • [a-zA-Z\_][0-9a-zA-Z\_]*It can match a string starting with a letter or an underscore, followed by any string consisting of a number, letter or underscore, which is a Python legal variable;
  • [a-zA-Z\_][0-9a-zA-Z\_]{0, 19}More precisely, the length of the variable is limited to 1-20 characters (1 character in front + up to 19 characters in the back).

A|Bcan match either A or B, so either (P|p)ythoncan be matched .'Python''python'

^Indicates the beginning of the line, ^\dindicating that it must start with a number.

$Indicates the end of the line, \d$indicating that it must end with a number.

You may have noticed that pyit can also be matched 'python', but adding ^py$it becomes a whole line match, and it can only match 'py'.

re module

With the preliminaries, we can use regular expressions in Python. Python provides remodules that contain all regular expression functionality. Since Python strings themselves also use \escapes, special attention should be paid:

​ It is strongly recommended to use Python's == rprefix, so there is no need to consider the problem of escaping ==

s = 'ABC\\-001' # Python的字符串
# 对应的正则表达式字符串变成:
# 'ABC\-001'

s = r'ABC\-001' # Python的字符串
# 对应的正则表达式字符串不变:
# 'ABC\-001'

match()Method to determine whether it matches, if the match is successful, returns an Matchobject, otherwise returns None.

>>> import re
>>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345')
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345')
>>>

Common judgment methods are:

test = '用户输入的字符串'
if re.match(r'正则表达式', test):
    print('ok')
else:
    print('failed')

split string

Splitting strings with regular expressions is more flexible than using fixed characters, please see the normal splitting code:

>>> 'a b   c'.split(' ')
['a', 'b', '', '', 'c']
# 无法识别连续的空格

With regular expressions:

>>> re.split(r'\s+', 'a b   c')
['a', 'b', 'c']

>>> re.split(r'[\s\,]+', 'a,b, c  d')
['a', 'b', 'c', 'd']

>>> re.split(r'[\s\,\;]+', 'a,b;; c  d')
['a', 'b', 'c', 'd']

group

In addition to simply judging whether to match, regular expressions also have the powerful function of == extracting substrings . Indicates ()the group to be extracted (Group) ==

>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> m.group(0)
'010-12345'
>>> m.group(1)
'010'
>>> m.group(2)
'12345'

If the group is defined in the regular expression, you can Matchuse group()the method on the object to extract the substring.

Note that == group(0)is always a string == that matches the entire regular expression, , group(1)... group(2)represent the 1st, 2nd, ... substrings.

But sometimes, complete verification cannot be achieved with regular expressions, and at this time, program cooperation is required for identification .

greedy matching

Regular matching defaults to greedy matching, which matches as many characters as possible

For example, match the following numbers 0:

>>> re.match(r'^(\d+)(0*)$', '102300').groups()
('102300', '')

Due to \d+the greedy matching, 0all the following are matched directly, and the result 0*can only match the empty string .

It is necessary to \d+use non-greedy matching (that is, to match as little as possible) in order to 0match the following,Add one ?to allow \d+non-greedy matching

>>> re.match(r'^(\d+?)(0*)$', '102300').groups()
('1023', '00')

compile

When we use regular expressions in Python, the re module does two things internally:

  1. Compile the regular expression, if the string itself of the regular expression is illegal, an error will be reported;
  2. use**after compilationThe regular expression ** to match the string.

If a regular expression is to be reused thousands of times, for efficiency reasons, we can precompile the regular expression , and then do not need to compile this step when it is reused, and directly match:

>>> import re
# 编译:
>>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# 使用:
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')

After compilation, a Regular Expression object is generated. Since the object itself contains a regular expression, it is not necessary to give a regular string when calling the corresponding method.

Guess you like

Origin blog.csdn.net/mwcxz/article/details/128713807