regular expression
Regular expressions are a powerful weapon for matching strings . Its design idea is to use a descriptive language to define a rule for strings. We consider any string that meets the rule to "match", otherwise, the string is illegal.
The way we judge whether a string is a legal Email is:
- Create a regular expression that matches Email;
- Use this regular expression to match the user's input to determine whether it is legal.
.
can match any character
To match variable-length characters , in regular expressions, use *
to represent any character (including 0), use to +
represent at least one character, use ?
to represent 0 or 1 character, use {n}
to represent n characters, and use to {n,m}
represent nm characters :
Advanced
For a more precise match, you can use []
ranges, for example:
[0-9a-zA-Z\_]
Can match a number, letter or underscore;[0-9a-zA-Z\_]+
Can match a string consisting of at least one number, letter or underscore, such as'a100'
,'0_Z'
,'Py3000'
etc.;[a-zA-Z\_][0-9a-zA-Z\_]*
It can match a string starting with a letter or an underscore, followed by any string consisting of a number, letter or underscore, which is a Python legal variable;[a-zA-Z\_][0-9a-zA-Z\_]{0, 19}
More precisely, the length of the variable is limited to 1-20 characters (1 character in front + up to 19 characters in the back).
A|B
can match either A or B, so either (P|p)ython
can be matched .'Python'
'python'
^
Indicates the beginning of the line, ^\d
indicating that it must start with a number.
$
Indicates the end of the line, \d$
indicating that it must end with a number.
You may have noticed that py
it can also be matched 'python'
, but adding ^py$
it becomes a whole line match, and it can only match 'py'
.
re module
With the preliminaries, we can use regular expressions in Python. Python provides re
modules that contain all regular expression functionality. Since Python strings themselves also use \
escapes, special attention should be paid:
It is strongly recommended to use Python's == r
prefix, so there is no need to consider the problem of escaping ==
s = 'ABC\\-001' # Python的字符串
# 对应的正则表达式字符串变成:
# 'ABC\-001'
s = r'ABC\-001' # Python的字符串
# 对应的正则表达式字符串不变:
# 'ABC\-001'
match()
Method to determine whether it matches, if the match is successful, returns an Match
object, otherwise returns None
.
>>> import re
>>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345')
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345')
>>>
Common judgment methods are:
test = '用户输入的字符串'
if re.match(r'正则表达式', test):
print('ok')
else:
print('failed')
split string
Splitting strings with regular expressions is more flexible than using fixed characters, please see the normal splitting code:
>>> 'a b c'.split(' ')
['a', 'b', '', '', 'c']
# 无法识别连续的空格
With regular expressions:
>>> re.split(r'\s+', 'a b c')
['a', 'b', 'c']
>>> re.split(r'[\s\,]+', 'a,b, c d')
['a', 'b', 'c', 'd']
>>> re.split(r'[\s\,\;]+', 'a,b;; c d')
['a', 'b', 'c', 'd']
group
In addition to simply judging whether to match, regular expressions also have the powerful function of == extracting substrings . Indicates ()
the group to be extracted (Group) ==
>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> m.group(0)
'010-12345'
>>> m.group(1)
'010'
>>> m.group(2)
'12345'
If the group is defined in the regular expression, you can Match
use group()
the method on the object to extract the substring.
Note that == group(0)
is always a string == that matches the entire regular expression, , group(1)
... group(2)
represent the 1st, 2nd, ... substrings.
But sometimes, complete verification cannot be achieved with regular expressions, and at this time, program cooperation is required for identification .
greedy matching
Regular matching defaults to greedy matching, which matches as many characters as possible
For example, match the following numbers 0
:
>>> re.match(r'^(\d+)(0*)$', '102300').groups()
('102300', '')
Due to \d+
the greedy matching, 0
all the following are matched directly, and the result 0*
can only match the empty string .
It is necessary to \d+
use non-greedy matching (that is, to match as little as possible) in order to 0
match the following,Add one ?
to allow \d+
non-greedy matching:
>>> re.match(r'^(\d+?)(0*)$', '102300').groups()
('1023', '00')
compile
When we use regular expressions in Python, the re module does two things internally:
- Compile the regular expression, if the string itself of the regular expression is illegal, an error will be reported;
- use**after compilationThe regular expression ** to match the string.
If a regular expression is to be reused thousands of times, for efficiency reasons, we can precompile the regular expression , and then do not need to compile this step when it is reused, and directly match:
>>> import re
# 编译:
>>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# 使用:
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')
After compilation, a Regular Expression object is generated. Since the object itself contains a regular expression, it is not necessary to give a regular string when calling the corresponding method.