I used to listen to regular expressions a lot, but I have never used them. This time I just need them, so I will learn them.
Reference link:
https://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001386832260566c26442c671fa489ebc6fe85badda25cd000
http://www.runoob.com/regexp/regexp-syntax.html
http://www.runoob.com/python3/python3-reg-expressions.html
1. Introduction to grammar
1.1 Character composition
Regular expressions are mainly used for string matching.
Regular expressions are literal patterns consisting of ordinary and special characters.
Normal characters include printing and non-printing characters.
1.2 Specific usage
1. "\d" means it can match a number, "\w" means it can match a number or letter, "." means it can match all characters except "\n", "*" means it contains 0-n character, "+" means 1-n characters, "?" means 0 or 1 character.
Example:
“\d\d\d”可以匹配“345”
"\w\w\w"可以匹配"we2"
“a*”可以匹配“aaa”"aa"
2. {n} means that it contains n characters, and {n,m} means that it contains nm characters.
Example:
"\d{3}\s+\d{3,9}"
表示可以匹配3个数字、空格1-n个,数字3-9个。
如:"231 34367"
3. Range representation []
[0-9a-zA-Z\_]表示可以匹配一个数字或字母或下划线,如"a","3"
[0-9a-zA-Z\_]+表示可以匹配由1-n个(数字或字母或下划线)组成的串,如"sdag1","sf321_12"
[a-zA-Z\_][0-9a-zA-Z\_]*表示可以匹配由一个字母或下划线开头,后面接0-n个(数字或字母或下划线)组成的串。我们在各种语言中定义的变量就是这种。
[a-zA-Z\_][0-9a-zA-Z\_]{0,19}表示可以匹配由一个字母或下划线开头,后面接0-19个(数字或字母或下划线)组成的串,即总长度在1-20之间。
4. "A|B" means matching A or B
^ means start of line, ^\d means must start with a digit.
Indicates that it must end with a number.
^py$ means match the entire line, only 'py'.
2. re module
2.1 Module introduction
1. re.match function
2. re.search function
3. re.sub retrieval and replacement functions
4. compile function
5. findall function
6. finditer function
7. Split function
2.2 Expand the introduction
1、r
Add r before the expression, regardless of escaping.
s = "adb\\-sx2" 正则表达式:"adb\-sx2"
s = r"adb\-sx2"正则表达式:"adb\-sx2"
2. Example 1, if match matches, a Match object is returned, otherwise it returns None
import re
a = re.match(r"\d{3}\-\d{6,8}","010-123467")
b = re.match(r"\d{3}\-\d{6,8}","010 123467")
print a
print b
if a:
print "yes"
else:
print "no"
if b:
print "yes"
else:
print "no"
3. Cut strings, such as continuous spaces, to split all unwanted
import re
s1 = "a b c d"
s2 = "a, b,c, d, e: f"
x1 = s1.split(" ")
x2 = re.split(r" +", s1)
x3 = re.split(r"[\s\,\:]+", s2)
print "x1=", x1
print "x2=", x2
print "x3=", x3
4. Group, extract substrings, use () to represent a group
import re
m = re.match(r"^(\d{3})-(\d{5,8})","010-123456")
print m
print m.group(0)
print m.group(1)
print m.group(2)
import re
time = "19:34:46"
m = re.match(r"(0[0-9]|1[0-9]|2[0-3]):(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]):(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])", time)
print m.groups()
print m.group(1)
print m.group(2)
print m.group(3)
5. Greedy matching
Regular matching defaults to greedy matching, matching as many characters as possible.
import re
m = re.match(r"^(\d+)(0*)$", "230010000")
print m.groups()
Here we originally wanted to match two groups, one group is 0-n groups starting with 0 at the end, and one group is the number of the previous group.
However, "\d+" uses greedy matching to match the following "0000" pages, resulting in "0*" not being matched.
So to use non-greedy matching, add a "?" after "\d+".
import re
m = re.match(r"^(\d+?)(0*)$", "230010000")
print m.groups()
6. Compile
When we use regular expressions in Python, the re module does two things internally:
Compile the regular expression, if the string of the regular expression itself is illegal, an error will be reported;
Use compiled regular expressions to match strings.
If a regular expression needs to be reused thousands of times, for efficiency reasons, we can precompile the regular expression, and then we don't need to compile this step for repeated use, and match directly.
import re
pattern = re.compile(r"^(\d{3})-(\d{5,9})$")
s1 = "010-1234567"
s2 = "212-3421534"
m1 = pattern.match(s1)
m2 = pattern.match(s2)
print m1.groups()
print m2.groups()
7. Homework 1
Please try to write a regular expression that validates email addresses. Version 1 should be able to verify a similar Email:
[email protected]
[email protected]
import re
pattern = re.compile(r"^[\w.]+@\w+.com$")
s1 = "[email protected]"
s2 = "[email protected]"
print pattern.match(s1).group()
print pattern.match(s2).group()
8. Homework 2
Version 2 can verify and extract email addresses with names:
import re
pattern = re.compile(r"^(<[\w\s]+>)\s[\w.]+@\w+.\w+$")
s = "<Tom Paris> [email protected]"
print pattern.match(s).group()
print pattern.match(s).group(1)
above.