python 正则表达式

python 通过 re 模块提供对正则表达式的支持。关于正则表达式的规则见：https://www.cnblogs.com/tina-python/p/5508402.html

re 模块

一般使用 re 的步骤是先将正则表达式的字符串形式编译为 Pattern 实例，然后使用 Pattern 实例处理文本并获得匹配结果（一个 match 函数），最后使用 match 函数获得信息，进行其它操作。

re.match(pattern, string, flags = 0)

re.match() 尝试从字符串的起始位置匹配一个模式。其中，pattern 指匹配的正则表达式，string 指要匹配的字符串，flags 为标志位，用于控制正则表达式的匹配方式，如是否区分大小写，多行匹配等。如果匹配成功，re.match 方法就返回一个匹配的对象，否则返回 None。

代码：

 1 #!/usr/bin/python3
 2 
 3 import re
 4 
 5 #span()    返回一个元组包含匹配 (开始,结束) 的位置
 6 #需要注意的是None,不能被span构造元组，使用re.match().span()需要先判断一下re.match()返回的是否为None
 7 # print(re.match('hello', 'hello world').span()) #在起始位置匹配
 8 # print(re.match('world', 'hello world')) #不在起始位置匹配
 9 
10 cnt1 = re.match('hello', 'hello world')
11 cnt2 = re.match('world', 'hello world')#不在起始位置匹配，返回None
12 
13 if(cnt1 != None):
14     print(cnt1.span())
15 else:
16     print(cnt1)
17 
18 if(cnt2 != None):
19     print(cnt2.span())
20 else:
21     print(cnt2)
22 
23 # 输出：
24 # (0, 5)
25 # None

View Code

注意：None 没有 span 方法成员，使用 re.match().span() 需要先判断一下 re.match() 返回的是否为None

re.search()

用于扫描整个字符串并返回第一个成功匹配的字符。如果匹配成功则返回一个匹配对象，否则返回 None

re.search(pattern, string, flags = 0)

代码：

 1 #!/usr/bin/python3
 2 
 3 import re
 4 
 5 cnt1 = re.search('hello', 'hello world')
 6 cnt2 = re.search('world', 'hello world')
 7 
 8 if(cnt1 != None):
 9     print(cnt1.span())
10 else:
11     print(cnt1)
12 
13 if(cnt2 != None):
14     print(cnt2.span())
15 else:
16     print(cnt2)
17 
18 # 输出：
19 # (0, 5)
20 # (6, 11)

View Code

re.match 与 re.search 的区别

re.match 函数只匹配字符串开始的字符，如果开始的字符不匹配，则返回 None

re.search 方法匹配整个字符串，直到找到第一个匹配的对象，匹配结束没找到匹配值才返回 None

代码：

 1 #!/usr/bin/python3
 2 
 3 import re
 4 
 5 line = 'Cats are smarter than dogs'
 6 
 7 # re.I（IGNORECASE）使匹配对大小写不敏感
 8 # re.M(MULTILINE) 多行匹配，影响^和$
 9 
10 matchObj = re.match(r'dogs', line, re.M | re.I)#r表示非转译的原始字符
11 
12 if matchObj:
13     print('use match, the match string is:', matchObj.group())
14     # group() 返回re整体匹配的字符串，可以一次输入多个组号，对应组号匹配的字符串
15 else:
16     print('No match string')
17 
18 matchObj = re.search(r'dogs', line, re.M | re.I)
19 
20 if matchObj:
21     print('use search, the match string is:', matchObj.group())
22 else:
23     print('No match string')
24 
25 # 输出：
26 # No match string
27 # use search, the match string is: dogs

View Code

注意：

flag 标志位有：

re.S(DOTALL) 使.匹配包括换行在内的所有字符
re.I（IGNORECASE）使匹配对大小写不敏感
re.L（LOCALE）做本地化识别（locale-aware)匹配，法语等
re.M(MULTILINE) 多行匹配，影响^和$
re.X(VERBOSE) 该标志通过给予更灵活的格式以便将正则表达式写得更易于理解
re.U 根据Unicode字符集解析字符，这个标志影响\w,\W,\b,\B

group() 返回 re 整体匹配的字符串，可以一次输入多个组号，对应组号匹配的字符串，默认参数为 0

贪婪模式和非贪婪模式

在 python 中数量词默认是贪婪的，总是尝试匹配经可能多的字符，非贪婪模式正好相反，总是尝试匹配尽可能少的字符

如：正则表达式 'ab*' 如果用于查找 'abbbc'，就会找到 'abbb'。如果使用非贪婪模式的数量词 'ab*?' ，就会找到 'a'

代码：

 1 #!/usr/bin/python3
 2 import re
 3 
 4 tmp = re.match(r'^(\d+)(0*)$', '102300')
 5 
 6 if tmp:
 7     print(tmp.groups())
 8 else:
 9     print(tmp)
10 # 输出：
11 # ('102300', '')

View Code

由于 \d+ 采用贪婪匹配，直接将后面的 0 全部匹配了，结果 0* 只能匹配空字符串。要让 0* 能够匹配到后面的两个 0，必须让 \d+ 采用非贪婪模式匹配

代码：

 1 #!/usr/bin/python3
 2 import re
 3 
 4 tmp = re.match(r'^(\d+?)(0*)$', '102300')
 5 
 6 if tmp:
 7     print(tmp.groups())
 8 else:
 9     print(tmp)
10 # 输出：
11 # ('1023', '00')

View Code

替换

python 的 re 模块提供了 re.sub，用于替换字符串中的匹配项

sub(repl, string[, count]) | re.sub(pattern, repl, string[, count])

使用 repl 替换 string 中每一个匹配的子串后返回替换后的字符串。当 repl 是一个方法时，这个方法应当只接收一个参数（match 对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。count 用于指定最多替换的次数，默认全部替换

代码：

 1 #!/usr/bin/python3
 2 import re
 3 
 4 def fun(m):
 5     return m.group(1).title() + '' + m.group(2).title()
 6 
 7 pt = re.compile(r'(w+)(w+)')
 8 #编译正则表达式模式，返回一个对象的模式。（可以把那些常用的正则表达式编译成正则表达式对象，这样可以提高一点效率。）
 9 greeting = 'i say, hello world!'
10 
11 print(pt.sub(r'2 1', greeting))
12 
13 print(pt.sub(fun, greeting))
14 
15 # 输出：
16 # i say, hello world!
17 # i say, hello world!

View Code

编译

在 python 中使用正则表达式时，re 模块内部会做两件事情：

1.编译正则表达式，如果正则表达式的字符串本身不合法，就报错

2.用编译后的正则表达式匹配字符串

对于出现次数比较多的正则表达式，可以使用 re.compile() 方法预编译该正则表达式来提高效率：

 1 #!/usr/bin/python3
 2 
 3 import re
 4 
 5 re_telepthone = re.compile(r'^(\d{3})-(\d{3,8})$') #预编译正则表达式
 6 
 7 cnt1 = re_telepthone.match('010-12345')
 8 cnt2 = re_telepthone.match('010-8086')
 9 
10 if(cnt1):
11     print(cnt1.groups())
12 else:
13     print(cnt1)
14 
15 if(cnt2):
16     print(cnt2.groups())
17 else:
18     print(cnt2)
19 
20 # 输出：
21 # ('010', '12345')
22 # ('010', '8086')

View Code