模式匹配与正则

正则表达式查找文本

Python 中所有正则表达式的函数都在 re 模块中,导入该模块：

import re

用 re.compile()函数创建一个 Regex 对象，它将返回一个 Regex 模式
对象（或者就简称为 Regex 对象）。

phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)

phoneNumRegex 变量包含了一个 Regex 对象。

向 Regex 对象的 search()方法传入想查找的字符串。它返回一个 Match 对象。调用 Match 对象的 group()方法，返回实际匹配文本的字符串。

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())
Phone number found: 415-555-4242

正则表达式匹配更多模式

利用括号分组

>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = phoneNumRegex.search('My number is 415-555-4242.')
>>> mo.group(1)
'415'
>>> mo.group(2)
'555-4242'

用管道匹配多个分组

>>> batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
>>> mo = batRegex.search('Batmobile lost a wheel')
>>> mo.group()
'Batmobile'
>>> mo.group(1)
'mobile'

用问号实现可选匹配

>>> batRegex = re.compile(r'Bat(wo)?man')
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'
>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'

ps：可以认为，“匹配这个问号之前的分组零次或一次”。

用星号匹配零次或多次

>>> mo3 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo3.group()
'Batwowowowoman'

用加号匹配一次或多次 +

>>> mo2 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo2.group()
'Batwowowowoman'

花括号匹配特定次数(只匹配特定次数的)
正则表达式(Ha){3}将匹配字符串’HaHaHa’，但不会匹配’HaHa’，
因为后者只重复了(Ha)分组两次。
贪心和非贪心匹配

>>> greedyHaRegex = re.compile(r'(Ha){3,5}')

正则为(Ha){3,5}可以匹配 3 个、4 个或 5 个实例，而group()函数会返回匹配次数最多结果，即使3满足，4满足，但有5个就会匹配5个

>>> mo1 = greedyHaRegex.search('HaHaHaHaHa')
>>> mo1.group()
'HaHaHaHaHa'

python的正则表达式默认有“贪婪”匹配的，如果要使用非贪婪匹配，在
结束的花括号后跟着一个问号。

findall()方法

findall()方法

除了search方法外，Regex对象也有一个findall()方法。search()将返回一个Match对象，包含被查找字符串中的“第一次”匹配的文本。
而 findall()方法将返回一组字符串，包含被查找字符串中的所有匹配

简单来说search()返回的 Match 对象只包含第一次出现的匹配文本，，只要在正则表达式中没有分组。列表中的每个字符串都是一段被查找的文本

>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
['415-555-9999', '212-555-0000']

如果调用在一个有分组的正则表达式上法 findall()将返回一个字符串的元组的列表

>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
[('415', '555', '1122'), ('212', '555', '0000')]

字符分类
11. 字符分类

\d 0 到 9 的任何数字
\D 除 0 到 9 的数字以外的任何字符
\w 任何字母、数字或下划线字符（可以认为是匹配“单词”字符）
\W 除字母、数字和下划线以外的任何字符
\s 空格、制表符或换行符（可以认为是匹配“空白”字符）
\S 除空格、制表符和换行符以外的任何字符

>>> xmasRegex = re.compile(r'\d+\s\w+')
>>> xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7
swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6
geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']

类[a-zA-Z0-9]将匹配所
有小写字母、大写字母和数字。

字符分类[0-5]只匹配数字 0 到 5

字符（^）

非字符类将匹配不在这个字符类中的所有字符

>>> consonantRegex = re.compile(r'[^aeiouAEIOU]')
>>> consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')
['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', '
', 'B', 'B', 'Y', ' ', 'F', 'D', '.']

在正则表达式的开始处使用插入符号（^），表明匹配必须发生在被查找文
本开始处

>>> beginsWithHello = re.compile(r'^Hello')
>>> beginsWithHello.search('Hello world!')
<_sre.SRE_Match object; span=(0, 5), match='Hello'>

和12点的不同是他针对的是字符串，单引号内的内容，12点是中括号里面的[]。

可以再正则表达式的末尾加上美元符号（$），表示该字符串必
须以这个正则表达式的模式结束

用插入符号（^），表明匹配必须发生在被查找文本开始处，的末尾加上美元符号（$），表示该字符串必
须以这个正则表达式的模式结束，可以同时使用^和 $，表明整个字符串必须匹配该
模式

>>> wholeStringIsNum = re.compile(r'^\d+$')
>>> wholeStringIsNum.search('1234567890')
<_sre.SRE_Match object; span=(0, 10), match='1234567890'>
>>> wholeStringIsNum.search('12345xyz67890') == None
True

通配字符
15. .（句点）字符称为“通配符”。它匹配除了换行之外的所有
字符

>>> atRegex = re.compile(r'.at')
>>> atRegex.findall('The cat in the hat sat on the flat mat.')
['cat', 'hat', 'sat', 'lat', 'mat']

用点-星匹配所有字符( .* )
结合贪婪模式试验一下

>>> nongreedyRegex = re.compile(r'<.*?>')
>>> mo = nongreedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man>'
>>> greedyRegex = re.compile(r'<.*>')
>>> mo = greedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man> for dinner.>'

点-星将匹配除换行外的所有字符，所以也可以用来匹配换行符


>>> noNewlineRegex = re.compile('.*')
>>> noNewlineRegex.search('Serve the public trust.\nProtect the innocent.
\nUphold the law.').group()
'Serve the public trust.'

 ?匹配零次或一次前面的分组。
 匹配零次或多次前面的分组。
 +匹配一次或多次前面的分组。
 {n}匹配 n 次前面的分组。
 {n,}匹配 n 次或更多前面的分组。
 {,m}匹配零次到 m 次前面的分组。
 {n,m}匹配至少 n 次、至多 m 次前面的分组。
 {n,m}?或?或+?对前面的分组进行非贪心匹配。
 ^spam 意味着字符串必须以 spam 开始。
 spam$意味着字符串必须以 spam 结束。
 .匹配所有字符，换行符除外。
 \d、\w 和\s 分别匹配数字、单词和空格。
 \D、\W 和\S 分别匹配出数字、单词和空格外的所有字符。
 [abc]匹配方括号内的任意字符（诸如 a、b 或 c）。
 [^abc]匹配不在方括号内的任意字符。

复杂的正则表达式

向 re.compile()
传入变量 re.VERBOSE，作为第二个参数
告诉 re.compile()，忽略正则表达式字符
串中的空白符和注释

phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
Python 编程快速上手——让繁琐工作自动化
\d{3} # first 3 digits
(\s|-|\.) # separator
\d{4} # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)''', re.VERBOSE)

习题

re.compile() 函数返回Regex 对象。
使用原始字符串是为了让反斜杠不必转义。
search() 方法返回Match 对象。
group() 方法返回匹配文本的字符串。
分组0 是整个匹配，分组1 包含第一组括号，分组2 包含第二组括号
句号和括号可以用反斜杠转义：.、\（和\）。
如果正则表达式没有分组，就返回字符串的列表。如果正则表达式有分组，就返回字符串的元组的列表。
| 字符表示匹配两个组中的“任何一个”。
? 字符可以表示“匹配前面分组0 次或1 次”，或用于表示非贪心配。
+匹配1 次或多次。匹配0 次或多次。
{3}匹配前面分组的精确3 次实例。{3, 5} 匹配3 至5 次实例。
缩写字符分类\d、\w 和\s 分别匹配一个数字、单词或空白字符。
缩写字符分类\D、\W 和\S 分别匹配一个字符，它不是数字、单词或空白字符。
将re.I 或re.IGNORECASE 作为第二个参数传入re.compile()，让匹配不区分大小写。
字符.通常匹配任何字符，换行符除外。如果将re.DOTALL 作为第二个参数传入re.compile()，那么点也会匹配换行符。
.执行贪心匹配，.?执行非贪心匹配。
[a-z0-9]或[0-9a-z]
‘X drummers, X pipers, five rings, X hens’
re.VERBOSE 参数忽略正则表达式字符串中的空白符和注释
re.compile（r’^\d{1,3}(,{3})$’）
re.compile(r’[A-Z][a-z]*\sNakamoto’)

实践项目
强口令检测:

长度不少于 8 个字符
包含大写和小写字符
至少有一位数字

import re 

len_str=re.compile(r'.{8,}')
num_str=re.compile(r'\d')
str_str=re.compile(r'[a-z].*[A-Z]|[A-Z].*[a-z]')

def pwdTest(pwd):
	if len_str.search(pwd) and num_str.search(pwd) and str_str.search(pwd):
		print(pwd+"  is strong enough")
		print(str_str.search(pwd))
	else:
		print('Noo it\'s so weak')
test = 'test123TEST'
pwdTest(test)

读写文件

在 Windows 上，路径书写使用倒斜杠作为文件夹之间的分隔符。但在 OS X 和
Linux 上，使用正斜杠作为它们的路径分隔符

os.path.join()返回文件路径

>>> myFiles = ['accounts.txt', 'details.csv', 'invite.docx']
>>> for filename in myFiles:
print(os.path.join('C:\\Users\\asweigart', filename))
C:\Users\asweigart\accounts.txt
C:\Users\asweigart\details.csv
C:\Users\asweigart\invite.docx

>>> os.getcwd()
'C:\\Python34'
>>> os.chdir('C:\\Windows\\System32')
>>> print(os.getcwd())
'C:\\Windows\\System32'

《python编程让繁琐的工作自动化》笔记

模式匹配与正则

读写文件

猜你喜欢