python-正则表达式学习笔记

前言：对于我这种渣渣我只能靠整理笔记记住一样东西了，以下是我整理《Python编程快速上手—让繁琐工作自动化》的正则表达式的简单的例子，附带自己整理重点的思维导图，仅供参考。

文章结构

文章的大致结构

万金油方式

使用的方法

文章例子介绍

用正则表达式查找文本模式

正则表达式，简称为 regex

正则表达式\d{3}-\d{3}-\d{4}

创建正则表达式对象

>>> import re

>>> phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)

匹配 Regex 对象

>>> phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)
>>> mo = phoneNumRegex.search(‘My number is 415-555-4242.’)
>>> print('Phone number found: ’ + mo.group())
Phone number found: 415-555-4242

向 re.compile()传递原始字符串

输入 r’\d\d\d-\d\d\d-\d\d\d\d’

总结

1．用 import re 导入正则表达式模块。
2．用 re.compile()函数创建一个 Regex 对象（记得使用原始字符串）。
3．向 Regex 对象的 search()方法传入想查找的字符串。它返回一个 Match 对象。
4．调用 Match 对象的 group()方法，返回实际匹配文本的字符串。

用正则表达式匹配更多模式

利用括号分组

>>> phoneNumRegex = re.compile(r’(\d\d\d)-(\d\d\d-\d\d\d\d)’)
>>> mo = phoneNumRegex.search(‘My number is 415-555-4242.’)
>>> mo.group(1)
‘415’

>>> mo.group(2)
‘555-4242’
>>> mo.group(0)
‘415-555-4242’
>>> mo.group()
‘415-555-4242’

>>> mo.groups()
(‘415’, ‘555-4242’)
>>> areaCode, mainNumber = mo.groups()
>>> print(areaCode)
415
>>> print(mainNumber)
555-4242

>>> phoneNumRegex = re.compile(r’((\d\d\d)) (\d\d\d-\d\d\d\d)’)
>>> mo = phoneNumRegex.search(‘My phone number is (415) 555-4242.’)
>>> mo.group(1)
‘(415)’
>>> mo.group(2)
'555-4242

传递给 re.compile()的原始字符串中， (和)转义字符将匹配实际的括号字符

用管道匹配多个分组

>>> heroRegex = re.compile (r’Batman|Tina Fey’)
>>> mo1 = heroRegex.search(‘Batman and Tina Fey.’)
>>> mo1.group()
‘Batman’
>>> mo2 = heroRegex.search(‘Tina Fey and Batman.’)
>>> mo2.group

>>> batRegex = re.compile(r’Bat(man|mobile|copter|bat)’)
>>> mo = batRegex.search(‘Batmobile lost a wheel’)
>>> mo.group()
‘Batmobile’
>>> mo.group(1)
'mobile

用问号实现可选匹配

>>> batRegex = re.compile(r’Bat(wo)?man’)
>>> mo1 = batRegex.search(‘The Adventures of Batman’)
>>> mo1.group()
‘Batman’
>>> mo2 = batRegex.search(‘The Adventures of Batwoman’)
>>> mo2.group()
'Batwoman

用星号匹配零次或多次

*（称为星号）意味着“匹配零次或多次”

>>> batRegex = re.compile(r’Bat(wo)*man’)
>>> mo1 = batRegex.search(‘The Adventures of Batman’)
>>> mo1.group()
‘Batman’
>>> mo2 = batRegex.search(‘The Adventures of Batwoman’)
>>> mo2.group()
‘Batwoman’
>>> mo3 = batRegex.search(‘The Adventures of Batwowowowoman’)
>>> mo3.group()
'Batwowowowoman

如果需要匹配真正的星号字符，就在正则表达式的星号字符前加上倒斜杠，即*

用加号匹配一次或多次

>>> batRegex = re.compile(r’Bat(wo)+man’)
>>> mo1 = batRegex.search(‘The Adventures of Batwoman’)
>>> mo1.group()
‘Batwoman’
>>> mo2 = batRegex.search(‘The Adventures of Batwowowowoman’)
>>> mo2.group()
‘Batwowowowoman’
>>> mo3 = batRegex.search(‘The Adventures of Batman’)
>>> mo3 == None
True

如果需要匹配真正的加号字符，在加号前面加上倒斜杠实现转义： +。

用花括号匹配特定次数

(Ha){3}
(Ha)(Ha)(Ha)

(Ha){3,5}
((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))

>>> haRegex = re.compile(r’(Ha){3}’)
>>> mo1 = haRegex.search(‘HaHaHa’)
>>> mo1.group()
‘HaHaHa’
>>> mo2 = haRegex.search(‘Ha’)
>>> mo2 == None
True

贪心和非贪心匹配

>>> greedyHaRegex = re.compile(r’(Ha){3,5}’)
>>> mo1 = greedyHaRegex.search(‘HaHaHaHaHa’)
>>> mo1.group()
‘HaHaHaHaHa’
>>> nongreedyHaRegex = re.compile(r’(Ha){3,5}?’)
>>> mo2 = nongreedyHaRegex.search(‘HaHaHaHaHa’)
>>> mo2.group()
‘HaHaHa’

findall()方法

>>> phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)
>>> mo = phoneNumRegex.search(‘Cell: 415-555-9999 Work: 212-555-0000’)
>>> mo.group()
'415-555-9999

>>> phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’) # has no groups
>>> phoneNumRegex.findall(‘Cell: 415-555-9999 Work: 212-555-0000’)
[‘415-555-9999’, ‘212-555-0000’]

>>> phoneNumRegex = re.compile(r’(\d\d\d)-(\d\d\d)-(\d\d\d\d)’) # has groups
>>> phoneNumRegex.findall(‘Cell: 415-555-9999 Work: 212-555-0000’)
[(‘415’, ‘555’, ‘1122’), (‘212’, ‘555’, ‘0000’)]

字符分类

缩写字符分类	表示
\d	0 到 9 的任何数字
\D	除 0 到 9 的数字以外的任何字符

缩写字符分类	表示
\w	任何字母、数字或下划线字符（可以认为是匹配“单词”字符）
\W	除字母、数字和下划线以外的任何字符
\s	空格、制表符或换行符（可以认为是匹配“空白”字符）
\S	除空格、制表符和换行符以外的任何字符

>>> xmasRegex = re.compile(r’\d+\s\w+’)
>>> xmasRegex.findall(‘12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge’)
[‘12 drummers’, ‘11 pipers’, ‘10 lords’, ‘9 ladies’, ‘8 maids’, ‘7 swans’, ‘6geese’, ‘5 rings’, ‘4 birds’, ‘3 hens’, ‘2 doves’, ‘1 partridge’]

建立自己的字符分类

>>> vowelRegex = re.compile(r’[aeiouAEIOU]’)
>>> vowelRegex.findall(‘RoboCop eats baby food. BABY FOOD.’)
[‘o’, ‘o’, ‘o’, ‘e’, ‘a’, ‘a’, ‘o’, ‘o’, ‘A’, ‘O’, ‘O’]

>>> consonantRegex = re.compile(r’[^aeiouAEIOU]’)
>>> consonantRegex.findall(‘RoboCop eats baby food. BABY FOOD.’)
[‘R’, ‘b’, ‘c’, ‘p’, ’ ', ‘t’, ‘s’, ’ ', ‘b’, ‘b’, ‘y’, ’ ', ‘f’, ‘d’, ‘.’, ’
', ‘B’, ‘B’, ‘Y’, ’ ', ‘F’, ‘D’, ‘.’]

插入字符和美元字符

>>> beginsWithHello = re.compile(r’^Hello’)
>>> beginsWithHello.search(‘Hello world!’)
<_sre.SRE_Match object; span=(0, 5), match=‘Hello’>
>>> beginsWithHello.search(‘He said hello.’) == None
True

>>> endsWithNumber = re.compile(r’\d$’)
>>> endsWithNumber.search(‘Your number is 42’)
<_sre.SRE_Match object; span=(16, 17), match=‘2’>
>>> endsWithNumber.search(‘Your number is forty two.’) == None
True

>>> wholeStringIsNum = re.compile(r’^\d+$’)
>>> wholeStringIsNum.search(‘1234567890’)
<_sre.SRE_Match object; span=(0, 10), match=‘1234567890’>
>>> wholeStringIsNum.search(‘12345xyz67890’) == None
True
>>> wholeStringIsNum.search(‘12 34567890’) == None
True

通配字符

>>> atRegex = re.compile(r’.at’)
>>> atRegex.findall(‘The cat in the hat sat on the flat mat.’)
[‘cat’, ‘hat’, ‘sat’, ‘lat’, ‘mat’]

用点-星匹配所有字符

>>> nameRegex = re.compile(r’First Name: (.) Last Name: (.)’)
>>> mo = nameRegex.search(‘First Name: Al Last Name: Sweigart’)
>>> mo.group(1)
‘Al’
>>> mo.group(2)
‘Sweigart’

>>> nongreedyRegex = re.compile(r’<.?>’)
>>> mo = nongreedyRegex.search(’ for dinner.>’)
>>> mo.group()
‘’
>>> greedyRegex = re.compile(r’<.>’)
>>> mo = greedyRegex.search(’ for dinner.>’)
>>> mo.group()
’ for dinner.>’

用句点字符匹配换行

>>> noNewlineRegex = re.compile(’.’)
>>> noNewlineRegex.search(‘Serve the public trust.\nProtect the innocent.
\nUphold the law.’).group()
‘Serve the public trust.’
>>> newlineRegex = re.compile(’.’, re.DOTALL)
>>> newlineRegex.search('Serve the public trust.\nProtect the innocent.

\nUphold the law.’).group()
‘Serve the public trust.\nProtect the innocent.\nUphold the law.’

总结

 ?匹配零次或一次前面的分组。
 匹配零次或多次前面的分组。
 +匹配一次或多次前面的分组。
 {n}匹配 n 次前面的分组。
 {n,}匹配 n 次或更多前面的分组。
 {,m}匹配零次到 m 次前面的分组。
 {n,m}匹配至少 n 次、至多 m 次前面的分组。
 {n,m}?或?或+?对前面的分组进行非贪心匹配。
 ^spam 意味着字符串必须以 spam 开始。
 spam$意味着字符串必须以 spam 结束。
 .匹配所有字符，换行符除外。
 \d、 \w 和\s 分别匹配数字、单词和空格。
 \D、 \W 和\S 分别匹配出数字、单词和空格外的所有字符。
 [abc]匹配方括号内的任意字符（诸如 a、 b 或 c）。
 [^abc]匹配不在方括号内的任意字符。

不区分大小写的匹配

>>> regex1 = re.compile(‘RoboCop’)
>>> regex2 = re.compile(‘ROBOCOP’)
>>> regex3 = re.compile(‘robOcop’)
>>> regex4 = re.compile(‘RobocOp’

>>> robocop = re.compile(r’robocop’, re.I)
>>> robocop.search(‘RoboCop is part man, part machine, all cop.’).group()
‘RoboCop’
>>> robocop.search(‘ROBOCOP protects the innocent.’).group()
‘ROBOCOP’
>>> robocop.search(‘Al, why does your programming book talk about robocop so much?’).group()
'robocop ’

用 sub()方法替换字符串

>>> namesRegex = re.compile(r’Agent \w+’)
>>> namesRegex.sub(‘CENSORED’, ‘Agent Alice gave the secret documents to Agent Bob.’)
'CENSORED gave the secret documents to CENSORED.

>>> agentNamesRegex = re.compile(r’Agent (\w)\w*’)
>>> agentNamesRegex.sub(r’\1****’, ‘Agent Alice told Agent Carol that Agent
Eve knew Agent Bob was a double agent.’)
A**** told C**** that E**** knew B**** was a double agent.

管理复杂的正则表达式

phoneRegex = re.compile(r’((\d{3}|(\d{3}))?(\s|-|.)?\d{3}(\s|-|.)\d{4}
(\s*(ext|x|ext.)\s*\d{2,5})?)’)

phoneRegex = re.compile(r’’’(
(\d{3}|(\d{3}))? # area code
(\s|-|.)? # separator

组合使用 re.IGNOREC ASE、 re.DOTALL 和 re.VERBOSE

> someRegexValue = re.compile(‘foo’, re.IGNORECASE | re.DOTALL)

>>> someRegexValue = re.compile(‘foo’, re.IGNORECASE | re.DOTALL | re.VERBOSE)