文章目录

一、简介
二、正则表达式与Python语言

1、使用match()方法匹配字符串
2、使用search()方法在一个字符串中查找模式
3、匹配多个字符串（择一匹配符号：|）
4、匹配任何单个字符（点号.）
5、创建字符集([])
6、重复

a 星号*
b 加号+
c 问号？
d 大括号{N}
e 大括号{M,N}

7、匹配字符串的起始、结尾或单词边界
8、分组和匹配子组
9、使用findall()和finditer()查找每一次出现的位置
10、使用sub(),subn()搜索与替换
11、在限定模式上使用split()分隔字符串

一、简介

操作文本和数据是件大事。
正则表达式为高级的文本模式匹配、抽取、与/或文本形式的搜索和替换功能提供了基础。
简单滴说，是一些由字符和特殊符号组成的字符串，它们描述了模式的重复或者表述多个字符，于是正在表达式能按照某种模式匹配一系列有相似特征的字符串。
Python通过标准库中的re模块来支持正则表达式。

二、正则表达式与Python语言

1、使用match()方法匹配字符串

match()试图从字符串的起始部分对模式进行匹配。如果匹配成功就返回一个匹配对象，如果匹配失败，就返回None，匹配对象的group()方法能够用于显示那个成功的匹配。

# 成功的匹配
>>> m = re.match('foo','foo')
>>> if m is not None:
...     m.group()
... 
'foo'

# 失败的匹配
>>> m = re.match('foo','bar')
>>> if m is not None:
...     m.group()
... 
>>>

# 即使字符串比模式长，匹配仍然能够成功
>>> m = re.match('foo','food on the table')
>>> if m is not None:
...     m.group()
... 
'foo'
>>>

# 利用python原生的面向对象特性，忽略保存中间过程产生的结果
>>> re.match('foo','food on the table').group()
'foo'
>>>

2、使用search()方法在一个字符串中查找模式

search()会用它的字符串参数，在任意位置，对给定正则表达式模式，搜索第一次出现的匹配情况。如果搜索到成功的匹配，就会返回一个匹配对象；否则，返回None。

# 匹配失败
>>> re.match('foo','seafood').group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

# 搜索成功
>>> re.search('foo','seafood').group()
'foo'

3、匹配多个字符串（择一匹配符号：|）

>>> bt = 'bat|bet|bit'
>>> m = re.match(bt, 'bat')# bat 是一个匹配
>>> m
<re.Match object; span=(0, 3), match='bat'>
>>> m.group()
'bat'
>>> 
>>> m = re.match(bt, 'blt')
>>> m # 没有匹配
>>> 
>>> m = re.match(bt, 'He bit me!')
>>> m # 没有匹配
>>> 
>>> m = re.search(bt, 'He bit me!')
>>> m
<re.Match object; span=(3, 6), match='bit'>
>>> m.group() # 通过搜索查找'bit'
'bit'
>>>

4、匹配任何单个字符（点号.）

点号(.)匹配除了换行符之外的任一单个字符。

>>> anyend = '.end'
>>> m = re.match(anyend, 'bend')# 点号匹配b
>>> m
<re.Match object; span=(0, 4), match='bend'>
>>> m.group()
'bend'
>>> 
>>> m = re.match(anyend, 'end')# 不匹配 
>>> m
>>> 
>>> m = re.match(anyend, '\nend')# 换行符\n不能匹配
>>> m
>>> 
>>> m = re.search(anyend, 'The end.')# 在搜索中匹配
>>> m
<re.Match object; span=(3, 7), match=' end'>
>>> m.group()
' end'# 注意e前面有个空格
>>> 
>>> patt314 = '3.14'# 表示正则表达式的点号
>>> pi_patt = '3\.14'# 表示字面量的点号
>>> 
>>> m = re.match(pi_patt, '3.14')# 精确匹配
>>> m
<re.Match object; span=(0, 4), match='3.14'>
>>> m.group()
'3.14'
>>> 
>>> m = re.match(patt314, '3014')# 点号匹配0
>>> m
<re.Match object; span=(0, 4), match='3014'>
>>> m.group()
'3014'
>>> 
>>> m = re.match(patt314, '3.14')# 点号匹配' . '
>>> m
<re.Match object; span=(0, 4), match='3.14'>
>>> m.group()
'3.14'

5、创建字符集([])

[…] :匹配来自字符集的任一单一字符。
[…x-y…] :匹配x~y范围中的任一单一字符。
[^…] :不匹配此字符集中出现的任何一个字符，包括某一范围的字符。

>>> 
>>> m = re.match('[cr][23][dp][o2]','c3po')
>>> m
<re.Match object; span=(0, 4), match='c3po'>
>>> m.group()
'c3po'
>>> 
>>> 
>>> m = re.match('[cr][23][dp][o2]','c2do')
>>> m
<re.Match object; span=(0, 4), match='c2do'>
>>> m.group()
'c2do'
>>>

6、重复

a 星号*

匹配0次或多次前面出现的正则表达式。

>>> patt ='\w*'
>>> re.match(patt,' ')# 匹配0次
<re.Match object; span=(0, 0), match=''>
>>> re.match(patt,'a')# 匹配1次
<re.Match object; span=(0, 1), match='a'>
>>> re.match(patt,'aa')# 匹配2次
<re.Match object; span=(0, 2), match='aa'>
>>> re.match(patt,'aaaaa')# 匹配5次
<re.Match object; span=(0, 5), match='aaaaa'>
>>> re.match(patt,'aaaaaaaaaa')# 匹配10次
<re.Match object; span=(0, 10), match='aaaaaaaaaa'>

b 加号+

匹配1次或多次前面出现的正则表达式。

>>> patt = '\w+'
>>> re.match(patt, '') # 不能匹配0次
>>> re.match(patt, 'a')# 匹配1次
<re.Match object; span=(0, 1), match='a'>
>>> re.match(patt, 'aa')# 匹配2次
<re.Match object; span=(0, 2), match='aa'>
>>> re.match(patt, 'aaaaa')# 匹配5次
<re.Match object; span=(0, 5), match='aaaaa'>
>>> re.match(patt, 'aaaaaaaaaa')# 匹配10次
<re.Match object; span=(0, 10), match='aaaaaaaaaa'>

c 问号？

匹配0或1次前面出现的正则表达式。

>>> patt = '\w?'
>>> re.match(patt, '')
<re.Match object; span=(0, 0), match=''>
>>> re.match(patt, 'a')
<re.Match object; span=(0, 1), match='a'>
>>> re.match(patt, 'aa')
<re.Match object; span=(0, 1), match='a'>
>>> re.match(patt, 'aaaaa')
<re.Match object; span=(0, 1), match='a'>

d 大括号{N}

匹配N次前面出现的正则表达式。

>>> patt = '\w{3}'
>>> re.match(patt, '')
>>> re.match(patt, 'a')
>>> re.match(patt, 'aa')
>>> re.match(patt, 'aaa')
<re.Match object; span=(0, 3), match='aaa'>
>>> re.match(patt, 'aaaaa')
<re.Match object; span=(0, 3), match='aaa'>
>>> re.match(patt, 'aaaaaaaaaa')
<re.Match object; span=(0, 3), match='aaa'>

e 大括号{M,N}

匹配M~N次前面出现的正则表达式。

>>> patt = '\w{3,5}'
>>> re.match(patt, 'aa')
>>> re.match(patt, 'aaa')
<re.Match object; span=(0, 3), match='aaa'>
>>> re.match(patt, 'aaaa')
<re.Match object; span=(0, 4), match='aaaa'>
>>> re.match(patt, 'aaaaa')
<re.Match object; span=(0, 5), match='aaaaa'>
>>> re.match(patt, 'aaaaaa')
<re.Match object; span=(0, 5), match='aaaaa'>
>>> re.match(patt, 'aaaaaaaaaa')
<re.Match object; span=(0, 5), match='aaaaa'>

7、匹配字符串的起始、结尾或单词边界

^ : 匹配字符串起始部分；
$ :匹配字符串终止部分；
\b :匹配一个单词的边界；
\B :匹配出现在单词中间；

>>> m = re.search('^The', 'The dog')# 匹配
>>> m
<re.Match object; span=(0, 3), match='The'>
>>> 
>
>>> m = re.search('^The', 'dog The')# 不做为起始
>>> m
>>> 
>
>>> m = re.search(r'\bthe', 'bite the dog')# 在边界
>>> m
<re.Match object; span=(5, 8), match='the'>

>>> m = re.search(r'\bthe', 'bitethe dog')# 有边界
>>> m
>>> 

>>> m = re.search(r'\bthe', 'thebite  dog')
>>> m
<re.Match object; span=(0, 3), match='the'>

>>> m = re.search(r'\Bthe', 'bitethe dog')# 没有边界
>>> m
<re.Match object; span=(4, 7), match='the'>
>>>

8、分组和匹配子组

在正则表达式中，一对圆括号可以实现以下任意一个或两个功能：

对正则表达式进行分组；
匹配子组；

 >>> patt = r'(\w+)-(\d+)'
>>> 
>>> re.match(patt,'a-1')
<re.Match object; span=(0, 3), match='a-1'>
>>> re.match(patt,'abc-123')
<re.Match object; span=(0, 7), match='abc-123'>
>>> m = re.match(patt,'abc-123')
>>> m.groups()
('abc', '123')
>>> t = m.groups()
>>> t[0]
'abc'
>>> t[1]
'123'
>>>

9、使用findall()和finditer()查找每一次出现的位置

findall:

>>> re.findall('car','car')
['car']
>>> re.findall('car','scary')
['car']
>>> re.findall('car','carry the barcardi to the car')
['car', 'car', 'car']

finditer:与findall类似，但更节省内存。

>>> s = 'This and That.'
>>> re.findall(r'(th\w+) and (th\w+)',s, re.I)
[('This', 'That')]
>>> re.finditer(r'(th\w+) and (th\w+)',s, re.I)
<callable_iterator object at 0x1082fe160>

>>> it = re.finditer(r'(th\w+) and (th\w+)',s, re.I)
>>> it.__next__().groups() # why  __next__ ?
('This', 'That')

>>> it = re.finditer(r'(th\w+) and (th\w+)',s, re.I)
>>> it.__next__().group(1)
'This'
>>> it = re.finditer(r'(th\w+) and (th\w+)',s, re.I)
>>> it.__next__().group(2)
'That'

10、使用sub(),subn()搜索与替换

>>> re.sub('[ae]', 'X', 'abcdef')
'XbcdXf'
>>> re.subn('[ae]', 'X', 'abcdef')
('XbcdXf', 2)

两个函数都是做替换，不同的是后者还会返回一个替换的总数。

11、在限定模式上使用split()分隔字符串

# 简单的示例：
>>> re.split(':','str1:str2:str3')
['str1', 'str2', 'str3']
>>> 
>>> 

# 复杂的示例（这里用到了扩展符号，现在还不熟悉）：
>>> DATA = (
...     'Mountain View, CA 94040',
...     'Sunnyvale, CA',
...     'Los Altos, 94023',
...     'Cupertino 95014',
...     'Palo Alto CA',
... )
>>> for datum in DATA:
...     print(re.split(', |(?= (?:\d{5}|[A-Z]{2}))',datum))
... 
['Mountain View', 'CA', ' 94040']
['Sunnyvale', 'CA']
['Los Altos', '94023']
['Cupertino', ' 95014']
['Palo Alto', ' CA']

python核心编程(第三版) 第一章 -正则表达式