爬虫必备模块之re

之前在爬虫时会反复用到了正则表达式，如何掌握正则总结了正则的用法，现在是时候总结python中的正则模块re的用法了。正则在每个语言中都有，且匹配模式大体相同， re是python中支持正则的模块。

常用re成员方法

1 re.compile() ，编译正则表达式模式。编译之后可以提高效率, 直接使用字符串表示的正则表达式进行匹配操作时，python会将字符串转换为正则表达式对象。而使用compile完成一次转换之后，在每次使用模式的时候就不用重复转换。下面编译和未编译的计算时间时间对比：

import re 
word = 'Sometimes in order to see the light, you have to risk the dark.'
import datetime 

r = re.compile('o')
before = datetime.datetime.now()
for i in range(100000):
    r.findall(word)
t1 = datetime.datetime.now()-before
    
r = 'o' 
before = datetime.datetime.now()
for i in range(100000):
    re.findall(r,word)
t2= datetime.datetime.now()-before
print(t1,t2)
datetime.timedelta(microseconds=95018)   datetime.timedelta(microseconds=210070)  # 显然编译后的匹配速度更块

2 re.search() 与re.match(）都有两个非空参数 pattern, string。 pattern为正则表达式，string为要匹配的字符串。二者的不同点在于re.match()只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search()匹配整个字符串，直到找到一个匹配,匹配一个即停止。

re.match('a','abc').group()
'a'
re.search('a', 'abc').group()
'a'
re.match('b','abc').group()
None
re.search('b','abc').group()
'b'

3 re.findall() 在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。

word = '123 Sometimes in order to see the light, you have to risk the dark.'
re.findall('i',word)
 ['i', 'i', 'i', 'i']

4 re.finditer()和 re.findall() 类似，在字符串中找到正则表达式所匹配的所有子串，只是它作为一个迭代器返回而不是列表。

word = '123 Sometimes in order to see the light, you have to risk the dark.'
iter = re.finditer('i',word)
for i in iter:
    print(i.group(),i.span())
i (9, 10)
i (14, 15)
i (35, 36)
i (54, 55)

5 re.sub() 功能与word中cril+f相同，查找替换，第一个参数为正则，第二个参数为替换的字符，第三个为待匹配的字符串，re.subn()与re.sub()的区别在于re.subn()返回替换次数

re.sub('a','f','abc')
 'fbc'
re.subn('a','f','abc')
('fbc', 1)

6 re.split() 方法按照能够匹配的子串将字符串分割后返回列表 ,也很常用

word = 'Sometimes in order to see the light, you have to risk the dark.' 
re.split(' ', word)
['Sometimes' 'in', 'order', 'to', 'see', 'the', 'light,', 'you', 'have', 'to', 'risk', 'the', 'dark.']

常用修饰符 - 可选标志flags

flags	功能
re.I (re.IGNORECASE）	不分大小写
re.S （re.DOTALL）	使 . 匹配包括换行在内的所有字符
re.M （re.MULTILINE）	多行匹配，影响 ^ 和 $

用法示例如下：

re.findall('a.', 'aAb')
['a']
re.findall('a.', 'aAb',flags=re.I) # 不区分大小写
['aA']    

re.findall('a.*', 'aa\na')
['aa', 'a']
re.findall('a.*', 'aa\na',flags=re.S) #  . 可以识别\n 换行符
['aa\na']

s= '12 34/n56 78/n90'
re.findall( r'^/d+' , s , re.M )        # 可以匹配位于行首的数字
 ['12', '56', '90']