【python数据抓取技术与实战】正则表达式

我之所以把正则表达式单独罗列出来讲，是因为我觉得它在爬虫技术中的地位真的非常重要。所以，这个章节你也要打起精神学习。

先来看看它是个什么工具

正则表达式是处理字符串的有力工具。python中有个re模块提供了大量的方法，能实现正则表达式相关的各类操作。re模块常用的函数包括match、search、findall，可以通过help('re')命令查看函数用法。

>>> help('re')
...
match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found.

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found.

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.
    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.
...

我们发现，第一个参数都是pattern（模式），它是一个可以包含特殊的字符的字符串，最常见的就是模式通配符，如" + " " * " " ？" " . "等。我们通过例子来说明吧。

match函数

>>> import re
>>> date_s = '2016_06_01'
>>> m = re.match(r'2016',date_s)
>>> m
<_sre.SRE_Match object at 0x0000000002A66B90> 
>>> m.start()
0
>>> m.end()
4
>>> m.span()
(0, 4)
>>> m.group() 
'2016'

r'2016'即我们这里说的模式pattern，date_s即match函数中待匹配的字符串string。
r'2016'前面的符号r是python中'raw string'的标识。如果在一个字符串前面出现r作为前缀，则反斜杠不会被特殊处理，在该例子中不加r也是可以的。

r'\d+_\d+_\d+'
#模式 r'\d' 可匹配任何一个十进制数字字符，模式 '+'表示匹配前一个字符一次或多次

我们仍以date_s为例，可以匹配新的结果，代码如下：

>>> import re
>>> date_s = '2016_06_01'
>>> m = re.match(r'\d+_\d+_\d+',date_s)
>>> m
<_sre.SRE_Match object at 0x0000000002BA6B90>
>>> m.start()
0
>>> m.end()
10
>>> m.group(0)
'2016_06_01'

search函数

比match方法更灵活，match方法只能从字符串的开始进行匹配，search方法会扫描整个字符串查找匹配，对比如下所示：

>>> import re
>>> date_s_2 = 'today is 2016_06_01'
>>> m_match = re.match(r'(\d+)_(\d+)_(\d+)',date_s_2)
>>> m_match
>>> m_search = re.search(r'(\d+)_(\d+)_(\d+)',date_s_2)
>>> m_search
<_sre.SRE_Match object at 0x0000000002CBF750>
>>> m_search.groups()
('2016', '06', '01')
>>> m_search.group(0)
'2016_06_01'
>>> m_search.group(1)
'2016'
>>> m_search.group(2)
'06'
>>> m_search.group(3)
'01'

方法group：参数为0，返回值是全部匹配字符串；参数为1时，返回第一个分组的内容，依次类推
方法groups：返回一个包含全部匹配分组内容的元组

findall函数

findall可以找到所有的匹配，结合例子来说明：

>>> import re
>>> url='http://www.xyz123.com/prod_lst?start_time=2016_01_01&end_time=2016_12_31'
>>> m=re.match(r'(\d+)_(\d+)_(\d+)',url)
>>> m
>>> m=re.findall(r'(\d+)_(\d+)_(\d+)',url)
>>> m
[('2016', '01', '01'), ('2016', '12', '31')]

相信以上的例子都还是非常简单的，大家练手之后就能够掌握了。下面是常用的正则表达式字符和通常的含义，在遇到正则问题的时候可以做为参考。

符号	含义
.	匹配除"\n"之外的任意字符
+	匹配前一个字符1次或多次
*	匹配前一个字符0次或多次
？	匹配前一个字符0次或1次
( )	括号括起来的表达式表示一个分组
^	匹配字符串的开头
$	匹配字符串的末尾
[ ]	表示字符的集合，可用-表示范围；当^出现在[ ]第一个字符时表示取反
\s	匹配空白字符，等同于[ \t\n\r\f\v]。注意，\t前是空格符
\S	匹配任何非空白字符，等同于[^ \t\n\r\f\v]
\d	匹配数字，等同于[0-9]
\D	匹配任何非数字，等同于[^0-9]
\w	匹配单词字符(word characters)，等同于[a-zA-Z0-9_]
\W	匹配任何非单词字符，等同于[^a-zA-Z0-9_]

【python数据抓取技术与实战】正则表达式

猜你喜欢