【Python3 爬虫学习笔记】Scrapy框架的使用 4

正则匹配

Scrapy的选择器还支持正则匹配。比如，在示例的a节点中的文本类似于Name:My image 1，现在我们只想把Name:后面的内容提取出来，这时就可以借助re()方法，实现如下：

>>> response.xpath('//a/text()').re('Name:\s(.*)')
['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']

我们给re()方法传了一个正则表达式，其中(.*)就是要匹配的内容，输出的结果就是正则表达式匹配的分组，结果会依次输出。
如果同时存在两个分组，那么结果依然会被按序输出，如下所示：

>>> response.xpath('//a/text()').re('(.*?):\s(.*?)')
['Name', 'My image 1', 'Name', 'My image 2', 'Name', 'My image 3', 'Name', 'My image 4', 'Name', 'My image 5']

类似extract_first()方法，re_first()方法可以选取列表的第一个元素，用法如下：

>>> response.xpath('//a/text()').re_first('(.*?):\s(.*)')
'Name'
>>> response.xpath('//a/text()').re_first('Name:\s(.*)')
'My image 1 '

不论正则匹配了几个分组，结果都会等于列表的第一个元素。
直接注意的是，response对象不能直接调用re()和re_first()方法。如果想要对全文进行正则匹配，可以先调用xpath()方法再正则匹配，如下所示：

>>> response.re('Name:\s(.*)')
AttributeError                            Traceback (most recent call last)
<ipython-input-23-3635822752bd> in <module>()
----> 1 response.re('Name:\s(.*)')
AttributeError: 'HtmlResponse' object has no attribute 're'
>>> response.xpath('.').re('Name:\s(.*)<br>')
['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']
>>> response.xpath('.').re_first('Name:\s(.*)<br>')
'My image 1 '

通过上面的例子，我们可以看到，直接调用re()方法会提示没有re属性。但是这里首先调用了xpath(’.’)选中全文，然后调用re()和re_first()方法，就可以进行正则匹配了。

【Python3 爬虫学习笔记】Scrapy框架的使用 4

正则匹配

猜你喜欢