scrapy 官方文档笔记

1.response.xpath().get(default=None)

get方法有默认参数None,未提取到默认返回None,否则返回default值,源码:

    def get(self, default=None):
        """
        Return the result of ``.get()`` for the first element in this list.
        If the list is empty, return the default value.
        """
        for x in self:
            return x.get()
        else:
            return default

示例:

response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')

'not-found'

2.response.xpath().re() or response.xpath().re_first()

selector对象也可以使用正则表达式子,返回list 或 str 。

response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
 'My image 2',
 'My image 3',
 'My image 4',
 'My image 5']
response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'

3.xpath() 语法中使用count()计数

如:查找某个ul标签下有9个li子标签的 ul 标签 xpath语法 //ul[count(li)=9]

response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'

4.在scrapy框架中,无须通过lxml解析xpath

>>> from scrapy import Selector
>>> doc = u"""
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').getall()
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall()
['link1.html', 'link2.html', 'link4.html', 'link5.html']

Regular expressions

The test() function, for example, can prove quite useful when XPath’s starts-with() or contains() are not sufficient.

5.xpath其他用法

from scrapy import Selector

str1 = """
<p class="foo bar-baz">First</p>
<p class="foo">Second</p>
<p class="bar">Third</p>
<p>Fourth</p>
"""

s = Selector(text=str1, type='html')
# ret1,ret2结果等价
ret1 = s.xpath('//p[has-class("foo")]').getall()  # 有class属性且值为foo的p标签
ret2 = s.xpath('//p[contains(@class,"foo")]').getall()  # class属性包含foo的p标签
ret3 = s.xpath('//p[has-class("foo", "bar-baz")]').getall()  # foo 且 bar-baz
print(ret1, ret2, ret3)

猜你喜欢

转载自blog.csdn.net/zhu6201976/article/details/106607826