Scrapy-Xpath 实例

Refer to :https://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

>>> from scrapy import Selector
>>> doc = u""" ... <div> ...  <ul> ...  <li class="item-0"><a href="link1.html">first item</a></li> ...  <li class="item-1"><a href="link2.html">second item</a></li> ...  <li class="item-inactive"><a href="link3.html">third item</a></li> ...  <li class="item-1"><a href="link4.html">fourth item</a></li> ...  <li class="item-0"><a href="link5.html">fifth item</a></li> ...  </ul> ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> sel.xpath('//li//@href').getall() ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html'] >>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall() ['link1.html', 'link2.html', 'link4.html', 'link5.html'] >>>

****************************************************************************************

>>> response.xpath("//a/@href").getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

****************************************************************************************

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)') ['My image 1',  'My image 2',  'My image 3',  'My image 4',  'My image 5']

****************************************************************************************

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)') 'My image 1'
****************************************************************************************
Get and extract_first

>>> response.css('a::attr(href)').get() 'image1.html' >>> response.css('a::attr(href)').extract_first() 'image1.html'

****************************************************************************************

>>> response.css('a::attr(href)').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] >>> response.css('a::attr(href)').extract() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

****************************************************************************************

>>> response.css('a::attr(href)')[0].get() 'image1.html' >>> response.css('a::attr(href)')[0].extract() 'image1.html'

****************************************************************************************
CSS 模糊匹配class

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>') >>> sel.css('.shout').xpath('./time/@datetime').getall() ['2014-07-23 19:00']

****************************************************************************************

>>> from scrapy import Selector
>>> sel = Selector(text=""" ....: <ul class="list"> ....: <li>1</li> ....: <li>2</li> ....: <li>3</li> ....: </ul> ....: <ul class="list"> ....: <li>4</li> ....: <li>5</li> ....: <li>6</li> ....: </ul>""") >>> xp = lambda x: sel.xpath(x).getall()

>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']

>>> xp("(//li)[1]")
['<li>1</li>']

****************************************************************************************

['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string ['Click here to go to the Next Page']

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall() []

>>> sel.xpath("//a[contains(., 'Next Page')]").getall() ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

***************************************************************************************

<p class="foo bar-baz">First</p> <p class="foo">Second</p> <p class="bar">Third</p> <p>Fourth</p>

>>> response.xpath('//p[has-class("foo")]') [<Selector xpath='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,  <Selector xpath='//p[has-class("foo")]' data='<p class="foo">Second</p>'>] >>> response.xpath('//p[has-class("foo", "bar-baz")]') [<Selector xpath='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>] >>> response.xpath('//p[has-class("foo", "bar")]') []

猜你喜欢