练习url:https://doc.scrapy.org/en/latest/_static/selectors-sample1.html
a get text value
xpath
In [18]: response.selector.xpath('//title/text()').extract_first(default='') Out[18]: 'Example website'
css
In [19]: response.selector.css('title::text').extract_first(default='') Out[19]: 'Example website'
Note: It can be omitted and written as: response.xpath()
Second, get the attribute value
xpath
In [23]: response.selector.xpath('//base/@href').extract_first() Out[23]: 'http://example.com/'
css
In [24]: response.selector.css('base::attr(href)').extract_first() Out[24]: 'http://example.com/'
Note: Can be omitted and written as: response.css
Three xpath, css nested use
Because of css, xpath returns a SelectorList instance, all of which can be nested and used conveniently.
ps: get attributes, xpath, @ has been implemented, no need for /text()
In [21]: response.selector.css('img').xpath('@src').extract() Out[21]: ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
Four.re()
.re()
.re_first()
ps : Returns a list of unicode, so .re() cannot be nested
In [1]: response.selector.css('div > p:nth-of-type(2)::text').extract() Out[1]: ['333xxx'] In [2]: response.selector.css('div > p:nth-of-type(2)::text').extract_first() Out[2]: '333xxx' In [3]: response.selector.css('div > p:nth-of-type(2)::text').re_first('\w+') Out[3]: '333xxx' In [4]: response.selector.css('div > p:nth-of-type(2)::text').re_first('[A-Za-z]+') Out[4]: 'xxx' In [5]: response.selector.css('div > p:nth-of-type(2)::text').re('[A-Za-z]+') Out[5]: ['xxx']
Five notes about Xpath's relative path lookup
Find the p tag under the div tag
<html lang="zh-CN"> <head> </head> <body> <p>11</p> <div> <p>222</p> <p>333</p> </div> </body> </html>
Wrong way:
In [4]: divs = response.selector.xpath('//div') In [5]: for p in divs.xpath('//p'): ...: print(p.extract()) ...: <p>11</p> <p>222</p> <p>333</p>
Correct practice 1:
In [6]: divs = response.selector.css('div') In [7]: for p in divs.xpath('.//p'): ...: print(p.extract()) ...: ...: <p>222</p> <p>333</p>
Correct practice 2:
In [8]: divs = response.selector.css('div') In [9]: for p in divs.xpath('p'): ...: print(p.extract()) ...: ...: ...: <p>222</p> <p>333</p>