scrapy之Selectors

 

练习url:https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

a get text value

  xpath

In [18]: response.selector.xpath('//title/text()').extract_first(default='')
Out[18]: 'Example website'

  css

In [19]: response.selector.css('title::text').extract_first(default='')
Out[19]: 'Example website'

  Note: It can be omitted and written as: response.xpath()

Second, get the attribute value

  xpath

In [23]: response.selector.xpath('//base/@href').extract_first()
Out[23]: 'http://example.com/'

  css 

In [24]: response.selector.css('base::attr(href)').extract_first()
Out[24]: 'http://example.com/'

  Note: Can be omitted and written as: response.css

Three xpath, css nested use

  Because of css, xpath returns a SelectorList instance, all of which can be nested and used conveniently.

  ps: get attributes, xpath, @ has been implemented, no need for /text()

In [21]: response.selector.css('img').xpath('@src').extract()
Out[21]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

 

Four.re()

  .re()

  .re_first()

  ps : Returns a list of unicode, so .re() cannot be nested

In [1]: response.selector.css('div > p:nth-of-type(2)::text').extract()
Out[1]: ['333xxx']

In [2]: response.selector.css('div > p:nth-of-type(2)::text').extract_first()
Out[2]: '333xxx'

In [3]: response.selector.css('div > p:nth-of-type(2)::text').re_first('\w+')
Out[3]: '333xxx'

In [4]: response.selector.css('div > p:nth-of-type(2)::text').re_first('[A-Za-z]+')
Out[4]: 'xxx'

In [5]: response.selector.css('div > p:nth-of-type(2)::text').re('[A-Za-z]+')
Out[5]: ['xxx']

Five notes about Xpath's relative path lookup

  Find the p tag under the div tag

<html lang="zh-CN">
<head>
</head>
<body>
    <p>11</p>
    <div>
        <p>222</p>
        <p>333</p>
    </div>
</body>
</html>

  Wrong way:

In [4]: divs = response.selector.xpath('//div')

In [5]: for p in divs.xpath('//p'):
   ...:     print(p.extract())
   ...:
<p>11</p>
<p>222</p>
<p>333</p>

  Correct practice 1:

In [6]: divs = response.selector.css('div')

In [7]: for p in divs.xpath('.//p'):
   ...:     print(p.extract())
   ...:
   ...:
<p>222</p>
<p>333</p>

  Correct practice 2:

In [8]: divs = response.selector.css('div')

In [9]: for p in divs.xpath('p'):
   ...:     print(p.extract())
   ...:
   ...:
   ...:
<p>222</p>
<p>333</p>

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324548712&siteId=291194637