【爬虫】Python Scrapy Selectors (选择器)

【原文链接】https://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:

BeautifulSoup is a very popular web scraping library among Python programmers which 基于HTML代码的结构创建一个 Python 对象 and also 很好地 deals with bad markup (标记), but it has one drawback: it’s slow.

lxml is an XML 解析库 (which also parses HTML) with a pythonic API based on ElementTree. (lxml is not part of the Python standard library.)

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

XPath is a language for selecting 节点 in XML documents, which can also be used with HTML. CSS is a language for applying 样式 to HTML documents. It 定义 selectors (选择器) to associate those 样式 with specific HTML 元素.

Scrapy 选择器 are built over the lxml library, which means they’re very similar in speed and parsing accuracy.

This page explains how selectors work and describes their API which is very small and simple, unlike the lxml API which is much bigger because the lxml library can be used for many other tasks, besides selecting markup documents.

For a complete reference of the selectors API see Selector reference

Using selectors

Constructing selectors

Scrapy selectors are 实例 of Selector class constructed by 传递 text or TextResponse 对象. It automatically chooses the 最好的解析规则 (XML vs HTML) based on input type:

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

Constructing from text:

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()

['good']

Constructing from response:

>>> response = HtmlResponse(url='http://example.com', body=body, encoding='utf-8')
>>> Selector(response=response).xpath('//span/text()').extract()

['good']

For convenience, 响应对象 expose a selector on .selector attribute, it’s totally OK to use this shortcut when possible:

>>> response.selector.xpath('//span/text()').extract()

['good']

Using selectors

To explain how to use the selectors we’ll use the Scrapy shell (which provides interactive testing) and an example page located in the Scrapy documentation server:

https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

Here’s its HTML code:

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

First, let’s open the shell:

scrapy shell https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

Then, after the shell loads, you’ll have the 响应 available as response shell 变量, and its attached selector in response.selector attribute.

Since we’re dealing with HTML, the selector will automatically use an HTML parser.

So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:

In [1]: response.selector.xpath('//title/text()')

Out[1]: [<Selector xpath='//title/text()' data='Example website'>]

Querying responses using XPath and CSS is so common that responses include two convenience shortcuts: response.xpath() and response.css():

In [2]: response.xpath('//title/text()')
Out[2]: [<Selector xpath='//title/text()' data='Example website'>]

In [3]: response.css('title::text')
Out[3]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

To actually extract the textual data, you must call the selector .extract() method, as follows:

In [5]: response.xpath('//title/text()').extract()

Out[5]: ['Example website']

As you can see, .xpath() and .css() methods return a SelectorList instance, which is a list of new selectors. This API can be used for quickly selecting nested data:

In [4]: response.css('img').xpath('@src').extract()

Out[4]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

If you want to extract only first matched element, you can call the selector .extract_first()

In [6]: response.xpath('//div[@id="images"]/a/text()').extract_first()

Out[6]: 'Name: My image 1 '

It returns None if no element was found:

In [7]: response.xpath('//div[@id="not-exists"]/text()').extract_first() is None

Out[7]: True

Now we’re going to get the base URL and some image links:

In [8]: response.xpath('//base/@href').extract()
Out[8]: ['http://example.com/']

In [9]: response.css('base::attr(href)').extract()
Out[9]: ['http://example.com/']

In [11]: response.xpath('//a[contains(@href, "image")]/@href').extract()
Out[11]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [12]: response.xpath('//a[contains(@href, "image1")]/@href').extract()
Out[12]: ['image1.html']

In [13]: response.xpath('//a[contains(@href, "image")]/img/@src').extract()
Out[13]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

Nesting selectors

The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too. Here’s an example:

In [14]: links = response.xpath('//a[contains(@href, "image")]')
In [15]: links.extract()
Out[15]:
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

In [17]: for index, link in enumerate(links):
    ...:     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
    ...:     print('Link number %d points to url %s and image %s' %args)
    ...:
Link number 0 points to url ['image1.html'] and image ['image1_thumb.jpg']
Link number 1 points to url ['image2.html'] and image ['image2_thumb.jpg']
Link number 2 points to url ['image3.html'] and image ['image3_thumb.jpg']
Link number 3 points to url ['image4.html'] and image ['image4_thumb.jpg']
Link number 4 points to url ['image5.html'] and image ['image5_thumb.jpg']

Using selectors with regular expressions

Selector also has a .re() method for extracting data using regular expressions. However, unlike using .xpath() or .css() methods, .re() returns a list of unicode strings. So you can’t construct nested .re() calls.

Here’s an example used to extract image names from the HTML code above:

In [18]: response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

Out[18]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']

There’s an additional helper reciprocating .extract_first() for .re(), named .re_first(). Use it to extract just the first matching string:

In [22]: response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')

Out[22]: 'My image 1 '

Working with relative XPaths

Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.

For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:

>>> divs = response.xpath('//div')
>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

note the dot prefixing the .//p XPath. At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div>elements:

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

Another common case would be to extract all direct <p> children:

>>> for p in divs.xpath('p'):
...     print p.extract()

For more details about relative XPaths see the Location Paths section in the XPath specification.

Variables in XPath expressions

XPath 能够让你引用 XPath 表达式中的变量, using the $somevariable syntax. This is somewhat 类似于 parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders (占位符) like ?, which are then substituted with values passed with the query.

Here’s an example to 匹配一个元素 based on its “id” 属性值, without 硬编码 (that was shown previously):

In [34]: response.xpath('//div[@id=$val]/a/text()', val='images').extract_first()

Out[34]: 'Name: My image 1 '

Here’s another example, to find the “id” attribute of a <div> tag containing five <a>children (here we pass the value 5 as an integer):

In [35]: response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first()

Out[35]: 'images'

All variable references must have a binding value when calling .xpath() (otherwise you’ll get a ValueError: XPath error: exception). This is done by passing as many named arguments as necessary.

parsel, the library powering Scrapy selectors, has more details and examples on XPath variables.

Using EXSLT extensions

Being built atop lxml, Scrapy selectors also support some EXSLT (a community initiative to provide extensions to XSLT, which are broken down into a number of modules) extensions and come with these 预注册的命名空间 to use in XPath expressions:

prefix	namespace	usage
re	http://exslt.org/regular-expressions	regular expressions
set	http://exslt.org/sets	set manipulation

Regular expressions

The test() function, for example, can prove quite useful when XPath’s starts-with()or contains() are not sufficient.

Example selecting links in list item with a “class” attribute ending with a digit:

In [36]: from scrapy import Selector
In [38]: doc = """
    ...: <div>
    ...:     <ul>
    ...:         <li class="item-0"><a href="link1.html">first item</a></li>
    ...:         <li class="item-1"><a href="link2.html">second item</a></li>
    ...:         <li class="item-inactive"><a href="link3.html">third item</a></li>
    ...:         <li class="item-1"><a href="link4.html">fourth item</a></li>
    ...:         <li class="item-0"><a href="link5.html">fifth item</a></li>
    ...:     </ul>
    ...: </div>
    ...: """
In [39]: sel = Selector(text=doc, type="html")
In [40]: sel.xpath('//li//@href').extract()
Out[40]: ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

In [41]: sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()
Out[41]: ['link1.html', 'link2.html', 'link4.html', 'link5.html']

Warning: C library libxslt doesn’t natively support EXSLT regular expressions so lxml’s implementation uses hooks to Python’s re module. Thus, using regexp functions in your XPath expressions may add a small performance penalty.

Set operations

These can be handy for excluding parts of a document tree before extracting text elements for example.

Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and corresponding itemprops:

In [46]: doc = """
    ...:  <div itemscope itemtype="http://schema.org/Product">
    ...:  
    ...:    <span itemprop="name">Kenmore White 17" Microwave</span>
    ...:    <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
    ...:  
    ...:    <div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
    ...:     Rated <span itemprop="ratingValue">3.5</span>/5
    ...:     based on <span itemprop="reviewCount">11</span> customer reviews
    ...:    </div>
    ...:
    ...:    <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    ...:      <span itemprop="price">$55.00</span>
    ...:      <link itemprop="availability" href="http://schema.org/InStock" />In stock
    ...:    </div>
    ...:
    ...:    Product description:
    ...:    <span itemprop="description">0.7 cubic feet countertop microwave.
    ...:    Has six preset cooking categories and convenience features like
    ...:    Add-A-Minute and Child Lock.</span>
    ...:
    ...:    Customer reviews:
    ...:
    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...:      <span itemprop="name">Not a happy camper</span> -
    ...:      by <span itemprop="author">Ellie</span>,
    ...:      <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
    ...:      <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...:        <meta itemprop="worstRating" content = "1">
    ...:        <span itemprop="ratingValue">1</span>/
    ...:        <span itemprop="bestRating">5</span>stars
    ...:      </div>
    ...:      <span itemprop="description">The lamp burned out and now I have to replace
    ...:      it. </span>
    ...:    </div>
    ...:
    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...:      <span itemprop="name">Value purchase</span> -
    ...:      by <span itemprop="author">Lucas</span>,
    ...:      <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
    ...:      <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...:        <meta itemprop="worstRating" content = "1"/>
    ...:        <span itemprop="ratingValue">4</span>/
    ...:        <span itemprop="bestRating">5</span>stars
    ...:      </div>
    ...:      <span itemprop="description">Great microwave for the price. It is small and
    ...:      fits in my apartment.</span>
    ...:    </div>
    ...:
    ...:  </div>
    ...:  """
In [47]: sel = Selector(text=doc, type="html")
In [49]: #    ...:  <div itemscope itemtype="http://schema.org/Product">
    ...: #    ...:    <div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
    ...: #    ...:    <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    ...: #    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...: #    ...:      <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...: #    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...: #    ...:      <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
    ...: for scope in sel.xpath('//div[@itemscope]'):
    ...:     print("current scope:", scope.xpath('@itemtype').extract())
    ...: # 第一个：
    ...: #    ...:    <span itemprop="name">Kenmore White 17" Microwave</span>
    ...: #    ...:    <div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
    ...: #    ...:    <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    ...: #    ...:    <span itemprop="description">0.7 cubic feet countertop microwave.
    ...: #    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...: #    ...:    <div itemprop="review" itemscope itemtype="http://schema.org/Review">
    ...: # 之后的：
    ...: #    ...:     Rated <span itemprop="ratingValue">3.5</span>/5
    ...: #    ...:     based on <span itemprop="reviewCount">11</span> customer reviews
    ...:     props = scope.xpath('set:difference(./descendant::*/@itemprop, .//*[@itemscope]/*/@itemprop)')
    ...:     print("    properties:", props.extract())
    ...:     print
    ...:

current scope: ['http://schema.org/Product']
    properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review']
current scope: ['http://schema.org/AggregateRating']
    properties: ['ratingValue', 'reviewCount']
current scope: ['http://schema.org/Offer']
    properties: ['price', 'availability']
current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']
current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']
current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']
current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']

Here we first iterate over itemscope elements, and for each one, we look for all itemprops elements and exclude those that are themselves inside another itemscope.

Some XPath tips

Here are some tips that you may find useful when using XPath with Scrapy selectors, based on this post from ScrapingHub’s blog. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.

Using text nodes in a condition

When you need to use the text content as argument to an XPath string function, avoid using .//text() and use just . instead.

This is because the expression .//text() yields a collection of text elements – a node-set. And when a node-set (节点集合)被转换为一个字符串的时候, which happens 当这个字符串被当作实参传递给一个像 contains() or starts-with() 这样的字符串函数的时候, it results in the text for the first element only.

Example: A node converted to a string, however, puts together the text of itself plus of all its descendants:

In [50]: from scrapy import Selector
In [51]: sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
In [52]: sel.xpath('//a//text()').extract() # take a peek at the node-set
Out[52]: ['Click here to go to the ', 'Next Page']

In [53]: sel.xpath("string(//a[1]//text())").extract() # convert it to string
Out[53]: ['Click here to go to the ']

In [54]: sel.xpath("//a[1]").extract() # select the first node
Out[54]: ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

In [55]: sel.xpath("string(//a[1])").extract() # convert it to string
Out[55]: ['Click here to go to the Next Page']

So, using the .//text() node-set won’t select anything in this case. But using the . to mean the node, works:

In [56]: sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
Out[56]: []

In [57]: sel.xpath("//a[contains(., 'Next Page')]").extract()
Out[57]: ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

Example:

In [58]: from scrapy import Selector
In [59]: sel = Selector(text="""<ul class="list">
    ...:                             <li>1</li>
    ...:                             <li>2</li>
    ...:                             <li>3</li>
    ...:                         </ul>
    ...:                         <ul class="list">
    ...:                             <li>4</li>
    ...:                             <li>5</li>
    ...:                             <li>6</li>
    ...:                         </ul>""")

In [60]: xp = lambda x: sel.xpath(x).extract()

In [62]: xp("//li[1]") #This gets all first <li> elements under whatever it is its parent
Out[62]: ['<li>1</li>', '<li>4</li>']

In [63]: xp("(//li)[1]") #And this gets the first <li> element in the whole document
Out[63]: ['<li>1</li>']

In [64]: xp("//ul/li[1]") #This gets all first <li> elements under an <ul> parent
Out[64]: ['<li>1</li>', '<li>4</li>']

In [65]: xp("(//ul/li)[1]") #And this gets the first <li> element under an <ul> parent in the whole document
Out[65]: ['<li>1</li>']

When querying by class, consider using CSS (略)