Scrapy的Selector用于提取数据，基于lxml实现，两者的效率相差不多

下列代码均针对该html文本：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

使用selector

构造selector：Scrapy的selector是Selector类的实例，通过text（html文本）或是TextResponse对象进行实力化

首先引入包

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

从text中构造

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

从response中构造

>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

response对象会自带selector对象，因此可以直接使用

>>> response.selector.xpath('//span/text()').extract()
[u'good']

对于response自带的selector对象，可以简写成下列形式：

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

.xpath（）和.css（）方法返回SelectorList实例，是Selector对象的list，如果我们想提取匹配数据，可以通过：

>>> response.xpath('//title/text()').extract()
[u'Example website']

如果只想提取第一个匹配数据：

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '

当没有元素查找到时，会返回None，我们也可以自己指定没有找到元素时的返回值：

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found')
'not-found'

对于css方法，可以通过下列形式提取文本或是属性的值：

>>> response.css('title::text').extract()
[u'Example website']

Nesting selectors

.xpath（）和.css（）方法均返回selector的list，因此可以再次调用.xpath（）和.css（）方法

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
...     print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

在selectors中使用正则表达式

selector也可以使用.re（）方法，通过正则表达式进行匹配，re（）方法返回匹配的数据组成的list，而不是selector：

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

如果只想返回第一个匹配的数据，可以使用re_first（）：

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
u'My image 1'

使用相对XPaths

需要注意，在xpath中，//表示绝对路径，如果我们想使用Nesting selector，必须要注意这一点，举个例子：

如果我们想提取<div>标签下的<p>元素，首先我们想获得Nesting selector：

>>> divs = response.xpath('//div')

如果我们这么使用：

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

则会提取所有的<p>元素，而不是提取<div>元素，正确的使用姿势：

>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

或者：

>>> for p in divs.xpath('p'):
...     print p.extract()

在XPath中使用占位符

Xpath中也可以使用占位符，例如：

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first()
u'images'

所有占位符都必须有具体的值，否则会抛出ValueError: XPath error:异常

使用扩展功能

由于selector基于lxml构建，XPath方法也支持lxml中的扩展功能

在xpath方法中使用正则表达式：

>>> from scrapy import Selector
>>> doc = """
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').extract()
[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()
[u'link1.html', u'link2.html', u'link4.html', u'link5.html']

使用text函数注意事项

text函数返回的是一个list，如果我们想在Xpath的String函数（例如contains、start-with（））中使用text函数，需要注意，由于list转换为string时只会保持第一个值，而Xpath的string函数会先将结果转化为string在进行匹配，因此与我们设想的结果可能会有出入，此时可以不使用text函数，而是只是用.号，.号表示该节点，举个例子：

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
#使用text函数，当list转化为string时只会保留第一个值
>>> sel.xpath('//a//text()').extract() # take a peek at the node-set
[u'Click here to go to the ', u'Next Page']
>>> sel.xpath("string(//a[1]//text())").extract() # convert it to string
[u'Click here to go to the ']
#节点转化为string，只会保留文本
>>> sel.xpath("//a[1]").extract() # select the first node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").extract() # convert it to string
[u'Click here to go to the Next Page']

#在contains中使用.//text()函数可能不会筛选到任何值
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
[]

#在contains中使用.表示本节点则可以
>>> sel.xpath("//a[contains(., 'Next Page')]").extract()
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

//node[1]与（//node）[1]的不同

//node[1]选择所有节点下的第一个node标签（如果有的话）

(//node)[1]将所有node标签都筛选出来，选择其中的第一个node标签

举个例子：

>>> from scrapy import Selector
>>> sel = Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()
>>> xp("//li[1]")
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//li)[1]")
[u'<li>1</li>']
>>> xp("//ul/li[1]")
[u'<li>1</li>', u'<li>4</li>']
>>> xp("(//ul/li)[1]")
[u'<li>1</li>']

通过class查询时，最好使用css函数

Xpath用于class查询不太合适，如果使用@class='someclass'，只会匹配class=‘something’的标签

如果使用contains(@class, 'someclass')，如果有其他的类的名字当中含有someclass，则会匹配（例如：class=‘hesomething’）

在根据class筛选时，可以使用css：

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').extract()
[u'2014-07-23 19:00']

Selector对象

构造函数：scrapy.selector.Selector(response=None, text=None, type=None)

response类型为HtmlResponse、XmlResponse，当response未指定时，text表示需要解析的文本，type指定了selector的类型，可以是html、xml、None，如果未指定type的类型（None），将根据response的类型决定使用html（HtmlResponse）或是xml（XmlResponse），如果response未指定，但是指定了text，则type默认为html

常用函数：

xpath(query)：返回匹配query的结果（query为xpath），结果类型未SelectList实例，可以直接通过response.xpath（）调用

css（query）：返回匹配的query的结果（query为css），结果类型为SelectList实例，可以直接通过response.css（）调用

extract（）：返回匹配结果的文本

re（regex）：返回匹配regex（正则表达式）的结果，结果类型为list（内含string类型）

register_namespace（prefix，uri）：设定Selector的命名空间，在某些情况下需要只当命名空间才可以提取到数据（例如xml）

remove_namespace（）：移除文本（html、xml）中的所有命名空间，需要消耗比较多的系统资源

__nonzero__()：如果选择器匹配到了任何值，返回True，否则返回False

SelectorList对象

SelectorList对象是list对象的子类，但是支持xpath(query)、css(query)、extract()、re()函数

Python爬虫笔记（十四）——Scrapy官方文档阅读——Selector