Z 1. XPath常用的语法

  • 参考网页
    http://www.w3school.com.cn/xpath/xpath_syntax.asp

  • 接下来,我们通过一些例子展示XPath的使用。
    XPath

  • 首先创建一个用于演示的html文档,并用其构造一个HtmlResponse对象:

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> body = '''
... <html>
...     <head>
...         <base href='http://example.com/' />
...         <title>Example website</title>
...     </head>
...     <body>
...         <div id='images'>
...             <a href='image1.html'>Name: Image 1 <br/><img src='image1.jpg' /></a>
...             <a href='image2.html'>Name: Image 2 <br/><img src='image2.jpg' /></a>
...             <a href='image3.html'>Name: Image 3 <br/><img src='image3.jpg' /></a>
...             <a href='image4.html'>Name: Image 4 <br/><img src='image4.jpg' /></a>
...             <a href='image5.html'>Name: Image 5 <br/><img src='image5.jpg' /></a>
...         </div>
...     </body>
... </html>
... '''
>>> response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf8')
  • /:描述一个从根开始的绝对路径。
>>> response.xpath('/html')
[<Selector xpath='/html' data='<html>\n\t<head>\n\t\t<base href="http://exam'>]
>>> response.xpath('/html/head')
[<Selector xpath='/html/head' data='<head>\n\t\t<base href="http://example.com/'>]
  • E1/E2:选中E1子节点中的所有E2。
# 选中div子节点中的所有a
>>> response.xpath('/html/body/div/a')
[<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name: My image 1 <'>,
<Selector xpath='/html/body/div/a' data='<a href="image2.html">Name: My image 2 <'>,
<Selector xpath='/html/body/div/a' data='<a href="image3.html">Name: My image 3 <'>,
<Selector xpath='/html/body/div/a' data='<a href="image4.html">Name: My image 4 <'>,
<Selector xpath='/html/body/div/a' data='<a href="image5.html">Name: My image 5 <'>]
  • //E:选中文档中的所有E,无论在什么位置。
# 选中文档中的所有a
>>> response.xpath('//a')
[<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
<Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
<Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
<Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
<Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]
  • E1//E2:选中E1后代节点中的所有E2,无论在后代中的什么位置。
# 选中body后代中的所有img
>>> response.xpath('/html/body//img')
[<Selector xpath='/html/body//img' data='<img src="image1.jpg">'>,
<Selector xpath='/html/body//img' data='<img src="image2.jpg">'>,
<Selector xpath='/html/body//img' data='<img src="image3.jpg">'>,
<Selector xpath='/html/body//img' data='<img src="image4.jpg">'>,
<Selector xpath='/html/body//img' data='<img src="image5.jpg">'>]
  • E/text():选中E的文本子节点。
# 选中所有a的文本
>>> sel = response.xpath('//a/text()')
>>> sel
[<Selector xpath='//a/text()' data='Name: My image 1 '>,
<Selector xpath='//a/text()' data='Name: My image 2 '>,
<Selector xpath='//a/text()' data='Name: My image 3 '>,
<Selector xpath='//a/text()' data='Name: My image 4 '>,
<Selector xpath='//a/text()' data='Name: My image 5 '>]
>>> sel.extract()
['Name: My image 1 ',
'Name: My image 2 ',
'Name: My image 3 ',
'Name: My image 4 ',
'Name: My image 5 ']
  • E/*:选中E的所有元素子节点。
# 选中html的所有元素子节点
>>> response.xpath('/html/*')
[<Selector xpath='/html/*' data='<head>\n\t\t<base href="http://example.com/'>,
<Selector xpath='/html/*' data='<body>\n\t\t<div id="images">\n\t\t\t<a href="i'>]
# 选中div的所有后代元素节点
>>> response.xpath('/html/body/div//*')
[<Selector xpath='/html/body/div//*' data='<a href="image1.html">Name: My image 1 <'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image1.jpg">'>,
<Selector xpath='/html/body/div//*' data='<a href="image2.html">Name: My image 2 <'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image2.jpg">'>,
<Selector xpath='/html/body/div//*' data='<a href="image3.html">Name: My image 3 <'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image3.jpg">'>,
<Selector xpath='/html/body/div//*' data='<a href="image4.html">Name: My image 4 <'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image4.jpg">'>,
<Selector xpath='/html/body/div//*' data='<a href="image5.html">Name: My image 5 <'>,
<Selector xpath='/html/body/div//*' data='<br>'>,
<Selector xpath='/html/body/div//*' data='<img src="image5.jpg">'>]
  • */E:选中孙节点中的所有E。
# 选中div孙节点中的所有img
>>> response.xpath('//div/*/img')
[<Selector xpath='//div/*/img' data='<img src="image1.jpg">'>,
<Selector xpath='//div/*/img' data='<img src="image2.jpg">'>,
<Selector xpath='//div/*/img' data='<img src="image3.jpg">'>,
<Selector xpath='//div/*/img' data='<img src="image4.jpg">'>,
<Selector xpath='//div/*/img' data='<img src="image5.jpg">'>]
● E/@ATTR:选中E的ATTR属性。
# 选中所有img的src 属性
>>> response.xpath('//img/@src')
[<Selector xpath='//img/@src' data='image1.jpg'>,
<Selector xpath='//img/@src' data='image2.jpg'>,
<Selector xpath='//img/@src' data='image3.jpg'>,
<Selector xpath='//img/@src' data='image4.jpg'>,
<Selector xpath='//img/@src' data='image5.jpg'>]
● //@ATTR:选中文档中所有ATTR属性。
# 选中所有的href 属性
>>> response.xpath('//@href')
[<Selector xpath='//@href' data='http://example.com/'>,
<Selector xpath='//@href' data='image1.html'>,
<Selector xpath='//@href' data='image2.html'>,
<Selector xpath='//@href' data='image3.html'>,
<Selector xpath='//@href' data='image4.html'>,
<Selector xpath='//@href' data='image5.html'>]
  • E/@*:选中E的所有属性。
# 获取第一个a 下img的所有属性(这里只有src 一个属性)
>>> response.xpath('//a[1]/img/@*')
[<Selector xpath='//a[1]/img/@*' data='image1.jpg'>]
  • . :选中当前节点,用来描述相对路径。
# 获取第1个a的选择器对象
>>> sel = response.xpath('//a')[0]
>>> sel
<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>
# 假设我们想选中当前这个a 后代中的所有img,下面的做法是错误的,
# 会找到文档中所有的img
# 因为//img是绝对路径,会从文档的根开始搜索,而不是从当前的a 开始
>>> sel.xpath('//img')
[<Selector xpath='//img' data='<img src="image1.jpg">'>,
<Selector xpath='//img' data='<img src="image2.jpg">'>,
<Selector xpath='//img' data='<img src="image3.jpg">'>,
<Selector xpath='//img' data='<img src="image4.jpg">'>,
<Selector xpath='//img' data='<img src="image5.jpg">'>]
# 需要使用.//img 来描述当前节点后代中的所有img
>>> sel.xpath('.//img')
[<Selector xpath='.//img' data='<img src="image1.jpg">'>]
  • .. :选中当前节点的父节点,用来描述相对路径。
# 选中所有img的父节点
>>> response.xpath('//img/..')
[<Selector xpath='//img/..' data='<a href="image1.html">Name: My image 1 <'>,
<Selector xpath='//img/..' data='<a href="image2.html">Name: My image 2 <'>,
<Selector xpath='//img/..' data='<a href="image3.html">Name: My image 3 <'>,
<Selector xpath='//img/..' data='<a href="image4.html">Name: My image 4 <'>,
<Selector xpath='//img/..' data='<a href="image5.html">Name: My image 5 <'>]
  • node[谓语]:谓语用来查找某个特定的节点或者包含某个特定值的节点。
# 选中所有a 中的第3 个
>>> response.xpath('//a[3]')
[<Selector xpath='//a[3]' data='<a href="image3.html">Name: My image 3 <'>]
# 使用last函数,选中最后1 个
>>> response.xpath('//a[last()]')
[<Selector xpath='//a[last()]' data='<a href="image5.html">Name: My image 5 <'>]
# 使用position函数,选中前3 个
>>> response.xpath('//a[position()<=3]')
[<Selector xpath='//a[position()<=3]' data='<a href="image1.html">Name: My image 1 <'>,
<Selector xpath='//a[position()<=3]' data='<a href="image2.html">Name: My image 2 <'>,
<Selector xpath='//a[position()<=3]' data='<a href="image3.html">Name: My image 3 <'>]
# 选中所有含有id属性的div
>>> response.xpath('//div[@id]')
[<Selector xpath='//div[@id]' data='<div id="images">\n\t\t\t<a href="image1.htm'>]
# 选中所有含有id属性且值为"images"的div
>>> response.xpath('//div[@id="images"]')
[<Selector xpath='//div[@id="images"]' data='<div id="images">\n\t\t\t<a href="image1.htm'>]

猜你喜欢

转载自blog.csdn.net/qq_41682050/article/details/81073008
z