Python XPath syntax and reptiles lxml module

XPath syntax and lxml module

What is XPath?

xpath (XML Path Language) is an XML and find information in the HTML document language, can be used to traverse the elements and attributes in XML and HTML documents.

XPath Development Tools

  1. Chrome plug-in XPath Helper.
  2. Firefox plugin Try XPath.

XPath syntax

Select node:

XPath uses path expressions to select nodes in an XML document or set of nodes. These path expressions and expressions we see in conventional computer file systems are very similar.

expression description Examples result
nodename Select all the child nodes of this node bookstore Select all child nodes bookstore
/ If you are at the top, select representatives from the root node. Otherwise, select a node in a node /bookstore Select all the nodes under the root element bookstore
// Select the node from the global node, in which position just //book Find all book nodes from the global node
@ Select the attributes of a node //book[@price] Select all the nodes have a price book property
. The current node ./a Select a tag at the current node

predicate:

The predicate is used to locate a specific node or the specified node contains a value, is fitted in the square brackets.
In the table below, we have listed some path expressions with predicates, and the result of the expression:

Path expression description
/bookstore/book[1] The first child element under the select bookstore
/bookstore/book[last()] Select the penultimate book elements under the bookstore.
bookstore/book[position()<3] Select two sub-elements preceding the bookstore.
//book[@price] Select book element has the property price
//book[@price=10] Select all the attributes price equal to the book element 10

Tsuhaifu

* Wildcard.

Tsuhaifu description Examples result
* Matches any node /bookstore/* Select all child elements under the bookstore.
@* Any attributes match node //book[@*] Select all book elements with attributes.

Selecting a plurality of paths:

By using a path expression "|" operator can select a plurality of paths.
Examples are as follows:

//bookstore/book | //book/title
# 选取所有book元素以及book元素下所有的title元素

Operator:

lxml library

lxml is an HTML / XML parser, how the main function is to parse and extract HTML / XML data.

lxml and regular, as is implemented in C, is a high-performance Python HTML / XML parser, we can use XPath syntax to learn before, to quickly locate specific elements and node information.

lxml python official document: http://lxml.de/index.html

You need to install the C language library that can be installed using pip: pip install lxml

Basic use:

We can use him to parse HTML code, and when parsing the HTML code, if non-standard HTML code, he will be an automatic completion. Sample code is as follows:

# 使用 lxml 的 etree 库
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签
     </ul>
 </div>
'''

#利用etree.HTML,将字符串解析为HTML文档 html = etree.HTML(text) # 按字符串序列化HTML文档 result = etree.tostring(html) print(result) 

Enter the following results:

<html><body>
<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html> 

can be seen. lxml will automatically modify the HTML code. Example not only complement the li tag, but also adds body, html tags.

Html code read from the file:

In addition to direct string parsing, lxml also supports reading from the file. Hello.html we create a new file:

<!-- hello.html -->
<div>
    <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> 

Then use the etree.parse()method to read the file. Sample code is as follows:

from lxml import etree

# 读取外部文件 hello.html
html = etree.parse('hello.html')
result = etree.tostring(html, pretty_print=True) print(result) 

And enter the result is the same before.

Using XPath syntax lxml in:

  1. Get all li tags:

     from lxml import etree
    
     html = etree.parse('hello.html')
     print type(html)  # 显示etree.parse() 返回类型 result = html.xpath('//li') print(result) # 打印<li>标签的元素集合 
  2. Gets the value of all class attributes in all li elements:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li/@class')
    
     print(result)
    
  3. Href tag to get the next li www.baidu.comof a label:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li/a[@href="www.baidu.com"]')
    
     print(result)
    
  4. Get all span tags under the li tags:

     from lxml import etree
    
     html = etree.parse('hello.html')
    
     #result = html.xpath('//li/span')
     #注意这么写是不对的: #因为 / 是用来获取子元素的,而 <span> 并不是 <li> 的子元素,所以,要用双斜杠 result = html.xpath('//li//span') print(result) 
  5. Get a label under the li tag all the class:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li/a//@class')
    
     print(result)
    
  6. Get last li href attribute of a value corresponding to:

     from lxml import etree
    
     html = etree.parse('hello.html')
    
     result = html.xpath('//li[last()]/a/@href')
     # 谓语 [last()] 可以找到最后一个元素 print(result) 
  7. Get content penultimate li elements:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li[last()-1]/a')
    
     # text 方法可以获取元素内容 print(result[0].text) 
  8. Get content penultimate li elements of the second approach:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li[last()-1]/a/text()')
    
     print(result)
    

Using xpath requests and crawling movie heaven

Sample code is as follows:

import requests
from lxml import etree

BASE_DOMAIN = 'http://www.dytt8.net'
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36', 'Referer': 'http://www.dytt8.net/html/gndy/dyzz/list_23_2.html' } def spider(): url = 'http://www.dytt8.net/html/gndy/dyzz/list_23_1.html' resp = requests.get(url,headers=HEADERS) # resp.content:经过编码后的字符串 # resp.text:没有经过编码,也就是unicode字符串 # text:相当于是网页中的源代码了 text = resp.content.decode('gbk') # tree:经过lxml解析后的一个对象,以后使用这个对象的xpath方法,就可以 # 提取一些想要的数据了 tree = etree.HTML(text) # xpath/beautifulsou4 all_a = tree.xpath("//div[@class='co_content8']//a") for a in all_a: title = a.xpath("text()")[0] href = a.xpath("@href")[0] if href.startswith('/'): detail_url = BASE_DOMAIN + href crawl_detail(detail_url) break def crawl_detail(url): resp = requests.get(url,headers=HEADERS) text = resp.content.decode('gbk') tree = etree.HTML(text) create_time = tree.xpath("//div[@class='co_content8']/ul/text()")[0].strip() imgs = tree.xpath("//div[@id='Zoom']//img/@src") # 电影海报 cover = imgs[0] # 电影截图 screenshoot = imgs[1] # 获取span标签下所有的文本 infos = tree.xpath("//div[@id='Zoom']//text()") for index,info in enumerate(infos): if info.startswith("◎年  代"): year = info.replace("◎年  代","").strip() if info.startswith("◎豆瓣评分"): douban_rating = info.replace("◎豆瓣评分",'').strip() print(douban_rating) if info.startswith("◎主  演"): # 从当前位置,一直往下面遍历 actors = [info] for x in range(index+1,len(infos)): actor = infos[x] if actor.startswith("◎"): break actors.append(actor.strip()) print(",".join(actors)) if __name__ == '__main__': spider() 

chrome-related issues:

In 62 editions (the latest), there is a bug, not FormData data recorded in the page when a 302 redirect. This is a bug in this version. For details see the following links: https://stackoverflow.com/questions/34015735/http-post-payload-not-visible-in-chrome-debugger.

In the Canary version has solved this problem, you can download this version continues at the following link: https://www.google.com/chrome/browser/canary.html

Guess you like

Origin www.cnblogs.com/csnd/p/11469337.html