Article Directory
-
-
-
- Introduce
- What is XML
- XPath definition
- XPath expression
-
- The most commonly used path expressions
- Common path expressions and the results of expressions
- The predicate is used to find a specific node or a node containing a specified value, and is embedded in square brackets
- Select unknown node
- Select several paths, by using the "|" operator in the path expression, you can select several paths
- XPath operators
- lxml library
- XPath case
-
-
Introduce
Some people say that I don’t use regular regex and it’s very tiring to process HTML documents. Is there any other way?
Have! That is XPath. We can first convert the String type data obtained from the network into HTML/XML documents, and then use XPath to find HTML/XML nodes or elements.
What is XML
Everyone knows HTML, so what is XML?
- XML stands for Extensible Markup Language (EXtensible Markup Language)
- XML is a markup language, very similar to HTML
- XML is designed to transmit data, not display data
- XML tags need to be defined by us.
- XML is designed to be self-descriptive.
- XML is the recommended standard of W3C
Illustration:
[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-vUeRZNSF-1590634079149) (C:\Users\王利琴\Desktop\Reptile Summary\XML diagram1.jpg)]
[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-0Pxx1cJ1-1590634079153) (C:\Users\王利琴\Desktop\Reptile Summary\XML Diagram 2.jpg)]
XML node relationship
Father, son, compatriots, ancestors, descendants
XPath definition
XPath (XML Path Language) is a language for finding information in XML documents. It can be used to traverse elements and attributes in XML documents.
XPath expression
-
The most commonly used path expressions
Wildcard description / Pick from the root node. // Select nodes in the document from the current node of the matching selection, regardless of their location. . Select the current node. … Select the parent node of the current node. @ Select attributes. -
Common path expressions and the results of expressions
expression result /bookstore Select the root element bookstore. Note: If the path starts from a forward slash (/), this path always represents the absolute path to an element! bookstore/book Select all book elements that are child elements of bookstore //book Select all book sub-elements, regardless of their position in the document. bookstore//book Select all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore. //@lang Select all attributes named lang -
The predicate is used to find a specific node or a node containing a specified value, and is embedded in square brackets
Path expression result /bookstore/book[1] Select the first book element that belongs to the bookstore child element. /bookstore/book[last()] Select the last book element that belongs to the bookstore child element. /bookstore/book[last()-1] Select the penultimate book element that belongs to the bookstore child element. /bookstore/book[postion()❤️] Select the first two book elements that are child elements of the bookstore element. // title [@lang] Select all title elements that have an attribute named lang. // title [@ lang = 'eng'] Select all title elements, and these elements have a lang attribute with a value of eng. /bookstore/book[price>35.00] Select all book elements of the bookstore element, and the value of the price element must be greater than 35.00. /bookstore/book[price>35.00]/title Select all the title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00. -
Select unknown node
Wildcard description * Match any element node. @* Match any attribute node. Path expression result /bookstore/* Select all child elements of the bookstore element. //* Select all elements in the document. //title[@*] Select all title elements with attributes. -
Select several paths, by using the "|" operator in the path expression, you can select several paths
Path expression result //book/title|//book/price Select all the title and price elements of the book element. //title|//price Select all the title and price elements in the document. /bookstore/book/title|//price Select all the title elements belonging to the book element of the bookstore element and all the price elements in the document. -
XPath operators
Operator description Instance return value | Calculate two node sets //book|//cd Return all node sets that have book and cd elements + addition 6+4 10 - Subtraction 6-4 2 * multiplication 6*4 24 div division 8div4 2 = equal price=9.80 If price is 9.80, return true. If the price is 9.90, false is returned. != not equal to price!=9.80 If price is 9.90, return true. If the price is 9.80, false is returned. < Less than price<9.80 If price is 9.00, return true. If the price is 9.90, false is returned. <= less than or equal to price<=9.80 If price is 9.00, return true. If the price is 9.90, false is returned. > more than the price>9.80 If price is 9.90, return true. If the price is 9.80, false is returned. >= greater than or equal to price>=9.80 If price is 9.90, return true. If the price is 9.70, false is returned. or or price=9.80 or price=9.70 If price is 9.80 or 9.70, return true. and versus price>9.00 and price<9.90 If price is 9.80, return true. If the price is 8.50, false is returned. mod Calculate the remainder of the division 5 mod 2 1
lxml library
definition
-
lxml is an HTML/XML parser, and its main function is how to parse and extract HTML/XML data.
-
Like regular, lxml is also implemented in C. It is a high-performance Python HTML/XML parser. We can use XPath syntax to quickly locate specific elements and node information.
lxml data conversion
-
Source code
from lxml import etree str = '''<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' #利用etree.HTML,将String字符串解析为HTML文档 html = etree.HTML(str) result = etree.tostring(html) print(result.decode('utf-8'))
-
Data obtained
'''<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
lxml read file
-
Source code
from lxml import etree # 读取外部文件 hello.html html = etree.parse('./hello.html') print(html)
-
File data
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>测试页面</title>
</head>
<body>
<ol>
<li class="haha">醉卧沙场君莫笑,古来征战几人回</li>
<li class="heihei">两岸猿声啼不住,轻舟已过万重山</li>
<li id="hehe" class="nene">一骑红尘妃子笑,无人知是荔枝来</li>
<li class="xixi">停车坐爱枫林晚,霜叶红于二月花</li>
<li class="lala">商女不知亡国恨,隔江犹唱后庭花</li>
</ol>
<div id="pp">
<div>
<a href="http://www.baidu.com">李白</a>
</div>
<ol>
<li class="huanghe">君不见黄河之水天上来,奔流到海不复回</li>
<li id="tata" class="hehe">李白乘舟将欲行,忽闻岸上踏歌声</li>
<li class="tanshui">桃花潭水深千尺,不及汪伦送我情</li>
</ol>
<div class="hh">
<a href="http://mi.com">雷军</a>
</div>
<div class="jj">
<b href="http://mi.com"><c>3</c></b>
<b href="http://mi.com"><c>5</c></b>
<b href="http://mi.com"><c>6</c></b>
<b href="http://mi.com"><c>8</c></b>
<b href="http://mi.com"><c>9</c></b>
<b href="http://mi.com"><c>3</c></b>
</div>
<ol>
<li class="dudu">are you ok</li>
<li class="meme">会飞的猪</li>
</ol>
</div>
</body>
</html>
XPath具体用法
-
被解析网页原码
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8" /> <title>测试页面</title> </head> <body> <ol> <li class="haha">醉卧沙场君莫笑,古来征战几人回</li> <li class="heihei">两岸猿声啼不住,轻舟已过万重山</li> <li id="hehe" class="nene">一骑红尘妃子笑,无人知是荔枝来</li> <li class="xixi">停车坐爱枫林晚,霜叶红于二月花</li> <li class="lala">商女不知亡国恨,隔江犹唱后庭花</li> </ol> <div id="pp"> <div> <a href="http://www.baidu.com">李白</a> </div> <ol> <li class="huanghe">君不见黄河之水天上来,奔流到海不复回</li> <li id="tata" class="hehe">李白乘舟将欲行,忽闻岸上踏歌声</li> <li class="tanshui">桃花潭水深千尺,不及汪伦送我情</li> </ol> <div class="hh"> <a href="http://mi.com">雷军</a> </div> <div class="jj"> <b href="http://mi.com"><c>3</c></b> <b href="http://mi.com"><c>5</c></b> <b href="http://mi.com"><c>6</c></b> <b href="http://mi.com"><c>8</c></b> <b href="http://mi.com"><c>9</c></b> <b href="http://mi.com"><c>3</c></b> </div> <ol> <li class="dudu">are you ok</li> <li class="meme">会飞的猪</li> </ol> </div> </body> </html>
-
获取所有的< li >标签
from lxml import etree html = etree.parse('hello.html') li_list = html.xpath('//li') print(li_list) # 打印<li>标签的元素集合 print(len(li_list))
-
继续获取< li > 标签的所有 class属性
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li/@class') print(result)
-
继续获取< li >标签下href为 link1.html 的 < a > 标签
from lxml import etree html = etree.parse('./hello.html') result = html.xpath('//li/a[@href="link1.html"]') print(result)
-
获取< li >标签下的所有< span >标签
from lxml import etree data = ''' <div> <ul> <li class="item-0">你好,老段<a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>''' html = etree.HTML(data) result = html.xpath('//li//span') print(result[0].text)
-
获取< li >标签下的< a >标签的所有class
# 获取 <li> 标签下的<a>标签里的所有 class from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li/a//@class') print(result)
-
获取最后一个< li >的< a >的href
from lxml import etree xml = etree.parse('./hello.html') result = xml.xpath('//li[last()]/a/@href') print(result)
-
获取倒数第二个元素的内容
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li[last()-1]/a') print(result[0].text) print(result)
-
获取class值为bold的标签名
# 获取 class 值为 bold 的标签名 from lxml import etree html = etree.parse('hello.html') result = html.xpath('//*[@class="bold"]') # tag方法可以获取标签名 print(result[0].tag) print(result[0].text)
-
条件使用
- 获取文本数据://li[@id=“hehe”]/text()
- 包含某个条件://li[contains(@class,“h”)]
- 等于某个条件://div[@id=“pp”]/ol[last()]/li/@*
- 条件并用://li[@id=“hehe”] [@class=“nene”]/text()
XPath案例
import requests
from lxml import etree
url1 = 'https://www.neihanba.com/dz/'
url = 'https://www.neihanba.com/dz/list_%d.html'
if __name__ == '__main__':
fp = open('./duanzi.csv',mode = 'a',encoding='utf-8')
for i in range(1,101):
if i == 1:
url_duanzi = url1
else:
url_duanzi = url%(i)
response = requests.get(url_duanzi)
response.encoding = 'gbk'
content = response.text
html = etree.HTML(content)
result = html.xpath('//ul[@class="piclist longList"]/li')
for li in result:
try:
title = li.xpath('.//h4/a/b/text()')[0]
content = li.xpath('.//div[@class="f18 mb20"]/text()')[0].strip().strip('\n')
info = ''.join(li.xpath('.//div[@class="ft"]/span//text()')[1:])
fp.write('%s\t%s\t%s\n'%(title,content,info))
except Exception as e:
# 异常保存,第二天,分析,单独爬取。
pass
print('第%d页内容保存成功!'%(i))
fp.close()
# !!!缺少异常捕获