Practical source code for web crawler development: https://github.com/MakerChen66/Python3Spider
It is not easy to be original, plagiarism and reprinting are prohibited in this article, a summary of years of practical crawler development experience, infringement must be investigated!
Table of contents
1. Introduction of XPth
When we use regular expressions to extract page information, it will feel cumbersome and troublesome, and if there is a wrong place, it may cause matching failure, so it is still a little inconvenient to use regular expressions to extract page information. Then, on the
page When parsing, we can use XPth or CSS selectors to extract a certain node, and then call the corresponding method to obtain its text or attributes, can't we extract any information we want?
There are many such parsing libraries in Python, among which the more powerful libraries include lxml, Beautiful Soup, pyquery, etc. Among them, Beautiful Soup has been introduced to you before, and then I will introduce how to use XPth selectors to locate node elements in the lxml parsing library. With them, we don’t need to worry about regular expressions anymore, and the parsing efficiency is also improved. will greatly improve
2. Use of XPth
XPth, the full name is XML Path Language, that is, XML path language, it is a language for finding information in XML documents, but it is also suitable for searching HTML documents
2.1 Overview of XPth
The selection function of XPath is very powerful, providing more than 100 built-in functions for string, value, time matching, node and sequence processing, etc. Almost all the nodes we want to locate can be selected by XPath
Official website of XPath:
https://www.w3.org/TR/xpath/
2.2 XPath common rules
expression | describe |
---|---|
nodename | Select all child nodes of this node |
/ | Select all direct child nodes from the current node |
// | Select all descendant nodes from the current node |
. | Select current node |
… | Select the parent and child points of the current node |
@ | select attribute |
Common XPath rules are listed here, examples are as follows:
//div[@name='loginname']
This is an XPath rule, which means to select all nodes whose name is div and whose attribute name is loginname
will be parsed through the library lxml, using XPath locates and parses HTML
2.3 Installation
If you want to do a good job, you must first sharpen your tools. First, install the parsing library lxml. If you have already installed it, you can ignore it.
pip install lxml -i https://pypi.doubanio.com/simple
Here, -i is used to specify the installation URL. This URL is Douban source, and you can also use Tsinghua source, Ali source, etc. by yourself. The reason for using the domestic mirror source for installation is because the default installation URL is an external network, and the installation may fail due to connection timeout. I believe you have a deep understanding. Use the domestic mirror source, and
you can also use PyCharm to install it inside the software
3. XPth example
3.1 Instance introduction
Now let's feel the process of using XPath to parse HTML through an example, as follows:
from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
First import the etree module of the lxml library, then declare a piece of HTML text, and call the HTML class to initialize, thus successfully creating an XPath parsing object. It should be noted that the last li node in the above HTML text is not closed, but etree can automatically correct the HTML text.
Here we call the tostring() method to output the corrected HTML code, but the result is bytes type, and then use The decode() method converts it into a str type, and the output results are as follows:
You can see that after processing, the li node label is completed, and the body and html nodes are automatically added.
In addition, the text file can also be read directly for analysis. Examples are as follows:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
The content of test.html is the HTML code in the above example, ./ indicates the current directory, and the output is as follows:
3.2 All nodes
Generally, we will use XPath rules beginning with // to select all nodes that meet the requirements. Taking the previous HTML as an example, if you want to select all nodes, you can do this:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)
The output results are as follows:
use * to match all nodes, that is to say, all nodes in the entire HTML text will be obtained. What is returned is a list, each element is of Element type, followed by the name of the node, such as html, body, div, ul, li, a, etc., all nodes are included in the list, of course, matching here can also specify
nodes name. For example, if you want to get all li nodes:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
print(result[0])
To select all li nodes, you can use // plus the node name, and use the xpath() method directly when calling. You
can see that the output result is also in the form of a list, and each element is an Element object. If you want to take out one of the objects, you can directly use the square brackets to add the index, such as [0]
3.3 Child nodes
We can find the child node or descendant node of the element through / or //. If we now want to select all direct child nodes a of the li node, we can do this:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
The output result is as follows:
Here, by adding /a at the end, it means that all direct child nodes a of all li nodes are selected. If you want to select all descendant nodes a of li nodes, write //a. Of course, the output result written in this way is the same as above. The same, because there is only one a node after the li node
3.4 Parent node
We know that all direct child nodes or descendant nodes can be selected with / or / /, so if we know the child nodes, how to find the parent node? It can be realized by...
If you want to select a node whose href attribute value is link4.html, and then get the class attribute value of its parent node, the implementation code is as follows:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)
The output is as follows:
['item-1']
[Finished in 0.5s]
Check test.html and find that 'item-1' is the class of the target li node we obtained.
We can also obtain the parent node through parent::, the code is as follows:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)
The output is also the same as above. This way of writing is called node axis, please refer to the fourth part for details
3.5 Attribute matching
When selecting, you can use the @ symbol to filter attributes. For example, if we want to select the li node whose class is item-0, it can be implemented like this:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)
The output is as follows:
[<Element li at 0x2176e366a48>, <Element li at 0x2176e366a88>]
There are two matching results. As for whether they are the correct two, we will verify later
3.6 Text Acquisition
The text() method in XPath can get the text in the node, and then try to get the text of the a node in the previous li node, the code is as follows:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)
The output is as follows:
['first item', 'fifth item']
It can be seen that there are two return values here. We first selected all the li nodes whose attribute is item-0, and then used/selected all its direct child nodes a, and then obtained its text. The result we got was exactly what we wanted two results
3.7 Attribute Acquisition
Use text() to get the node text, so how to get the node properties? Just use the @ sign. For example, we want to get the href attribute of all a nodes under all li nodes, the code is as follows:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
The output is as follows:
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
Here we can get the href attribute of the node through @href. Note that the method here is different from attribute matching. Attribute matching is brackets plus attribute name and value to limit an attribute, such as [@href="link2.html"] and @href here refers to a
certain attribute, the two need to be distinguished
3.8 Attribute multi-value matching
Sometimes, a property of a node may have multiple values, for example:
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)
The output is as follows:
[]
An empty list, because the class attribute of the li node in the text text has two values li and li-first, and at this time, the previous attribute matching is used to obtain it, so it cannot be matched. At this time, the contains() function needs to be
used , the code can be rewritten as follows:
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
The output is as follows:
['first item']
This method is often used when an attribute of a node has multiple values, such as the class attribute of a node usually has multiple values
3.9 Multi-attribute matching
In addition, we may also encounter a situation where a node is determined based on multiple attributes, and then multiple attributes need to be matched at the same time. At this point you can use the operator and to connect, examples are as follows:
from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
The output is as follows:
['first item']
The li node here adds another attribute name. To determine this node, it needs to be selected according to the class and name attributes at the same time. One condition is that the class attribute contains 1i strings, and the other condition is that the name attribute is an item string. The two need to be satisfied at the same time and need to be connected with the and operator. After connecting, put them in square brackets for conditional filtering.
The and here is actually an operator in XPath. In addition, there are many operators, such as or, mod, etc. For more operators, please refer to the link:
https://www.w3school.com.cn/xpath/xpath_operators.asp
3.10 Sequential selection
Sometimes, some attributes may match multiple nodes at the same time when we choose, but we only want one of the nodes, such as the second node or the last node, what should we do at this time? At this time, we can
use The method of passing the index in square brackets gets the nodes in a specific order, examples are as follows:
from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)
When selecting for the first time, we selected the first li node, and just pass in the number 1 in the square brackets. Note that this is different from the code, the serial number starts with 1, not 0
In the second selection, we selected the last li node, just pass last() in the square brackets, and the last li node is returned
In the third selection, we selected the li nodes whose position is less than 3, that is, the nodes whose position numbers are 1 and 2, and the result is the first two li nodes
In the fourth selection, we selected the penultimate li node, and passed last0)-2 in the square brackets. Because last() is the last one, so last()-2 is the third last
The output is as follows:
Here we use functions such as last() and position(). In XPah, more than 100 functions are provided, including access, value, string, logic, node, sequence and other processing functions. For more functions and functions, please refer to the link:
http://www.w3school.com.cn /xpath/xpath_functions.asp
4. XPth node axis
4.1 XPth common node extraction
XPath provides many node axis selection methods, including obtaining child elements, sibling elements, parent elements, ancestor elements, etc. Examples are as follows:
from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')
print(result)
The output is as follows:
When selecting for the first time, we call the ancestor axis to get all ancestor nodes. It needs to be followed by two colons. Then there is the selector of the node, here we directly use *, which means matching all nodes, so the returned result is all ancestor nodes of the first li node, including html, body, div and ul
In the second selection, we added a restriction, this time adding a div after the colon, so that the result is only the ancestor node of div
In the third selection, we called the attribute axis to get all the attribute values, followed by the selector or *, which means to get all the attributes of the node, and the return value is all the attribute values of the li node
For the fourth selection, we call the child axis to get all direct child nodes. Here we have added a restriction, select the node of link1.html whose href attribute is
In the fifth selection, we call the descendant axis to get all descendant nodes. Here we add a condition to get the span node, so the returned result only contains the span node and not the a node
In the sixth selection, we call the following axis to get all the sections after the current node. Although we are using * matching here, we have added index selection, so only the second subsequent node is obtained
In the seventh selection, we call the following-sibling axis to get all sibling nodes after the current node. Here we use * matching, so all subsequent sibling nodes are fetched
4.2 Summary
We have basically introduced the XPath selectors that may be used. The XPath selector is very powerful and has many built-in functions. After proficiency, it can greatly improve the extraction efficiency of HTML information
For more usage of node axes, please refer to the link:
https://www.w3school.com.cn/xpath/xpath_axes.asp
For more usage of XPath, please refer to the link:
https://www.w3school.com.cn/xpath/index.asp
For more usage of the lxml library, you can check the official website:
https://lxml.de/
Five, read the original text
Link to the original text of my original public account: read the original text
Originality is not easy, if you find it useful, I hope you can give it a thumbs up, thank you guys!
6. Author Info
Author: Xiaohong's Fishing Daily, Goal: Make programming more interesting!
Original WeChat public account: " Xiaohong Xingkong Technology ", focusing on algorithms, crawlers, websites, game development, data analysis, natural language processing, AI, etc., looking forward to your attention, let us grow and code together!
Reprint instructions: This article prohibits plagiarism and reprinting, and infringement must be investigated!