Web crawler|Getting started tutorial parsing library lxml+XPth selector

Practical source code for web crawler development: https://github.com/MakerChen66/Python3Spider

It is not easy to be original, plagiarism and reprinting are prohibited in this article, a summary of years of practical crawler development experience, infringement must be investigated!

1. Introduction of XPth

When we use regular expressions to extract page information, it will feel cumbersome and troublesome, and if there is a wrong place, it may cause matching failure, so it is still a little inconvenient to use regular expressions to extract page information. Then, on the

page When parsing, we can use XPth or CSS selectors to extract a certain node, and then call the corresponding method to obtain its text or attributes, can't we extract any information we want?

There are many such parsing libraries in Python, among which the more powerful libraries include lxml, Beautiful Soup, pyquery, etc. Among them, Beautiful Soup has been introduced to you before, and then I will introduce how to use XPth selectors to locate node elements in the lxml parsing library. With them, we don’t need to worry about regular expressions anymore, and the parsing efficiency is also improved. will greatly improve


2. Use of XPth

XPth, the full name is XML Path Language, that is, XML path language, it is a language for finding information in XML documents, but it is also suitable for searching HTML documents

2.1 Overview of XPth

The selection function of XPath is very powerful, providing more than 100 built-in functions for string, value, time matching, node and sequence processing, etc. Almost all the nodes we want to locate can be selected by XPath

Official website of XPath:
https://www.w3.org/TR/xpath/

2.2 XPath common rules

expression describe
nodename Select all child nodes of this node
/ Select all direct child nodes from the current node
// Select all descendant nodes from the current node
. Select current node
Select the parent and child points of the current node
@ select attribute

Common XPath rules are listed here, examples are as follows:
//div[@name='loginname']
This is an XPath rule, which means to select all nodes whose name is div and whose attribute name is loginname

will be parsed through the library lxml, using XPath locates and parses HTML

2.3 Installation

If you want to do a good job, you must first sharpen your tools. First, install the parsing library lxml. If you have already installed it, you can ignore it.

pip install lxml -i https://pypi.doubanio.com/simple

Here, -i is used to specify the installation URL. This URL is Douban source, and you can also use Tsinghua source, Ali source, etc. by yourself. The reason for using the domestic mirror source for installation is because the default installation URL is an external network, and the installation may fail due to connection timeout. I believe you have a deep understanding. Use the domestic mirror source, and

you can also use PyCharm to install it inside the software


3. XPth example

3.1 Instance introduction

Now let's feel the process of using XPath to parse HTML through an example, as follows:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

First import the etree module of the lxml library, then declare a piece of HTML text, and call the HTML class to initialize, thus successfully creating an XPath parsing object. It should be noted that the last li node in the above HTML text is not closed, but etree can automatically correct the HTML text.

Here we call the tostring() method to output the corrected HTML code, but the result is bytes type, and then use The decode() method converts it into a str type, and the output results are as follows:
insert image description here
You can see that after processing, the li node label is completed, and the body and html nodes are automatically added.

In addition, the text file can also be read directly for analysis. Examples are as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

The content of test.html is the HTML code in the above example, ./ indicates the current directory, and the output is as follows:
insert image description here

3.2 All nodes

Generally, we will use XPath rules beginning with // to select all nodes that meet the requirements. Taking the previous HTML as an example, if you want to select all nodes, you can do this:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

The output results are as follows:
insert image description here
use * to match all nodes, that is to say, all nodes in the entire HTML text will be obtained. What is returned is a list, each element is of Element type, followed by the name of the node, such as html, body, div, ul, li, a, etc., all nodes are included in the list, of course, matching here can also specify

nodes name. For example, if you want to get all li nodes:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
print(result[0])

To select all li nodes, you can use // plus the node name, and use the xpath() method directly when calling. You
insert image description here
can see that the output result is also in the form of a list, and each element is an Element object. If you want to take out one of the objects, you can directly use the square brackets to add the index, such as [0]

3.3 Child nodes

We can find the child node or descendant node of the element through / or //. If we now want to select all direct child nodes a of the li node, we can do this:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

The output result is as follows:
insert image description here
Here, by adding /a at the end, it means that all direct child nodes a of all li nodes are selected. If you want to select all descendant nodes a of li nodes, write //a. Of course, the output result written in this way is the same as above. The same, because there is only one a node after the li node

3.4 Parent node

We know that all direct child nodes or descendant nodes can be selected with / or / /, so if we know the child nodes, how to find the parent node? It can be realized by...

If you want to select a node whose href attribute value is link4.html, and then get the class attribute value of its parent node, the implementation code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)

The output is as follows:

['item-1']
[Finished in 0.5s]

Check test.html and find that 'item-1' is the class of the target li node we obtained.

We can also obtain the parent node through parent::, the code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

The output is also the same as above. This way of writing is called node axis, please refer to the fourth part for details

3.5 Attribute matching

When selecting, you can use the @ symbol to filter attributes. For example, if we want to select the li node whose class is item-0, it can be implemented like this:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)

The output is as follows:

[<Element li at 0x2176e366a48>, <Element li at 0x2176e366a88>]

There are two matching results. As for whether they are the correct two, we will verify later

3.6 Text Acquisition

The text() method in XPath can get the text in the node, and then try to get the text of the a node in the previous li node, the code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

The output is as follows:

['first item', 'fifth item']

It can be seen that there are two return values ​​here. We first selected all the li nodes whose attribute is item-0, and then used/selected all its direct child nodes a, and then obtained its text. The result we got was exactly what we wanted two results

3.7 Attribute Acquisition

Use text() to get the node text, so how to get the node properties? Just use the @ sign. For example, we want to get the href attribute of all a nodes under all li nodes, the code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

The output is as follows:

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

Here we can get the href attribute of the node through @href. Note that the method here is different from attribute matching. Attribute matching is brackets plus attribute name and value to limit an attribute, such as [@href="link2.html"] and @href here refers to a
certain attribute, the two need to be distinguished

3.8 Attribute multi-value matching

Sometimes, a property of a node may have multiple values, for example:

from lxml import etree

text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)

The output is as follows:

[]

An empty list, because the class attribute of the li node in the text text has two values ​​​​li and li-first, and at this time, the previous attribute matching is used to obtain it, so it cannot be matched. At this time, the contains() function needs to be

used , the code can be rewritten as follows:

from lxml import etree

text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

The output is as follows:

['first item']

This method is often used when an attribute of a node has multiple values, such as the class attribute of a node usually has multiple values

3.9 Multi-attribute matching

In addition, we may also encounter a situation where a node is determined based on multiple attributes, and then multiple attributes need to be matched at the same time. At this point you can use the operator and to connect, examples are as follows:

from lxml import etree

text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

The output is as follows:

['first item']

The li node here adds another attribute name. To determine this node, it needs to be selected according to the class and name attributes at the same time. One condition is that the class attribute contains 1i strings, and the other condition is that the name attribute is an item string. The two need to be satisfied at the same time and need to be connected with the and operator. After connecting, put them in square brackets for conditional filtering.

The and here is actually an operator in XPath. In addition, there are many operators, such as or, mod, etc. For more operators, please refer to the link:
https://www.w3school.com.cn/xpath/xpath_operators.asp

3.10 Sequential selection

Sometimes, some attributes may match multiple nodes at the same time when we choose, but we only want one of the nodes, such as the second node or the last node, what should we do at this time? At this time, we can

use The method of passing the index in square brackets gets the nodes in a specific order, examples are as follows:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

When selecting for the first time, we selected the first li node, and just pass in the number 1 in the square brackets. Note that this is different from the code, the serial number starts with 1, not 0

In the second selection, we selected the last li node, just pass last() in the square brackets, and the last li node is returned

In the third selection, we selected the li nodes whose position is less than 3, that is, the nodes whose position numbers are 1 and 2, and the result is the first two li nodes

In the fourth selection, we selected the penultimate li node, and passed last0)-2 in the square brackets. Because last() is the last one, so last()-2 is the third last

The output is as follows:
insert image description here
Here we use functions such as last() and position(). In XPah, more than 100 functions are provided, including access, value, string, logic, node, sequence and other processing functions. For more functions and functions, please refer to the link:
http://www.w3school.com.cn /xpath/xpath_functions.asp

4. XPth node axis

4.1 XPth common node extraction

XPath provides many node axis selection methods, including obtaining child elements, sibling elements, parent elements, ancestor elements, etc. Examples are as follows:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')
print(result)

The output is as follows:
insert image description here
When selecting for the first time, we call the ancestor axis to get all ancestor nodes. It needs to be followed by two colons. Then there is the selector of the node, here we directly use *, which means matching all nodes, so the returned result is all ancestor nodes of the first li node, including html, body, div and ul

In the second selection, we added a restriction, this time adding a div after the colon, so that the result is only the ancestor node of div

In the third selection, we called the attribute axis to get all the attribute values, followed by the selector or *, which means to get all the attributes of the node, and the return value is all the attribute values ​​of the li node

For the fourth selection, we call the child axis to get all direct child nodes. Here we have added a restriction, select the node of link1.html whose href attribute is

In the fifth selection, we call the descendant axis to get all descendant nodes. Here we add a condition to get the span node, so the returned result only contains the span node and not the a node

In the sixth selection, we call the following axis to get all the sections after the current node. Although we are using * matching here, we have added index selection, so only the second subsequent node is obtained

In the seventh selection, we call the following-sibling axis to get all sibling nodes after the current node. Here we use * matching, so all subsequent sibling nodes are fetched

4.2 Summary

We have basically introduced the XPath selectors that may be used. The XPath selector is very powerful and has many built-in functions. After proficiency, it can greatly improve the extraction efficiency of HTML information

For more usage of node axes, please refer to the link:
https://www.w3school.com.cn/xpath/xpath_axes.asp

For more usage of XPath, please refer to the link:
https://www.w3school.com.cn/xpath/index.asp

For more usage of the lxml library, you can check the official website:
https://lxml.de/


Five, read the original text

Link to the original text of my original public account: read the original text

Originality is not easy, if you find it useful, I hope you can give it a thumbs up, thank you guys!

6. Author Info

Author: Xiaohong's Fishing Daily, Goal: Make programming more interesting!

Original WeChat public account: " Xiaohong Xingkong Technology ", focusing on algorithms, crawlers, websites, game development, data analysis, natural language processing, AI, etc., looking forward to your attention, let us grow and code together!

Reprint instructions: This article prohibits plagiarism and reprinting, and infringement must be investigated!

Guess you like

Origin blog.csdn.net/qq_44000141/article/details/121526788