From entry to giving up: the use of python crawler series-xpath parsing library

Before starting, please make sure that the python3 environment has been configured and the lxml third-party library has been installed.
This article is the study notes of the blogger. If there are any deficiencies, please point out.
follow me! Continue to update ing~
In addition, [From entry to abandon: python data analysis series] is being updated~

1. Introduction to xpath

1.1 What is xPath?

xPath is called XML Path Language (XML Path Language), which is a language used to determine the location of a certain part of an XML document. At the same time, it is fully applicable to html document search.
XPath is based on the XML tree structure. It provides a very concise and clear path selection expression, which can find the specified node in the data structure tree.
For crawlers, we can use xpath for data filtering and data extraction.

1.2 Small test

The blogger here provides a small html document. The following explanations are based on this HTML document.

<html>
	<head>
		<meta charset="utf-8">
		<title>案例</title>
	</head>
	<body>
		<div class="nav">
			<ul>
				<li><a href="">这是1</a></li>
				<li><a href=""></a>这是2</li>
				<li><a href="" id="text">这里有个text链接</a></li>
				<li><img src="" class="myImg" >嗯,图片</li>
				<li><span id="targs">超哥最帅</span></li>
				<li>无</li>
			</ul>
		</div>
		<div id="content sentence">
			<p class="p1 "  id= "content1">是以泰山不让土壤,故能成其大;河海不择细流,故能就其深;王者不却众庶,故能明其德。</p>
			<p class="p2">治大者不可以烦,烦则<strong>乱</strong>;治小者不可以怠,怠则废</p>
		</div>
	</body>
</html>

Let's get a feel for how to use xpath through this small case:

from lxml import etree

root = etree.HTML(html)    # 这里传入的html是上文中的html文档内容 创建xpath解析对象 
result = root.xpath('//title/text()')  # 获取根目录下的title标签的文本内容
print(result)

The output is as follows:

['案例']

Here we first import etree from lxml to create xpath parsing objects; then we use etree.HTML(html) to convert html into objects that can be parsed by xpath; finally we select the text content of the title tag through xpath path selection expressions . Don't worry when you see the xpath path selection expression, it is actually very simple. Next, the blogger further explained related knowledge points.

2. Absolute path and relative path

Regarding the concept of the path here, I believe readers who understand the HTML DOM tree are not unfamiliar, html dom regards html as a tree structure, and this structure is called a node tree. Each node in the node tree has a hierarchical relationship. Such as parent node (upper level node), child node (lower level node), descendant node (all nodes under the current node), sibling node (same level node). Among these nodes, the top node <HTML> is called the root node. The absolute path below refers to the root node as the root directory.

2.1 xPath common rules

Before starting to explain the absolute path and relative path, let us first understand some of the common syntax rules of xpath, we will often use in the following content:

  • *: Select all nodes under this node
  • //: Select descendant nodes from the current node
  • /: Select direct child nodes from the current node
  • .: Select the current node
  • ..: Select the parent node of the current node
  • @: select attribute
  • []: Specify attributes

2.2 Absolute path

The absolute path is the html node as the root node in the entire html tree structure. That is, starting from the root node, the destination node is searched according to the tree structure. For example, we want to find all p tags in an html document:

//body/div/p

or

//p  

2.3 Relative path

The relative path starts from the current path and searches for the destination node according to the tree structure. For example, if the current node is the ul tag, we need to select the img tag in the descendant tags, we can do this:

./li/img

or

.//img

3. Data extraction

Data extraction is an important part of crawlers. Data extraction is mainly manifested in two parts: the extraction of text content and the extraction of label attributes.

3.1 Location search

Location search refers to the process of searching based on tag names or attributes. The location search mainly has the following aspects:

  • Single attribute matching: The attribute matching can be filtered using [@]. For example, to extract the div tag whose class is nav:

    //div[@class="nav"]
    

    Extract the a tag whose id is text:

    //a[@id="text"]
    
  • Attribute multi-value matching: Sometimes, a certain attribute of a label may have multiple values, and then using the above method will no longer work. You can use the contains() function at this time. For example, we extract the div tags with id="content sentence" in the document:

    //div[contaions(@id," content sentence")]
    

    The first parameter specifies the attribute name, and the second parameter specifies the attribute value. When the label contains this attribute value, it will be selected.

  • Multi-attribute matching: Sometimes, a single attribute cannot locate the desired label. In this case, you can use the operator and to specify multiple attributes:

    //div/p[@class="p1" and @id="content1"]
    
  • Sorting and matching: When there are multiple target tags and no feature values ​​such as attributes, we cannot use the above method. At this time, you can use the position sorting match of xpath:
    select the first li (here different languages ​​such as python, the starting position is 1):

    //div[@class="nav"]/ul/li[1]
    

    Select the last li:

    //div[@class="nav"]/ul/li[last()]
    

    Choose the first three li:

    //div[@class="nav"]/ul/li[position()<4]
    

    Choose the penultimate li:

    //div[@class="nav"]/ul/li[last()-1]
    

3.2 Text extraction

Text extraction is to extract the content of the tag, here the text() method is used, for example, to extract the content of the p tag whose clas is p2:

//p[@class="p2"]/text()

3.3 Attribute extraction

Sometimes it is also necessary to crawl the attributes of the label, the most common is to get pictures or link addresses. You can use @ to get node attributes. It should be noted that "[]" is not needed here:

//img[@class="myImg"]@src

4. Conclusion

Up to now, all the commonly used knowledge points of xpath have been introduced. If you want to know more information, please click more .

Guess you like

Origin blog.csdn.net/qq_45807032/article/details/107544103