Introduction to python crawler xpath and basic use of lxml library, let's learn xpath together

Table of contents

What is XPath?

xpath syntax

knowledge points

node

Pick a node:

Select all href attributes under a node

../ select parent node

bookstore/book select child element li

bookstore//book Regardless of the location, select the li under the div node

Predicates

wildcard

operator

Select multiple paths:

match demo

select node

Common functions

lxml parsing library

Install

manual

read html file

If there is any problem in the article, please point it out, thank you!


What is XPath?

xpath (XML Path Language) is a language for finding information in XML and HTML documents, which can be used to traverse and match elements and attributes in XML and HTML documents.

xpath syntax

knowledge points

  • Methods related to mastering element paths
  • Master the method of obtaining and obtaining attributes
  • Know how to get text

node

In XPath, there are seven types of nodes: element, attribute, text, namespace, processing instruction, comment, and document (root) nodes. XML documents are treated as nodes trees. The root of the tree is called the document node or root node.

Pick a node:

XPath uses path expressions to select nodes or sets of nodes in an XML document. These path expressions are very similar to the expressions we see in regular computer file systems.

expression

describe

example

result

nodename

Select all child nodes of this node

bookstore

Select all child nodes under bookstore

/

If it is at the front, it means selecting from the root node. Otherwise, select a node under a node (that is, the current node) (take child nodes)

/bookstore

Select all bookstore nodes under the root element

//

Select direct and indirect child nodes in the document from the current node - this allows us to "skip levels" or select nodes from the global node, wherever you want (take descendant nodes)

//book

Finds all book nodes or selects all book child elements from the global node, regardless of their position in the document.

@

Select the properties of a node

//a[@class]

Select the one with the class attribute

all a nodes

.

current node

./a

Select the a label under the current node

..

Select the parent node of the current node.

//div[@class="book-list list"]//div[@class="title"]/../a

Select the a tag of the parent node of the div[@class="title"] node

In the table below, we have listed some path expressions and their results:

path expression

result

bookstore/book

Selects all book elements that are children of bookstore.

bookstore//book

Selects all book elements that are descendants of the bookstore element, no matter where they are located under bookstore.

//@lang

Selects all attributes named lang.

Select all href attributes under a node

//div[@class="book-list list"]//div[@class="title"]/a/@href

../ select parent node

bookstore/book select child element li

bookstore//book Regardless of the location, select the li under the div node

Predicates

Predicates are used to find a specific node or nodes that contain a specified value.

Predicates are enclosed in square brackets.

In the table below we list some path expressions with predicates, and the result of the expressions:

path expression

result

/bookstore/book[1]

Selects the first book element that is a child of the bookstore element.

/bookstore/book[last()]

Selects the last book element that is a child of the bookstore.

/bookstore/book[last()-1]

Selects the second-to-last book element that is a child of the bookstore element.

/bookstore/book[position()<3]

Select the first two child elements (book) of the bookstore element

//book/title[text()='The ugliest in the world']

Select all the title elements under the book, only select the title element whose text is the ugliest in the world

//title[@lang='eng']

Selects all title elements with a lang attribute value of eng.

/bookstore/book[price>35.00]

Select all book elements of the bookstore element, and the value of the price element must be greater than 35.00.

/bookstore/book[price>35.00]//title

Select all the title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00.

wildcard

* represents a wildcard.

wildcard

describe

example

result

*

match any node

/bookstore/*

Select all child elements under bookstore.

@*

matches any attribute in the node

//book[@*]

Selects all book elements with attributes.

operator

Match the same tag with different attributes: or

 Match tags containing class="title" or class="pic"

//div[@class="book-list list"]//div[@class="title" or @class="pic"]/a

and

I believe that everyone can understand and the relationship between the two is in line with this and the other.

Select multiple paths:

xpath-expression1 | xpath1-expression2 | xpath-expression3

Several paths can be selected by using the "|" operator in a path expression.

Examples are as follows:

path expression

result

//book/title | //book/price

Selects all title and price elements of the book element.

//bookstore/book | //book/title

Select all book elements and all title elements under the book element

/bookstore/book/title | //price

Selects all title elements that are book elements of the bookstore element, and all price elements in the document.

match demo

<ul class="book_list"> 
    <li> 
        <title class="book_001">Harry Potter</title> 
        <author>J K. Rowling</author> 
        <year>2005</year> 
        <price>69.99</price> 
    </li>
    <li> 
        <title class="book_002">Spider</title>
        <author>Forever</author>
        <year>2019</year>
        <price>49.99</price>
   </li>
</ul>
1、查找所有的li节点
     //li 
2、查找li节点下的title子节点中,class属性值为'book_001'的节点
     //li/title[@class="book_001"] 
3、查找li节点下所有title节点的,class属性的值
     //li//title/@class

As long as the condition is involved, add []

Just get the attribute value, add @

select node

1、// :从所有节点中查找(包括子节点和后代节点)
2、@ :获取属性值 
# 使用场景1(属性值作为条件)
    //div[@class="movie"] 
# 使用场景2(直接获取属性值)
    //div/a/@src

Common functions

1、contains() :匹配属性值中包含某些字符串节点 
    # 查找class属性值中包含"book_"的title节点
       //title[contains(@class,"book_")] 
       //div[@class="book-list list"]//a[contains(text(), "世界上最丑")]

    # 如果是多个则可以:

        //div[contains(@class, 'book2') and contains(@class, 'book3')]
    # 如果目标 class 不一定是第一个,可以:

        //div[contains(concat(' ', @class, ' '), 'book2')]

2、text() :获取节点的文本内容
    # 查找所有书籍的名称
       //ul[@class="book_list"]/li/title/text()

3、将对象还原为字符串:etree.tostring()
import requests 
from lxml import etree 
url='https://www.douban.com/' 
headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36' 
}
req=requests.get(url,headers=headers) 
html=req.text xp=etree.HTML(html) 
a=xp.xpath('//div[@class="book-list list"]//div[@class="title"]/a')[0]

print(etree.tostring(a,encoding='utf-8').decode('utf-8'))

 

 //div[contains(@class, 'mod_action cf')] 

//div[contains(concat(' ', @class, ' '), 'mod_action cf')]

 

lxml parsing library

Install

pip install lxml

manual

1、导入模块
    from lxml import etree 
2、创建解析对象 
    req = etree.HTML(html) 
3、解析对象调用xpath 
    r_list = req.xpath('xpath表达式') # 返回结果类型为列表

We can use it to parse the HTML code, and when parsing the HTML code, if the HTML code is not standardized, it will automatically complete it. The sample code is as follows:

# 使用 lxml 的 etree 库 
from lxml import etree 
text = ''' <h2>
        新书速递
            &nbsp;·&nbsp;·&nbsp;·&nbsp;·&nbsp;·&nbsp;·
            <span class="pl">&nbsp;(
                
                    <a href="https://book.douban.com/latest" target="_self">更多</a>
                ) </span>
    </h2> ''' 
#利用etree.HTML,将字符串解析为HTML文档 
html = etree.HTML(text) 
# 按字符串序列化HTML文档 
result = etree.tostring(html,encoding='utf-8',pretty_print=True)
print(result.decode('utf8'))

The output is as follows:

 can be seen. lxml will automatically modify HTML code. In the example, not only the li tag is completed, but also the body and html tags are added.

read html file

etree.parse() method to read the file

# 获取所有li标签: 
from lxml import etree 
html = etree.parse('hello.html') 
print(type(html)) 
# 显示etree.parse() 返回类型 
result = html.xpath('//li') 
print(result) # 打印<li>标签的元素集合
 
# 获取所有li元素下的所有class属性的值: 
from lxml import etree 
html = etree.parse('hello.html') 
result = html.xpath('//li/@class') 
print(result) 

# 获取li标签下href为www.baidu.com的a标签: 
from lxml import etree 
html = etree.parse('hello.html') 
result = html.xpath('//li/a[@href="www.baidu.com"]') 
print(result) 

# 获取li标签下所有span标签: 
from lxml import etree 
html = etree.parse('hello.html') 
#result = html.xpath('//li/span') #注意这么写是不对的: #因为 / 是用来获取子元素的,而 <span> 并不是 <li> 的子元素,所以,要用双斜杠 
result = html.xpath('//li//span') 
print(result) 

# 获取li标签下的a标签里的所有class: 
from lxml import etree 
html = etree.parse('hello.html') 
result = html.xpath('//li/a//@class') 
print(result) 

# 获取最后一个li的a的href属性对应的值: 
from lxml import etree 
html = etree.parse('hello.html') 
result = html.xpath('//li[last()]/a/@href') # 谓语 [last()] 可以找到最后一个元素 
print(result) 

# 获取倒数第二个li元素的内容: 
from lxml import etree 
html = etree.parse('hello.html') 
result = html.xpath('//li[last()-1]/a')[0]
# text 方法可以获取元素内容 
print(result.text) 
# 获取倒数第二个li元素的内容的第二种方式: 
result = html.xpath('//li[last()-1]/a/text()') 
print(result)

If there is any problem in the article, please point it out, thank you!

Guess you like

Origin blog.csdn.net/q1246192888/article/details/123649072