Python crawler web page analysis artifact Xpath detailed explanation

1. Introduction to XPath

XPath is a language for finding information in XML documents. Originally designed to search XML documents, but can also be used to search HTML documents.

2. Install lxml

lxml is a third-party parsing library for Python, supports HTML and XML parsing, and is extremely efficient, making up for the shortcomings of Python's own xml standard library in XML parsing.

How to install third-party libraries:pip install lxml

3. XPath analysis principle

  1. Instantiate an etree object, and the parsed page source code data needs to be loaded into the object.
  2. Call the xpath method in the etree object combined with xpath expressions to realize label positioning and content capture.

4. Instantiate the etree object

  1. Load the source code data in the local html document into the etree object:etree.parse(filePath)
  2. Load the source code data obtained from the Internet into the object:etree.HTML(response.text)
  3. xpath('xpath expression')

5. XPath path expression

expression illustrate
/ Select from the root node
// Represents multiple levels, starting from any position
. Select current node
Select the parent node of the current node
@ select attribute
//div[@class=‘title’] tag[@attrName=“attrValue”] attribute positioning
//div[@class=“zhang”]/p[3] Index positioning, the index starts from 1
/text() What is obtained is the direct text content in the label
//text() Non-immediate text content in tags (all text content)
/@attrName ==>img/src Take attributes

6. Combined with actual combat explanation

Take the CSDN website as an example to explain
insert image description here
example: Here I want to get the title of the headline blog on the homepage of the official website, open the console (click the small arrow on the console or press Ctrl+Shift+C at the same time), point to the title, and locate it according to the class value of the div tag (this is We usually use more xpath syntax.

from lxml import etree
import requests

headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
url = "https://www.csdn.net/"
response = requests.get(url=url, headers=headers)
# 使用etree解析
data = etree.HTML(response.text)
# //div表示任意路径下的div标签
names = data.xpath("//div[@class='headlines']/div[@class='headlines-right']//div[@class='headswiper-item']/a/text()")
url = data.xpath("//div[@class='headlines']/div[@class='headlines-right']//div[@class='headswiper-item']/a/@href")

blog_list = list(zip(names, url))
for blog in blog_list:
    print(blog)


Realize the effect:
insert image description here

Guess you like

Origin blog.csdn.net/qq_44723773/article/details/128760503